= SOP for converting LifeLines Geno Data = [[TOC()]] How to pseudonomize and reformat imputed genotype (GWAS) data per study. This SOP applies to LL3. == Variables per study: == * '''studyId''' - the id of study. E.g. 'OV039' * '''studyDir''' - the folder where all converted data should go. E.g. '../home/lifelines_OV039' * '''mappingFile''' - describing what WGA ids are included and what their study ids are. E.g.: {{{ 1 LL_WGA0001 1 12345 1 LL_WGA0002 1 09876 1 LL_WGA0003 1 64542 ... }}} * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms * Items are TAB-separated and it doesn't end with a newline == Constants over all studies (for LL3): == * Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose * Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped * Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores.txt == Expected outputs == Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing: * [studyId]_Chr[x].PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) * [studyId]_Chr[x].BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) * [studyId]_Chr[x].DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) * [studyId]_imputation_batch.txt listing imputation batches * [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch NB: All files should be prefixed with {{{studyId}}}. > TODO monomorphic filtering == Procedure == === Step 0: request a study user === === Step 1: create subset_study.txt file for study === This is done in the generic layer: * In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers) * Export this view (tab separated, no enclosures, no headers) to subset_study.txt * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3 reformat mapping file '''WHY IS THIS?''': {{{#!sh ./formatsubsetfile.sh study.txt }}} === Step 2: generate conversion jobs === The conversion has been fully automated (see below for details). Therefore we generate all the jobs needed to convert. These jobs are produced in the 'studyDir/scripts'. Command: {{{ sh ../LL3/scripts/generateGenoJobs.sh \ --studyId OV039 \ --outputDir ../lifelines_OV039 \ --mappingFile ../mappingFile_OV039.txt }}} == Step 3: submit jobs == change directory to the scripts directory, inspect and submit: change directory: {{{ cd ../lifelines_OV039/scripts }}} list scripts (we expect jobs for each format and each chromosome): {{{ ls -lah }}} submit to cluster {{{ sh submit_jobs.sh }}} == Step 4: monitor progress, QC results == monitor progress using 'qstat' TBD how to QC. * must check that now WGA id is in the data * must check that all ids where in the set == Step 5: release == {{{ cd ../lifelines_OV039/ #give user permission to see the data chown lifelines_OV039:lifelines * }}} == Implementation details == The 'generateGenoJobs.sh' script implements the following steps: === Convert MAP/PED and generate BIM/BED/FAM=== {{{#!sh #step1 #generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile #step2: for i in {1..22} #--file [file] is input file in .map and .ped #--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id) #--recode tells plink to write results (otherwise no results!!! argh!) #--out defines output prefix (here: filtered.*) #--update-ids [file] tells prefix to update ids #result: filtered.ped/map' /target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \ --file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \ --update-ids $updateIdsFile \ --out $studyDir/${studyId}_chr$i \ --keep $keepFile \ --recode #step 3: for i in {1..22} #convert to bim/fam/bed plink \ --file $studyDir/${studyId}_chr$i \ --make-bed #remove temp rm temp_chr1 === Convert dosage format === As PLINK cannot updateIds on dosage files we created it ourselves. The command: {{{ #step1: for i in {1..22} #--subsetFile is the mappingFile #--doseFile is the imputed dosages #--outFile is where the converted file should go python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \ --subsetFile $mappingFile \ --doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \ --outFile ${studyDir}/${studyId}_chr$i.dose }}} === Step 5: copy all study* files to the lifelines0 folder === {{{#!sh cp study* ../../lifelines0 }}} * May take some time! == Maintaining the source code of the tools == To work with the sourcecode: 1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/ 2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/ 2. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/ == Overview == [[Image(http://i.imgur.com/nLT2e.png)]] A schematic overview of the export procedures described above.