Version 34 (modified by 13 years ago) (diff) | ,
---|
SOP for converting LifeLines Geno Data
Table of Contents
How to pseudonomize and reformat imputed genotype (GWAS) data per study. This SOP applies to LL3.
Variables per study:
- studyId - the id of study. E.g. 'OV039'
- studyDir - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
- mappingFile - describing what WGA ids are included and what their study ids are. E.g.:
1 LL_WGA0001 1 12345 1 LL_WGA0002 1 09876 1 LL_WGA0003 1 64542 ...
- So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
- Items are TAB-separated and it doesn't end with a newline
Constants over all studies (for LL3):
- Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
- Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
- Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores?.txt
Expected outputs
Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:
- [studyId]_Chr.PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
- [studyId]_Chr.BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
- [studyId]_Chr.DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
- [studyId]_imputation_batch.txt listing imputation batches
- [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch
NB: All files should be prefixed with studyId
.
TODO monomorphic filtering
Procedure
Step 0: request a study user
Step 1: create subset_study<n>.txt file for study<n>
This is done in the generic layer:
- In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
- In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
- Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
- scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
reformat mapping file WHY IS THIS?:
./formatsubsetfile.sh study<n>.txt
Step 2: generate conversion jobs
The conversion has been fully automated (see below for details). Therefore we generate all the jobs needed to convert. These jobs are produced in the 'studyDir/scripts'.
Command:
sh ../LL3/scripts/generateGenoJobs.sh \ --studyId OV039 \ --outputDir ../lifelines_OV039 \ --mappingFile ../mappingFile_OV039.txt
Step 3: submit jobs
change directory to the scripts directory, inspect and submit:
change directory:
cd ../lifelines_OV039/scripts
list scripts (we expect jobs for each format and each chromosome):
ls -lah
submit to cluster
sh submit_jobs.sh
Step 4: monitor progress, QC results
monitor progress using 'qstat'
TBD how to QC.
- must check that now WGA id is in the data
- must check that all ids where in the set
Step 5: release
cd ../lifelines_OV039/ #give user permission to see the data chown lifelines_OV039:lifelines *
Implementation details
The 'generateGenoJobs.sh' script implements the following steps:
Convert MAP/PED and generate BIM/BED/FAM
#step1 #generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile #step2: for i in {1..22} #--file [file] is input file in .map and .ped #--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id) #--recode tells plink to write results (otherwise no results!!! argh!) #--out defines output prefix (here: filtered.*) #--update-ids [file] tells prefix to update ids #result: filtered.ped/map' /target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \ --file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \ --update-ids $updateIdsFile \ --out $studyDir/${studyId}_chr$i \ --keep $keepFile \ --recode #step 3: for i in {1..22} #convert to bim/fam/bed plink \ --file $studyDir/${studyId}_chr$i \ --make-bed #remove temp rm temp_chr1 === Convert dosage format === As PLINK cannot updateIds on dosage files we created it ourselves. The command: {{{ #step1: for i in {1..22} #--subsetFile is the mappingFile #--doseFile is the imputed dosages #--outFile is where the converted file should go python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \ --subsetFile $mappingFile \ --doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \ --outFile ${studyDir}/${studyId}_chr$i.dose }}} === Step 5: copy all study*<n> files to the lifelines0<n> folder === {{{#!sh cp study<n>* ../../lifelines0<n> }}} * May take some time! == Maintaining the source code of the tools == To work with the sourcecode: 1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/ 2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/ 2. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/ == Overview == [[Image(http://i.imgur.com/nLT2e.png)]] A schematic overview of the export procedures described above.