Version 61 (modified by 12 years ago) (diff) | ,
---|
SOP for converting LifeLines Geno Data
Table of Contents
How to pseudonomize and reformat imputed genotype (GWAS) data per study. This SOP applies to LL3.
Variables per study:
- studyId - the id of study. E.g. 'OV039'
- studyDir - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
- mappingFile - describing what WGA ids are included and what their study ids are. E.g.:
1 LL_WGA0001 1 12345 1 LL_WGA0002 1 09876 1 LL_WGA0003 1 64542 ...
- So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
- Items are TAB-separated and it doesn't end with a newline
WARNING: NOW THE MAPPING FILE DOES NOT YET CONTAIN FAMILY PSEUDONYMS!
Constants over all studies (for LL3):
- Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
- Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
- Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores?.txt
Expected outputs
Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:
- [studyId]_Chr.PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
- [studyId]_Chr.BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
- [studyId]_Chr.DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
- [studyId]_imputation_batch.txt listing imputation batches
- [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch
NB: All files should be prefixed with studyId
.
TODO monomorphic filtering
Procedure
Step 0: request a study user
Contact "Target beheer" (admins, e.g. Ger) and ask for an account.
E.g. username = lifelines_OV093 group = none home directory = /target/gpfs2/lifelines-genome/home/lifelines_OV093
Step 1: create subset_study<n>.txt file for study<n>
This is done in the generic layer:
- In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
- In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
- Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
- scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
Step 2: generate conversion jobs
The conversion has been fully automated (see below for details). First we generate all the jobs needed to convert.
Command:
sh /target/gpfs2/lifelines_rp/releases/LL3/scripts/generateGenoJobs.sh \ --studyId OV039 \ --studyDir /full/path/to/lifelines_OV039 \ --mappingFile /full/path/to/mappingFile_OV039.txt
NB use full path names
Step 3: submit jobs
change directory to the 'studyDir/scripts' directory, inspect jobs and submit:
change directory:
cd ../lifelines_OV039/scripts
list scripts (we expect jobs for 'plink' and 'dose' and each chromosome):
ls -lah
submit to cluster
sh submit_jobs.sh
monitor job completions (on an empty cluster this takes >2hr)
qstat -u <user>
Step 4: QC results
Steps
- check jobs/*.finished files for all jobs (should all exist)
- check error logs (should not be there)
- run the test
sh jobs/run_test.sh
Step 5: release
cd ../lifelines_OV039/ #remove temp files (!) rm temp* #copy the quality scores cp /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputeDosage/BeagleImputedQualityScores.txt OV039_imputation_batch_qualities.txt #give user permission to see the data chown lifelines_OV039:lifelines *
Implementation details
The 'generateGenoJobs.sh' script implements the following steps:
Convert MAP/PED and generate BIM/BED/FAM
#step1 #generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile #step2: for i in {1..22} #--file [file] is input file in .map and .ped #--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id) #--recode tells plink to write results (otherwise no results!!! argh!) #--out defines output prefix (here: filtered.*) #--update-ids [file] tells prefix to update ids #result: filtered.ped/map' /target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \ --file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \ --update-ids $updateIdsFile \ --out $studyDir/${studyId}_chr$i \ --keep $keepFile \ --recode #step 3: for i in {1..22} #convert to bim/fam/bed plink \ --file $studyDir/${studyId}_chr$i \ --make-bed #step 4: #remove temp rm temp*
Convert dosage format
As PLINK cannot updateIds on dosage files we created it ourselves. The command:
#step1: for i in {1..22} #--subsetFile is the mappingFile #--doseFile is the imputed dosages #--outFile is where the converted file should go python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \ --subsetFile $mappingFile \ --doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \ --outFile ${studyDir}/${studyId}_chr$i.dose
Deprecated: Maintaining the source code of the tools
To work with the sourcecode:
- Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
- Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
- Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/
Additional scripts
Define the source folders (where data has been converted) and final destination (folder where data is served). Sometimes the name are different.
For example
sources=("lifelines_OV0xy" "lifelines_OV0xy") targets=("lifelines_OV0ab" "lifelines_OV0cd")
Script to check if all folders are there
total=${#sources[*]} for (( i=0; i<=$(( $total -1 )); i++ )) do if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then echo ${targets[i]} else echo "error: "${sources[i]} fi
Batch copy and setup of the data
total=${#sources[*]} for (( i=0; i<=$(( $total -1 )); i++ )) do if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then echo ${targets[i]} rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bim rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.ped rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.map rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bed rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.dose rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.log rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.fam rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*_imputation* rsync -av --progress /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ \ /target/gpfs2/lifelines-genome/home/${targets[i]}/ rm /target/gpfs2/lifelines-genome/home/${targets[i]}/temp* rm /target/gpfs2/lifelines-genome/home/${targets[i]}/imputed/test* rm /target/gpfs2/lifelines-genome/home/${targets[i]}/jobs* chown -R $user /target/gpfs2/lifelines-genome/home/${targets[i]}/* chmod -R a-rw /target/gpfs2/lifelines-genome/home/${targets[i]}/* chmod -R u+r /target/gpfs2/lifelines-genome/home/${targets[i]}/* else echo "error: "${sources[i]} fi done
Count files to see if indeed all completed
for (( i=0; i<=$(( $total -1 )); i++ )) do if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then echo "${sources[i]} -> ${targets[i]}" echo "raw" ls -lah /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/raw | wc -l ls -lah /target/gpfs2/lifelines-genome/home/${targets[i]}/raw | wc -l echo "imputed" ls -lah /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/imputed | wc -l ls -lah /target/gpfs2/lifelines-genome/home/${targets[i]}/imputed | wc -l fi done