wiki:SopConvertLifeLinesGenoData

Context Navigation

Version 34 (modified by Morris Swertz, 14 years ago) (diff)
--

SOP for converting LifeLines Geno Data

How to pseudonomize and reformat imputed genotype (GWAS) data per study. This SOP applies to LL3.

Variables per study:

studyId - the id of study. E.g. 'OV039'
studyDir - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
mappingFile - describing what WGA ids are included and what their study ids are. E.g.:

1   LL_WGA0001   1   12345
1   LL_WGA0002   1   09876
1   LL_WGA0003   1   64542
...

So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
Items are TAB-separated and it doesn't end with a newline

Constants over all studies (for LL3):

Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores?.txt

Expected outputs

Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:

[studyId]_Chr.PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
[studyId]_Chr.BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
[studyId]_Chr.DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
[studyId]_imputation_batch.txt listing imputation batches
[studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch

NB: All files should be prefixed with studyId.

TODO monomorphic filtering

Procedure

Step 0: request a study user

Step 1: create subset_study<n>.txt file for study<n>

This is done in the generic layer:

In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

reformat mapping file WHY IS THIS?:

./formatsubsetfile.sh study<n>.txt

Step 2: generate conversion jobs

The conversion has been fully automated (see below for details). Therefore we generate all the jobs needed to convert. These jobs are produced in the 'studyDir/scripts'.

Command:

sh ../LL3/scripts/generateGenoJobs.sh \
--studyId OV039 \
--outputDir ../lifelines_OV039 \
--mappingFile ../mappingFile_OV039.txt

Step 3: submit jobs

change directory to the scripts directory, inspect and submit:

change directory:

cd ../lifelines_OV039/scripts

list scripts (we expect jobs for each format and each chromosome):

ls -lah

submit to cluster

sh submit_jobs.sh

Step 4: monitor progress, QC results

monitor progress using 'qstat'

TBD how to QC.

must check that now WGA id is in the data
must check that all ids where in the set

Step 5: release

cd ../lifelines_OV039/

#give user permission to see the data
chown lifelines_OV039:lifelines *

Implementation details

The 'generateGenoJobs.sh' script implements the following steps:

Convert MAP/PED and generate BIM/BED/FAM

#step1
#generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile


#step2: for i in {1..22}
#--file [file] is input file in .map and .ped
#--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
#--out defines output prefix (here: filtered.*)
#--update-ids [file] tells prefix to update ids
#result: filtered.ped/map'

/target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \
--file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \
--update-ids $updateIdsFile \
--out $studyDir/${studyId}_chr$i \
--keep $keepFile \
--recode

#step 3:  for i in {1..22}
#convert to bim/fam/bed
plink \
--file $studyDir/${studyId}_chr$i \
--make-bed

#remove temp
rm temp_chr1

=== Convert dosage format ===

As PLINK cannot updateIds on dosage files we created it ourselves. The command:


{{{
#step1: for i in {1..22}
#--subsetFile is the mappingFile
#--doseFile is the imputed dosages
#--outFile is where the converted file should go

python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \
--subsetFile $mappingFile \
--doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \
--outFile ${studyDir}/${studyId}_chr$i.dose

}}}


=== Step 5: copy all study*<n> files to the lifelines0<n> folder ===

{{{#!sh
cp study<n>* ../../lifelines0<n>
}}}
 
* May take some time!

== Maintaining the source code of the tools ==

To work with the sourcecode:

1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
2. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/

== Overview ==

[[Image(http://i.imgur.com/nLT2e.png)]]

A schematic overview of the export procedures described above.

Download in other formats:

Plain Text