wiki:SopConvertLifeLinesGenoData

Version 59 (modified by Morris Swertz, 12 years ago) (diff)

--

SOP for converting LifeLines Geno Data

How to pseudonomize and reformat imputed genotype (GWAS) data per study. This SOP applies to LL3.

Variables per study:

  • studyId - the id of study. E.g. 'OV039'
  • studyDir - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
  • mappingFile - describing what WGA ids are included and what their study ids are. E.g.:
    1   LL_WGA0001   1   12345
    1   LL_WGA0002   1   09876
    1   LL_WGA0003   1   64542
    ...
    
  • So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
  • Items are TAB-separated and it doesn't end with a newline

WARNING: NOW THE MAPPING FILE DOES NOT YET CONTAIN FAMILY PSEUDONYMS!

Constants over all studies (for LL3):

  • Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
  • Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
  • Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores?.txt

Expected outputs

Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:

  • [studyId]_Chr.PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
  • [studyId]_Chr.BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
  • [studyId]_Chr.DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
  • [studyId]_imputation_batch.txt listing imputation batches
  • [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch

NB: All files should be prefixed with studyId.

TODO monomorphic filtering

Procedure

Step 0: request a study user

Step 1: create subset_study<n>.txt file for study<n>

This is done in the generic layer:

  • In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
  • In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
  • Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
  • scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

Step 2: generate conversion jobs

The conversion has been fully automated (see below for details). First we generate all the jobs needed to convert.

Command:

sh /target/gpfs2/lifelines_rp/releases/LL3/scripts/generateGenoJobs.sh \
--studyId OV039 \
--studyDir /full/path/to/lifelines_OV039 \
--mappingFile /full/path/to/mappingFile_OV039.txt

NB use full path names

Step 3: submit jobs

change directory to the 'studyDir/scripts' directory, inspect jobs and submit:

change directory:

cd ../lifelines_OV039/scripts

list scripts (we expect jobs for 'plink' and 'dose' and each chromosome):

ls -lah

submit to cluster

sh submit_jobs.sh

monitor job completions (on an empty cluster this takes >2hr)

qstat -u <user>

Step 4: QC results

Steps

  • check jobs/*.finished files for all jobs (should all exist)
  • check error logs (should not be there)
  • run the test
sh jobs/run_test.sh

Step 5: release

cd ../lifelines_OV039/

#remove temp files (!)
rm temp*

#copy the quality scores
cp /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputeDosage/BeagleImputedQualityScores.txt OV039_imputation_batch_qualities.txt

#give user permission to see the data
chown lifelines_OV039:lifelines *

Implementation details

The 'generateGenoJobs.sh' script implements the following steps:

Convert MAP/PED and generate BIM/BED/FAM

#step1
#generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile


#step2: for i in {1..22}
#--file [file] is input file in .map and .ped
#--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
#--out defines output prefix (here: filtered.*)
#--update-ids [file] tells prefix to update ids
#result: filtered.ped/map'

/target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \
--file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \
--update-ids $updateIdsFile \
--out $studyDir/${studyId}_chr$i \
--keep $keepFile \
--recode

#step 3:  for i in {1..22}
#convert to bim/fam/bed
plink \
--file $studyDir/${studyId}_chr$i \
--make-bed

#step 4: 
#remove temp
rm temp*

Convert dosage format

As PLINK cannot updateIds on dosage files we created it ourselves. The command:

#step1: for i in {1..22}
#--subsetFile is the mappingFile
#--doseFile is the imputed dosages
#--outFile is where the converted file should go

python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \
--subsetFile $mappingFile \
--doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \
--outFile ${studyDir}/${studyId}_chr$i.dose

Deprecated: Maintaining the source code of the tools

To work with the sourcecode:

  1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
  2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
  3. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/

Additional scripts =

Define the source folders (where data has been converted) and final destination (folder where data is served). Sometimes the name are different.

For example

sources=("lifelines_OV0xy" "lifelines_OV0xy")
targets=("lifelines_OV0ab" "lifelines_OV0cd")

Script to check if all folders are there

total=${#sources[*]}
for (( i=0; i<=$(( $total -1 )); i++ )) do
if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then
echo ${targets[i]}
else
echo "error: "${sources[i]}
fi

Batch copy and setup of the data

total=${#sources[*]}
for (( i=0; i<=$(( $total -1 )); i++ )) do
if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then
echo ${targets[i]}

rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bim
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.ped
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.map
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bed
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.dose
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.log
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.fam
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*_imputation*

rsync -av --progress /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ \
/target/gpfs2/lifelines-genome/home/${targets[i]}/

rm /target/gpfs2/lifelines-genome/home/${targets[i]}/temp*
chown -R $user /target/gpfs2/lifelines-genome/home/${targets[i]}/*
chmod -R a-rw /target/gpfs2/lifelines-genome/home/${targets[i]}/*
chmod -R u+r /target/gpfs2/lifelines-genome/home/${targets[i]}/*

else
echo "error: "${sources[i]}
fi
done