wiki:SopConvertLifeLinesGenoData

Version 64 (modified by cp229, 12 years ago) (diff)

--

SOP for converting LifeLines Geno Data for one study

Background

Researcher request access to LifeLines. If granted, they get a 'study'. In this study data set all identifiers should be impossible to link back to the rest of lifelines. Therefore, all 'lifelines participant identifiers' will need to be replaced with 'study sample identifiers'. This SOP describes how to do that for genotypic data.

Variables per study:

  • studyId - the id of study. E.g. 'OV039'
  • studyDir - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
  • mappingFile - describing what WGA ids are included and what their study ids are. E.g.:
    1   LL_WGA0001   1   12345
    1   LL_WGA0002   1   09876
    1   LL_WGA0003   1   64542
    ...
    
  • So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
  • Items are TAB-separated and it doesn't end with a newline

WARNING: NOW THE MAPPING FILE DOES NOT YET CONTAIN FAMILY PSEUDONYMS!

Constants over all studies (for LL3):

  • Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
  • Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
  • Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores?.txt
  • PCA analysis of the genotypes; used for QC (filtering of individuals genetically too different): /target/gpfs2/lifelines_rp/releases/LL3/UnimputedPedMap

Expected outputs

Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:

  • [studyId]_Chr.PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
  • [studyId]_Chr.BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
  • [studyId]_Chr.DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
  • [studyId]_imputation_batch.txt listing imputation batches
  • [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch

NB: All files should be prefixed with studyId.

TODO monomorphic filtering

Procedure

Step 0: request a study user

Contact "Target beheer" (admins, e.g. Ger) and ask for an account.

E.g. username = lifelines_OV093 group = none home directory = /target/gpfs2/lifelines-genome/home/lifelines_OV093

Step 1: create subset_study<n>.txt file for study<n>

This is done in the generic layer:

  • In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
  • In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
  • Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
  • scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

Step 2: generate conversion jobs

The conversion has been fully automated (see below for details). First we generate all the jobs needed to convert.

Command:

sh /target/gpfs2/lifelines_rp/releases/LL3/scripts/generateGenoJobs.sh \
--studyId OV039 \
--studyDir /full/path/to/lifelines_OV039 \
--mappingFile /full/path/to/mappingFile_OV039.txt

NB use full path names

Step 3: submit jobs

change directory to the 'studyDir/scripts' directory, inspect jobs and submit:

change directory:

cd ../lifelines_OV039/scripts

list scripts (we expect jobs for 'plink' and 'dose' and each chromosome):

ls -lah

submit to cluster

sh submit_jobs.sh

monitor job completions (on an empty cluster this takes >2hr)

qstat -u <user>

Step 4: QC results

Steps

  • check jobs/*.finished files for all jobs (should all exist)
  • check error logs (should not be there)
  • run the test
sh jobs/run_test.sh

Step 5: release

cd ../lifelines_OV039/

#remove temp files (!)
rm temp*

#copy the quality scores
cp /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputeDosage/BeagleImputedQualityScores.txt OV039_imputation_batch_qualities.txt

#give user permission to see the data
chown lifelines_OV039:lifelines *

Implementation details

The 'generateGenoJobs.sh' script implements the following steps:

Convert MAP/PED and generate BIM/BED/FAM

#step1
#generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile


#step2: for i in {1..22}
#--file [file] is input file in .map and .ped
#--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
#--out defines output prefix (here: filtered.*)
#--update-ids [file] tells prefix to update ids
#result: filtered.ped/map'

/target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \
--file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \
--update-ids $updateIdsFile \
--out $studyDir/${studyId}_chr$i \
--keep $keepFile \
--recode

#step 3:  for i in {1..22}
#convert to bim/fam/bed
plink \
--file $studyDir/${studyId}_chr$i \
--make-bed

#step 4: 
#remove temp
rm temp*

Convert dosage format

As PLINK cannot updateIds on dosage files we created it ourselves. The command:

#step1: for i in {1..22}
#--subsetFile is the mappingFile
#--doseFile is the imputed dosages
#--outFile is where the converted file should go

python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \
--subsetFile $mappingFile \
--doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \
--outFile ${studyDir}/${studyId}_chr$i.dose

Deprecated: Maintaining the source code of the tools

To work with the sourcecode:

  1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
  2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
  3. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/

Additional scripts

Define the source folders (where data has been converted) and final destination (folder where data is served). Sometimes the name are different.

For example

sources=("lifelines_OV0xy" "lifelines_OV0xy")
targets=("lifelines_OV0ab" "lifelines_OV0cd")

Script to check if all folders are there

total=${#sources[*]}
for (( i=0; i<=$(( $total -1 )); i++ )) do
if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then
echo ${targets[i]}
else
echo "error: "${sources[i]}
fi

Batch copy and setup of the data

total=${#sources[*]}
for (( i=0; i<=$(( $total -1 )); i++ )) do
if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then
echo ${targets[i]}

rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bim
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.ped
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.map
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bed
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.dose
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.log
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.fam
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*_imputation*

rsync -av --progress /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ \
/target/gpfs2/lifelines-genome/home/${targets[i]}/

rm /target/gpfs2/lifelines-genome/home/${targets[i]}/temp*
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/imputed/test*
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/jobs*
chown -R $user /target/gpfs2/lifelines-genome/home/${targets[i]}/*
chmod -R a-rw /target/gpfs2/lifelines-genome/home/${targets[i]}/*
chmod -R u+r /target/gpfs2/lifelines-genome/home/${targets[i]}/*

else
echo "error: "${sources[i]}
fi
done

Count files to see if indeed all completed

for (( i=0; i<=$(( $total -1 )); i++ )) do
if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then

echo "${sources[i]} -> ${targets[i]}"
echo "raw"
ls -lah /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/raw | wc -l
ls -lah /target/gpfs2/lifelines-genome/home/${targets[i]}/raw | wc -l
echo "imputed"
ls -lah /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/imputed | wc -l
ls -lah /target/gpfs2/lifelines-genome/home/${targets[i]}/imputed | wc -l

fi
done