= SOP for converting LifeLines Geno Data = [[TOC()]] How to pseudonomize and reformat imputed genotype (GWAS) data per study. This SOP applies to LL3. == Variables per study: == * '''studyId''' - the id of study. E.g. 'OV039' * '''studyDir''' - the folder where all converted data should go. E.g. '../home/lifelines_OV039' * '''mappingFile''' - describing what WGA ids are included and what their study ids are. E.g.: {{{ 1 LL_WGA0001 1 12345 1 LL_WGA0002 1 09876 1 LL_WGA0003 1 64542 ... }}} * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms * Items are TAB-separated and it doesn't end with a newline >>WARNING: NOW THE MAPPING FILE DOES NOT YET CONTAIN FAMILY PSEUDONYMS! == Constants over all studies (for LL3): == * Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose * Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped * Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores.txt == Expected outputs == Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing: * [studyId]_Chr[x].PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) * [studyId]_Chr[x].BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) * [studyId]_Chr[x].DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) * [studyId]_imputation_batch.txt listing imputation batches * [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch NB: All files should be prefixed with {{{studyId}}}. > TODO monomorphic filtering == Procedure == === Step 0: request a study user === === Step 1: create subset_study.txt file for study === This is done in the generic layer: * In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers) * Export this view (tab separated, no enclosures, no headers) to subset_study.txt * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3 === Step 2: generate conversion jobs === The conversion has been fully automated (see below for details). First we generate all the jobs needed to convert. Command: {{{ sh /target/gpfs2/lifelines_rp/releases/LL3/scripts/generateGenoJobs.sh \ --studyId OV039 \ --studyDir /full/path/to/lifelines_OV039 \ --mappingFile /full/path/to/mappingFile_OV039.txt }}} NB use full path names === Step 3: submit jobs === change directory to the 'studyDir/scripts' directory, inspect jobs and submit: change directory: {{{ cd ../lifelines_OV039/scripts }}} list scripts (we expect jobs for 'plink' and 'dose' and each chromosome): {{{ ls -lah }}} submit to cluster {{{ sh submit_jobs.sh }}} monitor job completions (on an empty cluster this takes >2hr) {{{ qstat -u }}} === Step 4: QC results === Steps * check jobs/*.finished files for all jobs (should all exist) * check error logs (should not be there) * run the test {{{ sh jobs/run_test.sh }}} === Step 5: release === {{{ cd ../lifelines_OV039/ #remove temp files (!) rm temp* #copy the quality scores cp /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputeDosage/BeagleImputedQualityScores.txt OV039_imputation_batch_qualities.txt #give user permission to see the data chown lifelines_OV039:lifelines * }}} == Implementation details == The 'generateGenoJobs.sh' script implements the following steps: === Convert MAP/PED and generate BIM/BED/FAM=== {{{#!sh #step1 #generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile #step2: for i in {1..22} #--file [file] is input file in .map and .ped #--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id) #--recode tells plink to write results (otherwise no results!!! argh!) #--out defines output prefix (here: filtered.*) #--update-ids [file] tells prefix to update ids #result: filtered.ped/map' /target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \ --file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \ --update-ids $updateIdsFile \ --out $studyDir/${studyId}_chr$i \ --keep $keepFile \ --recode #step 3: for i in {1..22} #convert to bim/fam/bed plink \ --file $studyDir/${studyId}_chr$i \ --make-bed #step 4: #remove temp rm temp* }}} === Convert dosage format === As PLINK cannot updateIds on dosage files we created it ourselves. The command: {{{ #step1: for i in {1..22} #--subsetFile is the mappingFile #--doseFile is the imputed dosages #--outFile is where the converted file should go python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \ --subsetFile $mappingFile \ --doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \ --outFile ${studyDir}/${studyId}_chr$i.dose }}} == Deprecated: Maintaining the source code of the tools == To work with the sourcecode: 1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/ 2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/ 2. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/ = Additional scripts == Define the source folders (where data has been converted) and final destination (folder where data is served). Sometimes the name are different. For example {{{ sources=("lifelines_OV0xy" "lifelines_OV0xy") targets=("lifelines_OV0ab" "lifelines_OV0cd") }}} Script to check if all folders are there {{{ total=${#sources[*]} for (( i=0; i<=$(( $total -1 )); i++ )) do if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then echo ${targets[i]} else echo "error: "${sources[i]} fi }}} Batch copy and setup of the data {{{ total=${#sources[*]} for (( i=0; i<=$(( $total -1 )); i++ )) do if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then echo ${targets[i]} rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bim rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.ped rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.map rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bed rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.dose rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.log rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.fam rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*_imputation* rsync -av --progress /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ \ /target/gpfs2/lifelines-genome/home/${targets[i]}/ rm /target/gpfs2/lifelines-genome/home/${targets[i]}/temp* chown -R $user /target/gpfs2/lifelines-genome/home/${targets[i]}/* chmod -R a-rw /target/gpfs2/lifelines-genome/home/${targets[i]}/* chmod -R u+r /target/gpfs2/lifelines-genome/home/${targets[i]}/* else echo "error: "${sources[i]} fi done }}}