= SOP for converting LifeLines Geno Data  =
[[TOC()]]

How to pseudonomize and reformat imputed genotype (GWAS) data per study.
This SOP applies to LL3.


== Variables per study: ==
* '''studyId''' - the id of study. E.g. 'OV039'
* '''studyDir''' - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
* '''mappingFile''' - describing what WGA ids are included and what their study ids are. E.g.:

{{{
1   LL_WGA0001   1   12345
1   LL_WGA0002   1   09876
1   LL_WGA0003   1   64542
...
}}}

  * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
  * Items are TAB-separated and it doesn't end with a newline

== Constants over all studies (for LL3): ==
* Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
* Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
* Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores.txt

== Expected outputs ==

Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:
* [studyId]_Chr[x].PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
* [studyId]_Chr[x].BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
* [studyId]_Chr[x].DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) 
* [studyId]_imputation_batch.txt listing imputation batches
* [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch

NB: All files should be prefixed with {{{studyId}}}.

> TODO monomorphic filtering

== Procedure ==

=== Step 0: request a study user ===

=== Step 1: create subset_study<n>.txt file for study<n> ===

This is done in the generic layer:
 * In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
 * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
 * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt 
 * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

reformat mapping file '''WHY IS THIS?''':
{{{#!sh
./formatsubsetfile.sh study<n>.txt
}}}

=== Step 2: generate conversion jobs ===

The conversion has been fully automated (see below for details). Therefore we generate all the jobs needed to convert.
These jobs are produced in the 'studyDir/scripts'.

Command:
{{{
sh ../LL3/scripts/generateGenoJobs.sh \
--studyId OV039 \
--outputDir ../lifelines_OV039 \
--mappingFile ../mappingFile_OV039.txt
}}}
 
== Step 3: submit jobs ==

change directory to the scripts directory, inspect and submit:

change directory:
{{{
cd ../lifelines_OV039/scripts
}}}

list scripts (we expect jobs for each format and each chromosome):
{{{
ls -lah
}}}

submit to cluster
{{{
sh submit_jobs.sh
}}}

== Step 4: monitor progress,  QC results ==

monitor progress using 'qstat'

TBD how to QC.
* must check that now WGA id is in the data
* must check that all ids where in the set

== Step 5: release ==
{{{
cd ../lifelines_OV039/

#give user permission to see the data
chown lifelines_OV039:lifelines *
}}}

== Implementation details ==

The 'generateGenoJobs.sh' script implements the following steps:

=== Convert MAP/PED and generate BIM/BED/FAM===

{{{#!sh
#step1
#generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile


#step2: for i in {1..22}
#--file [file] is input file in .map and .ped
#--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
#--out defines output prefix (here: filtered.*)
#--update-ids [file] tells prefix to update ids
#result: filtered.ped/map'

/target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \
--file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \
--update-ids $updateIdsFile \
--out $studyDir/${studyId}_chr$i \
--keep $keepFile \
--recode

#step 3:  for i in {1..22}
#convert to bim/fam/bed
plink \
--file $studyDir/${studyId}_chr$i \
--make-bed

#remove temp
rm temp_chr1

=== Convert dosage format ===

As PLINK cannot updateIds on dosage files we created it ourselves. The command:


{{{
#step1: for i in {1..22}
#--subsetFile is the mappingFile
#--doseFile is the imputed dosages
#--outFile is where the converted file should go

python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \
--subsetFile $mappingFile \
--doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \
--outFile ${studyDir}/${studyId}_chr$i.dose

}}}


=== Step 5: copy all study*<n> files to the lifelines0<n> folder ===

{{{#!sh
cp study<n>* ../../lifelines0<n>
}}}
 
* May take some time!

== Maintaining the source code of the tools ==

To work with the sourcecode:

1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
2. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/

== Overview ==

[[Image(http://i.imgur.com/nLT2e.png)]]

A schematic overview of the export procedures described above.