= SOP for converting LifeLines Geno Data  =

[[TOC()]]

This SOP applies to LL3.

>TODO: make molgenis 'compute' pipeline for this :-)

Data is released to researcher 'per study' (i.e. an approved research request). 
* Per study a subset of the genotypes is created and made available to the researcher:
* Only individuals selected for study (e.g. 5000 out of total 17000)
* The identifiers 're-pseunomized' from 'marcel identifiers' to 'study identifiers' (so data can not be matched between studies).

== Expected outputs ==

User expects files in PLINK format:
* TPED/TFAM genotype files (chosen for internal use as easier to produce)
* BIM/BED/FAM genotype files (with missing value phenotype, monomorphic filtered)
* IDEM but then splitted per chromosome
* MAP/PED dosage files (with missing value phenotype, monomorphic filtered) 
* IDEM but then splitted per chromosome

== Required inputs ==

The following are input for the conversion procedure:
* TriTyper imputed data files: /target/gpfs2/lifelines_rp/releases/LL3/
* mapping file for study to select and re-pseudonomize identifiers

Example mapping file:
{{{
LL_WGA0001   STUDYPSEUDO1   0
LL_WGA0002   STUDYPSEUDO2   0
LL_WGA0003   STUDYPSEUDO3   0
...
}}}

 * So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user)
 * Items are TAB-separated and it doesn't end with a newline

== Procedure ==

=== Step 1: create subset_study<n>.txt file for study<n> ===

 * In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
 * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
 * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt 
 * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

=== Step 2: convert into study<n>.tped format ===
Estimated runtime: 4 hours (4Gb/2 cpu machine)

cd to directory:
{{{#!sh 
cd /target/gpfs2/lifelines_rp/releases/LL3
}}}

reformat mapping file:

{{{#!sh
./formatsubsetfile.sh study<n>.txt
}}}

run convertor on TriTyper and Mapping file:
{{{#!sh
/target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_study<n>.txt
}}}

Note:
* Convertor from TriTyper to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3
* Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/

=== Step 3: convert into binary plink format ===

Convert .tped into study<n>.bed, .bim. and .fam files:
 {{{#!sh 
plink --tfile study<n> --make-bed --out study<n> 
}}}

Split study<n>.bed, .bim, fam per chromosome:

>> this script is untested, awaiting account

{{{#!sh
#create variable holding study name
study = study<n>

#get all chromosomes out of .bim file
chrs=`awk '{print $1}' ${study}.bim | sort -nur`
echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
echo ""

#use to split/convert
for chr in $chrs; do
	print "Processing chromosome $_\n";
	plink --bfile $study --chr $_ --make-bed --out $study$_;
}}}

>NB: If this takes long we should make this cluster jobs!
=== Step 4: convert into dosage format ===

MISSING! ask Joeri?

=== Step 5: copy all study<n> files to the lifelines0<n> folder ===

{{{#!sh
cp study<n>* ../../lifelines0<n>
}}}
 
* May take some time!

== Overview ==

[[Image(http://i.imgur.com/nLT2e.png)]]

A schematic overview of the export procedures described above.