= SOP for converting LifeLines Geno Data  =

[[TOC()]]

This SOP applies to LL3.

Data is released to researcher 'per study' (i.e. an approved research request). 
* Per study a subset of the genotypes is created and made available to the researcher:
* Only individuals selected for study (e.g. 5000 out of total 17000)
* The identifiers 're-pseunomized' from 'marcel identifiers' to 'study identifiers' (so data can not be matched between studies).

== Expected outputs ==

User expects files in PLINK format:
* TPED/TFAM genotype files (chosen for internal use as easier to produce)
* BIM/BED/FAM genotype files (with missing value phenotype, monomorphic filtered)
* IDEM but then splitted per chromosome
* MAP/PED dosage files (with missing value phenotype, monomorphic filtered) 
* IDEM but then splitted per chromosome

== Available inputs ==

The following are input for the conversion procedure:
* TriTyper imputated data files: /target/gpfs2/lifelines_rp/releases/LL3/
* Identifier mapping file per study to select and re-pseudonomize identifiers

Example mapping file:
{{{
LL_WGA0001   STUDYPSEUDO1   0
LL_WGA0002   STUDYPSEUDO2   0
LL_WGA0003   STUDYPSEUDO3   0
...
}}}

 * So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user)
 * Items are TAB-separated and it doesn't end with a newline

== Procedure ==

=== Step 1: create mapping file for study ===

 * In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
 * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
 * Export this view (tab separated, no enclosures, no headers) to molgenis<n>.txt 
 * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

=== Step 2: run convertor for study ===

cd to directory:
{{{#!sh 
cd /target/gpfs2/lifelines_rp/releases/LL3
}}}

reformat mapping file:

{{{#!sh
./formatsubsetfile.sh molgenis<n>.txt
}}}

run convertor on TriTyper and Mapping file:
{{{#!sh
/target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_molgenis<n>.txt
}}}

Note:
* Convertor from TriTyper to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3
* Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/

=== Step 3: copy geno data to the study folder ===

{{{#!sh
cp study<n>.tped ../../lifelines0<n>
}}}
 
* May take some time!
== Further Genodata ==

The commands above generate a single large file for the study in question. From this researchers would like some further file manipulation to be done: 
 * Supply the large genodata in binary format, using command:
     {{{ plink --tfile <data> --make-bed --out <data> }}}
     This should generate .bed, .bim and .fam files.
 * Supply the data also in separate files per chromosome. This can be done with the commands:
     {{{  plink --tfile <data> --make-bed --chr 1 --out <data_chr1> }}}
     {{{  plink --tfile <data> --make-bed --chr 2 --out <data_chr2> }}}
     {{{  ... }}}
   (a script file should take care of this series of commands)

Besides the genodata dosage information is also desired, both for the total (per-study) dataset and for that dataset split per chromosome.
 * Joeri has a tool for this. NB: it is slow, reimplementing it in a compiled language might be worthwhile.

[[Image(http://i.imgur.com/nLT2e.png)]]

A schematic overview of the two export paths described above.