= SOP for converting LifeLines Geno Data  =
[[TOC()]]

How to pseudonomize geno data per study

This SOP applies to LL3.

Specifications:
* Geno data is released to researcher 'per study' (i.e. an approved research request). 
* Per study a subset of the individuals is selected
* The individual identifiers are 're-pseunomized' from 'WGA' to 'study' identifiers
* Data is reformatted in various PLINK formats

== Expected outputs ==

The IDS should be filtered (e.g. 5000) and recoded (psuedoids) for one study.

User expects files in PLINK format:
* PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
* BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
* MAP/PED imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) 
* batch.txt listing imputation batches
* impution_quality.txt listing imputation quality per SNP x batch

> TODO monomorphic filtering
== Required inputs ==

The following are input for the conversion procedure:
* Beagle imputed genotype files (fam/ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap
* Beagle imputed dosage files (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage
* '''per study''' mapping file for study to filter and re-pseudonomize identifiers

Example mapping file:
{{{
1   LL_WGA0001   1   STUDYPSEUDO1
1   LL_WGA0002   1   STUDYPSEUDO2
1   LL_WGA0003   1   STUDYPSEUDO3
...
}}}

 * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
 * Items are TAB-separated and it doesn't end with a newline

== Procedure ==

=== Step 1: create subset_study<n>.txt file for study<n> ===

 * In every STUDY<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
 * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
 * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt 
 * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

=== Step 2: convert into study<n>.tped format ===

cd to directory:
{{{#!sh 
cd /target/gpfs2/lifelines_rp/releases/LL3
}}}

reformat mapping file '''WHY IS THIS?''':
{{{#!sh
./formatsubsetfile.sh study<n>.txt
}}}

filter individuals (repeat per chr)
{{{#!sh
#--file [file] is input file (expects .map and .ped)
#--keep [file] tells plink what individuals to keep (from txt file with fam + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
#--out defines output prefix (here: filtered.*)
#--update-ids [file] tells prefix to update ids
#result: filtered.ped/map'

plink --file testdata_chr1 --keep subset.txt --recode --out temp_chr1
}}}

update individuals ids (repeat per chr)
{{{#!sh
#--file [file] is input file
#--keep [file] tells plink what individuals to update 
#(from txt file with OLD fam + ind id + NEW fam id + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
# result: updatedids.map/ped

plink --file temp_chr1 --update-ids subset.txt --recode --out study2_chr1 
}}}


#step 3:
#convert to bed (repeat per chr)
plink --file study2_chr1 --make-bed
=== Step 4: convert into dosage format ===

TODO! ask Joeri?

=== Step 5: copy all study*<n> files to the lifelines0<n> folder ===

{{{#!sh
cp study<n>* ../../lifelines0<n>
}}}
 
* May take some time!

== Maintaining the source code of the tools ==

To work with the sourcecode:

1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
2. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/

== Overview ==

[[Image(http://i.imgur.com/nLT2e.png)]]

A schematic overview of the export procedures described above.