= SOP for converting LifeLines Geno Data = [[TOC()]] This SOP applies to LL3. Data is released to researcher 'per study' (i.e. an approved research request). * Per study a subset of the genotypes is created and made available to the researcher: * Only individuals selected for study (e.g. 5000 out of total 17000) * The identifiers 're-pseunomized' from 'marcel identifiers' to 'study identifiers' (so data can not be matched between studies). == Expected outputs == User expects files in PLINK format: * TPED/TFAM genotype files (chosen for internal use as easier to produce) * BIM/BED/FAM genotype files (with missing value phenotype, monomorphic filtered) * IDEM but then splitted per chromosome * MAP/PED dosage files (with missing value phenotype, monomorphic filtered) * IDEM but then splitted per chromosome == Available inputs == The following are input for the conversion procedure: * TriTyper imputated data files: /target/gpfs2/lifelines_rp/releases/LL3/ * Identifier mapping file per study to select and re-pseudonomize identifiers Example mapping file: {{{ LL_WGA0001 STUDYPSEUDO1 0 LL_WGA0002 STUDYPSEUDO2 0 LL_WGA0003 STUDYPSEUDO3 0 ... }}} == Procedure == * Data resides on /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedTriTyper (accessible from all our new VMs) * Convertor from TriTyper to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3 * Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/ * STEP 1: make the subset_molgenis.txt file: * In every MOLGENIS schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers) * Export this view (tab separated, no enclosures, no headers) to molgenis.txt and scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3 * Run the following command there: {{{ ./formatsubsetfile.sh molgenis.txt }}} * Your file is now available as subset_molgenis.txt and looks like:[[BR]]LL_WGA0001 STUDYPSEUDO1 0[[BR]]LL_WGA0002 STUDYPSEUDO2 0[[BR]]LL_WGA0003 STUDYPSEUDO3 0[[BR]]... * So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user) * Items are TAB-separated and it doesn't end with a newline * STEP 2: run the convertor * Usage: {{{ /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study subset_molgenis.txt }}} * STEP 3: copy to correct location * {{{cp study.tped ../../lifelines0}}} * May take some time! == Further Genodata == The commands above generate a single large file for the study in question. From this researchers would like some further file manipulation to be done: * Supply the large genodata in binary format, using command: {{{ plink --tfile --make-bed --out }}} This should generate .bed, .bim and .fam files. * Supply the data also in separate files per chromosome. This can be done with the commands: {{{ plink --tfile --make-bed --chr 1 --out }}} {{{ plink --tfile --make-bed --chr 2 --out }}} {{{ ... }}} (a script file should take care of this series of commands) Besides the genodata dosage information is also desired, both for the total (per-study) dataset and for that dataset split per chromosome. * Joeri has a tool for this. NB: it is slow, reimplementing it in a compiled language might be worthwhile. [[Image(http://i.imgur.com/nLT2e.png)]] A schematic overview of the two export paths described above.