Context Navigation

Changes between Version 7 and Version 8 of SopConvertLifeLinesGenoData

Timestamp:: 2012-04-04T06:55:37+02:00 (14 years ago)
Author:: Morris Swertz
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SopConvertLifeLinesGenoData

-                      v7
+                      v8
 This SOP applies to LL3.
+>TODO: make molgenis 'compute' pipeline for this :-)
 Data is released to researcher 'per study' (i.e. an approved research request).
 …
 == Procedure ==
 === Step 1: create mapping file for study ===
+=== Step 1: create subset_study<n>.txt file for study<n> ===
  * In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
  * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
  * Export this view (tab separated, no enclosures, no headers) to molgenis<n>.txt
+ * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
  * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
 === Step 2: run convertor for study ===
+=== Step 2: run convertor to create study<n>.tped ===
 cd to directory:
 …
 {{{#!sh
 ./formatsubsetfile.sh molgenis<n>.txt
+./formatsubsetfile.sh study<n>.txt
 }}}
 run convertor on TriTyper and Mapping file:
 {{{#!sh
 /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_molgenis<n>.txt
+/target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_study<n>.txt
 }}}
+Estimated runtime:
 Note:
 …
 * Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/
+=== Step 3: copy geno data to the study folder ===
+=== Step 3: convert into binary plink format ==
+Convert .tped into study<n>.bed, .bim. and .fam files:
+ {{{#!sh
+plink --tfile study<n> --make-bed --out study<n>
+}}}
+Split study<n>.bed, .bim, fam per chromosome:
+{{{#!sh
+#create variable holding study name
+study = study<n>
+#get chromosomes out of file
+chrs=`awk '{print $1}' ${prefix}.bim | sort -nur`
+echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
+echo ""
+#use to split/convert
+for chr in $chrs; do
+        print "Processing chromosome $_\n";
+        plink --bfile $study --chr $_ --make-bed --out $study$_;
+}}}
+>NB: If this takes long we should make this cluster jobs!
+=== Step 4: convert into dosage format (MISSING!) ===
+=== Step 5: copy all study<n> files to the lifelines0<n> folder ===
 {{{#!sh
 cp study<n>.tped ../../lifelines0<n>
+cp study<n>* ../../lifelines0<n>
 }}}
 * May take some time!
-=== Step 4: convert into dosage format (MISSING!) ===
+=== Step 5: convert into other formats ==
+Convert the large genodata from TPED into .bed, .bim. and .bam files:
+ {{{#!sh
+plink --tfile <data> --make-bed --out <data>
+}}}
+     This should generate .bed, .bim and .fam files.
+ * Supply the data also in separate files per chromosome. This can be done with the commands:
+     {{{  plink --tfile <data> --make-bed --chr 1 --out <data_chr1> }}}
+     {{{  plink --tfile <data> --make-bed --chr 2 --out <data_chr2> }}}
+     {{{  ... }}}
+   (a script file should take care of this series of commands)
+Besides the genodata dosage information is also desired, both for the total (per-study) dataset and for that dataset split per chromosome.
+ * Joeri has a tool for this. NB: it is slow, reimplementing it in a compiled language might be worthwhile.
+== Overview ==
 [[Image(http://i.imgur.com/nLT2e.png)]]
+A schematic overview of the two export paths described above.
+A schematic overview of the two export paths described above.
+A schematic overview of the export procedures described above.