Context Navigation

Changes between Version 26 and Version 27 of SopConvertLifeLinesGenoData

Timestamp:: 2012-04-10T21:35:26+02:00 (14 years ago)
Author:: Morris Swertz
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SopConvertLifeLinesGenoData

-                      v26
+                      v27
 Example mapping file:
 {{{
+LL_WGA0001   STUDYPSEUDO1   0
+LL_WGA0002   STUDYPSEUDO2   0
+LL_WGA0003   STUDYPSEUDO3   0
+   LL_WGA0001   1   STUDYPSEUDO1
+   LL_WGA0002   1   STUDYPSEUDO2
+   LL_WGA0003   1   STUDYPSEUDO3
 ...
 }}}
  * So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user)
+ * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
  * Items are TAB-separated and it doesn't end with a newline
 == Procedure ==
 === Step 1: create subset_study<n>.txt file for study<n> ===
  * In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
+ * In every STUDY<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
  * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
  * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
 …
 }}}
+reformat mapping file:
+reformat mapping file '''WHY IS THIS?''':
 {{{#!sh
 ./formatsubsetfile.sh study<n>.txt
 }}}
+run convertor on TriTyper and Mapping file:
+filter individuals (repeat per chr)
 {{{#!sh
+/target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_study<n>.txt
+#--file [file] is input file (expects .map and .ped)
+#--keep [file] tells plink what individuals to keep (from txt file with fam + ind id)
+#--recode tells plink to write results (otherwise no results!!! argh!)
+#--out defines output prefix (here: filtered.*)
+#--update-ids [file] tells prefix to update ids
+#result: filtered.ped/map'
+plink --file testdata_chr1 --keep subset.txt --recode --out temp_chr1
 }}}
+Note:
+* Convertor from TriTyper to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3
+* Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/
+* Estimated runtime: 4 hours (4Gb/2 cpu @ cluster.gcc.rug.nl)
+=== Step 3: convert into binary plink format ===
+update individuals ids (repeat per chr)
+{{{
+#--file [file] is input file
+#--keep [file] tells plink what individuals to update
+#(from txt file with OLD fam + ind id + NEW fam id + ind id)
+#--recode tells plink to write results (otherwise no results!!! argh!)
+# result: updatedids.map/ped
+Convert .tped into study<n>.bed, .bim. and .fam files:
+ {{{#!sh
+plink --tfile study<n> --make-bed --out study<n>
+plink --file temp_chr1 --update-ids subset.txt --recode --out study2_chr1
 }}}
-Split study<n>.bed, .bim, fam per chromosome:
+>> this script is untested, awaiting account
+#step 3:
+#convert to bed (repeat per chr)
+plink --file study2_chr1 --make-bed
-{{{#!sh
-#create variable holding study name
-study = study<n>
-#get all chromosomes out of .bim file
-chrs=`awk '{print $1}' ${study}.bim | sort -nur`
-echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
-echo ""
-#use to split/convert
-for chr in $chrs; do
-        print "Processing chromosome $_\n";
-        plink --bfile $study --chr $_ --make-bed --out $study$_;
-}}}
->NB: If this takes long we should make this cluster jobs!
 === Step 4: convert into dosage format ===
 MISSING! ask Joeri?
+TODO! ask Joeri?
 === Step 5: copy all study<n> files to the lifelines0<n> folder ===
+=== Step 5: copy all study*<n> files to the lifelines0<n> folder ===
 {{{#!sh