Changes between Version 7 and Version 8 of SopConvertLifeLinesGenoData


Ignore:
Timestamp:
2012-04-04T06:55:37+02:00 (13 years ago)
Author:
Morris Swertz
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SopConvertLifeLinesGenoData

    v7 v8  
    44
    55This SOP applies to LL3.
     6
     7>TODO: make molgenis 'compute' pipeline for this :-)
    68
    79Data is released to researcher 'per study' (i.e. an approved research request).
     
    3840== Procedure ==
    3941
    40 === Step 1: create mapping file for study ===
     42=== Step 1: create subset_study<n>.txt file for study<n> ===
    4143
    4244 * In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
    4345 * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
    44  * Export this view (tab separated, no enclosures, no headers) to molgenis<n>.txt
     46 * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
    4547 * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
    4648
    47 === Step 2: run convertor for study ===
     49=== Step 2: run convertor to create study<n>.tped ===
    4850
    4951cd to directory:
     
    5557
    5658{{{#!sh
    57 ./formatsubsetfile.sh molgenis<n>.txt
     59./formatsubsetfile.sh study<n>.txt
    5860}}}
    5961
    6062run convertor on TriTyper and Mapping file:
    6163{{{#!sh
    62 /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_molgenis<n>.txt
     64/target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_study<n>.txt
    6365}}}
     66
     67Estimated runtime:
    6468
    6569Note:
     
    6771* Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/
    6872
    69 === Step 3: copy geno data to the study folder ===
     73=== Step 3: convert into binary plink format ==
     74
     75Convert .tped into study<n>.bed, .bim. and .fam files:
     76 {{{#!sh
     77plink --tfile study<n> --make-bed --out study<n>
     78}}}
     79
     80Split study<n>.bed, .bim, fam per chromosome:
     81{{{#!sh
     82#create variable holding study name
     83study = study<n>
     84
     85#get chromosomes out of file
     86chrs=`awk '{print $1}' ${prefix}.bim | sort -nur`
     87echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
     88echo ""
     89
     90#use to split/convert
     91for chr in $chrs; do
     92        print "Processing chromosome $_\n";
     93        plink --bfile $study --chr $_ --make-bed --out $study$_;
     94}}}
     95
     96>NB: If this takes long we should make this cluster jobs!
     97
     98=== Step 4: convert into dosage format (MISSING!) ===
     99
     100=== Step 5: copy all study<n> files to the lifelines0<n> folder ===
    70101
    71102{{{#!sh
    72 cp study<n>.tped ../../lifelines0<n>
     103cp study<n>* ../../lifelines0<n>
    73104}}}
    74105 
    75106* May take some time!
    76 === Step 4: convert into dosage format (MISSING!) ===
    77107
    78 === Step 5: convert into other formats ==
    79 
    80 
    81 Convert the large genodata from TPED into .bed, .bim. and .bam files:
    82  {{{#!sh
    83 plink --tfile <data> --make-bed --out <data>
    84 }}}
    85 
    86      This should generate .bed, .bim and .fam files.
    87  * Supply the data also in separate files per chromosome. This can be done with the commands:
    88      {{{  plink --tfile <data> --make-bed --chr 1 --out <data_chr1> }}}
    89      {{{  plink --tfile <data> --make-bed --chr 2 --out <data_chr2> }}}
    90      {{{  ... }}}
    91    (a script file should take care of this series of commands)
    92 
    93 Besides the genodata dosage information is also desired, both for the total (per-study) dataset and for that dataset split per chromosome.
    94  * Joeri has a tool for this. NB: it is slow, reimplementing it in a compiled language might be worthwhile.
     108== Overview ==
    95109
    96110[[Image(http://i.imgur.com/nLT2e.png)]]
    97111
    98 A schematic overview of the two export paths described above.
    99 A schematic overview of the two export paths described above.
     112A schematic overview of the export procedures described above.