Changes between Version 7 and Version 8 of SopConvertLifeLinesGenoData
- Timestamp:
- 2012-04-04T06:55:37+02:00 (13 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SopConvertLifeLinesGenoData
v7 v8 4 4 5 5 This SOP applies to LL3. 6 7 >TODO: make molgenis 'compute' pipeline for this :-) 6 8 7 9 Data is released to researcher 'per study' (i.e. an approved research request). … … 38 40 == Procedure == 39 41 40 === Step 1: create mapping file for study===42 === Step 1: create subset_study<n>.txt file for study<n> === 41 43 42 44 * In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view 43 45 * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers) 44 * Export this view (tab separated, no enclosures, no headers) to molgenis<n>.txt46 * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt 45 47 * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3 46 48 47 === Step 2: run convertor for study===49 === Step 2: run convertor to create study<n>.tped === 48 50 49 51 cd to directory: … … 55 57 56 58 {{{#!sh 57 ./formatsubsetfile.sh molgenis<n>.txt59 ./formatsubsetfile.sh study<n>.txt 58 60 }}} 59 61 60 62 run convertor on TriTyper and Mapping file: 61 63 {{{#!sh 62 /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_ molgenis<n>.txt64 /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_study<n>.txt 63 65 }}} 66 67 Estimated runtime: 64 68 65 69 Note: … … 67 71 * Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/ 68 72 69 === Step 3: copy geno data to the study folder === 73 === Step 3: convert into binary plink format == 74 75 Convert .tped into study<n>.bed, .bim. and .fam files: 76 {{{#!sh 77 plink --tfile study<n> --make-bed --out study<n> 78 }}} 79 80 Split study<n>.bed, .bim, fam per chromosome: 81 {{{#!sh 82 #create variable holding study name 83 study = study<n> 84 85 #get chromosomes out of file 86 chrs=`awk '{print $1}' ${prefix}.bim | sort -nur` 87 echo "Chromosome in Map File: ${chrs}" | tr "\n" " " 88 echo "" 89 90 #use to split/convert 91 for chr in $chrs; do 92 print "Processing chromosome $_\n"; 93 plink --bfile $study --chr $_ --make-bed --out $study$_; 94 }}} 95 96 >NB: If this takes long we should make this cluster jobs! 97 98 === Step 4: convert into dosage format (MISSING!) === 99 100 === Step 5: copy all study<n> files to the lifelines0<n> folder === 70 101 71 102 {{{#!sh 72 cp study<n> .tped../../lifelines0<n>103 cp study<n>* ../../lifelines0<n> 73 104 }}} 74 105 75 106 * May take some time! 76 === Step 4: convert into dosage format (MISSING!) ===77 107 78 === Step 5: convert into other formats == 79 80 81 Convert the large genodata from TPED into .bed, .bim. and .bam files: 82 {{{#!sh 83 plink --tfile <data> --make-bed --out <data> 84 }}} 85 86 This should generate .bed, .bim and .fam files. 87 * Supply the data also in separate files per chromosome. This can be done with the commands: 88 {{{ plink --tfile <data> --make-bed --chr 1 --out <data_chr1> }}} 89 {{{ plink --tfile <data> --make-bed --chr 2 --out <data_chr2> }}} 90 {{{ ... }}} 91 (a script file should take care of this series of commands) 92 93 Besides the genodata dosage information is also desired, both for the total (per-study) dataset and for that dataset split per chromosome. 94 * Joeri has a tool for this. NB: it is slow, reimplementing it in a compiled language might be worthwhile. 108 == Overview == 95 109 96 110 [[Image(http://i.imgur.com/nLT2e.png)]] 97 111 98 A schematic overview of the two export paths described above. 99 A schematic overview of the two export paths described above. 112 A schematic overview of the export procedures described above.