Changes between Version 26 and Version 27 of SopConvertLifeLinesGenoData
- Timestamp:
- 2012-04-10T21:35:26+02:00 (13 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SopConvertLifeLinesGenoData
v26 v27 31 31 Example mapping file: 32 32 {{{ 33 LL_WGA0001 STUDYPSEUDO1 0 34 LL_WGA0002 STUDYPSEUDO2 0 35 LL_WGA0003 STUDYPSEUDO3 0 33 1 LL_WGA0001 1 STUDYPSEUDO1 34 1 LL_WGA0002 1 STUDYPSEUDO2 35 1 LL_WGA0003 1 STUDYPSEUDO3 36 36 ... 37 37 }}} 38 38 39 * So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user)39 * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms 40 40 * Items are TAB-separated and it doesn't end with a newline 41 41 42 == Procedure == 42 43 43 44 === Step 1: create subset_study<n>.txt file for study<n> === 44 45 45 * In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view46 * In every STUDY<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view 46 47 * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers) 47 48 * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt … … 55 56 }}} 56 57 57 reformat mapping file: 58 58 reformat mapping file '''WHY IS THIS?''': 59 59 {{{#!sh 60 60 ./formatsubsetfile.sh study<n>.txt 61 61 }}} 62 62 63 run convertor on TriTyper and Mapping file: 63 filter individuals (repeat per chr) 64 64 {{{#!sh 65 /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_study<n>.txt 65 #--file [file] is input file (expects .map and .ped) 66 #--keep [file] tells plink what individuals to keep (from txt file with fam + ind id) 67 #--recode tells plink to write results (otherwise no results!!! argh!) 68 #--out defines output prefix (here: filtered.*) 69 #--update-ids [file] tells prefix to update ids 70 #result: filtered.ped/map' 71 72 plink --file testdata_chr1 --keep subset.txt --recode --out temp_chr1 66 73 }}} 67 74 68 Note: 69 * Convertor from TriTyper to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3 70 * Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/ 71 * Estimated runtime: 4 hours (4Gb/2 cpu @ cluster.gcc.rug.nl) 72 === Step 3: convert into binary plink format === 75 update individuals ids (repeat per chr) 76 {{{ 77 #--file [file] is input file 78 #--keep [file] tells plink what individuals to update 79 #(from txt file with OLD fam + ind id + NEW fam id + ind id) 80 #--recode tells plink to write results (otherwise no results!!! argh!) 81 # result: updatedids.map/ped 73 82 74 Convert .tped into study<n>.bed, .bim. and .fam files: 75 {{{#!sh 76 plink --tfile study<n> --make-bed --out study<n> 83 plink --file temp_chr1 --update-ids subset.txt --recode --out study2_chr1 77 84 }}} 78 85 79 Split study<n>.bed, .bim, fam per chromosome:80 86 81 >> this script is untested, awaiting account 87 #step 3: 88 #convert to bed (repeat per chr) 89 plink --file study2_chr1 --make-bed 82 90 83 {{{#!sh84 #create variable holding study name85 study = study<n>86 87 #get all chromosomes out of .bim file88 chrs=`awk '{print $1}' ${study}.bim | sort -nur`89 echo "Chromosome in Map File: ${chrs}" | tr "\n" " "90 echo ""91 92 #use to split/convert93 for chr in $chrs; do94 print "Processing chromosome $_\n";95 plink --bfile $study --chr $_ --make-bed --out $study$_;96 }}}97 98 >NB: If this takes long we should make this cluster jobs!99 91 === Step 4: convert into dosage format === 100 92 101 MISSING! ask Joeri?93 TODO! ask Joeri? 102 94 103 === Step 5: copy all study <n> files to the lifelines0<n> folder ===95 === Step 5: copy all study*<n> files to the lifelines0<n> folder === 104 96 105 97 {{{#!sh