= SOP for converting LifeLines Geno Data = [[TOC()]] This SOP applies to LL3. >TODO: make molgenis 'compute' pipeline for this :-) Data is released to researcher 'per study' (i.e. an approved research request). * Per study a subset of the genotypes is created and made available to the researcher: * Only individuals selected for study (e.g. 5000 out of total 17000) * The identifiers 're-pseunomized' from 'marcel identifiers' to 'study identifiers' (so data can not be matched between studies). == Expected outputs == User expects files in PLINK format: * TPED/TFAM genotype files (chosen for internal use as easier to produce) * BIM/BED/FAM genotype files (with missing value phenotype, monomorphic filtered) * IDEM but then splitted per chromosome * MAP/PED dosage files (with missing value phenotype, monomorphic filtered) * IDEM but then splitted per chromosome == Required inputs == The following are input for the conversion procedure: * TriTyper imputed data files: /target/gpfs2/lifelines_rp/releases/LL3/ * mapping file for study to select and re-pseudonomize identifiers Example mapping file: {{{ LL_WGA0001 STUDYPSEUDO1 0 LL_WGA0002 STUDYPSEUDO2 0 LL_WGA0003 STUDYPSEUDO3 0 ... }}} * So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user) * Items are TAB-separated and it doesn't end with a newline == Procedure == === Step 1: create subset_study.txt file for study === * In every MOLGENIS schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers) * Export this view (tab separated, no enclosures, no headers) to subset_study.txt * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3 === Step 2: convert into study.tped format === Estimated runtime: 4 hours (4Gb/2 cpu machine) cd to directory: {{{#!sh cd /target/gpfs2/lifelines_rp/releases/LL3 }}} reformat mapping file: {{{#!sh ./formatsubsetfile.sh study.txt }}} run convertor on TriTyper and Mapping file: {{{#!sh /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study subset_study.txt }}} Note: * Convertor from TriTyper to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3 * Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/ === Step 3: convert into binary plink format === Convert .tped into study.bed, .bim. and .fam files: {{{#!sh plink --tfile study --make-bed --out study }}} Split study.bed, .bim, fam per chromosome: >> this script is untested, awaiting account {{{#!sh #create variable holding study name study = study #get all chromosomes out of .bim file chrs=`awk '{print $1}' ${study}.bim | sort -nur` echo "Chromosome in Map File: ${chrs}" | tr "\n" " " echo "" #use to split/convert for chr in $chrs; do print "Processing chromosome $_\n"; plink --bfile $study --chr $_ --make-bed --out $study$_; }}} >NB: If this takes long we should make this cluster jobs! === Step 4: convert into dosage format === MISSING! ask Joeri? === Step 5: copy all study files to the lifelines0 folder === {{{#!sh cp study* ../../lifelines0 }}} * May take some time! == Overview == [[Image(http://i.imgur.com/nLT2e.png)]] A schematic overview of the export procedures described above.