= SOP for converting LifeLines Geno Data = [[TOC()]] This SOP applies to LL3. >TODO: make molgenis 'compute' pipeline for this :-) Specifications: * Geno data is released to researcher 'per study' (i.e. an approved research request). * Per study a subset of the individuals is selected * The individual identifiers are 're-pseunomized' from 'marcel identifiers' to 'study identifiers' (so data can not be matched between studies) * Data is reformatted in various PLINK formats == Expected outputs == The IDS should be filtered (e.g. 5000) and recoded (psuedoids) for one study. User expects files in PLINK format: * PED/MAP/FAM genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) * BIM/BED/FAM genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) * MAP/PED dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) == Required inputs == The following are input for the conversion procedure: * Beagle imputed genotype files (fam/ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap * Beagle imputed dosage files (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage * '''per study''' mapping file for study to filter and re-pseudonomize identifiers Example mapping file: {{{ 1 LL_WGA0001 1 STUDYPSEUDO1 1 LL_WGA0002 1 STUDYPSEUDO2 1 LL_WGA0003 1 STUDYPSEUDO3 ... }}} * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms * Items are TAB-separated and it doesn't end with a newline == Procedure == === Step 1: create subset_study.txt file for study === * In every STUDY schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers) * Export this view (tab separated, no enclosures, no headers) to subset_study.txt * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3 === Step 2: convert into study.tped format === cd to directory: {{{#!sh cd /target/gpfs2/lifelines_rp/releases/LL3 }}} reformat mapping file '''WHY IS THIS?''': {{{#!sh ./formatsubsetfile.sh study.txt }}} filter individuals (repeat per chr) {{{#!sh #--file [file] is input file (expects .map and .ped) #--keep [file] tells plink what individuals to keep (from txt file with fam + ind id) #--recode tells plink to write results (otherwise no results!!! argh!) #--out defines output prefix (here: filtered.*) #--update-ids [file] tells prefix to update ids #result: filtered.ped/map' plink --file testdata_chr1 --keep subset.txt --recode --out temp_chr1 }}} update individuals ids (repeat per chr) {{{#!sh #--file [file] is input file #--keep [file] tells plink what individuals to update #(from txt file with OLD fam + ind id + NEW fam id + ind id) #--recode tells plink to write results (otherwise no results!!! argh!) # result: updatedids.map/ped plink --file temp_chr1 --update-ids subset.txt --recode --out study2_chr1 }}} #step 3: #convert to bed (repeat per chr) plink --file study2_chr1 --make-bed === Step 4: convert into dosage format === TODO! ask Joeri? === Step 5: copy all study* files to the lifelines0 folder === {{{#!sh cp study* ../../lifelines0 }}} * May take some time! == Maintaining the source code of the tools == To work with the sourcecode: 1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/ 2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/ 2. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/ == Overview == [[Image(http://i.imgur.com/nLT2e.png)]] A schematic overview of the export procedures described above.