Version 22 (modified by 13 years ago) (diff) | ,
---|
SOP for converting LifeLines Geno Data
Table of Contents
This SOP applies to LL3.
TODO: make molgenis 'compute' pipeline for this :-)
Specifications:
- Geno data is released to researcher 'per study' (i.e. an approved research request).
- Per study a subset of the individuals is selected
- The individual identifiers are 're-pseunomized' from 'marcel identifiers' to 'study identifiers' (so data can not be matched between studies)
- Data is reformatted in various PLINK formats
Expected outputs
User expects files in PLINK format:
- TPED/TFAM genotype files (chosen for internal use as easier to produce)
- BIM/BED/FAM genotype files (with missing value phenotype, monomorphic filtered)
- IDEM but then splitted per chromosome
- MAP/PED dosage files (with missing value phenotype, monomorphic filtered)
- IDEM but then splitted per chromosome
Required inputs
The following are input for the conversion procedure:
- TriTyper? imputed data files: /target/gpfs2/lifelines_rp/releases/LL3/
- mapping file for study to select and re-pseudonomize identifiers
Example mapping file:
LL_WGA0001 STUDYPSEUDO1 0 LL_WGA0002 STUDYPSEUDO2 0 LL_WGA0003 STUDYPSEUDO3 0 ...
- So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user)
- Items are TAB-separated and it doesn't end with a newline
Procedure
Step 1: create subset_study<n>.txt file for study<n>
- In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
- In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
- Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
- scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
Step 2: convert into study<n>.tped format
cd to directory:
cd /target/gpfs2/lifelines_rp/releases/LL3
reformat mapping file:
./formatsubsetfile.sh study<n>.txt
run convertor on TriTyper? and Mapping file:
/target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_study<n>.txt
Note:
- Convertor from TriTyper? to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3
- Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/
- Estimated runtime: 4 hours (4Gb/2 cpu @ cluster.gcc.rug.nl)
Step 3: convert into binary plink format
Convert .tped into study<n>.bed, .bim. and .fam files:
plink --tfile study<n> --make-bed --out study<n>
Split study<n>.bed, .bim, fam per chromosome:
this script is untested, awaiting account
#create variable holding study name study = study<n> #get all chromosomes out of .bim file chrs=`awk '{print $1}' ${study}.bim | sort -nur` echo "Chromosome in Map File: ${chrs}" | tr "\n" " " echo "" #use to split/convert for chr in $chrs; do print "Processing chromosome $_\n"; plink --bfile $study --chr $_ --make-bed --out $study$_;
NB: If this takes long we should make this cluster jobs!
Step 4: convert into dosage format
MISSING! ask Joeri?
Step 5: copy all study<n> files to the lifelines0<n> folder
cp study<n>* ../../lifelines0<n>
- May take some time!
Overview
A schematic overview of the export procedures described above.