wiki:SopConvertLifeLinesGenoData

Context Navigation

Version 25 (modified by Morris Swertz, 14 years ago) (diff)
--

SOP for converting LifeLines Geno Data

This SOP applies to LL3.

TODO: make molgenis 'compute' pipeline for this :-)

Specifications:

Geno data is released to researcher 'per study' (i.e. an approved research request).
Per study a subset of the individuals is selected
The individual identifiers are 're-pseunomized' from 'marcel identifiers' to 'study identifiers' (so data can not be matched between studies)
Data is reformatted in various PLINK formats

Expected outputs

The IDS should be filtered (e.g. 5000) and recoded (psuedoids) for one study.

User expects files in PLINK format:

PED/MAP/FAM genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
BIM/BED/FAM genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
MAP/PED dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)

Required inputs

The following are input for the conversion procedure:

TriTyper? imputed data files: /target/gpfs2/lifelines_rp/releases/LL3/
mapping file for study to select and re-pseudonomize identifiers

Example mapping file:

LL_WGA0001   STUDYPSEUDO1   0
LL_WGA0002   STUDYPSEUDO2   0
LL_WGA0003   STUDYPSEUDO3   0
...

So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user)
Items are TAB-separated and it doesn't end with a newline

Procedure

Step 1: create subset_study<n>.txt file for study<n>

In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

Step 2: convert into study<n>.tped format

cd to directory:

cd /target/gpfs2/lifelines_rp/releases/LL3

reformat mapping file:

./formatsubsetfile.sh study<n>.txt

run convertor on TriTyper? and Mapping file:

/target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_study<n>.txt

Note:

Convertor from TriTyper? to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3
Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/
Estimated runtime: 4 hours (4Gb/2 cpu @ cluster.gcc.rug.nl)

Step 3: convert into binary plink format

Convert .tped into study<n>.bed, .bim. and .fam files:

plink --tfile study<n> --make-bed --out study<n>

Split study<n>.bed, .bim, fam per chromosome:

this script is untested, awaiting account

#create variable holding study name
study = study<n>

#get all chromosomes out of .bim file
chrs=`awk '{print $1}' ${study}.bim | sort -nur`
echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
echo ""

#use to split/convert
for chr in $chrs; do
        print "Processing chromosome $_\n";
        plink --bfile $study --chr $_ --make-bed --out $study$_;

NB: If this takes long we should make this cluster jobs!

Step 4: convert into dosage format

MISSING! ask Joeri?

Step 5: copy all study<n> files to the lifelines0<n> folder

cp study<n>* ../../lifelines0<n>

May take some time!

Maintaining the source code of the tools

To work with the sourcecode:

Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/

Overview

A schematic overview of the export procedures described above.

Download in other formats:

Plain Text