wiki:SopConvertLifeLinesGenoData

Context Navigation

Version 29 (modified by Morris Swertz, 14 years ago) (diff)
--

SOP for converting LifeLines Geno Data

How to pseudonomize geno data per study

This SOP applies to LL3.

Specifications:

Geno data is released to researcher 'per study' (i.e. an approved research request).
Per study a subset of the individuals is selected
The individual identifiers are 're-pseunomized' from 'WGA' to 'study' identifiers
Data is reformatted in various PLINK formats

Expected outputs

The IDS should be filtered (e.g. 5000) and recoded (psuedoids) for one study.

User expects files in PLINK format:

PED/MAP/FAM genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
BIM/BED/FAM genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
MAP/PED dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)

Required inputs

The following are input for the conversion procedure:

Beagle imputed genotype files (fam/ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap
Beagle imputed dosage files (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage
per study mapping file for study to filter and re-pseudonomize identifiers

Example mapping file:

1   LL_WGA0001   1   STUDYPSEUDO1
1   LL_WGA0002   1   STUDYPSEUDO2
1   LL_WGA0003   1   STUDYPSEUDO3
...

So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
Items are TAB-separated and it doesn't end with a newline

Procedure

Step 1: create subset_study<n>.txt file for study<n>

In every STUDY<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

Step 2: convert into study<n>.tped format

cd to directory:

cd /target/gpfs2/lifelines_rp/releases/LL3

reformat mapping file WHY IS THIS?:

./formatsubsetfile.sh study<n>.txt

filter individuals (repeat per chr)

#--file [file] is input file (expects .map and .ped)
#--keep [file] tells plink what individuals to keep (from txt file with fam + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
#--out defines output prefix (here: filtered.*)
#--update-ids [file] tells prefix to update ids
#result: filtered.ped/map'

plink --file testdata_chr1 --keep subset.txt --recode --out temp_chr1

update individuals ids (repeat per chr)

#--file [file] is input file
#--keep [file] tells plink what individuals to update 
#(from txt file with OLD fam + ind id + NEW fam id + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
# result: updatedids.map/ped

plink --file temp_chr1 --update-ids subset.txt --recode --out study2_chr1

#step 3: #convert to bed (repeat per chr) plink --file study2_chr1 --make-bed

Step 4: convert into dosage format

TODO! ask Joeri?

Step 5: copy all study*<n> files to the lifelines0<n> folder

cp study<n>* ../../lifelines0<n>

May take some time!

Maintaining the source code of the tools

To work with the sourcecode:

Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/

Overview

A schematic overview of the export procedures described above.

Download in other formats:

Plain Text