Version 5 (modified by 13 years ago) (diff) | ,
---|
SOP for converting LifeLines Geno Data
Table of Contents
This SOP applies to LL3.
Data is released to researcher 'per study' (i.e. an approved research request).
- Per study a subset of the genotypes is created and made available to the researcher:
- Only individuals selected for study (e.g. 5000 out of total 17000)
- The identifiers 're-pseunomized' from 'marcel identifiers' to 'study identifiers' (so data can not be matched between studies).
Expected outputs
User expects files in PLINK format:
- TPED/TFAM genotype files (chosen for internal use as easier to produce)
- BIM/BED/FAM genotype files (with missing value phenotype, monomorphic filtered)
- IDEM but then splitted per chromosome
- MAP/PED dosage files (with missing value phenotype, monomorphic filtered)
- IDEM but then splitted per chromosome
Available inputs
The following are input for the conversion procedure:
- TriTyper? imputated data files: /target/gpfs2/lifelines_rp/releases/LL3/
- Identifier mapping file per study to select and re-pseudonomize identifiers
Example mapping file:
LL_WGA0001 STUDYPSEUDO1 0 LL_WGA0002 STUDYPSEUDO2 0 LL_WGA0003 STUDYPSEUDO3 0 ...
Procedure
- Data resides on /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedTriTyper (accessible from all our new VMs)
- Convertor from TriTyper? to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3
- Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/
- STEP 1: make the subset_molgenis<n>.txt file:
- In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
- In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
- Export this view (tab separated, no enclosures, no headers) to molgenis<n>.txt and scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
- Run the following command there:
./formatsubsetfile.sh molgenis<n>.txt
- Your file is now available as subset_molgenis<n>.txt and looks like:
LL_WGA0001 STUDYPSEUDO1 0
LL_WGA0002 STUDYPSEUDO2 0
LL_WGA0003 STUDYPSEUDO3 0
...- So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user)
- Items are TAB-separated and it doesn't end with a newline
- STEP 2: run the convertor
- Usage:
/target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_molgenis<n>.txt
- Usage:
- STEP 3: copy to correct location
cp study<n>.tped ../../lifelines0<n>
- May take some time!
Further Genodata
The commands above generate a single large file for the study in question. From this researchers would like some further file manipulation to be done:
- Supply the large genodata in binary format, using command:
plink --tfile <data> --make-bed --out <data>
This should generate .bed, .bim and .fam files.
- Supply the data also in separate files per chromosome. This can be done with the commands:
plink --tfile <data> --make-bed --chr 1 --out <data_chr1>
plink --tfile <data> --make-bed --chr 2 --out <data_chr2>
...
(a script file should take care of this series of commands)
Besides the genodata dosage information is also desired, both for the total (per-study) dataset and for that dataset split per chromosome.
- Joeri has a tool for this. NB: it is slow, reimplementing it in a compiled language might be worthwhile.
A schematic overview of the two export paths described above.