Version 1 (modified by 14 years ago) (diff) | ,
---|
The purpose of this pipeline is to further annotate the SNPs from the 39 celiac disease patients samples. The sequencing and downstream analysis was performed in the BGI institute in China. These samples may be further augmented with 6 patients sequenced in Groningen. The initial input is 39 GFF samples. Initially we identified that there is an error in the GFF format. Namely, the label "alleles" should be "allele" so this has to be corrected in all files. (i.e alleles=G/A --> allele=G/A).
The first step of the pipeline was to annotate the GFF files with reference information from the HAPMAP3 and 1000Genome project. To do this we selected the SeattleSeqAnnotation? tool. It is a fast, stable and well known tool. The negatives are that is a web application with closed source code. The tool's webpage is: http://gvs.gs.washington.edu/SeattleSeqAnnotation/index.jsp they also provide a java program that wraps the web forms in order to run the tool from a command line: http://gvs.gs.washington.edu/SeattleSeqAnnotation/SubmitSeattleSeqAnnotationAutoJob.java
Second Step was to remove duplicates. SeattleSeqAnnotation? output contained several lines per position. We kept the first one of every duplicate line
Third step was to add annotation from Immuno_BeadChip
Forth step was to add the rs codes of SNPs. The output of SeattleSeqAnnotation? missed this information in some SNPs. For these SNPs we copied this information from the initial GFF files
So far the header of he 39 annotated files is:
# inDBSNPOrNot chromosome position referenceBase sampleGenotype allelesMaq allelesDBSNP accession functionGVS functionDBSNP rsID(dbSNP+1000genome) aminoAcids proteinPosition polyPhen nickLab scorePhastCons consScoreGERP chimpAllele CNV geneList AfricanHapMapFreq EuropeanHapMapFreq AsianHapMapFreq hasGenotypes dbSNPValidation repeatMasker tandemRepeat clinicalAssociation proteinSequence Immuno_BeadChip
From Patrick Deelen: I have added the annotations to the Q20 files. The only thing that is missing are the eQTL results, the rug cluster has crashed and so I can't download the results. I have tested my program with some old results and that is working so I hope they reset the cluster tomorrow. The analysis was already completed so it is only a matter of downloading.
From Patrick Deelen: I have added the eQTL results to the files. If the gene name is know that it is displayed otherwise the probe-ID is displayed.
These files where available via scp from Patrick. I 've downloaded them from him and given them to Agata.
Things to do:
- Add GO annotation from (GenBrowser2 or David or ...)
- Add allele frequencies for 1KGP and HapMap3
Peter added the following annotations:
eQTL gene Celiac loci Immunochip Source_SeattleSeq Function DB-SNP PolyPhen scorePhastCons consScoreGERP CNV
Pipeline overview
script | property | description | source |
1KGP | alleleFreq | allele freq in 1KG | 1KG |