Changes between Initial Version and Version 1 of CLCdemo27Jan11


Ignore:
Timestamp:
2011-01-27T15:55:07+01:00 (13 years ago)
Author:
Wil Bruins
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CLCdemo27Jan11

    v1 v1  
     1= CLCbio demo 27 Jan 2011 =
     2
     3Pieter demonstrated the protocol he uses to analyse raw sequencing data (alignment + SNP calling and annotation) Thursday 27 Jan. He used Jokes MMIHS sample.
     4
     5MMIHS is a severe bowel syndrome, leading to the death of patients at a very young age. This patient was a foetus aborted before 24 weeks of pregnancy (syndrome detected at 20 week ultrasound) of consanguineous (closely related, cousins) parents.
     6
     71. Where would we expect causal variants?\\
     8  * Because the parents are consanguineous, we expect the causal variant to be homozygous. Using SNP array and plotting the B allele frequency, stretches can be detected where all SNPs are homozygous. These regions are the regions of interest.
     92. Sequencing.\\
     10  * in this particulate dataset (110114 lane 5) the base qualities are a bit funny at the 3' end. Should check FastQC report.
     11  * read trimming using CLC:
     12    * 0.03 limit, equals Q20 (almost, CLC manual explains this)
     13    * trim first base
     14    * trim 2 or more ambiguous bases in one stretch (Ns)
     15    * CLC discards reads that after this trimming are shorter than 50 bases
     16    * all this seems similar to using BWA's -q 20
     17    * these trimming settings should also be used in the control sample (see below)
     18  * mapping:
     19    * can be relatively strict because of the trimming
     20    * Per chromosome so keep in mind the possibility that reads map (for example to a pseudogene) that should have been mapped to another chromosome alltogether
     21    * the settings mismatch cost 2, insertion cost 3, deletion cost 3 are similar to BWA allowing 5 mismatches
     22    * length fraction 0.9 with similarity 0.95 means that 90% of the reads should be 95% similar to the reference sequence
     23  * control sample: should be unrelated as in not family and not the same or similar disease (allthough HNPCC control is fine for MMIHS), but should be on the same run and done using the same capturing. Just for terminology: the library is a tube with DNA ready for sequencing: 1 sample or pool, fragmented, size selected (gel), adapter ligated
     243. SNP calling.
     25  * CLC can detect indels up to some 5 bases (DIP detection)
     26  * quality settings: window length 11, max gaps/mismatches 2, max average quality of surrounding bases 15, min quality of central base 20
     27  * significance settings: min coverage 10, minimum allele frequency of 60%, even though this is low for expected homozygous variants (which should be 100% different from the reference), the control can be 50%.
     28  * ploidy: max expected variations of course 2
     29  * CLC uses an annotated reference: the SNPs will have gene names when applicable
     30  * how reliable are SNPs found only on forward or reversely mapped reads? Pietr ignorse this for now
     31  * delete SNPs that are both in sampel and control (Excel)
     32  * delete SNPs that lack gene name (whole exome)
     33  * highlight SNPs in the previously detected homozygous regions
     344. !SeattleSeq
     35  * should be in format chr<tab>pos<tab>ref base<tab>cons base
     36  * remove known variants and other variants you're not interested in
     375. Return to CLC to check the remaining SNPs