= Running the VCF aggregration and de-sampling procedure = == 1. Install VCFtools and Tabix == Download and install [http://sourceforge.net/projects/samtools/files/tabix/ Tabix] and [http://sourceforge.net/projects/vcftools/files/ VCFtools]. Tabix needs to be compiled first, so use ''make''. == 2. Add location of VCFtools and Tabix to your path == E.g. use {{{ export PATH=/Volumes/Users/Software/vcftools_0.1.10/bin/:/Volumes/Users/Software/tabix-0.2.6/:${PATH} }}} Or a more permanent option (.bashrc file or so) == 3. Download and install Perl if you don't have it == Go to: https://www.perl.org/ == 4. Download the script and put in folder of choosing == GitHub link: https://github.com/molgenis/ngs-utils/blob/master/scripts/vcf-fill-gtc.pl Raw version for wget: https://raw.githubusercontent.com/molgenis/ngs-utils/master/scripts/vcf-fill-gtc.pl = Aggregation procedure = == 1. Sort, filter, bgzip and index == VCFtools will not work on uncompressed, unindexed VCF files so we must sort, filter on 'PASS', bgzip and index with tabix. {{{ for item in $(ls mydirectory/*.vcf); \ do echo "Processing $item..."; \ vcf-sort $item | vcf-annotate -H > $item\.sorted\.filtered; \ bgzip $item\.sorted\.filtered; \ tabix -p vcf $item\.sorted\.filtered\.gz; \ done }}} == 2. Merge sample VCFs into one batch VCF == {{{ vcf-merge mydirectory/*.vcf.sorted.filtered.gz | bgzip -c > merged.vcf.gz }}} == 3. Create a summary VCF per batch == {{{ vcf-fill-gtc.pl -vcfi merged.vcf.gz -vcfo stripped.vcf -ss -fv PASS -si -ll INFO > stripped.vcf.log }}} '''The option -ss is crucial here: it removed all sample details.''' Afterwards, be sure to inspect the log file for warnings! {{{ more stripped.vcf.log }}} == Troubleshooting == Q: My VCF files are not completely valid format! A: The are some built-in options to help with this. For example this fixes a bug in old NextGene versions: Fix missing '>' at the end of contig meta-data lines. {{{ perl -pi -e 's/(contig=\n]+)$/$1>/' mydirectory/*.vcf }}} Q: What are the script options? A: Man page: {{{ # # Create a summary VCF per batch: # -ss : remove sample details! # -fv PASS : keep only high quality variant calls that pass all filters applied in NextGene. # Just to be sure: variants should already have been filtered on PASS only in a previous step, # so this should be redundant here... # -si : remove all INFO subfields except for INFO:AN and INFO:AC. # INFO:AN and INFO:AC were automatically updated by vcf-merge, # but the others were not and may contain erroneous annotation # that cause vcf-validator to complain the created VCF is not valid. # }}}