Version 7 (modified by 10 years ago) (diff) | ,
---|
Running the VCF aggregration and de-sampling procedure
1. Install VCFtools and Tabix
Download and install Tabix and VCFtools. Tabix needs to be compiled first, so use make.
2. Add location of VCFtools and Tabix to your path
E.g. use
export PATH=/Volumes/Users/Software/vcftools_0.1.10/bin/:/Volumes/Users/Software/tabix-0.2.6/:${PATH}
Or a more permanent option (.bashrc file or so)
3. Download and install Perl if you don't have it
Go to: https://www.perl.org/
4. Download the script and put in folder of choosing
GitHub? link: https://github.com/molgenis/ngs-utils/blob/master/scripts/vcf-fill-gtc.pl Raw version for wget: https://raw.githubusercontent.com/molgenis/ngs-utils/master/scripts/vcf-fill-gtc.pl
Aggregation procedure
1. Merge sample VCFs into one batch VCF
vcf-merge CAR_*/*.vcf.sorted.filtered.gz | bgzip -c > merged.vcf.gz
2. Create a summary VCF per batch
vcf-fill-gtc.pl -vcfi merged.vcf.gz -vcfo stripped.vcf -ss -fv PASS -si -ll INFO > stripped.vcf.log
The option -ss is crucial here: it removed all sample details.
Afterwards, be sure to inspect the log file for warnings!
more stripped.vcf.log
Man page:
# # Create a summary VCF per batch: # -ss : remove sample details! # -fv PASS : keep only high quality variant calls that pass all filters applied in NextGene. # Just to be sure: variants should already have been filtered on PASS only in a previous step, # so this should be redundant here... # -si : remove all INFO subfields except for INFO:AN and INFO:AC. # INFO:AN and INFO:AC were automatically updated by vcf-merge, # but the others were not and may contain erroneous annotation # that cause vcf-validator to complain the created VCF is not valid. #
Troubleshooting
Q: My VCF files are not completely valid format! A: The are some built-in options to help with this. For example:
Prepare sample VCFs for one batch; e.g. CAR_Batch1_106Samples
cd /Volumes/CardioKitVCFs/OriginalVCFs/CAR_Batch1_106Samples
Fix missing '>' at the end of contig meta-data lines.
perl -pi -e 's/(contig=<ID=[^>\n]+)$/$1>/' CAR_*/*.vcf
Sort, filter on 'PASS', bgzip and index with tabix (vcftools will not work on uncompressed, unindexed VCF files.)
for item in $(ls CAR_*/*.vcf); \ do echo "Processing $item..."; \ vcf-sort $item | vcf-annotate -H > $item\.sorted\.filtered; \ bgzip $item\.sorted\.filtered; \ tabix -p vcf $item\.sorted\.filtered\.gz; \ done