= Short NGS related projects = == Introduction == The Genomics Coordination Center (GCC) is an expert in the field of next-generation sequencing (NGS) data processing and analysis. NGS is applied for both diagnostic purposes such as finding known mutations in targeted disease genes or discovering new genetic variants implicated in a disease, but also broader academic research such as genome wide structural changes or population wide diversity mapping. = Project: Writing a demultiplexer in Java = == Background == To save time and costs, different samples are often pooled together when sequenced with the !HiSeq. In order to untangle the sequences after sequencing is complete, a barcode is attached to each sequence read to identify the sample of origin. The !HiSeq is only able to read the endings of the sequences (NNNNNNNNN?????????????NNNNNNNNNN). The barcode can either be attached on both ends (1), or in the middle (2). For each method there are existing demultiplexing scripts to identify the sample that belongs to a sequence, and to remove the barcode. 1) 'Both end' demultiplexing: [http://www.bbmriwiki.nl/svn/ngs_scripts/trunk/scripts_data_archiving/demultiplex.R link] 2) 'Middle' demultiplexing [http://www.bbmriwiki.nl/svn/ngs_scripts/trunk/demultiplex_miseq_illumina_barcoded.pl link] However, these scripts are not very efficient and take many days to run. == Goals == * Rewrite the demultiplexing software in Java (the language of choice within the GCC, so easy to maintain) * Combine both processing methods in a single piece of software * Write test cases in TestNG that fit the requirements to develop correctly and detect bugs when maintaining the code * Implement the algorithm in a parallel way to get a major performance boost (each read can be processed separately) = Project: New NGS quality report procedure = == Background == The sequencing machines produce large raw data files which are processed and refined using a computational pipeline. This pipeline comprises of the latest tools in the field and runs on high-performance computer clusters and the grid. An essential part of delivering the end product is a quality report that contains a number of statistics, visualizations and checks of various steps in the pipeline. This document is reviewed to ensure there were no anomalies in the input data, all processing steps went as expected, and there are no unwelcome surprises in the results. The scripts that produce this quality report could use an overhaul. == Goals == * Identify aspects of the pipeline that need better reporting, including missing data, missing files, clear error messages for debugging, single read and paired end, etc * The techniques used to produce the report are not ideal, and could be replaced by a better suited framework that can seamlessly combine textual output with tables and graphs. * A number of plot routines with important visualizations could be reworked to be more clear and informative * The way the report is produced could be verified with test cases to ensure the reporting software is working correctly and any errors are truly caused by faulty data