Development of Gold Standard Next Generation Sequencing Data Sets

Grants and Contracts Details


Specific Aims This research plan proposes that the development of “gold standard” data sets for various Next Generation Sequencing (NGS) studies will allow for efficient testing and benchmarking of new bioinformatics tools, algorithms, and emerging computational platforms. This project represents a first step in building NGS infrastructure for researchers and clinicians at the University of Kentucky. Aim 1. Collect and evaluate biomedical data sets Well-studied and characterized data sets will be collected for different NGS use cases, such as resequencing, RNA-seq, CHiP-seq, structural variation analysis, de novo sequencing, and metagenomics. All data sets will be of biomedical interest. Aim 2. Collect and evaluate analysis tools and methods A compressive list of the most current and widely used analyses tools, algorithms, and emerging computational platforms, relevant to all of the NGS data sets, will be collected. The literature and University of Kentucky researchers will be resources for building this list. Aim 3. Select performance metrics In order to quantitatively assess different methods and platforms, metrics will be selected for comparison of the tools, algorithms, and emerging computational platforms. Potential metrics include accuracy and speed of code, rates of data transfer, time-to-completion of analysis from start to delivery of results, and ease of tool use. Aim 4. Benchmark tools and computational platforms The list of tools will be installed on multiple platforms, including the University of Kentucky cluster (DLX) and through Globus Genomics1. Some tools may also be tested on national computational resources through XSEDE proposals. Significance Since sequencing costs are dropping, improved management of data analysis and storage will be essential for state-of-the-art research and for efficient clinical decisionmaking based on NGS. A common challenge is the identification of variations within sequences that may be the cause of particular traits or diseases; these could be single nucleotide polymorphisms (SNPs), indels (insertion or deletions), or structural variations (swapping of the location of genes). All of these areas are still being actively researched. New methods are being developed to address experimental errors in base calling and computational errors in read alignment. It has been shown that using different sequencing technologies results in different SNP calls2 with as many as tens of thousands of SNPs being called only on a specific sequencing platform.3 In addition to variations resulting from different sequencing technologies, different SNP calling pipelines may give drastically different results. Using five different pipelines and fifteen samples from the same sequencing technology, only an average concordance of 57.4% was found for called SNPs4. Even more worrisome, using three indel-calling pipelines only gave an average concordance of 26.8% for called indels. These massive differences in results show how important benchmark data will be in testing new pipelines and technologies.
Effective start/end date6/1/119/30/16


  • National Center for Advancing Translational Sciences


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.