The exponential growth of high throughput sequencing data has been witnessed in almost every sequencing data repository. To date, most of the exploratory analysis on these large datasets requires heavy lifting data processing pipelines that are both resource and labor intensive. Very recently, various algorithms have been developed to enable arbitrary sequence query over large collections of sequencing data. These algorithms were designed to support presence/absence query, i.e., screening for RNA-seq samples containing a given transcript sequence. Their utility is rather limited as they cannot retrieve abundance information of query sequence. Such abundance information is indeed critical in real applications in order to understand how the variation of transcript expression associates with different biological conditions or disease subtypes. In this paper, we present Gazelle, a sequence query engine that enables fast and quantified query against large-scale RNA-seq experiments. Gazelle exploits the advantages of two different types of hashing algorithms and seamlessly combines them into one integrated structure to support highly efficient and accurate sequence queries with abundance. We evaluated the performance of Gazelle on three datasets to benchmark its efficiency, accuracy as well as its utility in real-life applications. Our result shows that Gazelle achieves near-perfect k-mer query, supports on-demand sequence query against moderately large sequence database, and renders highly consistent abundance estimation with RT-qPCR as well as traditional transcript quantification method such as Kallisto.
|Title of host publication||Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021|
|State||Published - Jan 18 2021|
|Event||12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 - Virtual, Online, United States|
Duration: Aug 1 2021 → Aug 4 2021
|Name||Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021|
|Conference||12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021|
|Period||8/1/21 → 8/4/21|
Bibliographical notePublisher Copyright:
© 2021 ACM.
- transcript query
ASJC Scopus subject areas
- Computer Science Applications
- Biomedical Engineering
- Health Informatics