Gazelle: Transcript abundance query against large-scale RNA-seq experiments

Xiaofei Zhang, Ye Yu, Chan Hee Mok, James N. MacLeod, Jinze Liu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The exponential growth of high throughput sequencing data has been witnessed in almost every sequencing data repository. To date, most of the exploratory analysis on these large datasets requires heavy lifting data processing pipelines that are both resource and labor intensive. Very recently, various algorithms have been developed to enable arbitrary sequence query over large collections of sequencing data. These algorithms were designed to support presence/absence query, i.e., screening for RNA-seq samples containing a given transcript sequence. Their utility is rather limited as they cannot retrieve abundance information of query sequence. Such abundance information is indeed critical in real applications in order to understand how the variation of transcript expression associates with different biological conditions or disease subtypes. In this paper, we present Gazelle, a sequence query engine that enables fast and quantified query against large-scale RNA-seq experiments. Gazelle exploits the advantages of two different types of hashing algorithms and seamlessly combines them into one integrated structure to support highly efficient and accurate sequence queries with abundance. We evaluated the performance of Gazelle on three datasets to benchmark its efficiency, accuracy as well as its utility in real-life applications. Our result shows that Gazelle achieves near-perfect k-mer query, supports on-demand sequence query against moderately large sequence database, and renders highly consistent abundance estimation with RT-qPCR as well as traditional transcript quantification method such as Kallisto.

Original languageEnglish
Title of host publicationProceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
ISBN (Electronic)9781450384506
DOIs
StatePublished - Jan 18 2021
Event12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 - Virtual, Online, United States
Duration: Aug 1 2021Aug 4 2021

Publication series

NameProceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021

Conference

Conference12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
Country/TerritoryUnited States
CityVirtual, Online
Period8/1/218/4/21

Bibliographical note

Publisher Copyright:
© 2021 ACM.

Keywords

  • RNA-seq
  • indexing
  • transcript query

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Biomedical Engineering
  • Health Informatics

Fingerprint

Dive into the research topics of 'Gazelle: Transcript abundance query against large-scale RNA-seq experiments'. Together they form a unique fingerprint.

Cite this