Abstract
Motivation: Metagenomic read classification is a critical step in the identification and quantification of microbial species sampled by high-throughput sequencing. Although many algorithms have been developed to date, they suffer significant memory and/or computational costs. Due to the growing popularity of metagenomic data in both basic science and clinical applications, as well as the increasing volume of data being generated, efficient and accurate algorithms are in high demand. Results: We introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequencing reads. The algorithm employs a novel data structure, called l-Othello, to support efficient querying of a taxon using its k-mer signatures. MetaOthello is an order-of-magnitude faster than the current state-of-the-art algorithms Kraken and Clark, and requires only one-third of the RAM. In comparison to Kaiju, a metagenomic classification tool using protein sequences instead of genomic sequences, MetaOthello is three times faster and exhibits 20-30% higher classification sensitivity. We report comparative analyses of both scalability and accuracy using a number of simulated and empirical datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 171-178 |
| Number of pages | 8 |
| Journal | Bioinformatics |
| Volume | 34 |
| Issue number | 1 |
| DOIs | |
| State | Published - Jan 1 2018 |
Bibliographical note
Publisher Copyright:© 2017 The Author.
Funding
This work was previously submitted to and accepted by The Seventh RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2017). We thank the reviewers for their valuable comments and suggestions. This work was supported by National Science Foundation [CAREER award grant number 1054631 to J.L.; grant CNS-1701681 to C.Q.]; and the National Institutes of Health [grant number P30CA177558 and 5R01HG006272-03 to J.L.]. This work was supported by National Science Foundation [CAREER award grant number 1054631 to J.L.; grant CNS-1701681 to C.Q.]; and the National Institutes of Health [grant number P30CA177558 and 5R01HG006272-03 to J.L.].
| Funders | Funder number |
|---|---|
| National Science Foundation Arctic Social Science Program | CNS-1701681, 1054631 |
| National Science Foundation Arctic Social Science Program | |
| National Institutes of Health (NIH) | P30CA177558 |
| National Institutes of Health (NIH) | |
| National Human Genome Research Institute | R01HG006272 |
| National Human Genome Research Institute | |
| National Science Foundation Arctic Social Science Program |
ASJC Scopus subject areas
- Statistics and Probability
- Biochemistry
- Molecular Biology
- Computer Science Applications
- Computational Theory and Mathematics
- Computational Mathematics