A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures

Xinan Liu, Ye Yu, Jinze Liu, Jinpeng Liu, Corrine F. Elliott, Chen Qian

Producción científica: Articlerevisión exhaustiva

29 Citas (Scopus)

Resumen

Motivation: Metagenomic read classification is a critical step in the identification and quantification of microbial species sampled by high-throughput sequencing. Although many algorithms have been developed to date, they suffer significant memory and/or computational costs. Due to the growing popularity of metagenomic data in both basic science and clinical applications, as well as the increasing volume of data being generated, efficient and accurate algorithms are in high demand. Results: We introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequencing reads. The algorithm employs a novel data structure, called l-Othello, to support efficient querying of a taxon using its k-mer signatures. MetaOthello is an order-of-magnitude faster than the current state-of-the-art algorithms Kraken and Clark, and requires only one-third of the RAM. In comparison to Kaiju, a metagenomic classification tool using protein sequences instead of genomic sequences, MetaOthello is three times faster and exhibits 20-30% higher classification sensitivity. We report comparative analyses of both scalability and accuracy using a number of simulated and empirical datasets.

Idioma originalEnglish
Páginas (desde-hasta)171-178
Número de páginas8
PublicaciónBioinformatics
Volumen34
N.º1
DOI
EstadoPublished - ene 1 2018

Nota bibliográfica

Publisher Copyright:
© 2017 The Author.

Financiación

This work was previously submitted to and accepted by The Seventh RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2017). We thank the reviewers for their valuable comments and suggestions. This work was supported by National Science Foundation [CAREER award grant number 1054631 to J.L.; grant CNS-1701681 to C.Q.]; and the National Institutes of Health [grant number P30CA177558 and 5R01HG006272-03 to J.L.]. This work was supported by National Science Foundation [CAREER award grant number 1054631 to J.L.; grant CNS-1701681 to C.Q.]; and the National Institutes of Health [grant number P30CA177558 and 5R01HG006272-03 to J.L.].

FinanciadoresNúmero del financiador
National Science Foundation Arctic Social Science ProgramCNS-1701681, 1054631
National Science Foundation Arctic Social Science Program
National Institutes of Health (NIH)P30CA177558
National Institutes of Health (NIH)
National Human Genome Research InstituteR01HG006272
National Human Genome Research Institute
National Science Foundation Arctic Social Science Program

    ASJC Scopus subject areas

    • Statistics and Probability
    • Biochemistry
    • Molecular Biology
    • Computer Science Applications
    • Computational Theory and Mathematics
    • Computational Mathematics

    Huella

    Profundice en los temas de investigación de 'A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures'. En conjunto forman una huella única.

    Citar esto