A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement

Erik D. Huckvale, Hunter N.B. Moseley

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

Original languageEnglish
Article numbere0299583
JournalPLoS ONE
Volume19
Issue number5 MAY
DOIs
StatePublished - May 2024

Bibliographical note

Publisher Copyright:
© 2024 Huckvale, Moseley. This is an open access article distributed under the terms of the Creative Commons Attribution License,

Funding

Funding: This work has been supported by the National Science Foundation [NSF 2020026 to HNBM] and the National Institute of Environmental Health Sciences [P42ES007380].

FundersFunder number
U.S. Department of Energy Chinese Academy of Sciences Guangzhou Municipal Science and Technology Project Oak Ridge National Laboratory Extreme Science and Engineering Discovery Environment National Science Foundation National Energy Research Scientific Computing Center National Natural Science Foundation of China2020026
U.S. Department of Energy Chinese Academy of Sciences Guangzhou Municipal Science and Technology Project Oak Ridge National Laboratory Extreme Science and Engineering Discovery Environment National Science Foundation National Energy Research Scientific Computing Center National Natural Science Foundation of China
National Institutes of Health/National Institute of Environmental Health SciencesP42ES007380
National Institutes of Health/National Institute of Environmental Health Sciences

    ASJC Scopus subject areas

    • General

    Fingerprint

    Dive into the research topics of 'A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement'. Together they form a unique fingerprint.

    Cite this