Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites

Erik D. Huckvale, Christian D. Powell, Huan Jin, Hunter N.B. Moseley

Research output: Contribution to journalArticlepeer-review

7 Scopus citations

Abstract

Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.

Original languageEnglish
Article number1120
JournalMetabolites
Volume13
Issue number11
DOIs
StatePublished - Nov 2023

Bibliographical note

Publisher Copyright:
© 2023 by the authors.

Funding

The research was funded by the National Science Foundation, grant number: 2020026 (PI Moseley), and by the National Institutes of Health, grant number: P42 ES007380 (University of Kentucky Super-fund Research Program Grant; PI Pennell). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation nor the National Institute of Environmental Health Sciences.

FundersFunder number
National Science Foundation Arctic Social Science Program2020026
National Institutes of Health (NIH)P42 ES007380
National Institutes of Health/National Institute of Environmental Health Sciences

    Keywords

    • KEGG
    • atom color
    • kegg_pull
    • machine learning
    • md_harmonize
    • metabolite
    • pathway

    ASJC Scopus subject areas

    • Endocrinology, Diabetes and Metabolism
    • Biochemistry
    • Molecular Biology

    Fingerprint

    Dive into the research topics of 'Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites'. Together they form a unique fingerprint.

    Cite this