Boosting Tree-Assisted Multitask Deep Learning for Small Scientific Datasets

Jian Jiang, Rui Wang, Menglun Wang, Kaifu Gao, Duc Duy Nguyen, Guo Wei Wei

Research output: Contribution to journalArticlepeer-review

71 Scopus citations

Abstract

Machine learning approaches have had tremendous success in various disciplines. However, such success highly depends on the size and quality of datasets. Scientific datasets are often small and difficult to collect. Currently, improving machine learning performance for small scientific datasets remains a major challenge in many academic fields, such as bioinformatics or medical science. Gradient boosting decision tree (GBDT) is typically optimal for small datasets, while deep learning often performs better for large datasets. This work reports a boosting tree-assisted multitask deep learning (BTAMDL) architecture that integrates GBDT and multitask deep learning (MDL) to achieve near-optimal predictions for small datasets when there exists a large dataset that is well correlated to the small datasets. Two BTAMDL models are constructed, one utilizing purely MDL output as GBDT input while the other admitting additional features in GBDT input. The proposed BTAMDL models are validated on four categories of datasets, including toxicity, partition coefficient, solubility, and solvation. It is found that the proposed BTAMDL models outperform the current state-of-the-art methods in various applications involving small datasets.

Original languageEnglish
Pages (from-to)1235-1244
Number of pages10
JournalJournal of Chemical Information and Modeling
Volume60
Issue number3
DOIs
StatePublished - Mar 23 2020

Bibliographical note

Publisher Copyright:
Copyright © 2020 American Chemical Society.

Funding

This work was supported in part by NSF Grants DMS-1721024, DMS-1761320, and IIS1900473 and NIH grant GM126189. D.D.N. and G.W.W. are also funded by Bristol-Myers Squibb and Pfizer. J.J. was supported by The Chinese Scholarships Council and the National Natural Science Foundation of China under Grant No.61573011 and No. 11972266.

FundersFunder number
Chinese Scholarships Council
National Science Foundation Arctic Social Science ProgramIIS1900473, DMS-1721024, DMS-1761320
National Science Foundation Arctic Social Science Program
National Institutes of Health (NIH)
National Institute of General Medical SciencesR01GM126189
National Institute of General Medical Sciences
Bristol-Myers Squibb
Pfizer
National Natural Science Foundation of China (NSFC)11972266, 61573011
National Natural Science Foundation of China (NSFC)

    ASJC Scopus subject areas

    • General Chemistry
    • General Chemical Engineering
    • Computer Science Applications
    • Library and Information Sciences

    Fingerprint

    Dive into the research topics of 'Boosting Tree-Assisted Multitask Deep Learning for Small Scientific Datasets'. Together they form a unique fingerprint.

    Cite this