Inconsistent Performance of Deep Learning Models on Mammogram Classification

Xiaoqin Wang, Gongbo Liang, Yu Zhang, Hunter Blanton, Zachary Bessinger, Nathan Jacobs

Research output: Contribution to journalArticlepeer-review

109 Scopus citations

Abstract

Objectives: Performance of recently developed deep learning models for image classification surpasses that of radiologists. However, there are questions about model performance consistency and generalization in unseen external data. The purpose of this study is to determine whether the high performance of deep learning on mammograms can be transferred to external data with a different data distribution. Materials and Methods: Six deep learning models (three published models with high performance and three models designed by us) were evaluated on four different mammogram data sets, including three public (Digital Database for Screening Mammography, INbreast, and Mammographic Image Analysis Society) and one private data set (UKy). The models were trained and validated on either Digital Database for Screening Mammography alone or a combined data set that included Digital Database for Screening Mammography. The models were then tested on the three external data sets. The area under the receiver operating characteristic curve (auROC) was used to evaluate model performance. Results: The three published models reported validation auROC scores between 0.88 and 0.95 on the validation data set. Our models achieved between 0.71 (95% confidence interval [CI]: 0.70-0.72) and 0.79 (95% CI: 0.78-0.80) auROC on the same validation data set. However, the same evaluation criteria of all six models on the three external test data sets were significantly decreased, only between 0.44 (95% CI: 0.43-0.45) and 0.65 (95% CI: 0.64-0.66). Conclusion: Our results demonstrate performance inconsistency across the data sets and models, indicating that the high performance of deep learning models on one data set cannot be readily transferred to unseen external data sets, and these models need further assessment and validation before being applied in clinical practice.

Original languageEnglish
Pages (from-to)796-803
Number of pages8
JournalJournal of the American College of Radiology
Volume17
Issue number6
DOIs
StatePublished - Jun 2020

Bibliographical note

Publisher Copyright:
© 2020 American College of Radiology

Funding

This work was supported by Grant No. IRG 16-182-28 from the American Cancer Society (principal investigator: Xiaoqin Wang) and Grant No. IIS-1553116 from the National Science Foundation (principal investigator: Jacob Nathan). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the American Cancer Society or National Science Foundation. The Markey Cancer Center's Research Communications Office assisted with preparation of this manuscript. This work was supported by grant No. IRG 16-182-28 from the American Cancer Society (PI-Xiaoqin Wang) and grant No. IIS-1553116 from the National Science Foundation (PI- Jacob Nathan). This work was supported by Grant No. IRG 16-182-28 from the American Cancer Society (principal investigator: Xiaoqin Wang) and Grant No. IIS-1553116 from the National Science Foundation (principal investigator: Jacob Nathan). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the American Cancer Society or National Science Foundation. The Markey Cancer Center’s Research Communications Office assisted with preparation of this manuscript.

FundersFunder number
American Cancer Society or National Science Foundation
National Science Foundation Arctic Social Science Program
American Cancer Society-Michigan Cancer Research FundIIS-1553116
American Cancer Society-Michigan Cancer Research Fund
University of Kentucky Markey Cancer Center

    Keywords

    • Deep learning
    • mammogram
    • performance inconsistency

    ASJC Scopus subject areas

    • Radiology Nuclear Medicine and imaging

    Fingerprint

    Dive into the research topics of 'Inconsistent Performance of Deep Learning Models on Mammogram Classification'. Together they form a unique fingerprint.

    Cite this