TY - JOUR
T1 - Class imbalance in out-of-distribution datasets
T2 - Improving the robustness of the TextCNN for the classification of rare cancer types
AU - De Angeli, Kevin
AU - Gao, Shang
AU - Danciu, Ioana
AU - Durbin, Eric B.
AU - Wu, Xiao Cheng
AU - Stroup, Antoinette
AU - Doherty, Jennifer
AU - Schwartz, Stephen
AU - Wiggins, Charles
AU - Damesyn, Mark
AU - Coyle, Linda
AU - Penberthy, Lynne
AU - Tourassi, Georgia D.
AU - Yoon, Hong Jun
N1 - Publisher Copyright:
© 2021
PY - 2022/1
Y1 - 2022/1
N2 - In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.
AB - In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.
KW - CNN
KW - Class Imbalance
KW - Deep Learning
KW - Ensemble
KW - NLP
KW - Text Classification
UR - http://www.scopus.com/inward/record.url?scp=85120156672&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85120156672&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2021.103957
DO - 10.1016/j.jbi.2021.103957
M3 - Article
C2 - 34823030
AN - SCOPUS:85120156672
SN - 1532-0464
VL - 125
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
M1 - 103957
ER -