TY - JOUR
T1 - Deep learning uncertainty quantification for clinical text classification
AU - Peluso, Alina
AU - Danciu, Ioana
AU - Yoon, Hong Jun
AU - Yusof, Jamaludin Mohd
AU - Bhattacharya, Tanmoy
AU - Spannaus, Adam
AU - Schaefferkoetter, Noah
AU - Durbin, Eric B.
AU - Wu, Xiao Cheng
AU - Stroup, Antoinette
AU - Doherty, Jennifer
AU - Schwartz, Stephen
AU - Wiggins, Charles
AU - Coyle, Linda
AU - Penberthy, Lynne
AU - Tourassi, Georgia D.
AU - Gao, Shang
N1 - Publisher Copyright:
© 2023
PY - 2024/1
Y1 - 2024/1
N2 - Introduction: Machine learning algorithms are expected to work side-by-side with humans in decision-making pipelines. Thus, the ability of classifiers to make reliable decisions is of paramount importance. Deep neural networks (DNNs) represent the state-of-the-art models to address real-world classification. Although the strength of activation in DNNs is often correlated with the network's confidence, in-depth analyses are needed to establish whether they are well calibrated. Method: In this paper, we demonstrate the use of DNN-based classification tools to benefit cancer registries by automating information extraction of disease at diagnosis and at surgery from electronic text pathology reports from the US National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) population-based cancer registries. In particular, we introduce multiple methods for selective classification to achieve a target level of accuracy on multiple classification tasks while minimizing the rejection amount—that is, the number of electronic pathology reports for which the model's predictions are unreliable. We evaluate the proposed methods by comparing our approach with the current in-house deep learning-based abstaining classifier. Results: Overall, all the proposed selective classification methods effectively allow for achieving the targeted level of accuracy or higher in a trade-off analysis aimed to minimize the rejection rate. On in-distribution validation and holdout test data, with all the proposed methods, we achieve on all tasks the required target level of accuracy with a lower rejection rate than the deep abstaining classifier (DAC). Interpreting the results for the out-of-distribution test data is more complex; nevertheless, in this case as well, the rejection rate from the best among the proposed methods achieving 97% accuracy or higher is lower than the rejection rate based on the DAC. Conclusions: We show that although both approaches can flag those samples that should be manually reviewed and labeled by human annotators, the newly proposed methods retain a larger fraction and do so without retraining—thus offering a reduced computational cost compared with the in-house deep learning-based abstaining classifier.
AB - Introduction: Machine learning algorithms are expected to work side-by-side with humans in decision-making pipelines. Thus, the ability of classifiers to make reliable decisions is of paramount importance. Deep neural networks (DNNs) represent the state-of-the-art models to address real-world classification. Although the strength of activation in DNNs is often correlated with the network's confidence, in-depth analyses are needed to establish whether they are well calibrated. Method: In this paper, we demonstrate the use of DNN-based classification tools to benefit cancer registries by automating information extraction of disease at diagnosis and at surgery from electronic text pathology reports from the US National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) population-based cancer registries. In particular, we introduce multiple methods for selective classification to achieve a target level of accuracy on multiple classification tasks while minimizing the rejection amount—that is, the number of electronic pathology reports for which the model's predictions are unreliable. We evaluate the proposed methods by comparing our approach with the current in-house deep learning-based abstaining classifier. Results: Overall, all the proposed selective classification methods effectively allow for achieving the targeted level of accuracy or higher in a trade-off analysis aimed to minimize the rejection rate. On in-distribution validation and holdout test data, with all the proposed methods, we achieve on all tasks the required target level of accuracy with a lower rejection rate than the deep abstaining classifier (DAC). Interpreting the results for the out-of-distribution test data is more complex; nevertheless, in this case as well, the rejection rate from the best among the proposed methods achieving 97% accuracy or higher is lower than the rejection rate based on the DAC. Conclusions: We show that although both approaches can flag those samples that should be manually reviewed and labeled by human annotators, the newly proposed methods retain a larger fraction and do so without retraining—thus offering a reduced computational cost compared with the in-house deep learning-based abstaining classifier.
KW - Abstaining classifier
KW - Accuracy
KW - CNN
KW - DNN
KW - Deep learning
KW - HiSAN
KW - NCI SEER
KW - Pathology reports
KW - Selective classification
KW - Text classification
KW - Uncertainty quantification
UR - http://www.scopus.com/inward/record.url?scp=85179763559&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85179763559&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2023.104576
DO - 10.1016/j.jbi.2023.104576
M3 - Article
C2 - 38101690
AN - SCOPUS:85179763559
SN - 1532-0464
VL - 149
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
M1 - 104576
ER -