Abstract
Objective: We study the performance of machine learning (ML) methods, including neural networks (NNs), to extract mutational test results from pathology reports collected by cancer registries. Given the lack of hand-labeled datasets for mutational test result extraction, we focus on the particular use-case of extracting Epidermal Growth Factor Receptor mutation results in non-small cell lung cancers. We explore the generalization of NNs across different registries where our goals are twofold: (1) to assess how well models trained on a registry's data port to test data from a different registry and (2) to assess whether and to what extent such models can be improved using state-of-the-art neural domain adaptation techniques under different assumptions about what is available (labeled vs unlabeled data) at the target registry site. Materials and methods: We collected data from two registries: the Kentucky Cancer Registry (KCR) and the Fred Hutchinson Cancer Research Center (FH) Cancer Surveillance System. We combine NNs with adversarial domain adaptation to improve cross-registry performance. We compare to other classifiers in the standard supervised classification, unsupervised domain adaptation, and supervised domain adaptation scenarios. Results: The performance of ML methods varied between registries. To extract positive results, the basic convolutional neural network (CNN) had an F1 of 71.5% on the KCR dataset and 95.7% on the FH dataset. For the KCR dataset, the CNN F1 results were low when trained on FH data (Positive F1: 23%). Using our proposed adversarial CNN, without any labeled data, we match the F1 of the models trained directly on each target registry's data. The adversarial CNN F1 improved when trained on FH and applied to KCR dataset (Positive F1: 70.8%). We found similar performance improvements when we trained on KCR and tested on FH reports (Positive F1: 45% to 96%). Conclusion: Adversarial domain adaptation improves the performance of NNs applied to pathology reports. In the unsupervised domain adaptation setting, we match the performance of models that are trained directly on target registry's data by using source registry's labeled data and unlabeled examples from the target registry.
Original language | English |
---|---|
Article number | 103267 |
Journal | Journal of Biomedical Informatics |
Volume | 97 |
DOIs | |
State | Published - Sep 2019 |
Bibliographical note
Publisher Copyright:© 2019 Elsevier Inc.
Funding
We are grateful for the support of the U.S. National Cancer Institute (NCI) through grant P30CA177558 and Surveillance, Epidemiology, and End Results Program (SEER) contracts HHSN261201300013I and HHSN261201800013I for enabling this effort. SMS and BG are supported through the NCI SEER contract HHSN26100007 and grant P30CA015704 . RK’s efforts are also partially supported by the U.S. National Center for Advancing Translational Sciences via grant UL1TR001998 .
Funders | Funder number |
---|---|
National Childhood Cancer Registry – National Cancer Institute | HHSN26100007, P30CA177558, P30CA015704, HHSN261201800013I |
National Childhood Cancer Registry – National Cancer Institute | |
National Center for Advancing Translational Sciences (NCATS) | UL1TR001998 |
National Center for Advancing Translational Sciences (NCATS) |
Keywords
- Cancer registry
- Domain adaptation
- Natural language processing
- Neural networks
- Text classification
- Text mining
ASJC Scopus subject areas
- Health Informatics
- Computer Science Applications