Deep active learning for classifying cancer pathology reports

Kevin De Angeli, Shang Gao, Mohammed Alawad, Hong Jun Yoon, Noah Schaefferkoetter, Xiao Cheng Wu, Eric B. Durbin, Jennifer Doherty, Antoinette Stroup, Linda Coyle, Lynne Penberthy, Georgia Tourassi

Research output: Contribution to journalArticlepeer-review

26 Scopus citations

Abstract

Background: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model. Results: We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes. Conclusions: Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.

Original languageEnglish
Article number113
JournalBMC Bioinformatics
Volume22
Issue number1
DOIs
StatePublished - Dec 2021

Bibliographical note

Publisher Copyright:
© 2021, The Author(s).

Funding

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paidup, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
DOE Public Access Plan
United States Government
U.S. Department of Energy EPSCoR
National Childhood Cancer Registry – National Cancer InstituteP30CA177558
National Childhood Cancer Registry – National Cancer Institute

    Keywords

    • Active learning
    • Cancer pathology reports
    • Convolutional neural networks
    • Deep learning
    • Text classification

    ASJC Scopus subject areas

    • Structural Biology
    • Biochemistry
    • Molecular Biology
    • Computer Science Applications
    • Applied Mathematics

    Fingerprint

    Dive into the research topics of 'Deep active learning for classifying cancer pathology reports'. Together they form a unique fingerprint.

    Cite this