A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification

Andrew E. Blanchard, Shang Gao, Hong Jun Yoon, J. Blair Christian, Eric B. Durbin, Xiao Cheng Wu, Antoinette Stroup, Jennifer Doherty, Stephen M. Schwartz, Charles Wiggins, Linda Coyle, Lynne Penberthy, Georgia D. Tourassi

Research output: Contribution to journalArticlepeer-review

3 Scopus citations


Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.

Original languageEnglish
Pages (from-to)2796-2803
Number of pages8
JournalIEEE Journal of Biomedical and Health Informatics
Issue number6
StatePublished - Jun 1 2022

Bibliographical note

Publisher Copyright:
© 2013 IEEE.


  • Machine learning
  • medical information systems
  • natural language processing

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics
  • Electrical and Electronic Engineering
  • Health Information Management


Dive into the research topics of 'A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification'. Together they form a unique fingerprint.

Cite this