Leveraging output term co-occurrence frequencies and latent associations in predicting medical subject headings

Research output: Contribution to journalArticlepeer-review

12 Scopus citations


Trained indexers at the National Library of Medicine (NLM) manually tag each biomedical abstract with the most suitable terms from the Medical Subject Headings (MeSH) terminology to be indexed by their PubMed information system. MeSH has over 26,000 terms and indexers look at each article's full text while assigning the terms. Recent automated attempts focused on using the article title and abstract text to identify MeSH terms for the corresponding article. Most of these approaches used supervised machine learning techniques that use already indexed articles and the corresponding MeSH terms. In this paper, we present a new indexing approach that leverages term co-occurrence frequencies and latent term associations computed using MeSH term sets corresponding to a set of nearly 18 million articles already indexed with MeSH terms by indexers at NLM. The main goal of our study is to gauge the potential of output label co-occurrences, latent associations, and relationships extracted from free text in both unsupervised and supervised indexing approaches. In this paper, using a novel and purely unsupervised approach, we achieve a micro-F-score that is comparable to those obtained using supervised machine learning techniques. By incorporating term co-occurrence and latent association features into a supervised learning framework, we also improve over the best results published on two public datasets.

Original languageEnglish
Pages (from-to)189-201
Number of pages13
JournalData and Knowledge Engineering
Issue numberPB
StatePublished - Nov 1 2014

Bibliographical note

Funding Information:
Many thanks to anonymous reviewers for their detailed comments and suggestions that greatly helped improve the paper. We are grateful to Zhiyong Lu for making available the datasets used in his paper and Trevor Cohen for open sourcing the random indexing programs used in this paper. The project described was supported by the National Center for Advancing Translational Sciences, UL1TR000117, and the National Center for Advancing Translational Sciences, UL1TR000117. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Publisher Copyright:
© 2014 Elsevier B.V. All rights reserved.


  • Medical subject headings
  • Multi-label classification
  • Output label associations
  • Reflective random indexing

ASJC Scopus subject areas

  • Information Systems and Management


Dive into the research topics of 'Leveraging output term co-occurrence frequencies and latent associations in predicting medical subject headings'. Together they form a unique fingerprint.

Cite this