Analyzing the moving parts of a large-scale multi-label text classification pipeline: Experiences in indexing biomedical articles

Anthony Rios, Ramakanth Kavuluru

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Medical subject headings (MeSH) is a controlled hierarchical vocabulary used by the National Library of Medicine (NLM) to index biomedical articles. In the 2014 version of MeSH terminology there are a total of 27,149 terms. Librarians at the NLM tag each biomedical article to be indexed for the Pub Med search system with terms from MeSH. This means the human indexers look at each article's full text and index it with a small set of descriptors, 13 on average, from over 27,000 descriptors available in MeSH. There have been many recent attempts to automate this process focused on using the article title and abstract text to predict MeSH terms for the corresponding article. There has also been an open automated biomedical indexing challenge, BioASQ, that started in 2013. The best general supervised learning framework in these challenges has been a pipeline with four different components: 1. Pre-processing and feature extraction, 2. Employing the binary relevance and/or nearest neighbor approaches to select a set of candidate terms, 3. Ranking these candidate terms using corresponding informative features, and 4. Applying label calibration to dynamically predict the number of top terms to be included in the final selection for the input instance. The specific details in how each of these components is implemented determines the performance variations of various entries in the challenge. In this paper, we analyze these moving parts of the MeSH indexing multi-label classification pipeline with experiments involving different combinations. Our best combination achieves approximately ≈ 1% increase in micro F-score compared with the top performing team across the five weeks of the final batch of the BioASQ 2014 challenge. The main take away from our efforts is that small improvements/modifications to different components of the pipeline can offer moderate improvements to the overall performance of the method. Our experiences show that, at least thus far, top performances have resulted mostly due to these improvements rather than drastic changes of the core methodology.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE International Conference on Healthcare Informatics, ICHI 2015
EditorsWai-Tat Fu, Prabhakaran Balakrishnan, Sanda Harabagiu, Fei Wang, Jaideep Srivatsava
Pages1-7
Number of pages7
ISBN (Electronic)9781467395489
DOIs
StatePublished - Dec 8 2015
Event3rd IEEE International Conference on Healthcare Informatics, ICHI 2015 - Dallas, United States
Duration: Oct 21 2015Oct 23 2015

Publication series

NameProceedings - 2015 IEEE International Conference on Healthcare Informatics, ICHI 2015

Conference

Conference3rd IEEE International Conference on Healthcare Informatics, ICHI 2015
Country/TerritoryUnited States
CityDallas
Period10/21/1510/23/15

Bibliographical note

Funding Information:
This publication was supported by the National Center for Research Resources and the National Center for Advancing Translational Sciences, US National Institutes of Health (NIH), through Grant UL1TR000117.

Publisher Copyright:
© 2015 IEEE.

ASJC Scopus subject areas

  • Health Informatics

Fingerprint

Dive into the research topics of 'Analyzing the moving parts of a large-scale multi-label text classification pipeline: Experiences in indexing biomedical articles'. Together they form a unique fingerprint.

Cite this