Medical subject headings (MeSH) is a controlled hierarchical vocabulary used by the National Library of Medicine (NLM) to index biomedical articles. In the 2014 version of MeSH terminology there are a total of 27,149 terms. Librarians at the NLM tag each biomedical article to be indexed for the Pub Med search system with terms from MeSH. This means the human indexers look at each article's full text and index it with a small set of descriptors, 13 on average, from over 27,000 descriptors available in MeSH. There have been many recent attempts to automate this process focused on using the article title and abstract text to predict MeSH terms for the corresponding article. There has also been an open automated biomedical indexing challenge, BioASQ, that started in 2013. The best general supervised learning framework in these challenges has been a pipeline with four different components: 1. Pre-processing and feature extraction, 2. Employing the binary relevance and/or nearest neighbor approaches to select a set of candidate terms, 3. Ranking these candidate terms using corresponding informative features, and 4. Applying label calibration to dynamically predict the number of top terms to be included in the final selection for the input instance. The specific details in how each of these components is implemented determines the performance variations of various entries in the challenge. In this paper, we analyze these moving parts of the MeSH indexing multi-label classification pipeline with experiments involving different combinations. Our best combination achieves approximately ≈ 1% increase in micro F-score compared with the top performing team across the five weeks of the final batch of the BioASQ 2014 challenge. The main take away from our efforts is that small improvements/modifications to different components of the pipeline can offer moderate improvements to the overall performance of the method. Our experiences show that, at least thus far, top performances have resulted mostly due to these improvements rather than drastic changes of the core methodology.
|Title of host publication||Proceedings - 2015 IEEE International Conference on Healthcare Informatics, ICHI 2015|
|Editors||Wai-Tat Fu, Prabhakaran Balakrishnan, Sanda Harabagiu, Fei Wang, Jaideep Srivatsava|
|Number of pages||7|
|State||Published - Dec 8 2015|
|Event||3rd IEEE International Conference on Healthcare Informatics, ICHI 2015 - Dallas, United States|
Duration: Oct 21 2015 → Oct 23 2015
|Name||Proceedings - 2015 IEEE International Conference on Healthcare Informatics, ICHI 2015|
|Conference||3rd IEEE International Conference on Healthcare Informatics, ICHI 2015|
|Period||10/21/15 → 10/23/15|
Bibliographical noteFunding Information:
This publication was supported by the National Center for Research Resources and the National Center for Advancing Translational Sciences, US National Institutes of Health (NIH), through Grant UL1TR000117.
© 2015 IEEE.
ASJC Scopus subject areas
- Health Informatics