Abstract
We introduce a lexical-based inference approach for identifying subtype (or is{-}a relation) inconsistencies in biomedical terminologies. Given a terminology, we first represent the name of each concept in the terminology as a sequence of words. We then generate hierarchically-linked and-unlinked pairs of concepts, such that the two concepts in a pair have the same number of words, and contain at least one word in common and a fixed number n of different words (n = 1,2,3,4,5). The linked and unlinked concept-pairs further infer corresponding linked and unlinked term-pairs, respectively. If a linked concept-pair and an unlinked concept-pair infer the same term-pair, we consider this as a potential subtype inconsistency, which may indicate a missing subtype relation or an incorrect subtype relation. We applied this approach to Gene Ontology (GO), National Cancer Institute thesaurus (NCIt) and SNOMED CT. A total of 4,841 potential subtype inconsistencies were found in GO, 2,677 in NCIt, and 53,782 in SNOMED CT. Domain experts evaluated a random sample of 211 potential inconsistencies in GO, and verified that 124 of them are valid (mathrm {i}.mathrm {e}., a precision of 58.77% for detecting subtype inconsistencies in GO). We also performed a preliminary study on the extent to which external knowledge in the Unified Medical Language System (UMLS) can provide supporting evidence for validating the detected potential inconsistencies: 0.54% (=26/4841) for GO, 11.43% (=306/2677) for NCIt, and 3.61% (=1940/53782) for SNOMED CT. Results indicate that our lexical-based inference approach is a promising way to identify subtype inconsistencies and facilitates the quality improvement of biomedical terminologies.
Original language | English |
---|---|
Title of host publication | Proceedings - 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 |
Editors | Harald Schmidt, David Griol, Haiying Wang, Jan Baumbach, Huiru Zheng, Zoraida Callejas, Xiaohua Hu, Julie Dickerson, Le Zhang |
Pages | 1982-1989 |
Number of pages | 8 |
ISBN (Electronic) | 9781538654880 |
DOIs | |
State | Published - Jan 21 2019 |
Event | 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 - Madrid, Spain Duration: Dec 3 2018 → Dec 6 2018 |
Publication series
Name | Proceedings - 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 |
---|
Conference
Conference | 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 |
---|---|
Country/Territory | Spain |
City | Madrid |
Period | 12/3/18 → 12/6/18 |
Bibliographical note
Publisher Copyright:© 2018 IEEE.
Keywords
- Gene Ontology
- incorrect subtype relations
- missing subtype relations
- national Cancer Institute thesaurus
- sNOMED CT
- subtype inconsistencies
- terminology quality assurance
- unified Med-ical Language System
ASJC Scopus subject areas
- Biomedical Engineering
- Health Informatics