Abstract
We introduce a lexical-based inference approach for identifying subtype (or is{-}a relation) inconsistencies in biomedical terminologies. Given a terminology, we first represent the name of each concept in the terminology as a sequence of words. We then generate hierarchically-linked and-unlinked pairs of concepts, such that the two concepts in a pair have the same number of words, and contain at least one word in common and a fixed number n of different words (n = 1,2,3,4,5). The linked and unlinked concept-pairs further infer corresponding linked and unlinked term-pairs, respectively. If a linked concept-pair and an unlinked concept-pair infer the same term-pair, we consider this as a potential subtype inconsistency, which may indicate a missing subtype relation or an incorrect subtype relation. We applied this approach to Gene Ontology (GO), National Cancer Institute thesaurus (NCIt) and SNOMED CT. A total of 4,841 potential subtype inconsistencies were found in GO, 2,677 in NCIt, and 53,782 in SNOMED CT. Domain experts evaluated a random sample of 211 potential inconsistencies in GO, and verified that 124 of them are valid (mathrm {i}.mathrm {e}., a precision of 58.77% for detecting subtype inconsistencies in GO). We also performed a preliminary study on the extent to which external knowledge in the Unified Medical Language System (UMLS) can provide supporting evidence for validating the detected potential inconsistencies: 0.54% (=26/4841) for GO, 11.43% (=306/2677) for NCIt, and 3.61% (=1940/53782) for SNOMED CT. Results indicate that our lexical-based inference approach is a promising way to identify subtype inconsistencies and facilitates the quality improvement of biomedical terminologies.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 |
| Editors | Harald Schmidt, David Griol, Haiying Wang, Jan Baumbach, Huiru Zheng, Zoraida Callejas, Xiaohua Hu, Julie Dickerson, Le Zhang |
| Pages | 1982-1989 |
| Number of pages | 8 |
| ISBN (Electronic) | 9781538654880 |
| DOIs | |
| State | Published - Jan 21 2019 |
| Event | 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 - Madrid, Spain Duration: Dec 3 2018 → Dec 6 2018 |
Publication series
| Name | Proceedings - 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 |
|---|
Conference
| Conference | 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 |
|---|---|
| Country/Territory | Spain |
| City | Madrid |
| Period | 12/3/18 → 12/6/18 |
Bibliographical note
Publisher Copyright:© 2018 IEEE.
Funding
This work was supported by the National Science Foundation through grants 1816805, 1657306 and 1252893 This work was supported by the National Science Foundation through grants 1816805, 1657306 and 1252893, and by the National Institutes of Health through grants UL1TR001998 and R21CA231904. Correspondence: [email protected]
| Funders | Funder number |
|---|---|
| National Science Foundation (NSF) | 1252893, 1931134, 1657306, 1816805 |
| National Institutes of Health (NIH) | R21CA231904, UL1TR001998 |
| National Science Foundation (NSF) |
Keywords
- Gene Ontology
- incorrect subtype relations
- missing subtype relations
- national Cancer Institute thesaurus
- sNOMED CT
- subtype inconsistencies
- terminology quality assurance
- unified Med-ical Language System
ASJC Scopus subject areas
- Biomedical Engineering
- Health Informatics