A Lexical Approach to Identifying Subtype Inconsistencies in Biomedical Terminologies

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

We introduce a lexical-based inference approach for identifying subtype (or is{-}a relation) inconsistencies in biomedical terminologies. Given a terminology, we first represent the name of each concept in the terminology as a sequence of words. We then generate hierarchically-linked and-unlinked pairs of concepts, such that the two concepts in a pair have the same number of words, and contain at least one word in common and a fixed number n of different words (n = 1,2,3,4,5). The linked and unlinked concept-pairs further infer corresponding linked and unlinked term-pairs, respectively. If a linked concept-pair and an unlinked concept-pair infer the same term-pair, we consider this as a potential subtype inconsistency, which may indicate a missing subtype relation or an incorrect subtype relation. We applied this approach to Gene Ontology (GO), National Cancer Institute thesaurus (NCIt) and SNOMED CT. A total of 4,841 potential subtype inconsistencies were found in GO, 2,677 in NCIt, and 53,782 in SNOMED CT. Domain experts evaluated a random sample of 211 potential inconsistencies in GO, and verified that 124 of them are valid (mathrm {i}.mathrm {e}., a precision of 58.77% for detecting subtype inconsistencies in GO). We also performed a preliminary study on the extent to which external knowledge in the Unified Medical Language System (UMLS) can provide supporting evidence for validating the detected potential inconsistencies: 0.54% (=26/4841) for GO, 11.43% (=306/2677) for NCIt, and 3.61% (=1940/53782) for SNOMED CT. Results indicate that our lexical-based inference approach is a promising way to identify subtype inconsistencies and facilitates the quality improvement of biomedical terminologies.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018
EditorsHarald Schmidt, David Griol, Haiying Wang, Jan Baumbach, Huiru Zheng, Zoraida Callejas, Xiaohua Hu, Julie Dickerson, Le Zhang
Pages1982-1989
Number of pages8
ISBN (Electronic)9781538654880
DOIs
StatePublished - Jan 21 2019
Event2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 - Madrid, Spain
Duration: Dec 3 2018Dec 6 2018

Publication series

NameProceedings - 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018

Conference

Conference2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018
Country/TerritorySpain
CityMadrid
Period12/3/1812/6/18

Bibliographical note

Publisher Copyright:
© 2018 IEEE.

Funding

This work was supported by the National Science Foundation through grants 1816805, 1657306 and 1252893 This work was supported by the National Science Foundation through grants 1816805, 1657306 and 1252893, and by the National Institutes of Health through grants UL1TR001998 and R21CA231904. Correspondence: [email protected]

FundersFunder number
National Science Foundation (NSF)1252893, 1931134, 1657306, 1816805
National Institutes of Health (NIH)R21CA231904, UL1TR001998
National Science Foundation (NSF)

    Keywords

    • Gene Ontology
    • incorrect subtype relations
    • missing subtype relations
    • national Cancer Institute thesaurus
    • sNOMED CT
    • subtype inconsistencies
    • terminology quality assurance
    • unified Med-ical Language System

    ASJC Scopus subject areas

    • Biomedical Engineering
    • Health Informatics

    Fingerprint

    Dive into the research topics of 'A Lexical Approach to Identifying Subtype Inconsistencies in Biomedical Terminologies'. Together they form a unique fingerprint.

    Cite this