TY - JOUR
T1 - Quality Assurance of NCI Thesaurus by Mining Structural-Lexical Patterns
AU - Abeysinghe, Rashmie
AU - Brooks, Michael A.
AU - Talbert, Jeffery
AU - Licong, Cui
PY - 2017
Y1 - 2017
N2 - Quality assurance of biomedical terminologies such as the National Cancer Institute (NCI) Thesaurus is an essential part of the terminology management lifecycle. We investigate a structural-lexical approach based on non-lattice subgraphs to automatically identify missing hierarchical relations and missing concepts in the NCI Thesaurus. We mine six structural-lexical patterns exhibiting in non-lattice subgraphs: containment, union, intersection, union-intersection, inference-contradiction, and inference union. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. We found 809 non-lattice subgraphs with these patterns in the NCI Thesaurus (version 16.12d). Domain experts evaluated a random sample of 50 small non-lattice subgraphs, of which 33 were confirmed to contain errors and make correct suggestions (33/50 = 66%). Of the 25 evaluated subgraphs revealing multiple patterns, 22 were verified correct (22/25 = 88%). This shows the effectiveness of our structurallexical-pattern-based approach in detecting errors and suggesting remediations in the NCI Thesaurus.
AB - Quality assurance of biomedical terminologies such as the National Cancer Institute (NCI) Thesaurus is an essential part of the terminology management lifecycle. We investigate a structural-lexical approach based on non-lattice subgraphs to automatically identify missing hierarchical relations and missing concepts in the NCI Thesaurus. We mine six structural-lexical patterns exhibiting in non-lattice subgraphs: containment, union, intersection, union-intersection, inference-contradiction, and inference union. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. We found 809 non-lattice subgraphs with these patterns in the NCI Thesaurus (version 16.12d). Domain experts evaluated a random sample of 50 small non-lattice subgraphs, of which 33 were confirmed to contain errors and make correct suggestions (33/50 = 66%). Of the 25 evaluated subgraphs revealing multiple patterns, 22 were verified correct (22/25 = 88%). This shows the effectiveness of our structurallexical-pattern-based approach in detecting errors and suggesting remediations in the NCI Thesaurus.
UR - http://www.scopus.com/inward/record.url?scp=85046017762&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85046017762&partnerID=8YFLogxK
M3 - Article
C2 - 29854100
AN - SCOPUS:85046017762
SN - 1559-4076
VL - 2017
SP - 364
EP - 373
JO - AMIA ... Annual Symposium proceedings. AMIA Symposium
JF - AMIA ... Annual Symposium proceedings. AMIA Symposium
ER -