CRII: III: A Scalable Framework for Debugging Large Biological Ontologies

  • Cui, Licong (PI)

Grants and Contracts Details


Large biological ontologies such as Gene Ontology (GO) continue to evolve over time. GO has been widely used for inferring gene functions, predicting protein functions, and automating gene-related discoveries within large-scale genetic studies. Quality issues in GO can impact data quality, leading to misleading results or missed biological discoveries. Therefore, debugging GO for continued quality enhancement has become an indispensable part of its lifecycle. However, the size and complex structure of GO make its quality assurance (QA) a challenging task. Manual review and debugging of ontology for quality assurance by ontology curators is arduous and time-consuming, and mostly infeasible or ineffective. Automated methods are highly desirable. As biological knowledge rapidly evolves, most existing QA approaches for GO have focused on the enrichment of GO terms, and largely ignored structural information. Only a few previous studies have addressed the quality issues of the existing GO from a structural point of view, such as incorrect classification of terms and missing hierarchical relations. However, there is a lack of approaches that automatically debugs GO and generate change suggestions that can be systematically reviewed and incorporated into new versions. We propose a novel hybrid framework, GO-Debugger, to debug GO by leveraging both GO's underlying graph structure and the algebraic properties of GO terms. Our approach first defines a set of algebraic properties based on GO terms and subsumption relations. We then develop automatic bug detection algorithms based on these algebraic properties to identify potential errors in GO. The detection algorithms will be experimented in two different settings: (1) performing exhaustive detection based on all GO terms; and (2) performing subspace detection based on GO terms contained in a graph substructure called non-lattice structure. We will also develop non-lattice-based subgraph matching algorithms to automatically group similar modeling issues and suggest systematic corrections for similar issues. The computational challenges of the proposed framework will be addressed by leveraging our preliminary work on exhaustive detection of anomalies in large-scale ontologies using a scalable computational approach. We hypothesize that combining algebraic properties and subgraph-based approach is an effective approach in identifying potential GO quality issues and automatically generating recommendations for changes. We plan to test this hypothesis in three ways: (1) Evaluate the effectiveness of the proposed framework by measuring the precisions of detected errors and recommended changes, as well as the correctness of the non-lattice-based subgraph matching algorithms; (2) Perform comparative evaluation on the prevalence of "bugs" in lattie vs non-lattice substructures in GO; (3) Perform reevaluation of GenBank data and compare the results of annotation of a GO version vs a corrected GO version. The proposed framework is novel in that it not only automatically detects potential bugs in the largest and most widely used biological ontologies, but also proposes remediations for identified errors and their benefit for re-annotation. This novel framework has the potential to be applicable to other large-scale biological ontologies, since it leverages the underlying hierarchical graph structure as well as ontology terms that are existent in all biological ontologies. The outcome of this project will be made accessible to the general public for ontology quality assurance. The research problems proposed in the project will also be incorporated in a new graduate course "Advanced Data Science" that the PI has created. The integrated educational plan will bring computer science students with real-world data science problems and train them for interdisciplinary research.
Effective start/end date3/1/172/28/19


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.