Skip to main navigation Skip to search Skip to main content

Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases

  • Japheth E. Gado
  • , Brent E. Harrison
  • , Mats Sandgren
  • , Jerry Ståhlberg
  • , Gregg T. Beckham
  • , Christina M. Payne

Research output: Contribution to journalArticlepeer-review

31 Scopus citations

Abstract

Family 7 glycoside hydrolases (GH7) are among the principal enzymes for cellulose degradation in nature and industrially. These enzymes are often bimodular, including a catalytic domain and carbohydrate-binding module (CBM) attached via a flexible linker, and exhibit an active site that binds cello-oligomers of up to ten glucosyl moieties. GH7 cellulases consist of two major subtypes: cellobiohydrolases (CBH) and endoglucanases (EG). Despite the critical importance of GH7 enzymes, there remain gaps in our understanding of how GH7 sequence and structure relate to function. Here, we employed machine learning to gain data-driven insights into relationships between sequence, structure, and function across the GH7 family. Machine-learning models, trained only on the number of residues in the active-site loops as features, were able to discriminate GH7 CBHs and EGs with up to 99% accuracy, demonstrating that the lengths of loops A4, B2, B3, and B4 strongly correlate with functional subtype across the GH7 family. Classification rules were derived such that specific residues at 42 different sequence positions each predicted the functional subtype with accuracies surpassing 87%. A random forest model trained on residues at 19 positions in the catalytic domain predicted the presence of a CBM with 89.5% accuracy. Our machine learning results recapitulate, as top-performing features, a substantial number of the sequence positions determined by previous experimental studies to play vital roles in GH7 activity. We surmise that the yet-to-be-explored sequence positions among the top-performing features also contribute to GH7 functional variation and may be exploited to understand and manipulate function.

Original languageEnglish
Article number100931
JournalJournal of Biological Chemistry
Volume297
Issue number2
DOIs
StatePublished - Aug 1 2021

Bibliographical note

Publisher Copyright:
© 2021 THE AUTHORS.

Funding

Funding and addditional information—This work was supported in part by the National Science Foundation (CBET-1552355 to C. M. P. in support of J. E. G.). Funding was provided to G. T. B. by the U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Bioenergy Technologies Office. This material is also based upon work supported by (while CMP is serving at) the NSF. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. Acknowledgments—This work was also authored in part by the Alliance for Sustainable Energy, LLC, the manager and operator of the National Renewable Energy Laboratory for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308.

FundersFunder number
U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Bioenergy Technologies Office
National Science Foundation (NSF)CBET-1552355
Michigan State University-U.S. Department of Energy (MSU-DOE) Plant Research LaboratoryDE-AC36-08GO28308
National Renewable Energy Laboratory

    ASJC Scopus subject areas

    • Biochemistry
    • Molecular Biology
    • Cell Biology

    Fingerprint

    Dive into the research topics of 'Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases'. Together they form a unique fingerprint.

    Cite this