Skip to main navigation Skip to search Skip to main content

S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure

  • Duolin Wang
  • , Mahdi Pourmirzaei
  • , Usman L. Abbas
  • , Shuai Zeng
  • , Negin Manshour
  • , Farzaneh Esmaili
  • , Biplab Poudel
  • , Yuexu Jiang
  • , Qing Shao
  • , Jin Chen
  • , Dong Xu

Research output: Contribution to journalArticlepeer-review

31 Citations (SciVal)

Abstract

Proteins play an essential role in various biological and engineering processes. Large protein language models (PLMs) present excellent potential to reshape protein research by accelerating the determination of protein functions and the design of proteins with the desired functions. The prediction and design capacity of PLMs relies on the representation gained from the protein sequences. However, the lack of crucial 3D structure information in most PLMs restricts the prediction capacity of PLMs in various applications, especially those heavily dependent on 3D structures. To address this issue, S-PLM is introduced as a 3D structure-aware PLM that utilizes multi-view contrastive learning to align the sequence and 3D structure of a protein in a coordinated latent space. S-PLM applies Swin-Transformer on AlphaFold-predicted protein structures to embed the structural information and fuses it into sequence-based embedding from ESM2. Additionally, a library of lightweight tuning tools is provided to adapt S-PLM for diverse downstream protein prediction tasks. The results demonstrate S-PLM's superior performance over sequence-only PLMs on all protein clustering and classification tasks, achieving competitiveness comparable to state-of-the-art methods requiring both sequence and structure inputs. S-PLM and its lightweight tuning tools are available at https://github.com/duolinwang/S-PLM/.

Original languageEnglish
Article number2404212
JournalAdvanced Science
Volume12
Issue number5
DOIs
StatePublished - Feb 3 2025

Bibliographical note

Publisher Copyright:
© 2024 The Author(s). Advanced Science published by Wiley-VCH GmbH.

Funding

D.W. and D.X. were partially supported by the National Institutes of Health (grant R35GM126985). Q.S., D.X., and J.C. would like to acknowledge the National Institutes of Health (grant R01LM014510). U.A., Q.S., and J.C. would like to thank AI in Medicine (AIM) at the University of Kentucky (NCATS UL1TR001998, NCI P30 CA177558). U.A. and Q.S. would also like to thank the Start‐up Fund of the University of Kentucky and Alzheimer's Association (AARG‐23‐1144638) for their financial support. Q.S. also acknowledges the financial support of the National Science Foundation (2154996). The computation for this work was partially performed on the high‐performance computing infrastructure provided by Research Computing Support Services at the University of Missouri. The authors also thank the University of Kentucky Center for Computational Sciences and Information Technology Services Research Computing for their support and use of the Morgan Compute Cluster and associated research computing resources. This work also used Delta‐GPU at NCSA through allocation CIS230053 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which was supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. D.W. and D.X. were partially supported by the National Institutes of Health (grant R35GM126985). Q.S., D.X., and J.C. would like to acknowledge the National Institutes of Health (grant R01LM014510). U.A., Q.S., and J.C. would like to thank AI in Medicine (AIM) at the University of Kentucky (NCATS UL1TR001998, NCI P30 CA177558). U.A. and Q.S. would also like to thank the Start-up Fund of the University of Kentucky and Alzheimer's Association (AARG-23-1144638) for their financial support. Q.S. also acknowledges the financial support of the National Science Foundation (2154996). The computation for this work was partially performed on the high-performance computing infrastructure provided by Research Computing Support Services at the University of Missouri. The authors also thank the University of Kentucky Center for Computational Sciences and Information Technology Services Research Computing for their support and use of the Morgan Compute Cluster and associated research computing resources. This work also used Delta-GPU at NCSA through allocation CIS230053 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which was supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

FundersFunder number
University of Kentucky
University of Missouri – St. Louis
Kentucky Transportation Center, University of Kentucky
National Center for Advancing Translational Sciences (NCATS)UL1TR001998
National Childhood Cancer Registry – National Cancer InstituteP30 CA177558
Alzheimer's AssociationAARG‐23‐1144638
National Institutes of Health (NIH)R35GM126985, R01LM014510
National Science Foundation Arctic Social Science Program2154996, 2138286, 2138296, CIS230053, 2138307, 2137603, 2138259

    Keywords

    • contrastive learning
    • deep learning
    • protein function prediction
    • protein language model
    • protein structure

    ASJC Scopus subject areas

    • Medicine (miscellaneous)
    • General Chemical Engineering
    • Biochemistry, Genetics and Molecular Biology (miscellaneous)
    • General Materials Science
    • General Engineering
    • General Physics and Astronomy

    Fingerprint

    Dive into the research topics of 'S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure'. Together they form a unique fingerprint.

    Cite this