Structure-Function-Aware Large Protein Language Models for Enhanced Biomedical Applications

Grants and Contracts Details

Description

Specific Aims Commented [jc1]: Need a specific goal The goal of this proposal is to develop advanced deep learning technologies that accelerate the development Commented [jc2]: This is a key sentence. We need a specific and deployment of protein large language model (PLLM) for biomedical research. The proposed PLLM will goal/approach here. expedite the discovery of the underlining molecular mechanisms in human diseases and facilitate the Commented [jc3]: MD simulation is your strength. Perhaps development of effective treatments by providing accurate in-silico predictions of disease-related protein you can in part of your workflow. properties. The current research poses two challenges roadblocking the extension of PLLMs into biomedical Commented [QS4]: What is multi-view DDPM? research: (a) the absence of biologically aware protein 3D structure data in existing PLLMs, and (b) the unavailability of efficient approaches to adapt trained PLLMs without compromising the model generality. To Commented [QS5]: Duoblin and Dong, please revise Aim 2. address these challenges, the proposed research will develop deep learning algorithms for fusing the 3D structure with 1D sequence information for proteins, and adapting the trained PLLMs for biomedical tasks without losing their generality. We hypothesize that (1) molecular simulations generate biologically aware protein 3D structures, (2) multi-view denoising diffusion probabilistic models (DDPMs) can fuse 3D structures with 1D sequences of proteins for better predicting protein properties, and (3) prompt engineering can adapt trained PLLMs to precisely perform biomedical tasks. Driven by the hypotheses, we propose to construct a library of biologically aware protein 3D structures using molecular simulations and alphaFold2, develop PLLMs using contrastive-learning multi-view DDPM to generates 3D structure-aware embeddings (3D-SAEs) for given protein sequences, and design optimal prompts based on 3D-SAEs to operate biomedical tasks. The proposed PLLM will be validated using multiple downstream tasks not previously exposed to the model. Our preliminary results include: (a) a large training dataset comprising 540 k proteins from the Swiss-Prot library, (b) a contrastive learning model integrating protein 3D structures and sequences, (c) a premier benchmark dataset for performance evaluation, (d) pretrained PLLMs optimized for downstream tasks through fine-tuning, (f) prompt engineering for adapting trained PLLMs. Based on these preliminary data, which demonstrate the feasibility and novelty of the proposed work, we have outlined three aims to achieve the goal of this project. Aim 1: Develop PLLMs with 3D-SAEs using contrastive-learning multi-view DDPMs. The goal of Aim 1 is to develop PLLMs that can output 3D-SAEs. We will refine our protein 3D structure library using alphaFold2 prediction evaluation and molecular dynamics simulation. Then we will develop contrastive-learning models based on the 3D structure view, 1 D sequence view and 3D structure-confidence view. After that, we will tune the fusion between the 3D structure and 1D sequence views using conditional DDPMs. Finally, we will test the developed PLLMs based on the ability of the generated 3D-SAEs to predict secondary structure of proteins. The expected outcome includes (a) a library of biologically aware protein 3D structure and a computational pathway to expand the library, (b) several developed light-weighted PLLMs that can produce 3D-SAEs, and (d) the premimer evaluation of the developed PLLMs based on the ability of 3D-SAEs to predict secondary structure of given proteins based on the sequence. Aim 2: Develop prompt engineering approaches to efficiently adapt developed PLLMs. The goal of Aim 2 is to develop prompt-based machine learning methods that can adapt the developed PLLMs for biomedical tasks without losing their generality. We will start with the manual prompt engineering that converts the downstream biomedical tasks into tasks that can be handled by GPT-like models. We will then develop machine learning methods to optimize the continuous prompt vectors. The performance of the prompt-based adaptations will be tested using the downstream biomedical task benchmark developed in Aim 1. The expected outcome includes (a) the evaluation of prompt-based adaption on PLLMs’ performance on downstream tasks and (b) python modules that enable the manual and continuous prompt engineering of protein LLMs. Aim 3: Validate the ability of prompt-based adaption of developed PLMs in biomedical tasks. The goal of Aim 3 is to validate and optimize the developed PLLMs and developed prompt methods deep based on three sets of comprehensive benchmarks. We will develop a set of downstream protein property tasks and benchmark the performance of the prompt adapted PLMMs. In parallel, we will develop webserver to open our prompt adapted PLMMs to the biomedical researchers and seek feedbacks from the third-party users. Finally, we will also request our experimental collaborators to provide feedbacks of our models on their biomedical research subjects such as the antibody affinity. The expected outcome will include (a) several prompt-based adapted protein LLMs that possess accurate prediction ability for downstream biomedical tasks and (b) python modules and webservers that can be used to perform prompt-based adaption of existing PLLMs. Overall, the success of this project will (a) develop PLLMs that generate 3D structure-aware sequence- based embeddings, (b) design optimal prompts to adapt developed PLLMs for a wide range of protein property prediction critical for biomedical research and (c) provide in-silico tools and webservers to predict protein properties based on the prompt-adapted PLMMs.
StatusActive
Effective start/end date6/21/245/31/28

Funding

  • National Library of Medicine: $343,124.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.