Papers on protein design

I’ve been catching up with recent progress and I wanted to share really brief summaries of the results I found interesting.

Training (mainly) on aminoacid sequences

  • Large language models generate functional protein sequences across diverse families (Jul′21, Salesforce, 43 citations) allows sequence generation with conditioning on things like cellular component, biological process and molecular function. Finetuning on functional family generates very novel proteins with the same properties as proteins in the finetuning set.

  • Robust deep learning based protein sequence design using ProteinMPNN (Jun′22, UWashington, 146 citations) - a notable paper on the problem of sequence prediction given backbone coordinates. Instead of using a deep learning model in steps of expensive energy minimization as in the previous method, they train a graph neural net end-to-end and show it works better.

  • Language models of protein sequences at the scale of evolution enable accurate structure prediction (Jul′22, Meta, 103 citations). “We find that as models are scaled they learn information enabling the prediction of the three-dimensional structure of a protein at the resolution of individual atoms. ESMFold has similar accuracy to AlphaFold2 [...] for sequences with low perplexity that are well understood by the language model.” They maintain the esm github repo (2k stars) with pretrained protein LMs.

  • Language models generalize beyond natural proteins (Dec′22, Meta, 17 citations) - another success in using a sequence-level LM to generate novel and diverse proteins. “Remarkably although the models are trained only on sequences, we find that they are capable of designing structure.” They synthesize the proteins and check for simple desirable properties like solubility.

  • A Text-guided Protein Design Framework (Feb′23, UMontreal and others): “ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. [...] To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs.“ A sequence-level model where you can modify text conditioning strength by interpolating between protein and text representations, or evaluate similarity of protein representation to property descriptions.

Training on aminoacid sequences and spatial structures

  • Illuminating protein space with a programmable generative model (Dec′22, Generate Biomedicines, 30 citations): conditioning on partial substructure, arbitrary geometry, symmetry, properties, natural language description. Diffusion for backbone, graph neural net for sequence. Birthday present idea: you can probably design a protein that looks like someone you know when folded.

  • Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models (Dec′22, UWashington, 42 citations) introduced a popular publicly available pretrained protein generation model RFdiffusion (850 stars). Can be used for protein binder design: “RFdiffusion is an extremely powerful binder design tool but it is not magic. [...] Truncating a target is an art. For some targets, such as multidomain extracellular membranes, a natural truncation point is where two domains are joined by a flexible linker. For other proteins, such as virus spike proteins, this truncation point is less obvious. Generally you want to preserve secondary structure and introduce as few chain breaks as possible. [...] Given the high success rates we observed in the paper, for some targets it may be sufficient to only generate ~1,000 RFdiffusion backbones in a campaign. What you want is to get enough designs that pass pAE_interaction < 10 (described more in Binder Design Filtering section) such that you are able to fill a DNA order with these successful designs.”

  • Joint Generation of Protein Sequence and Structure with RoseTTAFold Sequence Space Diffusion (May′23, UWashington) is a new ProteinGenerator diffusion model (RFdiffusion successor) which simultaneously generates protein sequences and structures. ProteinGenerator allows for the design of functional proteins with specific sequence and structural attributes, and paves the way for protein function optimization by active learning on sequence-activity datasets. Pretrained model available on Github (84 stars). Try generating proteins on Huggingface.

Antibody design

  • Efficient evolution of human antibodies from general protein language models (Apr′22, Stanford, 18 citations): generate targeted mutations with an LM, evaluate candidates in vivo, and repeat in a loop. It “improved the binding affinities of four clinically relevant, highly mature antibodies up to sevenfold” including for human SARS-Cov-2 antibodies.

  • Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models (Jul′22, Helixon and others, 28 citations): “The first deep learning-based method that can explicitly target specific antigen structures and generate antibodies at atomic resolution”. Uses a diffusion model with in silico validation.

  • Unlocking de novo antibody design with generative artificial intelligence (Jan′23, Absci, 2 citations): “Several groups have introduced models for generative antibody design with promising in silico evidence, however, no such method has demonstrated de novo antibody design with experimental validation. Here we use generative deep learning models to de novo design antibodies against three distinct targets, in a zero-shot fashion, where all designs are the result of a single round of model generations with no follow-up optimization. In particular, we screen over 400,000 antibody variants designed for binding to human epidermal growth factor receptor 2 (HER2) using our high-throughput wet lab capabilities. From these screens, we further characterize 421 binders using surface plasmon resonance (SPR), finding three that bind tighter than the therapeutic antibody trastuzumab. The binders are highly diverse, have low sequence identity to known antibodies, and adopt variable structural conformations.”