Back to all publications...
Sampling Protein Language Models for Functional Protein Design
Protein language models have emerged as powerful ways to learn complex repre- sentations of proteins, thereby improving their performance on several downstream tasks, from structure prediction to fitness prediction, property prediction, homology detection, and more. By learning a distribution over protein sequences, they are also very promising tools for designing novel and functional proteins, with broad applications in healthcare, new material, or sustainability. Given the vastness of the corresponding sample space, efficient exploration methods are critical to the success of protein engineering efforts. However, the methodologies for ade- quately sampling these models to achieve core protein design objectives remain underexplored and have predominantly leaned on techniques developed for Natural Language Processing. In this work, we first develop a holistic in silico protein design evaluation framework, to comprehensively compare different sampling methods. After performing a thorough review of sampling methods for language models, we introduce several sampling strategies tailored to protein design. Lastly, we compare the various strategies on our in silico benchmark, investigating the effects of key hyperparameters and highlighting practical guidance on the relative strengths of different methods.
Jeremie Theddy Darmawan, Yarin Gal, Pascal Notin
Machine Learning for Structural Biology / Generative AI and Biology workshops, NeurIPS 2023
[Paper]