Bryce Johnson


Nested Sampling for Exploration of Protein Fitness Landscapes

Recent advances in machine learning offer predictive methods to determine protein developability metrics (related to expression, solubility, and stability) from sequence alone, thus enabling a more efficient search of developable protein mutants to be screened experimentally. One such approach utilizes high throughput measurements to learn relevant amino acid properties and their interactions to predict developability with increased accuracy. However, even for a small protein scaffolds such as Gp2, a random search can parse at most O(10^-8) of all possible protein variants. Here, we overcome this difficulty by using nested sampling (NS), a Monte Carlo scheme for Bayesian parameter estimation and model selection, also commonly utilized in statistical thermodynamics  to efficiently explore  energy landscapes with many competing minima. We employ NS to explore the fitness landscape inferred by the machine learning model. Our analysis includes a non-linear dimensionality reduction (UMAP) of protein properties for high developability sequences identified by the algorithm, density of states estimation, and a unique topographical analysis of the fitness landscape.  A simple parallelization algorithm for NS to speed up convergence and runtime for high number of sequences is also discussed.

Video file