From Sequence to Function: Predicting What Proteins Actually Do

The central dogma of structural biology — that sequence determines structure, and structure determines function — has guided molecular biology for decades. The first link in that chain is now largely solved: given an amino acid sequence, we can predict its three-dimensional structure with remarkable accuracy. But the second link remains stubbornly incomplete. Two proteins can share the same fold yet catalyze entirely different reactions. A single point mutation can abolish enzymatic activity without detectably altering structure. Allosteric effectors can rewire a protein's function by shifting conformational equilibria in ways invisible to static structure prediction. The structure-function relationship is real but indirect, mediated by dynamics, electrostatics, solvation, and the precise choreography of chemical groups at active sites. Predicting what a protein does from its sequence — its substrates, its catalytic mechanism, its binding partners, its regulatory logic — requires models that go beyond spatial coordinates and reason about molecular behavior.

Current approaches to functional annotation are dominated by homology transfer: if a protein is similar in sequence to one with known function, assign the same function. This works well for close homologs but fails systematically for remote homologs, orphan proteins, and proteins whose functions have diverged despite conserved folds. The estimated 30-40% of proteins in any newly sequenced genome that cannot be assigned function by homology represents an enormous blind spot in our understanding of biology. Even among annotated proteins, functional descriptions are often coarse — "oxidoreductase" tells you the reaction class but not the substrate specificity, kinetic parameters, or regulatory mechanisms that determine biological role. Gene Ontology annotations, the standard vocabulary for protein function, are hierarchical but frequently incomplete, assigning only the most general terms while leaving specific molecular functions unknown. The field needs models that predict function de novo, from the protein itself rather than from its evolutionary neighbors.

At Webbeon, we have built a function prediction system within Oracle Class R that operates on three levels of functional description: molecular function (what reaction or binding event the protein performs), mechanistic detail (how it performs it — catalytic residues, transition state geometry, cofactor requirements), and biological context (where in cellular pathways the protein acts and how it is regulated). The architecture processes the protein sequence through a large protein language model to extract evolutionary and biophysical features, then conditions on predicted structural and dynamic properties — including the conformational ensemble outputs from our dynamics models — to produce functional predictions. Critically, the model is trained not just on annotated function labels but on a corpus of enzymatic reaction data (BRENDA, SABIO-RK), protein interaction networks (STRING, BioGRID), and the primary literature via an automated extraction pipeline that converts experimental findings into structured functional annotations.

For enzymatic function prediction, the results represent a meaningful advance. On the EC number prediction benchmark (predicting the four-level Enzyme Commission classification from sequence alone), Oracle Class R achieves 91.3% accuracy at the full four-digit level on a held-out test set with less than 30% sequence identity to training examples — a regime where homology transfer achieves 54.7% and the previous best ML method (ProteInfer) reaches 79.2%. More importantly, Oracle Class R provides interpretable mechanistic hypotheses: for each predicted enzyme, the model identifies the likely catalytic residues (89% overlap with experimentally verified catalytic sites in the Catalytic Site Atlas), proposes the chemical mechanism from a vocabulary of 147 mechanistic steps, and estimates kinetic parameters (kcat and KM) to within one order of magnitude for 67% of tested enzymes. These are not disconnected predictions; they form a coherent mechanistic narrative that biochemists can evaluate and test experimentally.

For binding specificity and allosteric regulation, we take a different approach. Rather than classifying proteins into predefined functional categories, Oracle Class R predicts interaction profiles: given a protein, what small molecules, peptides, or other proteins will it bind, and how will those binding events affect its conformational state and activity? The model constructs a latent representation of the protein's binding landscape by integrating sequence features, predicted binding-site geometry, and the conformational ensemble's response to in silico perturbations (virtual mutations, simulated ligand placement). In a benchmark against experimentally measured selectivity profiles for 312 kinases across a panel of 72 inhibitors, Oracle Class R's predicted selectivity matrices achieve an average AUC of 0.87, enabling virtual screening that prioritizes selective compounds over promiscuous binders. For allosteric regulation, we have demonstrated that the model correctly predicts known allosteric sites for 73% of a curated set of 200 allosteric proteins, and in 15 cases has identified putative allosteric sites that were subsequently confirmed by mutagenesis in collaboration with experimental partners.

The long-term vision is a system that can read a protein sequence the way a skilled biochemist reads a scientific paper — extracting not just what the molecule looks like, but what it does, how it does it, and why it matters in its biological context. We are not there yet, but the gap is narrowing. Each improvement in structure prediction, dynamics modeling, and functional annotation feeds into the others, creating a flywheel where better structural understanding enables better functional prediction, which in turn guides more targeted structural and dynamical analysis. This integrated approach — treating sequence, structure, dynamics, and function as coupled aspects of a single prediction problem rather than independent tasks — is at the heart of Webbeon's biophysics program and, we believe, the key to unlocking the next generation of computational biology.

Related Research

2026-03-05

Protein Structure Prediction in the Post-AlphaFold Era

2026-02-05

Molecular Dynamics at Millisecond Scale

Related Research

2026-03-05

Protein Structure Prediction in the Post-AlphaFold Era

2026-02-05

Molecular Dynamics at Millisecond Scale