Protein AI Series Part 9: Where Is the Field Heading?
The Technical Evolution of Protein AI — A Record of Key Design Decisions
This is Part 9 of a 10-part series tracing the architectural choices behind modern protein structure prediction and design models. This is the final installment.
The Core Question
What technical choices have converged across the field, what remains divergent, and what fundamental challenges await the next generation of protein AI models?
Over the previous eight Parts, we traced a remarkable arc of innovation: from AlphaFold2’s Evoformer revolution (2021) through the co-folding era (AF3, Boltz, Chai), the emergence of design models (RFdiffusion, BoltzGen, La-Proteina), the conformational diversity challenge (Frozen Trunk, BioEmu, Vilya-1), the engineering that makes it all trainable at scale, and the data landscape — from PDB experimental structures to AFDB’s proteome-scale quaternary expansion — that increasingly determines competitive advantage. This final Part steps back to identify the patterns, gaps, and open frontiers.
1. Where the Field Has Converged — Eight Technical Axes
Each Part of this series examined one technical axis. Across all eight, clear convergence directions have emerged — alongside questions that remain stubbornly open.
| Technical Axis (Part) | Convergence Direction | Open Question |
|---|---|---|
| Input (Part 1) | Hybrid: MSA + PLM | Can PLMs fully replace MSA? |
| Trunk (Part 2) | Pairformer-based; Triangle Multiplication survives all variants | Is attention in the Trunk necessary at all? |
| Generation (Part 3) | Flow Matching for new models | Does ensemble generation still need SDEs? |
| Scope (Part 4) | All-atom co-folding; open-source ecosystem maturing | Can open-source match or exceed proprietary models? |
| Design (Part 5) | “Keep the Trunk, Replace the Head” pattern | What is the optimal sequence representation for generation? |
| Diversity (Part 6) | Frozen Trunk identified as the bottleneck | No universal solution yet |
| Training (Part 7) | FlashAttention + BF16 + FSDP + activation checkpointing | FP8 adoption? Scaling beyond 3B parameters? |
| Data (Part 8) | Distillation essential; PDB + synthetic data mixing | Optimal experimental-to-synthetic ratio? Data moat vs. open data? |
1.1 The Strongest Convergences
Pairformer as de facto standard: Every major co-folding model released since 2024 — AF3, Boltz-1/2, Chai-1, Protenix, SeedFold, OpenFold3 — uses a Pairformer-variant Trunk. The variants differ in efficiency (SeedFold: linear attention; Pairmixer: attention-free), but the core data structure — pair representation $z_{ij}$ updated through triangular operations — is universal.
Flow Matching for generation: Among models released in 2025–2026, Flow Matching has become the default generative framework: NP3, Proteina, La-Proteina, Complexa, FrameFlow, AlphaFlow. The shift from diffusion SDEs to flow matching ODEs reduces sampling steps (40 vs. 200) and simplifies training (no noise schedule tuning). EDM-style diffusion persists in AF3-family models (Boltz, Chai) that predated this convergence.
All-atom co-folding: The era of protein-only models is over. Every new model handles proteins, nucleic acids, small molecules, ions, and covalent modifications in a unified framework. The tokenization strategy — residue-level for polymers, atom-level for non-standard molecules — introduced by AF3 has been adopted almost universally.
Engineering stack standardization: As detailed in Part 7, the combination of FlashAttention + BF16 mixed precision + FSDP + activation checkpointing is now the default for GPU-based training. Models deviate only for architecture-specific reasons (Pairmixer has no attention to flash-optimize; AF2/AF3 run on TPUs).
1.2 The Deepest Disagreements
Sequence representation for design: Three competing approaches — geometric encoding (BoltzGen), latent variables (La-Proteina), and discrete tokens (ESM3) — address the same fundamental problem: how to jointly generate discrete amino acid sequences and continuous 3D coordinates. None has emerged as clearly superior, and the evaluation criteria themselves are disputed (in silico metrics vs. experimental validation).
Conformational diversity: The Frozen Trunk problem (Part 6) is well-diagnosed but unsolved. Vilya-1’s unified diffusion transformer is architecturally elegant but limited to macrocycles. BioEmu produces thermodynamically correct ensembles but only for monomers. NP3’s polymer prior improves ConfBench scores but remains at 52%. No approach simultaneously achieves broad molecular scope, thermodynamic correctness, and computational tractability.
2. Patterns Across the Series — Meta-Observations
2.1 “Keep the Trunk, Replace the Head”
This pattern appeared in nearly every Part:
1
2
3
4
5
6
7
8
9
10
11
12
13
Part 2: Evoformer (Trunk) stays while structure module changes
AF2 → AF3: same Trunk concept, new diffusion head
Part 3: IPA → Diffusion → Flow Matching
Generation head evolves; Trunk representations persist
Part 5: RF1 Trunk → RFdiffusion denoiser (SE(3) diffusion head)
Boltz-2 Trunk → BoltzGen design head (geometric encoding)
Trunk = reusable asset; Head = task-specific adapter
Part 6: Frozen Trunk problem
The very success of "Keep the Trunk" creates the
conformational diversity bottleneck — z_ij is too static
The recurrence of this pattern reflects a deeper truth: learning rich representations of protein physics is the hardest part. Once a model has internalized the relationship between sequence, evolutionary signal, and 3D geometry in its Trunk, adapting that knowledge to new tasks (generation, design, affinity prediction) is comparatively straightforward. The Trunk is the most valuable asset — and also the hardest to change.
2.2 The Persistence of the Triangle
Across all Trunk variants — Evoformer, Pairformer, SeedFold, Pairmixer — one operation survives every simplification attempt: Triangle Multiplication.
1
2
3
4
5
6
7
8
Evoformer (2021): Tri Mult ✓ Tri Att ✓ MSA Att ✓ Seq Att (in MSA)
Pairformer (2024): Tri Mult ✓ Tri Att ✓ ──────── Seq Att ✓
SeedFold (2025): Tri Mult ✓ Linear TA ──────── Seq Att ✓
Pairmixer (2026): Tri Mult ✓ ──────── ──────── ────────
▲ SURVIVES removed removed removed
ALL or early by
variants linearized Pairmixer
Triangle Attention, MSA Attention, and Sequence Attention have all been removed or replaced in at least one competitive variant. Triangle Multiplication has not. The operation:
\[z_{ij}^{\text{out}} = \sum_k a_{ik} \cdot b_{jk}\]implements third-party reasoning — updating the relationship between residues $i$ and $j$ through a mediating residue $k$ — and maps naturally to GPU matrix multiplication. It is both the core inductive bias and the most hardware-efficient operation in the Trunk. Pairmixer’s central finding — “Triangle Multiplication Is All You Need” — is the strongest evidence that this operation, not attention, is the irreducible core of protein representation learning.
2.3 The Return of Physics-Based Priors
The field has traced an arc from physics through pure data back toward physics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2021 AF2: End-to-end learning, minimal physics priors
→ "Let the model learn physics from data"
2024 AF3/Boltz: Gaussian noise prior for diffusion
→ Generic prior, no structural knowledge
2025 NP3: Polymer Prior (Langevin dynamics with bonding potentials)
→ Physics-informed starting point for flow matching
→ 40 steps instead of 200
2025 BioEmu: Boltzmann distribution p(x|seq) ∝ exp(-E(x)/kT)
→ Explicit thermodynamic objective
→ PPFT: free energy measurements as training signal
2025 Vilya-1: Pure chemical features (element, bond, charge, chirality)
→ No molecule-type labels → physics-first representation
The progression suggests that pure data-driven learning hits diminishing returns — physics-based priors provide the right inductive biases for problems where data is scarce (conformational ensembles, rare folds, out-of-distribution molecules). This parallels similar trends in other scientific ML domains.
2.4 Data as the New Competitive Axis
A pattern that emerged most clearly in Part 8: as architectures converge, data becomes the primary differentiator.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
2021 AF2: Architecture is the breakthrough
→ Evoformer + IPA = unprecedented accuracy
→ Data (PDB) is necessary but not the innovation
2024 AF3/Boltz/Chai: Architecture converges (Pairformer + Diffusion)
→ Open-source matches AF3 on architecture alone
→ Differentiation begins to shift toward data strategy
2025 SeedFold: 26.5M distillation data is essential, not optional
→ Removing distillation mid-training causes immediate degradation
→ Architecture needs data volume to compensate for lost inductive bias
2026 IsoDDE: Proprietary experimental data creates the performance gap
→ Same architecture family as open-source models
→ Decisive advantage: internal binding data at scale
2026 AFDB Quaternary: Open data ecosystem expands to complexes
→ 1.8M high-confidence dimeric structures now publicly available
→ Open-source community gains interface data that was previously scarce
The implication: the “data pyramid” (Part 8) — sequences (billions) > synthetic structures (millions) > experimental structures (hundreds of thousands) > functional data (tens of thousands) — defines the landscape of competitive advantage. Each layer is harder to acquire than the one below it, and the scarcest layer (proprietary functional data) is increasingly where competitive moats form.
2.5 “What We Thought Was Essential, Wasn’t”
A recurring pattern of elimination:
| Year | Discovery | Impact |
|---|---|---|
| 2022 | ESMFold: MSA not required for structure prediction | PLM as MSA replacement |
| 2023 | RF2: Neither IPA nor Triangle Attention needed for AF2-level accuracy | Opened path to Pairmixer |
| 2024 | AF3: Recycling through the Trunk can be removed | Simplified training |
| 2025 | SeedFold: Triangle Attention can be linearized without quality loss | Sub-cubic Trunk |
| 2026 | Pairmixer: All attention in the Trunk can be removed | Attention-free protein AI |
Each “removal” discovery required first understanding why the component was included (usually for good theoretical reasons), then empirically demonstrating that the model could achieve equivalent performance without it. The accumulation of these discoveries suggests the field is converging toward a minimal sufficient architecture — one where every remaining component (Triangle Multiplication, pair representation, flow matching) has survived multiple elimination attempts.
3. The Wet-Lab Validation Gap
3.1 Computational Success $\neq$ Experimental Success
The most consequential gap in protein AI is between in silico metrics and experimental outcomes. A designed protein that scores well on designability (self-consistency via AF2 refolding), novelty, and diversity may still fail to:
- Express in a cell (solubility, folding kinetics)
- Fold into the predicted structure (kinetic traps, aggregation)
- Bind the intended target (affinity, specificity)
- Function as intended (catalysis, signaling)
3.2 Experimental Validation Status by Model
| Model | Design Target | Experimental Scale | Key Result | Publication |
|---|---|---|---|---|
| RFdiffusion | De novo binders, scaffolds | Hundreds of designs tested | ~15–30% success rates for binders; multiple designs validated in vivo | Nature 2023 |
| RFdiffusion2 | Enzyme active sites (all-atom) | Dozens of designs | Functional enzymes with designed active sites | 2025 |
| RFAntibody | Antibody CDR design | Multiple targets | Competitive with experimental methods | 2024–25 |
| Chai-2 | De novo antibody binders | 52 targets, ~20 designs each | ~16% hit rate (~100 $\times$ over previous methods); 50% of targets hit with $\leq 20$ designs | 2025 |
| BoltzGen | Universal binder design | Limited experimental data | 6 design protocols validated computationally | 2025 |
| La-Proteina | Backbone + sequence co-design | Computational benchmarks | 75%+ co-designability (in silico) | 2025 |
| Complexa | Binder design | Computational benchmarks | SOTA on binder design benchmarks (in silico) | 2025–26 |
The maturity gradient is striking: RFdiffusion (2023) has three years of experimental validation and hundreds of confirmed designs. BoltzGen and La-Proteina/Complexa (2025) show strong computational results but lack large-scale experimental confirmation. This lag is inherent — wet-lab validation takes 3–12 months per design cycle.
3.3 The In Silico–Experimental Correlation Problem
The standard computational validation pipeline:
1
Generated structure → AF2/Boltz refolding → RMSD to design → "designability"
This self-consistency check asks: “Does a prediction model agree that this design folds as intended?” But the prediction model shares many of the same biases as the generation model — especially when both are Pairformer-based. High self-consistency may reflect shared model biases rather than genuine foldability.
Metrics that correlate better with experimental success:
- Rosetta energy: Physics-based energy function, orthogonal to learned models
- Molecular dynamics stability: Does the design maintain its structure in simulation?
- Expression prediction: Sequence-level features predicting soluble expression
- Experimental hit rate: The only ground truth — but expensive and slow
The field needs a “PoseBusters for protein design” — a filter that catches physically implausible designs before they reach the wet lab.
3.4 The Tightening Design Loop
The practical impact of AI protein design depends on the speed of the design–make–test–learn cycle:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
2023: RFdiffusion (backbone)
→ ProteinMPNN (sequence)
→ Expression + binding assay (months)
→ Manual analysis → next round
2025: BoltzGen/Chai-2 (end-to-end: structure + sequence)
→ Automated expression (Recursion, Emerald Cloud Lab)
→ High-throughput binding (weeks)
→ Active learning → next round
Future: Unified AI model (design + predict + score)
→ Robotic synthesis + characterization (days)
→ Automatic data feedback → model update
→ Continuous improvement cycle
Each acceleration of this loop — from months to weeks to days — multiplicatively increases the value of AI design models. The models that integrate most tightly with automated experimental platforms will have the fastest learning cycles and the best real-world performance.
4. The Limits of Current Benchmarks
4.1 From CASP to Continuous Evaluation
CASP (Critical Assessment of protein Structure Prediction) has been the gold standard since 1994, running biennial blind prediction competitions. But its limitations have become apparent:
- Biennial cadence: Two-year gaps cannot track a field where major models appear quarterly
- Static targets: Test proteins are selected once; models can overfit to the distribution of CASP targets
- Structure-only focus: No evaluation of dynamics, binding, or design
CAMEO (Continuous Automated Model EvaluatiOn) provides weekly blind evaluation against newly deposited PDB structures, offering real-time tracking of model performance. PLINDER specifically benchmarks protein-ligand interaction prediction, filling a gap that CASP never addressed.
4.2 PoseBusters: Physical Validity
PoseBusters (2024) introduced a critical filter: checking whether predicted structures satisfy basic physical constraints — bond lengths within tolerance, no steric clashes, correct chirality, plausible torsion angles. Surprisingly, many AI-generated structures that score well on RMSD and lDDT fail PoseBusters checks:
1
2
3
4
5
6
7
8
9
10
11
Scenario:
AI predicts protein-ligand complex
lDDT: 0.85 (good)
Ligand RMSD: 1.2 Å (good)
BUT:
→ C-C bond stretched to 2.1 Å (should be ~1.5 Å)
→ Steric clash between ligand and protein
→ Inverted chirality at stereocenter
PoseBusters verdict: FAIL
This reveals a gap between geometric accuracy (learned from data) and physical validity (constrained by chemistry). Models trained purely on structural data can produce geometrically close but physically impossible structures — a problem that matters critically for drug design, where incorrect bond geometry invalidates binding affinity predictions.
4.3 Data Leakage and Temporal Splits
A pervasive concern: training data overlapping with test data. The PDB grows continuously; structures deposited after a model’s training cutoff may share high sequence identity with training examples. Proper evaluation requires:
- Temporal splits: Only evaluate on structures deposited after the training data cutoff
- Sequence identity filtering: Remove test targets with $\gt 30$% sequence identity to any training example
- Structural novelty: Test on genuinely novel folds not present in the training distribution
ConfBench (Part 6) exemplifies good benchmark design: it tests a capability (conformational discrimination) that models were not explicitly trained for, revealing genuine weaknesses rather than memorization.
4.4 Benchmark Overfitting
When every model optimizes for the same benchmarks, the field risks Goodhart’s Law: the metric ceases to be a good measure once it becomes the target. Signs of this in protein AI:
- Models achieving near-identical lDDT scores on standard test sets while differing substantially on out-of-distribution targets
- Hyperparameter tuning against benchmark leaderboards rather than downstream applications
- New benchmarks (ConfBench, PoseBusters, PLINDER) consistently exposing weaknesses that existing benchmarks missed
The healthiest dynamic is a co-evolution of models and benchmarks: new models expose benchmark limitations, new benchmarks expose model limitations, and both improve.
5. Industrial Applications — Where Do These Models Fit in Drug Discovery?
5.1 Structure-Based Virtual Screening
The most immediate industrial impact of protein structure prediction: enabling virtual screening for targets without experimental structures.
1
2
3
4
5
6
7
8
9
10
11
Before AI structure prediction:
Target protein → X-ray crystallography (months-years, may fail)
→ IF structure obtained: docking-based virtual screening
→ IF not: phenotypic screening (blind)
After AI structure prediction:
Target protein → AF3/Boltz-2 prediction (minutes)
→ Docking against AI structure
→ Prioritized compound library for experimental testing
This has already changed practice at major pharmaceutical companies. AI-predicted structures are routinely used for virtual screening when experimental structures are unavailable — particularly for targets in disease-relevant conformations that resist crystallization (membrane proteins, IDPs, transient complexes).
5.2 Binding Affinity: The Remaining Gap
Binding affinity prediction is the holy grail of computational drug design. Two approaches compete:
Physics-based (FEP+, TI): Free Energy Perturbation calculates relative binding free energies between congeneric ligands using molecular dynamics. Accuracy: ~1 kcal/mol for well-parameterized systems. Cost: GPU-hours per ligand pair.
AI-based (Boltz-2, IsoDDE): End-to-end prediction of $K_d$ or $\Delta G$ from structure. Boltz-2’s affinity head (Part 7, Stage 3) predicts binding affinity as an additional output. IsoDDE reportedly integrates affinity prediction into a multi-task framework.
Current status: AI affinity prediction has not yet matched FEP+ accuracy for lead optimization (where 0.5 kcal/mol differences matter). However, AI models are orders of magnitude faster and can handle targets where FEP+ requires expensive setup (novel scaffolds, protein-protein interactions, non-congeneric series). The likely convergence: AI for initial ranking, FEP+ for final optimization.
5.3 Biologics vs. Small Molecules
AI protein design’s impact differs dramatically between molecule types:
Biologics (antibodies, nanobodies, peptides): AI design is genuinely transformative.
- Traditional approach: immunization campaigns or phage display → months of screening
- AI approach: Chai-2/RFAntibody generates candidates in silico → immediate experimental testing
- Chai-2’s ~16% hit rate across diverse targets represents a step change in efficiency
- BoltzGen’s six design protocols (protein, peptide, antibody, nanobody, small molecule, redesign) provide a unified design platform
Small molecules: AI’s primary contribution is structure prediction for virtual screening, not molecule design.
- Small molecule generative models (diffusion-based, autoregressive) exist but operate in a different design space
- The binding pocket prediction from protein AI feeds into traditional SBDD (Structure-Based Drug Design) pipelines
- Protein-ligand co-folding (AF3, Boltz-2, NP3) improves pose prediction, which directly benefits small molecule optimization
5.4 The Automation Frontier
The companies integrating AI design with automated experimental platforms — Recursion (robotic biology), Isomorphic Labs (DeepMind spin-off), EvolutionaryScale (ESM3), and Chai Discovery — are positioned to close the design–test loop fastest. The competitive advantage is not the model alone but the speed of iteration: a slightly worse model with 10 $\times$ faster experimental feedback can outperform a better model with slow feedback.
6. Looking Ahead: 2026–2027
6.1 Will Unified Architectures Prevail?
Vilya-1’s unified diffusion transformer (Part 6) merges Trunk and Diffusion Module, making $z_{ij}$ dynamic at every denoising step. This solves the Frozen Trunk problem architecturally but at cost: $O(T \times L^2)$ instead of $O(L^2)$ for the pair representation.
Prediction: Unified architectures will expand beyond macrocycles to protein-ligand complexes within 2026–2027, but with approximations — updating $z_{ij}$ every $k$ steps rather than every step, or using lightweight update mechanisms. The full $O(T \times L^2)$ cost is unlikely to be acceptable for large complexes, but the principle of dynamic pair representations will become standard.
6.2 Will PLMs Replace MSA?
The trend from Part 1 is clear: PLM contribution is growing while MSA dependency is shrinking. ESMFold demonstrated MSA-free prediction (with accuracy loss); Chai-1 showed that PLMs can compensate when MSA quality is poor.
Prediction: By 2027, the best single-sequence (MSA-free) model will match the best MSA-dependent model on standard benchmarks. MSA will remain useful for edge cases (orphan proteins, de novo designs with no evolutionary history) but cease to be the default requirement.
6.3 Active Learning: Closing the Loop
The most impactful near-term development is not a new architecture but a new workflow: automated design–test–learn cycles where AI models propose designs, robotic platforms test them, and results feed back into model training.
1
2
3
4
5
6
7
8
9
10
11
12
Current (open-loop):
AI model (fixed) → designs → experiments → publications
(months later)
Future (closed-loop):
AI model v1 → designs → robotic testing (days)
↑ │
└── retrain on results ←──┘
AI model v2 → better designs → ...
This requires:
- Standardized experimental protocols that produce machine-readable results
- Uncertainty quantification in AI models (knowing which designs to test first)
- Efficient fine-tuning on small batches of new experimental data
6.4 Multi-Task Foundation Models
IsoDDE (Isomorphic Labs, 2026) signals the direction: a single model predicting structure, binding affinity, ADMET properties, and selectivity. The “multi-task drug design engine” treats each downstream application as a head on a shared Trunk — the “Keep the Trunk, Replace the Head” pattern scaled to its logical conclusion.
Prediction: By 2027, the leading protein AI models will be multi-task: structure prediction, confidence, affinity, design, and property prediction sharing a common Trunk. The competitive differentiation will shift from architecture (converged) to data (proprietary experimental datasets) and integration (speed of the design–test loop).
6.5 The Data Ecosystem: Expansion and Risk
The data landscape (Part 8) is evolving along three dimensions that will shape 2026–2027:
AFDB Quaternary Expansion: The 2026 expansion of the AlphaFold Database to proteome-scale quaternary structures (Han et al.) added 31M dimeric complex predictions, of which 1.8M meet high-confidence thresholds. Unlike monomeric AFDB, these contain interface information — directly usable for training binder design and PPI prediction models. This public resource partially addresses the data scarcity that previously limited complex-aware model training to PDB’s ~tens of thousands of experimental multimers.
Continuous distillation as requirement: SeedFold’s ablation (Part 8) established that distillation data is not merely useful for pre-training — it must be maintained throughout training. Removing it at step 47,612 caused immediate accuracy degradation. This finding has architectural implications: post-AF2 models that replaced IPA’s geometric inductive bias with general-purpose diffusion transformers fundamentally need more data to generalize. The 180K PDB structures are insufficient; distillation at scale (SeedFold: 26.5M, OpenFold3: 13M) is now a requirement, not an optimization.
The data flywheel and model collapse risk: As each generation of models generates synthetic training data for the next (AF2 → AFDB → Boltz-2/NP3 → next generation), a recursive loop forms. This flywheel drives improvement but carries the risk of model collapse — systematic errors reinforced rather than corrected across generations. The role of experimental PDB data as a “ground truth anchor” becomes increasingly critical. Whether the open-source community can sustain the flywheel without access to proprietary experimental data (IsoDDE’s advantage) remains the central strategic question.
6.6 Democratization of Access
The barrier to using protein AI has dropped dramatically:
| Year | Access Model | User Requirement |
|---|---|---|
| 2021 | AF2: Download code, install dependencies, run MSA search | ML engineering + biology |
| 2023 | AlphaFold Server: Web interface for prediction | Biology only |
| 2025 | Boltz API, Chai API: Cloud endpoints for prediction + design | API call |
| 2026+ | Integrated platforms: Design → predict → score in one interface | Domain question only |
This democratization means that the bottleneck shifts from computational access to biological insight: knowing which protein to target, which conformer matters, and how to validate results experimentally.
7. The Open-Source Ecosystem’s Long-Term Impact
7.1 Where Open Source Won
The open-source reproduction race (Part 4) has produced models that match or exceed AF3 on most benchmarks:
1
2
3
4
5
6
7
8
AF3 (2024, academic license)
├── Boltz-2 (MIT) ─── matches AF3 + affinity head
├── OpenFold3 (Apache 2.0) ─── matches AF3, best RNA
├── Chai-1 (Apache 2.0) ─── matches AF3 + ESM-2 integration
└── Protenix (open) ─── matches AF3 with minor improvements
Verdict: Open-source collectively matches AF3 across all benchmarks
AND adds capabilities (affinity, PLM integration) AF3 lacks
7.2 Where Proprietary Models Lead
IsoDDE represents capabilities that open-source has not yet matched:
- Multi-task: structure + affinity + ADMET in one model
- Trained on proprietary experimental data (Isomorphic Labs’ internal datasets)
- 1000-seed sampling with learned confidence ranking
- Integrated into a drug discovery pipeline (not just a prediction tool)
The pattern: open-source matches on architecture; proprietary leads on data and integration.
7.3 The Second-Order Innovation Effect
The most important impact of open-source is not the models themselves but the innovation they enable:
1
2
3
4
5
6
7
8
9
10
11
12
13
OpenFold (2022, Apache 2.0)
→ AlphaFlow (2024): AF2 fine-tuned for conformational ensembles
→ Training recipe shared → informed Boltz-1 development
Boltz-1/2 (MIT, training code open)
→ BoltzGen (2025): design capability added
→ BoltzDesign1: further design extensions
→ Community can experiment with training modifications
Proteina (open, 2025)
→ La-Proteina: latent flow matching extension
→ Complexa: binder design extension
→ Scaling law findings inform entire field
Each open-source release creates a platform for derivative innovation that the original authors could not have anticipated. The cumulative effect: the rate of innovation in protein AI is faster than any single organization could achieve alone.
7.4 License Matters
The choice between restrictive (AF3: academic-only) and permissive (Boltz-2: MIT, OpenFold3: Apache 2.0) licenses has had measurable consequences:
- Academic-only: AF3’s code is available but cannot be used commercially → pharmaceutical companies must rely on open alternatives or proprietary reproductions
- Permissive: Boltz-2 and OpenFold3 are directly usable in commercial drug discovery pipelines → faster industrial adoption
- Result: The open-source models with permissive licenses have seen wider adoption than AF3 despite AF3’s first-mover advantage
8. The Full Technical Timeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
[Timeline]
2018 AF1: Distance distribution prediction + gradient descent refinement
2021 AF2: Evoformer + IPA (end-to-end revolution)
RF1: 3-track architecture (1D + 2D + 3D simultaneous)
2022 ESMFold: PLM-only prediction (no MSA required)
ProteinMPNN: Inverse folding becomes standard
OpenFold: AF2 open-source reproduction (Apache 2.0)
UniFold: AF2 reproduction by DP Technology
2023 RFdiffusion: Design revolution (SE(3) diffusion)
RF2: Triangle Attention shown unnecessary
FrameDiff: SE(3) diffusion theoretical foundation
2024 AF3: Pairformer + EDM Diffusion (co-folding era begins)
RFAA: All-atom co-folding (Baker Lab)
Boltz-1, Chai-1: Open-source AF3 reproductions
BioEmu: Boltzmann distribution learning (conformational ensembles)
ESM3: Discrete multimodal generation (structure + sequence + function)
2025 Boltz-2: Affinity head added (structure + confidence + affinity)
SeedFold: Linear Triangle Attention (sub-cubic Trunk)
NP3: Flow Matching + Polymer Prior + ConfBench
BoltzGen: Universal binder design (geometric encoding)
Proteina → La-Proteina → Complexa: NVIDIA flow matching lineage
OpenFold3: Complete open-source AF3 (Apache 2.0)
Chai-2: Antibody design (~16% hit rate)
Vilya-1: Unified Diffusion Transformer (dynamic z_ij)
RFAntibody, RFdiffusion2: Baker Lab design extensions
2026 IsoDDE: Multi-task drug design engine (Isomorphic Labs)
Pairmixer: Attention-free Trunk ("Triangle Multiplication Is All You Need")
AFDB Quaternary: Proteome-scale complex predictions (31M dimers, 1.8M HC)
[Technical Axis Transitions]
Input: MSA → PLM → Hybrid (MSA + PLM)
Trunk: Evoformer → Pairformer → Pairmixer/SeedFold
Generation: IPA → Diffusion (EDM) → Flow Matching
Scope: Monomer → Complex → All-atom co-folding
Purpose: Prediction → Design → Ensemble
Training: DDP → FSDP, FP32 → BF16, single-stage → multi-stage
Data: PDB only → PDB + distillation → PDB + AFDB + synthetic complexes
[Lineage Map]
[Structure Prediction] [Design / Generation]
━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━
AF1 (2018)
└→ AF2 (2021) ─── OpenFold FrameDiff (2023) → FrameFlow
│ UniFold ProteinMPNN (2022)
├→ AF2-Multimer
└→ AF3 (2024) ─── OpenFold3
│ Protenix
│ SeedFold
├→ Boltz-1 → Boltz-2 ────────→ BoltzGen, BoltzDesign1
├→ Chai-1 ───────────────────→ Chai-2
└→ IsoDDE (2026)
RF1 (2021)
├→ RF2 (2023)
└→ RFAA (2024) ──────────────────────→ RFdiffusion → RFdiff-AA
→ RFAntibody → RFdiff2
NP1 (2022) → NP2 → NP3 (2025)
ESM-1b → ESM-2 → ESMFold ────────────→ ESM3
BioEmu (2024) ────────────────────────→ PPFT
Proteina (backbone FM)
→ La-Proteina (latent FM)
→ Complexa (binder design)
Vilya-1 (2025) ─── Unified Diffusion Transformer
PLACER (2025) ─── Conformational ensemble
AlphaFlow ──── Flow Matching + AF2 fine-tuning
Pairmixer (2026) ─── Attention-free Trunk
9. Closing Thoughts
What this series traced
Over nine installments, we followed one core question: why did each model make the technical choices it did? The answers revealed a field that is simultaneously converging (Pairformer, Flow Matching, all-atom scope, standard training stack) and diversifying (design representations, conformational diversity approaches, commercial vs. open-source strategies).
The four most important lessons
1. Representations matter more than generation methods. The Trunk — how a model represents residue-pair relationships — has been the most stable and valuable component across all model generations. Generation methods (IPA, diffusion, flow matching) have changed rapidly, but the pair representation $z_{ij}$ and the triangular operations that update it have persisted from Evoformer through Pairmixer. Investing in better representations pays off across all downstream tasks.
2. Elimination is as important as innovation. Some of the field’s most impactful contributions were removals: RF2 removing IPA, Pairmixer removing attention, ESMFold removing MSA dependence. Each removal simplified the architecture, reduced computational cost, and clarified what is truly essential. The minimal architecture — Triangle Multiplication + FFN for the Trunk, flow matching for generation — may be close to the irreducible core.
3. Engineering and data are first-class design decisions. Part 7 showed that the same architecture trained with different engineering produces different results. Part 8 showed that the same architecture trained with different data produces different results. The gap between “published architecture” and “working system” is filled by crop strategies, precision management, distillation pipelines, and data mixing ratios. As architectures converge, the competitive frontier shifts to data — who has the best experimental measurements (IsoDDE), who generates the best synthetic structures (AFDB, Teddymer), and who mixes them most effectively (Boltz-2’s 60/40, SeedFold’s 50/50).
4. Open data accelerates the entire field. The AFDB — from 360K monomers (2021) to 214M monomers (2024) to 31M quaternary complexes (2026) — has been the single most impactful public resource in protein AI. Models trained on AFDB-derived data (NP3, Boltz-2, SeedFold) collectively outperform models without it. The expansion to quaternary structures provides interface information that was previously available only through the PDB’s limited set of experimental complexes. Open data creates a rising tide that lifts all models — while proprietary data creates moats that only individual organizations can cross.
What comes next
The next generation of protein AI will likely be defined not by a new architecture but by closing the loop between computation and experiment. The models are good enough to generate plausible designs; the remaining bottleneck is knowing which designs to test and learning from the results. The organizations that solve this — integrating AI design, automated experimentation, and continuous model improvement — will define the field’s next chapter.
Part of the series: The Technical Evolution of Protein AI — A Record of Key Design Decisions