SSL for Co-Folding Part 2: How Co-Folding Models Use Synthetic Data — An SSL Perspective
What Protein AI Can Learn from Semi-Supervised Learning
This is Part 2 of a 4-part series examining how semi-supervised learning techniques from computer vision can improve protein structure prediction models.
- Part 1: The Semi-Supervised Revolution in Computer Vision
- Part 2 (this post): How Co-Folding Models Use Synthetic Data — An SSL Perspective
- Part 3: Untapped Opportunities and the Confidence Calibration Trap
- Part 4: The Road Ahead — Data Flywheels, Foundation Models, and Open Questions
The Core Question
How do protein structure prediction models use synthetic data, and what does this look like through the lens of semi-supervised learning?
In Part 1, we surveyed the evolution of semi-supervised learning in computer vision — from pseudo-labeling (Lee, 2013) through consistency regularization (Tarvainen & Valpola, 2017, NeurIPS) to the FixMatch unification (Sohn et al., 2020, NeurIPS) and beyond. We now have a shared vocabulary: pseudo-labels, confidence thresholds, teacher-student frameworks, weak-to-strong augmentation, adaptive weighting.
This Part applies that vocabulary to protein structure prediction. Every major co-folding model — AlphaFold2, AlphaFold3, Boltz-1/2, SeedFold, OpenFold3 — relies on synthetic data generated by teacher models. The field calls this “distillation.” We will show that distillation is, in every meaningful sense, semi-supervised learning — and that recognizing this connection reveals both what the field has already adopted and what remains untapped.
We proceed model by model, analyzing each across five dimensions: Teacher, Student, Data, Filtering, and Loss. Along the way, we map every design choice to its SSL counterpart. We close with systematic comparison tables and a clear accounting of which SSL techniques the field has adopted, and which it has not.
1. The Data Pyramid
Protein AI training data spans four orders of magnitude, with an inverse relationship between scale and information density:
1
2
3
4
5
6
7
8
9
10
11
12
13
┌─────────┐
│Functional│ ~10K-50K
│ Data │ (Kd, Ki, IC50)
├─────────┤
│ PDB │ ~220K
│Structures│ (experimental 3D)
┌───┴─────────┴───┐
│ Synthetic │ ~10M-200M
│ Structures │ (AF2/AF3 predictions)
┌───┴──────────────────┴───┐
│ Sequences │ ~250M-2.5B
│ (UniProt, BFD, MGnify) │
└──────────────────────────┘
The mapping to SSL terminology is immediate:
| Data Layer | Scale | SSL Term | Symbol |
|---|---|---|---|
| Sequences | ~250M-2.5B | Unlabeled data | D_U |
| Synthetic structures | ~10M-200M | Pseudo-labeled data | D̂_U |
| PDB experimental | ~220K | Labeled data | D_L |
| Functional data | ~10K-50K | Task-specific labels | D_task |
In computer vision, the canonical SSL problem is: 50K labeled images from CIFAR-10, millions of unlabeled images from the web. The ratio is roughly 1:100 or 1:1000.
In protein AI, we have ~220K experimental structures (labeled) and ~250M-2.5B sequences (unlabeled). The ratio is 1:1000 to 1:10,000. The synthetic structures — generated by running teacher models on unlabeled sequences — sit in between, playing exactly the role of pseudo-labels in SSL.
This pyramid is structurally identical to the SSL problem in computer vision. The only difference is the label type: instead of class probabilities over 10 categories, the “label” is a 3D coordinate set over thousands of atoms. Everything else — the data asymmetry, the teacher-student pipeline, the confidence filtering, the training schedule — maps directly.
2. Model-by-Model Analysis
We now analyze six models in chronological order. For each, we characterize the distillation strategy across five dimensions and provide the SSL interpretation.
Before diving in, two loss functions appear repeatedly and are worth defining once. The Frame Aligned Point Error (FAPE), used in AF2, measures structural accuracy via local reference frames. Given rigid-body frames T_i and atom positions x_j (true) and x̂_j (predicted):
\[\text{FAPE} = \frac{1}{N_{\text{frames}}} \frac{1}{N_{\text{atoms}}} \sum_{i} \sum_{j} \left\| T_i^{-1} \circ \hat{\mathbf{x}}_j - T_i^{-1} \circ \mathbf{x}_j \right\|_{\text{clamp}}\]The EDM diffusion loss, used in AF3 and all subsequent models, trains a denoiser D_θ to recover clean coordinates x₀ from noised coordinates x_t:
\[\mathcal{L}_{\text{diffusion}} = \mathbb{E}_{t, \epsilon}\left[\lambda(t) \left\| D_\theta(\mathbf{x}_t, t) - \mathbf{x}_0 \right\|^2\right]\]where λ(t) is a time-dependent weighting function. The noised coordinates are defined as x_t = α_t · x₀ + σ_t · ε with ε ~ N(0, I).
2.1 AlphaFold2: The First Self-Distillation
Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature 2021
AlphaFold2 introduced self-distillation to protein structure prediction. In SSL terms, this is the most classical form of self-training: the model generates pseudo-labels for unlabeled data, then a fresh copy of the same architecture trains on the combination.
Teacher. An “undistilled” AF2 — an earlier checkpoint trained on PDB data only. Single model inference without the 5-model ensemble re-ranking used for CASP14 submissions. The paper states: “All CASP14 models are trained with distillation from a slightly earlier version of the model.”
Student. Identical architecture (Evoformer + IPA Structure Module). Initialized from scratch — no weight transfer from the teacher. This is pure data-level knowledge transfer.
Data. Starting from 6.3M sequences in Uniclust30 (v2018-08), the pipeline applied greedy deduplication, length filtering (200 < length ≤ 1024), and MSA depth filtering (≥ 200 sequences). The result: 355,993 synthetic structures. MSA for each distillation example was subsampled to 1,000 sequences.
Filtering. AF2’s confidence filtering is finer-grained than anything in CV SSL. Rather than accepting or rejecting entire samples, AF2 uses a per-residue KL-divergence metric:
\[c_i = \frac{1}{|\mathcal{N}_i|} \sum_{j \in \mathcal{N}_i} D_{\text{KL}}\left(p_{\text{ref},|i-j|}(r) \;\|\; p_{i,j}(r)\right)\]Here p_i,j(r) is the predicted pairwise distance distribution between residues i and j, and p_ref is the reference distribution computed from 1,000 random Uniclust30 sequences. The neighborhood N_i covers residues within ±128 positions of i. Notably, this is not pLDDT — it is a distance-distribution-based confidence score. Residues with c_i < 0.5 are masked from the loss, meaning the student never trains on low-confidence regions.
Loss. The same loss as for experimental data: FAPE + distogram + masked MSA + auxiliary losses. The only difference is the per-residue confidence masking described above. FAPE clamping (10 angstrom) is applied identically to PDB and distillation data.
Training schedule. 75% distillation / 25% PDB sampling throughout training — both in the initial phase (~10M samples, 7 days) and fine-tuning (~1.5M samples, 4 days). Total: ~11 days on 128 TPUs.
SSL interpretation. This is classic self-training (Lee, 2013). The per-residue masking is conceptually similar to FixMatch’s confidence thresholding, but with two key differences: (1) it operates at residue granularity rather than sample granularity, and (2) it uses a distance-distribution metric rather than class probability. AF2’s ablation study found self-distillation to be “one of the most impactful components” — a strong empirical endorsement of pseudo-labeling for structural biology.
2.2 AlphaFold3: Multi-Teacher Cross-Distillation
Abramson et al., “Accurate structure prediction of biomolecular interactions with AlphaFold 3,” Nature 2024
AlphaFold3 represents a qualitative leap in distillation sophistication. It is the first model to use multiple purpose-specific teachers and to explicitly separate loss types by data source.
Teachers. Three different models, each serving a distinct purpose:
| Teacher Model | Purpose | Generated Data |
|---|---|---|
| AlphaFold2 | Protein monomers | ~41M structures from MGnify |
| AF-Multimer v2.3 | Disordered regions | ~14K-25K PDB proteins with missing residues |
| AlphaFold3 itself | RNA and TF-DNA | ~65K RNA + ~16K TF structures |
The AF-Multimer distillation deserves special attention. AF3’s diffusion module tends to hallucinate compact structures for intrinsically disordered regions — they look physically plausible but are wrong. AF-Multimer’s IPA module, by contrast, predicts extended ribbon-like conformations for these regions. This “correct expression of uncertainty” is distilled into AF3 to suppress hallucination. In SSL terms, this is using a less powerful but better-calibrated teacher for specific failure modes.
Student. Fundamentally different architecture from the primary teacher (AF2):
| Dimension | AF2 (Teacher) | AF3 (Student) |
|---|---|---|
| Trunk | Evoformer | Pairformer |
| Structure module | IPA (SE(3)-equivariant) | EDM Diffusion (non-equivariant) |
| Tokenization | Residue-level | Atom-level |
| MSA processing | Row-wise + column-wise attention | Removed column-wise attention |
This is genuine cross-architecture distillation — knowledge transfer between fundamentally different model families. In CV, the closest analogue is distilling from a large ViT teacher to a smaller ConvNet student, but the architectural gap here is much wider.
Data. The training mixture is carefully weighted:
| Dataset | Size | Sampling Weight |
|---|---|---|
| Weighted PDB | ~150K | 0.50 |
| Protein monomer (long, >200 res) | ~13M | 0.495 |
| Protein monomer (short, 4-200 res) | ~28M | 0.005 |
| Disordered PDB | ~14K-25K | 0.02 |
| RNA (Rfam v14.9) | ~65K | 0.05 |
| TF positive (JASPAR + SELEX) | 16,439 | 0.021 |
| TF negative (random pairs) | N/A | 0.011 |
Note the extreme downweighting of short monomers (0.005 for 28M sequences vs. 0.495 for 13M longer ones). This is implicit quality control through sampling — short proteins are easier to predict but less informative for training.
Filtering. AF3 made a surprising decision: it removed the pLDDT ≥ 0.8 threshold that AF2 had applied to monomer distillation. Instead of filtering at the data level, AF3 manages quality through loss differentiation and sampling weights. For RNA, a stricter filter applies: average PDE (Predicted Distance Error) < 2 angstrom.
Loss. This is the most critical design decision — and the one with the strongest SSL implications:
\[\mathcal{L} = \underbrace{\alpha_{\text{diff}} \mathcal{L}_{\text{diffusion}} + \alpha_{\text{dist}} \mathcal{L}_{\text{distogram}}}_{\text{applied to all data}} + \underbrace{\alpha_{\text{conf}} \left(\mathcal{L}_{\text{pLDDT}} + \mathcal{L}_{\text{PDE}} + \alpha_{\text{PAE}} \mathcal{L}_{\text{PAE}}\right)}_{\text{applied to PDB only}}\]with coefficients α_diff = 4, α_dist = 3 × 10⁻², and α_conf = 10⁻⁴.
The structural losses (diffusion denoising, distogram prediction) are applied to all data — PDB and synthetic alike. But the confidence losses (pLDDT, PDE, PAE) are applied exclusively to PDB experimental data. The reasoning: structural pseudo-labels from a good teacher are approximately correct and useful for training; but confidence pseudo-labels — which require knowing the true error — would propagate the teacher’s calibration errors into the student. We will return to this asymmetry in Part 3.
Training schedule. Four stages spanning ~20 days on 256 A100 GPUs:
| Stage | Crop Size | Distillation | Key Change |
|---|---|---|---|
| Initial | 384 | Monomer + disorder + RNA | Baseline training |
| Fine-tune 1 | 640 | Same + disorder unmasking | Unmask non-protein chains |
| Fine-tune 2 | 768 | + TF distillation | Add protein-DNA data |
| Fine-tune 3 | 768 | Same | Add PAE head, remove structure loss |
SSL interpretation. AF3’s multi-teacher strategy has no direct precedent in CV SSL, where single-teacher self-training dominates. The loss-type separation — structural loss on all data, confidence loss on labeled only — goes beyond the CV SSL playbook entirely. This is a protein-domain-specific insight: structural predictions can tolerate noise in ways that confidence estimates cannot.
2.3 Boltz-1 to Boltz-2: From Simple to Multi-Source
Boltz-1
Wohlwend et al., “Boltz-1: Democratizing Biomolecular Interaction Modeling,” bioRxiv 2024
Boltz-1 adopted the simplest distillation strategy among all models.
Teacher. AlphaFold2 via OpenProteinSet (pre-computed predictions with MSA features).
Student. Pairformer (48 blocks) + EDM diffusion — architecturally similar to AF3.
Data. ~270K structures from OpenProteinSet. Standard PDB set for experimental data.
Training schedule. Two stages with a distinctive transition:
| Stage | Steps | Data |
|---|---|---|
| Initial | 53K | PDB + distillation (50:50) |
| Fine-tuning | 15K | PDB only |
Boltz-1 is the only model that removes distillation data at the end of training. The final 15K steps use PDB exclusively — a strategy that stands in direct contrast to SeedFold’s finding that continuous distillation is necessary.
Loss. EDM denoising loss with per-atom-type weights (protein 1.0, DNA/RNA 5.0, ligand 10.0). No reported differentiation between distilled and experimental data.
Boltz-2
Passaro, Wohlwend et al., “Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction,” bioRxiv 2025
Boltz-2 dramatically expanded the distillation strategy, introducing the most explicit multi-source data mixing in the field.
Teachers. Two models serving complementary roles:
| Teacher | Purpose |
|---|---|
| AlphaFold2 (AFDB) | Protein monomers |
| Boltz-1 | Protein-ligand, protein-DNA, RNA, MHC-peptide complexes |
The use of their own previous model (Boltz-1) as the complex teacher is a form of iterative self-training across model generations — the student of the last round becomes the teacher for a new task in the next round.
Student. Expanded Pairformer (64 blocks, up from 48), with trifast triangle attention and BF16 mixed precision.
Data and sampling.
| Source | Content | Sampling Weight |
|---|---|---|
| PDB experimental | Ground truth structures | 60% |
| AFDB monomer (AF2) | ~5M proteins | 30% |
| Boltz-1 complex distillation | Protein-ligand, DNA, RNA, MHC | 10% |
| MD data (MISATO, ATLAS, md-CATH) | Experimental MD ensembles | (included in PDB share) |
Filtering. Boltz-2 applies different thresholds to different data types:
- AFDB monomers: global lDDT ≥ 0.5 — a notably low threshold compared to SeedFold’s pLDDT ≥ 0.8
- Boltz-1 complexes: iPDE ≤ 1.0, PDE ≤ 1.0, ipTM ≥ 0.85 — strict for interface quality
This asymmetry is interesting: Boltz-2 accepts low-confidence monomers (more data, more noise) but demands high-confidence complex predictions (less data, less noise). The implicit assumption is that monomer errors are tolerable but interface errors are not.
Loss. Same loss for all data sources — quality is managed entirely through sampling ratios, not through loss differentiation. No confidence loss separation as in AF3.
SSL interpretation. Fixed thresholds plus data-agnostic loss. Multi-teacher but without loss separation. In CV terms, this is closer to standard pseudo-labeling with hard thresholding than to the more nuanced FixMatch family. The lDDT ≥ 0.5 threshold is equivalent to setting FixMatch’s $\tau$ very low — including many uncertain predictions in training.
2.4 SeedFold: The Largest Distillation
SeedFold team, “SeedFold: Scaling Biomolecular Structure Prediction,” arXiv 2025
SeedFold holds the record for the largest distillation dataset: 26.5M structures, a 147× expansion over PDB alone.
Teacher. OpenFold running AlphaFold2 weights. The team chose AF2 over AF3 as the teacher — prioritizing inference speed and the well-established reliability of AF2 monomer predictions over AF3’s broader but newer capabilities.
Student. Modified Pairformer with linear triangle attention (sub-cubic complexity) and wider pair representations.
Data.
| Source | Samples | Sampling Weight |
|---|---|---|
| PDB experimental | 180K | 0.50 |
| AFDB (short, <200 residues) | 3.3M | 0.08 |
| MGnify (longer, median 435 residues) | 23M | 0.42 |
| Total distillation | 26.5M | 0.50 |
The MGnify component is distinctive. These 23M sequences come from metagenomic sources — uncultured organisms from soil, ocean, and gut microbiomes. Only 2M of the 23M sequences map to existing AFDB clusters, meaning the vast majority represent novel structural diversity absent from the PDB and conventional protein databases. This is the protein AI equivalent of mining the web for unlabeled images.
Filtering. pLDDT ≥ 0.8 for AFDB structures, 30-50% sequence identity clustering to ensure structural diversity.
Loss. Same loss for distilled and experimental data — no differentiation. Quality managed by 50:50 sampling weight and additional per-cluster, per-molecule-type weighting.
Critical ablation. SeedFold provides the strongest evidence in the field for the necessity of continuous distillation. When distillation data was removed at training step 47,612, intra-protein structure prediction accuracy degraded immediately. The authors offer a compelling explanation:
The architectural transition from AF2’s IPA (Invariant Point Attention) to AF3-style diffusion transformers removed strong geometric inductive biases. IPA is SE(3)-equivariant by construction — it “knows” about 3D geometry through its design. The diffusion transformer has no such built-in geometric knowledge and must learn spatial relationships entirely from data. With only 180K PDB structures, there is not enough data to learn these relationships. The 26.5M distillation set compensates for the lost inductive bias with data volume. In equation form:
\[\underbrace{\text{IPA (strong geometric prior)}}_{\text{needs less data}} \quad \longrightarrow \quad \underbrace{\text{Diffusion Transformer (weak prior)}}_{\text{needs } \gg \text{ data}}\]This is an architectural argument for why distillation is even more critical for post-AF2 models than it was for AF2 itself.
SSL interpretation. SeedFold is the most ambitious in scale but the most basic in SSL technique. No adaptive thresholding, no loss differentiation, no weak-to-strong augmentation — just massive pseudo-labeling with a fixed confidence threshold and uniform loss. In the SSL taxonomy, this is Era 1 technique applied at Era 4 scale.
2.5 OpenFold3: 97% Synthetic
OpenFold Consortium, “OpenFold3,” 2025
OpenFold3 is the open-source reproduction of AF3’s training protocol, and the model most dominated by synthetic data in its training mix.
Teachers. AlphaFold2 (monomer distillation) and AlphaFold3 (RNA distillation) — cross-generation distillation using two different teacher models.
Student. AF3 architecture reproduction: Pairformer trunk + EDM diffusion. Released under Apache 2.0 with full training code and data.
Data.
| Source | Size | Share of Training |
|---|---|---|
| Monomer distillation (AF2, from MGnify) | ~13M | ~96% |
| RNA distillation (AF3, from Rfam v15.1) | ~125K | ~1% |
| PDB experimental | ~300K | ~3% |
97% of OpenFold3’s training data is synthetic. This is the most extreme pseudo-label-to-labeled ratio in the field — far exceeding what is typical in CV SSL, where labeled data usually constitutes at least 10-20% of the training mix.
Filtering. Monomer: MGnify cluster size ≥ 10 (statistical sufficiency, not confidence-based). RNA: Rfam cluster representatives, AF3 predictions with average PDE < 2.
Loss. Follows AF3’s protocol: confidence losses (pLDDT, PDE, PAE) applied to PDB only. This is the key design decision that AF3 pioneered and OpenFold3 inherited.
Training schedule. 155,000 total steps across three stages (Initial: 131,500; Fine-tune 1: 8,000; Fine-tune 2: 15,500). Unlike AF3, OpenFold3 omits the third fine-tuning stage and trains the PAE head from the start.
SSL interpretation. OpenFold3 demonstrates that with a sufficiently good teacher, a model can train overwhelmingly on pseudo-labels (97%) and still achieve competitive accuracy. It is the only open-source model matching AF3-level RNA performance — a direct consequence of using AF3 itself as the RNA teacher. This is the protein AI equivalent of Noisy Student (Xie et al., 2020, CVPR): if the teacher is strong enough, the student can learn primarily from pseudo-labels, with a small anchor of labeled data to prevent drift.
3. Systematic Comparison
Having analyzed each model individually, we now compare them across multiple dimensions.
3.1 Master Table: Full Model Comparison
| Model | Teacher(s) | Student Arch | Distill Scale | PDB:Distill Ratio | Confidence Filter | Loss Differentiation | Continuous Distill |
|---|---|---|---|---|---|---|---|
| AF2 | Undistilled AF2 | Same (Evoformer+IPA) | 356K | 25:75 | KL-div c_i < 0.5 | Residue masking only | Yes |
| AF3 | AF2 + AFM v2.3 + AF3 | Different (Pairformer+Diffusion) | ~41M+ | ~50:50 | None (monomer), PDE<2 (RNA) | Conf. loss PDB only | Yes |
| Boltz-1 | AF2 (OpenFold) | Pairformer+Diffusion | 270K | 50:50 → PDB only | OpenFold defaults | None | No (removed last 15K) |
| Boltz-2 | AF2 + Boltz-1 | Pairformer+Diffusion | ~5M+ | 60:40 | lDDT≥0.5, ipTM≥0.85 | None | Yes |
| SeedFold | AF2 (OpenFold) | Pairformer+Diffusion | 26.5M | 50:50 | pLDDT≥0.8 | None | Yes (ablation proven) |
| OpenFold3 | AF2 + AF3 | Pairformer+Diffusion | ~13M | 3:97 | Cluster≥10, PDE<2 | Conf. loss PDB only | Yes |
3.2 Confidence Filtering Comparison
| Model | Metric | Threshold | Granularity | Notes |
|---|---|---|---|---|
| AF2 | KL-divergence c_i | 0.5 | Residue-level | Masks individual residues in loss |
| AF3 (monomer) | None | Removed | N/A | Dropped AF2’s pLDDT ≥ 0.8 filter |
| AF3 (RNA) | PDE | < 2 | Sample-level | Applied only to RNA distillation |
| Boltz-2 (monomer) | lDDT | ≥ 0.5 | Sample-level | Notably low threshold |
| Boltz-2 (complex) | ipTM | ≥ 0.85 | Sample-level | Strict for interfaces |
| SeedFold | pLDDT | ≥ 0.8 | Sample-level | Standard threshold |
| OpenFold3 (monomer) | Cluster size | ≥ 10 | Cluster-level | Statistical, not confidence-based |
| OpenFold3 (RNA) | PDE | < 2 | Sample-level | Follows AF3 protocol |
Key observation: Every model uses fixed thresholds with hard filtering. No model employs adaptive thresholding (FlexMatch), soft weighting (SoftMatch), or learned quality predictors (SemiReward). The most sophisticated filtering is AF2’s per-residue masking — ironically the oldest approach in the field.
3.3 Loss Differentiation Comparison
| Strategy | Models | Description |
|---|---|---|
| Same loss, no differentiation | Boltz-1, Boltz-2, SeedFold | Identical loss for PDB and synthetic data |
| Residue-level masking | AF2 | Low-confidence residues excluded from loss |
| Loss-type separation | AF3, OpenFold3 | Structure loss on all data; confidence loss on PDB only |
| Stage-selective masking | AF3 | Disorder distillation: non-protein chains masked initially, unmasked in fine-tune |
The split between “same loss” and “loss-type separation” camps is stark. AF3 and OpenFold3 recognize that structural pseudo-labels and confidence pseudo-labels have fundamentally different noise tolerances. The rest of the field treats them identically.
3.4 Teacher Strategy Evolution
The field’s approach to choosing teachers has evolved significantly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
AF2 (2021): Single self-teacher
│ Same architecture, earlier checkpoint
│
▼
AF3 (2024): Multi-teacher (AF2 + AFM v2.3 + AF3-self)
│ Purpose-specific teacher selection
│ Cross-architecture distillation
│
▼
Boltz-2 (2025): Multi-teacher (AF2 + Boltz-1)
│ Own prior model as complex teacher
│ Multi-modality coverage
│
▼
OpenFold3 (2025): Multi-teacher (AF2 + AF3)
Cross-generation distillation
Open-source reproduction
3.5 Distillation Scale Timeline
The exponential growth in synthetic data is one of the most striking trends:
1
2
3
4
5
6
7
8
9
10
11
12
13
Distillation Scale (structures)
AF2 (2021) ████ 356K
│
Boltz-1 (2024) ████ 270K
│
Boltz-2 (2025) ████████████████████████████ ~5M
│
OpenFold3 (2025) ████████████████████████████████████████████████████████████████ ~13M
│
SeedFold (2025) ████████████████████████████████████████████████████████████████████████████████████████████████████████ 26.5M
│
AF3 (2024) ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ ~41M+
From 356K to 41M+ in four years — a 115× increase. The PDB, meanwhile, grew from ~180K to ~220K in the same period. Synthetic data is not just supplementing PDB; it is dominating training by two orders of magnitude.
3.6 Distillation Continuity Comparison
| Strategy | Models | Rationale |
|---|---|---|
| Throughout training | AF2, AF3, Boltz-2, SeedFold, OpenFold3 | Majority approach |
| Removed at end | Boltz-1 (last 15K steps PDB only) | Final calibration on experimental data |
| Staged introduction | AF3 (TF distillation added in Fine-tune 2) | Progressive complexity |
SeedFold’s ablation provides the most direct evidence: removing distillation mid-training degrades accuracy immediately. This suggests that for modern diffusion-based architectures, distillation is not a pre-training trick but a continuous necessity. Boltz-1’s removal of distillation at the end may have been feasible only because its distillation set was small (270K) — a regime where PDB alone may suffice for final fine-tuning.
4. What the Field Has Already Adopted from SSL
Having mapped each model’s distillation strategy to SSL terminology, we can now systematically assess which SSL techniques the field has adopted — even if it did not use SSL vocabulary to describe them.
4.1 Adopted Techniques
| SSL Technique | CV Reference | Co-Folding Implementation | Adoption Level |
|---|---|---|---|
| Pseudo-labeling | Lee, 2013 | Self-distillation (all models) | Full |
| Teacher-student framework | Hinton et al., 2015 | Teacher generates structures, student trains on them | Full |
| Fixed confidence threshold | FixMatch (Sohn et al., 2020) | pLDDT/ipTM/KL-div thresholds | Full |
| Cross-architecture distillation | Knowledge distillation | AF2 → AF3, AF2 → Boltz | Full |
| Multi-source data mixing | — | AF3, Boltz-2 multi-teacher | Partial |
| Loss-type separation | — | AF3 confidence loss on PDB only | Partial (AF3/OF3 only) |
| Iterative self-training (1 round) | Noisy Student (Xie et al., 2020) | Student becomes next generation’s teacher | Implicit |
4.2 The SSL-to-Protein Mapping
The mapping is precise enough to write in equation form. In CV SSL, the standard self-training loss is:
\[\mathcal{L} = \frac{1}{|\mathcal{D}_L|} \sum_{(x,y) \in \mathcal{D}_L} \ell(f_\theta(x), y) \;+\; \lambda \frac{1}{|\hat{\mathcal{D}}_U|} \sum_{(x,\hat{y}) \in \hat{\mathcal{D}}_U} \mathbb{1}[\text{conf}(\hat{y}) \geq \tau] \cdot \ell(f_\theta(x), \hat{y})\]In protein AI, the corresponding formulation is:
\[\mathcal{L} = \underbrace{\sum_{s \in \mathcal{D}_{\text{PDB}}} w_s \cdot \mathcal{L}_{\text{struct+conf}}(s)}_{\text{labeled (PDB)}} \;+\; \underbrace{\sum_{s \in \hat{\mathcal{D}}_{\text{synth}}} w_s \cdot \mathcal{L}_{\text{struct}}(s)}_{\text{pseudo-labeled (synthetic)}}\]Here w_s encodes the sampling weight. The term L_struct includes diffusion and distogram losses, while L_conf includes pLDDT, PDE, and PAE losses. The confidence filtering enters through data preprocessing (hard threshold on the synthetic dataset) rather than through the loss itself.
4.3 The Adoption Timeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
┌──────────────────────────────────────────────────────────────────────┐
│ SSL Technique Adoption in Co-Folding │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ CV SSL Era 1 (2013-2016) Co-Folding (2021-2025) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Pseudo-labeling │ ────────► │ Self-distillation │ ✅ │
│ │ (Lee, 2013) │ │ (AF2, 2021) │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ CV SSL Era 2 (2017-2019) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Mean Teacher (EMA) │ │ Fixed offline │ │
│ │ (Tarvainen, 2017) │ ──── ✗ ──│ teacher only │ ❌ │
│ └─────────────────────┘ └─────────────────────┘ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Knowledge distill. │ ────────► │ AF2 → AF3 │ ✅ │
│ │ (Hinton, 2015) │ │ (cross-arch) │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ CV SSL Era 3 (2019-2021) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ FixMatch threshold │ ────────► │ pLDDT/ipTM filtering│ ✅ │
│ │ (Sohn, 2020) │ │ (all models) │ (fixed) │
│ └─────────────────────┘ └─────────────────────┘ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Weak→Strong aug. │ │ Same augmentation │ │
│ │ (FixMatch) │ ──── ✗ ──│ for teacher/student │ ❌ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ CV SSL Era 4 (2021-2025) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Adaptive threshold │ │ Fixed threshold │ │
│ │ (FlexMatch, 2021) │ ──── ✗ ──│ only │ ❌ │
│ └─────────────────────┘ └─────────────────────┘ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Soft weighting │ │ Hard binary │ │
│ │ (SoftMatch, 2023) │ ──── ✗ ──│ filtering only │ ❌ │
│ └─────────────────────┘ └─────────────────────┘ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Learned quality │ │ Self-assessed │ │
│ │ (SemiReward, 2024) │ ──── ✗ ──│ pLDDT only │ ❌ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
The pattern is clear: co-folding models have adopted the techniques from CV SSL’s Era 1 (2013-2016) and selectively from Era 2-3. The advances of Era 4 (2021-2025) — adaptive thresholding, soft weighting, learned quality predictors — remain entirely untapped.
5. Closing: What’s Missing?
The analysis above reveals a striking asymmetry. On one hand, every major co-folding model has independently converged on pseudo-labeling, teacher-student distillation, and fixed confidence thresholds — the core of 2013-2017 era SSL. On the other hand, the techniques that defined the 2019-2025 SSL revolution remain absent.
The gaps fall into five categories:
Adaptive thresholding. Every model uses a fixed confidence threshold (pLDDT ≥ 0.8, lDDT ≥ 0.5, etc.). FlexMatch (Zhang et al., 2021, NeurIPS) and FreeMatch (Wang et al., 2023, ICLR) showed that adaptive, class-specific thresholds dramatically improve SSL in CV. In protein terms, this would mean different thresholds for alpha-helical domains vs. disordered loops vs. protein-protein interfaces — regions where teacher confidence is calibrated very differently.
Soft weighting. All models apply binary hard filtering: a prediction is either in or out based on whether it exceeds the threshold. SoftMatch (Chen et al., 2023, ICLR) replaced this binary gate with a continuous Gaussian weight. In protein terms, a prediction with pLDDT 0.79 is currently discarded entirely (if the threshold is 0.8), while a prediction with pLDDT 0.81 gets full weight. A soft weighting scheme would make this boundary continuous.
Weak-to-strong augmentation. FixMatch’s core insight — generate pseudo-labels with weak augmentation, train the student with strong augmentation — is absent from every model. Protein AI has rich augmentation axes: MSA depth (full vs. subsampled), template availability (with vs. without), sequence cropping (full vs. random crop), coordinate noise. None of these are used asymmetrically between teacher and student.
Online teachers. Every model uses offline distillation: the teacher generates a fixed set of pseudo-labels before training begins. The Mean Teacher approach (Tarvainen & Valpola, 2017, NeurIPS) — where the teacher is an exponential moving average of the student, updated continuously — would allow pseudo-labels to improve as the student improves.
Confidence-weighted loss. Beyond AF3’s binary separation (confidence loss on PDB only vs. not), no model uses confidence as a continuous weight in the loss function. A natural extension: weight each residue’s structural loss by the teacher’s confidence, so that high-confidence regions contribute more to gradient updates.
Perhaps most critically, the field has not examined whether training confidence estimators on synthetic data — rather than experimental data alone — introduces systematic calibration errors. AF3’s decision to restrict confidence loss to PDB suggests the DeepMind team recognized this risk. But most other models have not followed suit.
What has been adopted corresponds to 2013-2017 era SSL. The 2019-2025 advances — the techniques that pushed CV SSL from “useful trick” to “standard practice” — remain untapped. This is the gap we will explore in detail in Part 3.
Next: Part 3 — Untapped Opportunities and the Confidence Calibration Trap
We will systematically propose how each unadopted SSL technique could be adapted for protein structure prediction, with concrete formulations and expected impact. The centerpiece will be the confidence calibration problem: why training confidence estimators on synthetic data may be more dangerous than training structure predictors on the same data — and what to do about it.
Part of the series: What Protein AI Can Learn from Semi-Supervised Learning
References
- Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature 2021
- Abramson et al., “Accurate structure prediction of biomolecular interactions with AlphaFold 3,” Nature 2024
- Wohlwend et al., “Boltz-1: Democratizing Biomolecular Interaction Modeling,” bioRxiv 2024
- Passaro, Wohlwend et al., “Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction,” bioRxiv 2025
- SeedFold team, “SeedFold: Scaling Biomolecular Structure Prediction,” arXiv 2025
- OpenFold Consortium, “OpenFold3,” 2025
- Lee, “Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks,” ICML Workshop 2013
- Tarvainen & Valpola, “Mean teachers are better role models,” NeurIPS 2017
- Sohn et al., “FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence,” NeurIPS 2020
- Xie et al., “Self-training with Noisy Student improves ImageNet classification,” CVPR 2020
- Zhang et al., “FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling,” NeurIPS 2021
- Wang et al., “FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning,” ICLR 2023
- Chen et al., “SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning,” ICLR 2023
- Wang et al., “SemiReward: A General Reward Model for Semi-supervised Learning,” ICLR 2024
- Hinton et al., “Distilling the Knowledge in a Neural Network,” NeurIPS Workshop 2015