SSL for Co-Folding Part 2: How Co-Folding Models Use Synthetic Data — An SSL Perspective

Posted Mar 31, 2026 Updated Apr 1, 2026

By Seongok Ryu

26 min read

What Protein AI Can Learn from Semi-Supervised Learning

This is Part 2 of a 4-part series examining how semi-supervised learning techniques from computer vision can improve protein structure prediction models.

Part 1: The Semi-Supervised Revolution in Computer Vision
Part 2 (this post): How Co-Folding Models Use Synthetic Data — An SSL Perspective
Part 3: Untapped Opportunities and the Confidence Calibration Trap
Part 4: The Road Ahead — Data Flywheels, Foundation Models, and Open Questions

The Core Question

How do protein structure prediction models use synthetic data, and what does this look like through the lens of semi-supervised learning?

In Part 1, we surveyed the evolution of semi-supervised learning in computer vision — from pseudo-labeling (Lee, 2013) through consistency regularization (Tarvainen & Valpola, 2017, NeurIPS) to the FixMatch unification (Sohn et al., 2020, NeurIPS) and beyond. We now have a shared vocabulary: pseudo-labels, confidence thresholds, teacher-student frameworks, weak-to-strong augmentation, adaptive weighting.

This Part applies that vocabulary to protein structure prediction. Every major co-folding model — AlphaFold2, AlphaFold3, Boltz-1/2, SeedFold, OpenFold3 — relies on synthetic data generated by teacher models. The field calls this “distillation.” We will show that distillation is, in every meaningful sense, semi-supervised learning — and that recognizing this connection reveals both what the field has already adopted and what remains untapped.

We proceed model by model, analyzing each across five dimensions: Teacher, Student, Data, Filtering, and Loss. Along the way, we map every design choice to its SSL counterpart. We close with systematic comparison tables and a clear accounting of which SSL techniques the field has adopted, and which it has not.

1. The Data Pyramid

Protein AI training data spans four orders of magnitude, with an inverse relationship between scale and information density:

                      ┌─────────┐
                      │Functional│  ~10K-50K
                      │  Data    │  (Kd, Ki, IC50)
                      ├─────────┤
                      │   PDB    │  ~220K
                      │Structures│  (experimental 3D)
                  ┌───┴─────────┴───┐
                  │   Synthetic      │  ~10M-200M
                  │   Structures     │  (AF2/AF3 predictions)
              ┌───┴──────────────────┴───┐
              │        Sequences          │  ~250M-2.5B
              │   (UniProt, BFD, MGnify)  │
              └──────────────────────────┘

The mapping to SSL terminology is immediate:

Data Layer	Scale	SSL Term	Symbol
Sequences	~250M-2.5B	Unlabeled data	D_U
Synthetic structures	~10M-200M	Pseudo-labeled data	D̂_U
PDB experimental	~220K	Labeled data	D_L
Functional data	~10K-50K	Task-specific labels	D_task

In computer vision, the canonical SSL problem is: 50K labeled images from CIFAR-10, millions of unlabeled images from the web. The ratio is roughly 1:100 or 1:1000.

In protein AI, we have ~220K experimental structures (labeled) and ~250M-2.5B sequences (unlabeled). The ratio is 1:1000 to 1:10,000. The synthetic structures — generated by running teacher models on unlabeled sequences — sit in between, playing exactly the role of pseudo-labels in SSL.

This pyramid is structurally identical to the SSL problem in computer vision. The only difference is the label type: instead of class probabilities over 10 categories, the “label” is a 3D coordinate set over thousands of atoms. Everything else — the data asymmetry, the teacher-student pipeline, the confidence filtering, the training schedule — maps directly.

2. Model-by-Model Analysis

We now analyze six models in chronological order. For each, we characterize the distillation strategy across five dimensions and provide the SSL interpretation.

Before diving in, two loss functions appear repeatedly and are worth defining once. The Frame Aligned Point Error (FAPE), used in AF2, measures structural accuracy via local reference frames. Given rigid-body frames T_i and atom positions x_j (true) and x̂_j (predicted):

\[\text{FAPE} = \frac{1}{N_{\text{frames}}} \frac{1}{N_{\text{atoms}}} \sum_{i} \sum_{j} \left\| T_i^{-1} \circ \hat{\mathbf{x}}_j - T_i^{-1} \circ \mathbf{x}_j \right\|_{\text{clamp}}\]

The EDM diffusion loss, used in AF3 and all subsequent models, trains a denoiser D_θ to recover clean coordinates x₀ from noised coordinates x_t:

\[\mathcal{L}_{\text{diffusion}} = \mathbb{E}_{t, \epsilon}\left[\lambda(t) \left\| D_\theta(\mathbf{x}_t, t) - \mathbf{x}_0 \right\|^2\right]\]

where λ(t) is a time-dependent weighting function. The noised coordinates are defined as x_t = α_t · x₀ + σ_t · ε with ε ~ N(0, I).

2.1 AlphaFold2: The First Self-Distillation

Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature 2021

AlphaFold2 introduced self-distillation to protein structure prediction. In SSL terms, this is the most classical form of self-training: the model generates pseudo-labels for unlabeled data, then a fresh copy of the same architecture trains on the combination.

Teacher. An “undistilled” AF2 — an earlier checkpoint trained on PDB data only. Single model inference without the 5-model ensemble re-ranking used for CASP14 submissions. The paper states: “All CASP14 models are trained with distillation from a slightly earlier version of the model.”

Student. Identical architecture (Evoformer + IPA Structure Module). Initialized from scratch — no weight transfer from the teacher. This is pure data-level knowledge transfer.

Data. Starting from 6.3M sequences in Uniclust30 (v2018-08), the pipeline applied greedy deduplication, length filtering (200 < length ≤ 1024), and MSA depth filtering (≥ 200 sequences). The result: 355,993 synthetic structures. MSA for each distillation example was subsampled to 1,000 sequences.

Filtering. AF2’s confidence filtering is finer-grained than anything in CV SSL. Rather than accepting or rejecting entire samples, AF2 uses a per-residue KL-divergence metric:

\[c_i = \frac{1}{|\mathcal{N}_i|} \sum_{j \in \mathcal{N}_i} D_{\text{KL}}\left(p_{\text{ref},|i-j|}(r) \;\|\; p_{i,j}(r)\right)\]

Here p_i,j(r) is the predicted pairwise distance distribution between residues i and j, and p_ref is the reference distribution computed from 1,000 random Uniclust30 sequences. The neighborhood N_i covers residues within ±128 positions of i. Notably, this is not pLDDT — it is a distance-distribution-based confidence score. Residues with c_i < 0.5 are masked from the loss, meaning the student never trains on low-confidence regions.

Loss. The same loss as for experimental data: FAPE + distogram + masked MSA + auxiliary losses. The only difference is the per-residue confidence masking described above. FAPE clamping (10 angstrom) is applied identically to PDB and distillation data.

Training schedule. 75% distillation / 25% PDB sampling throughout training — both in the initial phase (~10M samples, 7 days) and fine-tuning (~1.5M samples, 4 days). Total: ~11 days on 128 TPUs.

SSL interpretation. This is classic self-training (Lee, 2013). The per-residue masking is conceptually similar to FixMatch’s confidence thresholding, but with two key differences: (1) it operates at residue granularity rather than sample granularity, and (2) it uses a distance-distribution metric rather than class probability. AF2’s ablation study found self-distillation to be “one of the most impactful components” — a strong empirical endorsement of pseudo-labeling for structural biology.

2.2 AlphaFold3: Multi-Teacher Cross-Distillation

Abramson et al., “Accurate structure prediction of biomolecular interactions with AlphaFold 3,” Nature 2024

AlphaFold3 represents a qualitative leap in distillation sophistication. It is the first model to use multiple purpose-specific teachers and to explicitly separate loss types by data source.

Teachers. Three different models, each serving a distinct purpose:

Teacher Model	Purpose	Generated Data
AlphaFold2	Protein monomers	~41M structures from MGnify
AF-Multimer v2.3	Disordered regions	~14K-25K PDB proteins with missing residues
AlphaFold3 itself	RNA and TF-DNA	~65K RNA + ~16K TF structures

The AF-Multimer distillation deserves special attention. AF3’s diffusion module tends to hallucinate compact structures for intrinsically disordered regions — they look physically plausible but are wrong. AF-Multimer’s IPA module, by contrast, predicts extended ribbon-like conformations for these regions. This “correct expression of uncertainty” is distilled into AF3 to suppress hallucination. In SSL terms, this is using a less powerful but better-calibrated teacher for specific failure modes.

Student. Fundamentally different architecture from the primary teacher (AF2):

Dimension	AF2 (Teacher)	AF3 (Student)
Trunk	Evoformer	Pairformer
Structure module	IPA (SE(3)-equivariant)	EDM Diffusion (non-equivariant)
Tokenization	Residue-level	Atom-level
MSA processing	Row-wise + column-wise attention	Removed column-wise attention

This is genuine cross-architecture distillation — knowledge transfer between fundamentally different model families. In CV, the closest analogue is distilling from a large ViT teacher to a smaller ConvNet student, but the architectural gap here is much wider.

Data. The training mixture is carefully weighted:

Dataset	Size	Sampling Weight
Weighted PDB	~150K	0.50
Protein monomer (long, >200 res)	~13M	0.495
Protein monomer (short, 4-200 res)	~28M	0.005
Disordered PDB	~14K-25K	0.02
RNA (Rfam v14.9)	~65K	0.05
TF positive (JASPAR + SELEX)	16,439	0.021
TF negative (random pairs)	N/A	0.011

Note the extreme downweighting of short monomers (0.005 for 28M sequences vs. 0.495 for 13M longer ones). This is implicit quality control through sampling — short proteins are easier to predict but less informative for training.

Filtering. AF3 made a surprising decision: it removed the pLDDT ≥ 0.8 threshold that AF2 had applied to monomer distillation. Instead of filtering at the data level, AF3 manages quality through loss differentiation and sampling weights. For RNA, a stricter filter applies: average PDE (Predicted Distance Error) < 2 angstrom.

Loss. This is the most critical design decision — and the one with the strongest SSL implications:

\[\mathcal{L} = \underbrace{\alpha_{\text{diff}} \mathcal{L}_{\text{diffusion}} + \alpha_{\text{dist}} \mathcal{L}_{\text{distogram}}}_{\text{applied to all data}} + \underbrace{\alpha_{\text{conf}} \left(\mathcal{L}_{\text{pLDDT}} + \mathcal{L}_{\text{PDE}} + \alpha_{\text{PAE}} \mathcal{L}_{\text{PAE}}\right)}_{\text{applied to PDB only}}\]

with coefficients α_diff = 4, α_dist = 3 × 10⁻², and α_conf = 10⁻⁴.

The structural losses (diffusion denoising, distogram prediction) are applied to all data — PDB and synthetic alike. But the confidence losses (pLDDT, PDE, PAE) are applied exclusively to PDB experimental data. The reasoning: structural pseudo-labels from a good teacher are approximately correct and useful for training; but confidence pseudo-labels — which require knowing the true error — would propagate the teacher’s calibration errors into the student. We will return to this asymmetry in Part 3.

Training schedule. Four stages spanning ~20 days on 256 A100 GPUs:

Stage	Crop Size	Distillation	Key Change
Initial	384	Monomer + disorder + RNA	Baseline training
Fine-tune 1	640	Same + disorder unmasking	Unmask non-protein chains
Fine-tune 2	768	+ TF distillation	Add protein-DNA data
Fine-tune 3	768	Same	Add PAE head, remove structure loss

SSL interpretation. AF3’s multi-teacher strategy has no direct precedent in CV SSL, where single-teacher self-training dominates. The loss-type separation — structural loss on all data, confidence loss on labeled only — goes beyond the CV SSL playbook entirely. This is a protein-domain-specific insight: structural predictions can tolerate noise in ways that confidence estimates cannot.

2.3 Boltz-1 to Boltz-2: From Simple to Multi-Source

Boltz-1

Wohlwend et al., “Boltz-1: Democratizing Biomolecular Interaction Modeling,” bioRxiv 2024

Boltz-1 adopted the simplest distillation strategy among all models.

Teacher. AlphaFold2 via OpenProteinSet (pre-computed predictions with MSA features).

Student. Pairformer (48 blocks) + EDM diffusion — architecturally similar to AF3.

Data. ~270K structures from OpenProteinSet. Standard PDB set for experimental data.

Training schedule. Two stages with a distinctive transition:

Stage	Steps	Data
Initial	53K	PDB + distillation (50:50)
Fine-tuning	15K	PDB only

Boltz-1 is the only model that removes distillation data at the end of training. The final 15K steps use PDB exclusively — a strategy that stands in direct contrast to SeedFold’s finding that continuous distillation is necessary.

Loss. EDM denoising loss with per-atom-type weights (protein 1.0, DNA/RNA 5.0, ligand 10.0). No reported differentiation between distilled and experimental data.

Boltz-2

Passaro, Wohlwend et al., “Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction,” bioRxiv 2025

Boltz-2 dramatically expanded the distillation strategy, introducing the most explicit multi-source data mixing in the field.

Teachers. Two models serving complementary roles:

Teacher	Purpose
AlphaFold2 (AFDB)	Protein monomers
Boltz-1	Protein-ligand, protein-DNA, RNA, MHC-peptide complexes

The use of their own previous model (Boltz-1) as the complex teacher is a form of iterative self-training across model generations — the student of the last round becomes the teacher for a new task in the next round.

Student. Expanded Pairformer (64 blocks, up from 48), with trifast triangle attention and BF16 mixed precision.

Data and sampling.

Source	Content	Sampling Weight
PDB experimental	Ground truth structures	60%
AFDB monomer (AF2)	~5M proteins	30%
Boltz-1 complex distillation	Protein-ligand, DNA, RNA, MHC	10%
MD data (MISATO, ATLAS, md-CATH)	Experimental MD ensembles	(included in PDB share)

Filtering. Boltz-2 applies different thresholds to different data types:

AFDB monomers: global lDDT ≥ 0.5 — a notably low threshold compared to SeedFold’s pLDDT ≥ 0.8
Boltz-1 complexes: iPDE ≤ 1.0, PDE ≤ 1.0, ipTM ≥ 0.85 — strict for interface quality

This asymmetry is interesting: Boltz-2 accepts low-confidence monomers (more data, more noise) but demands high-confidence complex predictions (less data, less noise). The implicit assumption is that monomer errors are tolerable but interface errors are not.

Loss. Same loss for all data sources — quality is managed entirely through sampling ratios, not through loss differentiation. No confidence loss separation as in AF3.

SSL interpretation. Fixed thresholds plus data-agnostic loss. Multi-teacher but without loss separation. In CV terms, this is closer to standard pseudo-labeling with hard thresholding than to the more nuanced FixMatch family. The lDDT ≥ 0.5 threshold is equivalent to setting FixMatch’s $\tau$ very low — including many uncertain predictions in training.

2.4 SeedFold: The Largest Distillation

SeedFold team, “SeedFold: Scaling Biomolecular Structure Prediction,” arXiv 2025

SeedFold holds the record for the largest distillation dataset: 26.5M structures, a 147× expansion over PDB alone.

Teacher. OpenFold running AlphaFold2 weights. The team chose AF2 over AF3 as the teacher — prioritizing inference speed and the well-established reliability of AF2 monomer predictions over AF3’s broader but newer capabilities.

Student. Modified Pairformer with linear triangle attention (sub-cubic complexity) and wider pair representations.

Data.

Source	Samples	Sampling Weight
PDB experimental	180K	0.50
AFDB (short, <200 residues)	3.3M	0.08
MGnify (longer, median 435 residues)	23M	0.42
Total distillation	26.5M	0.50

The MGnify component is distinctive. These 23M sequences come from metagenomic sources — uncultured organisms from soil, ocean, and gut microbiomes. Only 2M of the 23M sequences map to existing AFDB clusters, meaning the vast majority represent novel structural diversity absent from the PDB and conventional protein databases. This is the protein AI equivalent of mining the web for unlabeled images.

Filtering. pLDDT ≥ 0.8 for AFDB structures, 30-50% sequence identity clustering to ensure structural diversity.

Loss. Same loss for distilled and experimental data — no differentiation. Quality managed by 50:50 sampling weight and additional per-cluster, per-molecule-type weighting.

Critical ablation. SeedFold provides the strongest evidence in the field for the necessity of continuous distillation. When distillation data was removed at training step 47,612, intra-protein structure prediction accuracy degraded immediately. The authors offer a compelling explanation:

The architectural transition from AF2’s IPA (Invariant Point Attention) to AF3-style diffusion transformers removed strong geometric inductive biases. IPA is SE(3)-equivariant by construction — it “knows” about 3D geometry through its design. The diffusion transformer has no such built-in geometric knowledge and must learn spatial relationships entirely from data. With only 180K PDB structures, there is not enough data to learn these relationships. The 26.5M distillation set compensates for the lost inductive bias with data volume. In equation form:

\[\underbrace{\text{IPA (strong geometric prior)}}_{\text{needs less data}} \quad \longrightarrow \quad \underbrace{\text{Diffusion Transformer (weak prior)}}_{\text{needs } \gg \text{ data}}\]

This is an architectural argument for why distillation is even more critical for post-AF2 models than it was for AF2 itself.

SSL interpretation. SeedFold is the most ambitious in scale but the most basic in SSL technique. No adaptive thresholding, no loss differentiation, no weak-to-strong augmentation — just massive pseudo-labeling with a fixed confidence threshold and uniform loss. In the SSL taxonomy, this is Era 1 technique applied at Era 4 scale.

2.5 OpenFold3: 97% Synthetic

OpenFold Consortium, “OpenFold3,” 2025

OpenFold3 is the open-source reproduction of AF3’s training protocol, and the model most dominated by synthetic data in its training mix.

Teachers. AlphaFold2 (monomer distillation) and AlphaFold3 (RNA distillation) — cross-generation distillation using two different teacher models.

Student. AF3 architecture reproduction: Pairformer trunk + EDM diffusion. Released under Apache 2.0 with full training code and data.

Data.

Source	Size	Share of Training
Monomer distillation (AF2, from MGnify)	~13M	~96%
RNA distillation (AF3, from Rfam v15.1)	~125K	~1%
PDB experimental	~300K	~3%

97% of OpenFold3’s training data is synthetic. This is the most extreme pseudo-label-to-labeled ratio in the field — far exceeding what is typical in CV SSL, where labeled data usually constitutes at least 10-20% of the training mix.

Filtering. Monomer: MGnify cluster size ≥ 10 (statistical sufficiency, not confidence-based). RNA: Rfam cluster representatives, AF3 predictions with average PDE < 2.

Loss. Follows AF3’s protocol: confidence losses (pLDDT, PDE, PAE) applied to PDB only. This is the key design decision that AF3 pioneered and OpenFold3 inherited.

Training schedule. 155,000 total steps across three stages (Initial: 131,500; Fine-tune 1: 8,000; Fine-tune 2: 15,500). Unlike AF3, OpenFold3 omits the third fine-tuning stage and trains the PAE head from the start.

SSL interpretation. OpenFold3 demonstrates that with a sufficiently good teacher, a model can train overwhelmingly on pseudo-labels (97%) and still achieve competitive accuracy. It is the only open-source model matching AF3-level RNA performance — a direct consequence of using AF3 itself as the RNA teacher. This is the protein AI equivalent of Noisy Student (Xie et al., 2020, CVPR): if the teacher is strong enough, the student can learn primarily from pseudo-labels, with a small anchor of labeled data to prevent drift.

3. Systematic Comparison

Having analyzed each model individually, we now compare them across multiple dimensions.

3.1 Master Table: Full Model Comparison

Model	Teacher(s)	Student Arch	Distill Scale	PDB:Distill Ratio	Confidence Filter	Loss Differentiation	Continuous Distill
AF2	Undistilled AF2	Same (Evoformer+IPA)	356K	25:75	KL-div c_i < 0.5	Residue masking only	Yes
AF3	AF2 + AFM v2.3 + AF3	Different (Pairformer+Diffusion)	~41M+	~50:50	None (monomer), PDE<2 (RNA)	Conf. loss PDB only	Yes
Boltz-1	AF2 (OpenFold)	Pairformer+Diffusion	270K	50:50 → PDB only	OpenFold defaults	None	No (removed last 15K)
Boltz-2	AF2 + Boltz-1	Pairformer+Diffusion	~5M+	60:40	lDDT≥0.5, ipTM≥0.85	None	Yes
SeedFold	AF2 (OpenFold)	Pairformer+Diffusion	26.5M	50:50	pLDDT≥0.8	None	Yes (ablation proven)
OpenFold3	AF2 + AF3	Pairformer+Diffusion	~13M	3:97	Cluster≥10, PDE<2	Conf. loss PDB only	Yes

3.2 Confidence Filtering Comparison

Model	Metric	Threshold	Granularity	Notes
AF2	KL-divergence c_i	0.5	Residue-level	Masks individual residues in loss
AF3 (monomer)	None	Removed	N/A	Dropped AF2’s pLDDT ≥ 0.8 filter
AF3 (RNA)	PDE	< 2	Sample-level	Applied only to RNA distillation
Boltz-2 (monomer)	lDDT	≥ 0.5	Sample-level	Notably low threshold
Boltz-2 (complex)	ipTM	≥ 0.85	Sample-level	Strict for interfaces
SeedFold	pLDDT	≥ 0.8	Sample-level	Standard threshold
OpenFold3 (monomer)	Cluster size	≥ 10	Cluster-level	Statistical, not confidence-based
OpenFold3 (RNA)	PDE	< 2	Sample-level	Follows AF3 protocol

Key observation: Every model uses fixed thresholds with hard filtering. No model employs adaptive thresholding (FlexMatch), soft weighting (SoftMatch), or learned quality predictors (SemiReward). The most sophisticated filtering is AF2’s per-residue masking — ironically the oldest approach in the field.

3.3 Loss Differentiation Comparison

Strategy	Models	Description
Same loss, no differentiation	Boltz-1, Boltz-2, SeedFold	Identical loss for PDB and synthetic data
Residue-level masking	AF2	Low-confidence residues excluded from loss
Loss-type separation	AF3, OpenFold3	Structure loss on all data; confidence loss on PDB only
Stage-selective masking	AF3	Disorder distillation: non-protein chains masked initially, unmasked in fine-tune

The split between “same loss” and “loss-type separation” camps is stark. AF3 and OpenFold3 recognize that structural pseudo-labels and confidence pseudo-labels have fundamentally different noise tolerances. The rest of the field treats them identically.

3.4 Teacher Strategy Evolution

The field’s approach to choosing teachers has evolved significantly:

AF2 (2021):  Single self-teacher
  │          Same architecture, earlier checkpoint
  │
  ▼
AF3 (2024):  Multi-teacher (AF2 + AFM v2.3 + AF3-self)
  │          Purpose-specific teacher selection
  │          Cross-architecture distillation
  │
  ▼
Boltz-2 (2025):  Multi-teacher (AF2 + Boltz-1)
  │               Own prior model as complex teacher
  │               Multi-modality coverage
  │
  ▼
OpenFold3 (2025):  Multi-teacher (AF2 + AF3)
                   Cross-generation distillation
                   Open-source reproduction

3.5 Distillation Scale Timeline

The exponential growth in synthetic data is one of the most striking trends:

             Distillation Scale (structures)
             
  AF2 (2021)     ████  356K
                  │
  Boltz-1 (2024) ████  270K
                  │
  Boltz-2 (2025) ████████████████████████████  ~5M
                  │
  OpenFold3 (2025) ████████████████████████████████████████████████████████████████  ~13M
                  │
  SeedFold (2025) ████████████████████████████████████████████████████████████████████████████████████████████████████████  26.5M
                  │
  AF3 (2024)     ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████  ~41M+

From 356K to 41M+ in four years — a 115× increase. The PDB, meanwhile, grew from ~180K to ~220K in the same period. Synthetic data is not just supplementing PDB; it is dominating training by two orders of magnitude.

3.6 Distillation Continuity Comparison

Strategy	Models	Rationale
Throughout training	AF2, AF3, Boltz-2, SeedFold, OpenFold3	Majority approach
Removed at end	Boltz-1 (last 15K steps PDB only)	Final calibration on experimental data
Staged introduction	AF3 (TF distillation added in Fine-tune 2)	Progressive complexity

SeedFold’s ablation provides the most direct evidence: removing distillation mid-training degrades accuracy immediately. This suggests that for modern diffusion-based architectures, distillation is not a pre-training trick but a continuous necessity. Boltz-1’s removal of distillation at the end may have been feasible only because its distillation set was small (270K) — a regime where PDB alone may suffice for final fine-tuning.

4. What the Field Has Already Adopted from SSL

Having mapped each model’s distillation strategy to SSL terminology, we can now systematically assess which SSL techniques the field has adopted — even if it did not use SSL vocabulary to describe them.

4.1 Adopted Techniques

SSL Technique	CV Reference	Co-Folding Implementation	Adoption Level
Pseudo-labeling	Lee, 2013	Self-distillation (all models)	Full
Teacher-student framework	Hinton et al., 2015	Teacher generates structures, student trains on them	Full
Fixed confidence threshold	FixMatch (Sohn et al., 2020)	pLDDT/ipTM/KL-div thresholds	Full
Cross-architecture distillation	Knowledge distillation	AF2 → AF3, AF2 → Boltz	Full
Multi-source data mixing	—	AF3, Boltz-2 multi-teacher	Partial
Loss-type separation	—	AF3 confidence loss on PDB only	Partial (AF3/OF3 only)
Iterative self-training (1 round)	Noisy Student (Xie et al., 2020)	Student becomes next generation’s teacher	Implicit

4.2 The SSL-to-Protein Mapping

The mapping is precise enough to write in equation form. In CV SSL, the standard self-training loss is:

\[\mathcal{L} = \frac{1}{|\mathcal{D}_L|} \sum_{(x,y) \in \mathcal{D}_L} \ell(f_\theta(x), y) \;+\; \lambda \frac{1}{|\hat{\mathcal{D}}_U|} \sum_{(x,\hat{y}) \in \hat{\mathcal{D}}_U} \mathbb{1}[\text{conf}(\hat{y}) \geq \tau] \cdot \ell(f_\theta(x), \hat{y})\]

In protein AI, the corresponding formulation is:

\[\mathcal{L} = \underbrace{\sum_{s \in \mathcal{D}_{\text{PDB}}} w_s \cdot \mathcal{L}_{\text{struct+conf}}(s)}_{\text{labeled (PDB)}} \;+\; \underbrace{\sum_{s \in \hat{\mathcal{D}}_{\text{synth}}} w_s \cdot \mathcal{L}_{\text{struct}}(s)}_{\text{pseudo-labeled (synthetic)}}\]

Here w_s encodes the sampling weight. The term L_struct includes diffusion and distogram losses, while L_conf includes pLDDT, PDE, and PAE losses. The confidence filtering enters through data preprocessing (hard threshold on the synthetic dataset) rather than through the loss itself.

4.3 The Adoption Timeline

┌──────────────────────────────────────────────────────────────────────┐
│                SSL Technique Adoption in Co-Folding                  │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  CV SSL Era 1 (2013-2016)          Co-Folding (2021-2025)           │
│  ┌─────────────────────┐           ┌─────────────────────┐          │
│  │ Pseudo-labeling     │ ────────► │ Self-distillation   │  ✅      │
│  │ (Lee, 2013)         │           │ (AF2, 2021)         │          │
│  └─────────────────────┘           └─────────────────────┘          │
│                                                                      │
│  CV SSL Era 2 (2017-2019)                                           │
│  ┌─────────────────────┐           ┌─────────────────────┐          │
│  │ Mean Teacher (EMA)  │           │ Fixed offline       │          │
│  │ (Tarvainen, 2017)   │ ──── ✗ ──│ teacher only        │  ❌      │
│  └─────────────────────┘           └─────────────────────┘          │
│  ┌─────────────────────┐           ┌─────────────────────┐          │
│  │ Knowledge distill.  │ ────────► │ AF2 → AF3           │  ✅      │
│  │ (Hinton, 2015)      │           │ (cross-arch)        │          │
│  └─────────────────────┘           └─────────────────────┘          │
│                                                                      │
│  CV SSL Era 3 (2019-2021)                                           │
│  ┌─────────────────────┐           ┌─────────────────────┐          │
│  │ FixMatch threshold  │ ────────► │ pLDDT/ipTM filtering│  ✅      │
│  │ (Sohn, 2020)        │           │ (all models)        │  (fixed) │
│  └─────────────────────┘           └─────────────────────┘          │
│  ┌─────────────────────┐           ┌─────────────────────┐          │
│  │ Weak→Strong aug.    │           │ Same augmentation   │          │
│  │ (FixMatch)          │ ──── ✗ ──│ for teacher/student  │  ❌      │
│  └─────────────────────┘           └─────────────────────┘          │
│                                                                      │
│  CV SSL Era 4 (2021-2025)                                           │
│  ┌─────────────────────┐           ┌─────────────────────┐          │
│  │ Adaptive threshold  │           │ Fixed threshold     │          │
│  │ (FlexMatch, 2021)   │ ──── ✗ ──│ only                │  ❌      │
│  └─────────────────────┘           └─────────────────────┘          │
│  ┌─────────────────────┐           ┌─────────────────────┐          │
│  │ Soft weighting      │           │ Hard binary         │          │
│  │ (SoftMatch, 2023)   │ ──── ✗ ──│ filtering only      │  ❌      │
│  └─────────────────────┘           └─────────────────────┘          │
│  ┌─────────────────────┐           ┌─────────────────────┐          │
│  │ Learned quality     │           │ Self-assessed       │          │
│  │ (SemiReward, 2024)  │ ──── ✗ ──│ pLDDT only          │  ❌      │
│  └─────────────────────┘           └─────────────────────┘          │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

The pattern is clear: co-folding models have adopted the techniques from CV SSL’s Era 1 (2013-2016) and selectively from Era 2-3. The advances of Era 4 (2021-2025) — adaptive thresholding, soft weighting, learned quality predictors — remain entirely untapped.

5. Closing: What’s Missing?

The analysis above reveals a striking asymmetry. On one hand, every major co-folding model has independently converged on pseudo-labeling, teacher-student distillation, and fixed confidence thresholds — the core of 2013-2017 era SSL. On the other hand, the techniques that defined the 2019-2025 SSL revolution remain absent.

The gaps fall into five categories:

Adaptive thresholding. Every model uses a fixed confidence threshold (pLDDT ≥ 0.8, lDDT ≥ 0.5, etc.). FlexMatch (Zhang et al., 2021, NeurIPS) and FreeMatch (Wang et al., 2023, ICLR) showed that adaptive, class-specific thresholds dramatically improve SSL in CV. In protein terms, this would mean different thresholds for alpha-helical domains vs. disordered loops vs. protein-protein interfaces — regions where teacher confidence is calibrated very differently.
Soft weighting. All models apply binary hard filtering: a prediction is either in or out based on whether it exceeds the threshold. SoftMatch (Chen et al., 2023, ICLR) replaced this binary gate with a continuous Gaussian weight. In protein terms, a prediction with pLDDT 0.79 is currently discarded entirely (if the threshold is 0.8), while a prediction with pLDDT 0.81 gets full weight. A soft weighting scheme would make this boundary continuous.
Weak-to-strong augmentation. FixMatch’s core insight — generate pseudo-labels with weak augmentation, train the student with strong augmentation — is absent from every model. Protein AI has rich augmentation axes: MSA depth (full vs. subsampled), template availability (with vs. without), sequence cropping (full vs. random crop), coordinate noise. None of these are used asymmetrically between teacher and student.
Online teachers. Every model uses offline distillation: the teacher generates a fixed set of pseudo-labels before training begins. The Mean Teacher approach (Tarvainen & Valpola, 2017, NeurIPS) — where the teacher is an exponential moving average of the student, updated continuously — would allow pseudo-labels to improve as the student improves.
Confidence-weighted loss. Beyond AF3’s binary separation (confidence loss on PDB only vs. not), no model uses confidence as a continuous weight in the loss function. A natural extension: weight each residue’s structural loss by the teacher’s confidence, so that high-confidence regions contribute more to gradient updates.

Perhaps most critically, the field has not examined whether training confidence estimators on synthetic data — rather than experimental data alone — introduces systematic calibration errors. AF3’s decision to restrict confidence loss to PDB suggests the DeepMind team recognized this risk. But most other models have not followed suit.

What has been adopted corresponds to 2013-2017 era SSL. The 2019-2025 advances — the techniques that pushed CV SSL from “useful trick” to “standard practice” — remain untapped. This is the gap we will explore in detail in Part 3.

Next: Part 3 — Untapped Opportunities and the Confidence Calibration Trap

We will systematically propose how each unadopted SSL technique could be adapted for protein structure prediction, with concrete formulations and expected impact. The centerpiece will be the confidence calibration problem: why training confidence estimators on synthetic data may be more dangerous than training structure predictors on the same data — and what to do about it.

Part of the series: What Protein AI Can Learn from Semi-Supervised Learning

References

Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature 2021
Abramson et al., “Accurate structure prediction of biomolecular interactions with AlphaFold 3,” Nature 2024
Wohlwend et al., “Boltz-1: Democratizing Biomolecular Interaction Modeling,” bioRxiv 2024
Passaro, Wohlwend et al., “Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction,” bioRxiv 2025
SeedFold team, “SeedFold: Scaling Biomolecular Structure Prediction,” arXiv 2025
OpenFold Consortium, “OpenFold3,” 2025
Lee, “Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks,” ICML Workshop 2013
Tarvainen & Valpola, “Mean teachers are better role models,” NeurIPS 2017
Sohn et al., “FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence,” NeurIPS 2020
Xie et al., “Self-training with Noisy Student improves ImageNet classification,” CVPR 2020
Zhang et al., “FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling,” NeurIPS 2021
Wang et al., “FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning,” ICLR 2023
Chen et al., “SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning,” ICLR 2023
Wang et al., “SemiReward: A General Reward Model for Semi-supervised Learning,” ICLR 2024
Hinton et al., “Distilling the Knowledge in a Neural Network,” NeurIPS Workshop 2015

Drug Discovery, Foundation Model

This post is licensed under CC BY 4.0 by the author.