Post

Synthesis AI Part 1: The Synthesis Bottleneck — Why "Make" Lags Behind

Synthesis AI Part 1: The Synthesis Bottleneck — Why "Make" Lags Behind

AI-Driven Synthesis in Drug Discovery

This is Part 1 of a 5-part series on AI-driven synthesis in drug discovery.


1. Introduction: The Amdahl’s Law of Drug Discovery

Drug discovery runs on cycles. We design a molecule, make it, test it, learn from the results, and design the next one. This Design-Make-Test-Analysis (DMTA) loop is the heartbeat of every drug program — a typical small-molecule lead optimization campaign runs 10-15 full cycles per year. AI has compressed the Design step by orders of magnitude. But a strange asymmetry has emerged.

The overall cycle time is governed by its slowest stage, not its fastest — and Make is overwhelmingly the slowest.

Consider the arithmetic, which we explored in detail in the companion essay “Toward Accelerating the Entire Design-Make-Test Cycle with AI”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
  Typical DMTA cycle (small molecule lead optimization):

    Design:  2 days      (med chem ideation + computational scoring)
    Make:    21 days      (synthesis, purification, QC)
    Test:    10 days      (biochemical + cellular + Tier 1 ADME panel)
    Learn:   2 days       (data analysis, team review)
    ─────────────────
    Total:   35 days      (~ 5 weeks)

  After 100x Design acceleration (AI generative models):
    Design:  0.02 days    (minutes, not days)
    Make:    21 days       (unchanged)
    Test:    10 days       (unchanged)
    Learn:   2 days        (unchanged)
    ─────────────────
    Total:   33.02 days   → 5.7% improvement

  After 2x Make + Test acceleration (automation + prediction):
    Design:  2 days        (unchanged)
    Make:    10.5 days     (route pre-validation, robotic synthesis)
    Test:    5 days        (active learning, focused assay panel)
    Learn:   2 days        (unchanged)
    ─────────────────
    Total:   19.5 days    → 44% improvement

AstraZeneca — the pharma company that has published most extensively on DMTA optimization — reports that a full cycle takes 4-6 weeks, with synthesis alone consuming 3-6 weeks per round (Thakkar et al., 2021, Chem. Sci.). Design, by contrast, can now be completed in hours with modern generative models.

This is Amdahl’s law applied to chemistry: optimizing the fast component of a sequential process has negligible impact when the slow component dominates.

This series asks what it would take to break that bottleneck. Across five parts, we will trace the landscape of AI-driven synthesis — from predicting what a reaction produces, to planning multi-step routes backward from a target molecule, to generating only molecules that can actually be made, and finally to executing AI-proposed routes in physical laboratories.

  • Part 1 (this post): Why Make resists AI — the structural barriers and data landscape
  • Part 2: Reaction prediction — “What forms when these reactants meet?”
  • Part 3: Retrosynthesis — “How do we build this target molecule?”
  • Part 4: Synthesis-aware design — “Can we generate only makeable molecules?”
  • Part 5: Lab execution — “Does the AI-proposed route actually work?”

Here is the DMTA cycle and where each part of this series intervenes:

1
2
3
4
5
6
7
8
9
10
11
12
       ┌──────────┐      ┌──────────┐      ┌──────────┐      ┌──────────┐
       │  DESIGN  │─────▶│   MAKE   │─────▶│   TEST   │─────▶│  LEARN   │
       │          │      │          │      │          │      │          │
       │ propose  │      │ synth /  │      │ assay /  │      │ analyze  │
       │ molecule │      │ purify   │      │ screen   │      │ update   │
       └──────────┘      └──────────┘      └──────────┘      └────┬─────┘
            ▲                  │                                   │
            │            ┌─────┴──────┐                           │
            │            │ THIS SERIES│                           │
            │            │ Parts 2-5  │                           │
            │            └────────────┘                           │
            └─────────────────────────────────────────────────────┘

2. Why Chemistry Resists AI: Three Structural Barriers

AI has transformed molecular design. Structure prediction runs in seconds. Generative models propose candidates by the thousands. Docking evaluates binding in milliseconds. Why, then, does the Make step remain so stubbornly resistant?

The answer lies in three structural barriers that make chemical synthesis fundamentally harder for AI than molecular design.

These are not temporary gaps waiting for more data or bigger models. They reflect deep properties of chemistry itself.


Barrier 1: Combinatorial Explosion of Conditions

A chemical reaction is not determined by its reactants alone. Consider a Suzuki coupling — the reactants (aryl halide + boronic acid) are only the beginning. The outcome depends on:

  • Catalyst: Pd(PPh3)4 vs. Pd(dppf)Cl2 vs. Pd(OAc)2/XPhos — each favors different substrates
  • Base: K2CO3 vs. Cs2CO3 vs. K3PO4 — affects transmetalation rate
  • Solvent: DMF vs. THF vs. dioxane/water — governs solubility and reaction pathway
  • Temperature: 60 C vs. 100 C vs. microwave 150 C
  • Concentration: 0.1 M vs. 0.5 M — dilution effects on selectivity
  • Atmosphere: air vs. N2 vs. Ar — oxygen sensitivity
  • Time: 2 hours vs. 24 hours vs. 3 days

The same pair of reactants can give completely different products — or no product at all — under different conditions. A textbook example: the reaction of an enolizable ketone with an aldehyde can give an aldol product (kinetic control, low temperature, LDA) or an elimination product (thermodynamic control, high temperature, NaOH). Same reactants, different conditions, different molecules.

The combinatorial space is enormous. Even discretizing each variable coarsely — say, 10 common solvents, 5 catalysts, 5 bases, 5 temperature ranges, 3 concentration regimes, 2 atmospheres — gives:

10 x 5 x 5 x 5 x 3 x 2 = 7,500 condition combinations per reaction

For a structure prediction model, the input is a sequence and the output is a structure — a well-defined mapping. For a reaction prediction model, the input must somehow encode not just the molecular graphs of reactants but the entire physicochemical context. This is a fundamentally higher-dimensional prediction problem.


Barrier 2: Exception-Rich Rules

Organic chemistry is taught through rules: Markovnikov’s rule, Baldwin’s rules for ring closure, Bredt’s rule, Woodward-Hoffmann rules. These rules are genuinely useful — they capture real thermodynamic and kinetic tendencies.

But every rule in organic chemistry has exceptions that are almost as well-known as the rule itself.

Consider a few:

RuleWhat It SaysFamous Exceptions
Markovnikov’s ruleHX adds to the more substituted carbon of an alkeneAnti-Markovnikov addition with peroxides (Kharasch, 1933)
Baldwin’s rulesCertain ring closure modes are kinetically disfavored4-exo-dig and 5-endo-dig cyclizations that violate the rules but proceed readily
Bredt’s ruleNo double bond at bridgehead of small bicyclic systemsStable bridgehead olefins in large rings (Wiseman, 1967)
Woodward-HoffmannThermal [2+2] cycloadditions are forbiddenKetene [2+2] cycloadditions proceed thermally

This creates a dilemma for AI:

  • Physics-based simulation (QM/DFT) can in principle handle all of these correctly by computing transition state energies. But a single transition state calculation takes minutes to hours at DFT level, and days at coupled-cluster level. Running QM on every proposed reaction step in a multi-step synthesis is computationally prohibitive.
  • Data-driven approaches (neural networks trained on reaction databases) are fast — milliseconds per prediction — but they learn statistical patterns, not physical laws. They perform well on reactions similar to their training data and degrade unpredictably on novel substrates or unusual reaction types.

The result is a fundamental tension: physics is accurate but slow; data is fast but brittle. In protein structure prediction, end-to-end differentiable models (AlphaFold2, Boltz-1) have unified physics and data into a single framework. No comparable unification exists yet for reaction prediction.


Barrier 3: Negative Data Scarcity

Perhaps the most insidious barrier is what we do not see in the data. When a chemist tries a reaction and it fails — no product, wrong product, decomposition — that experiment almost never enters any database.

Failed reactions are the dark matter of chemistry: they shape the landscape but are nearly invisible to data-driven models.

The reasons are structural:

  • Publication bias: Journals publish successful reactions. A paper titled “A Highly Efficient Palladium-Catalyzed Cross-Coupling” will be accepted; “Fourteen Conditions Under Which This Coupling Failed” will not.
  • Electronic lab notebooks (ELNs) capture failures within organizations, but this data is proprietary, unstructured, and rarely shared.
  • Patent data (the basis of USPTO) reports optimized procedures, not the dozens of failed attempts that preceded them.

A model trained exclusively on successful reactions learns “what works” but has no signal for “what doesn’t work.” This is like training a medical diagnosis model only on patients who survived.

Here is the positive/negative ratio in commonly used datasets:

DatasetApproximate SizePositive ReactionsNegative/Failed ReactionsPos:Neg Ratio
USPTO~3.5M reactions~3.5M~0~inf:1
Reaxys~130M reactions~130Mminimal (some low-yield)»100:1
Pistachio~16M reactions~16M~0~inf:1
ORD~1M reactions~900K~100K (includes null results)~9:1
Typical pharma ELNvaries~40-60%~40-60%~1:1

The contrast between public datasets (nearly 100% positive) and actual lab experience (roughly 50/50 success/failure) is striking. The Open Reaction Database (ORD) is the first major effort to systematically include negative results, but it remains small compared to the positive-only giants.

Without negative data, models cannot learn decision boundaries — they can only interpolate within the space of known successes.


3. The Data Landscape: What AI Models Learn From

Every AI model for chemical synthesis is shaped by the data it trains on. The reaction databases available today vary enormously in size, quality, coverage, and accessibility. Understanding their strengths and limitations is essential for interpreting what any synthesis AI model can — and cannot — do.


3.1 The Major Databases

USPTO (United States Patent and Trademark Office)

The workhorse of academic reaction prediction research. Daniel Lowe’s 2012 extraction of ~3.5 million reactions from US patents (Lowe, 2012, PhD Thesis, Cambridge) created the most widely used public benchmark.

  • Strengths: Large, public, free, well-established benchmarks (USPTO-50K, USPTO-MIT, USPTO-STEREO)
  • Weaknesses: Noisy automated extraction from patent text. Condition data sparse. Biased toward industrial reaction types (cross-couplings, amide formations, reductions).

Reaxys (Elsevier)

The gold standard for reaction data in the pharmaceutical industry. Approximately 130 million reactions manually curated from the scientific literature since the 1800s.

  • Strengths: Massive coverage, expert curation, structured condition data for many entries, spans two centuries
  • Weaknesses: Commercial — expensive institutional license. Not available for ML training at scale.

Pistachio (NextMove Software)

A structured extraction of ~16 million reactions from patent literature, classified using the NextMove NameRxn system.

  • Strengths: Cleaner than raw USPTO, reactions classified by type (~1,000 named reaction categories)
  • Weaknesses: Commercial. Shares USPTO’s patent-derived biases. Condition data partial.

ORD (Open Reaction Database)

A community effort launched in 2021 (Kearnes et al., 2021, JACS) to create a standardized, open repository including conditions, yields, analytical data, and crucially, negative results.

  • Strengths: Standardized schema (Protocol Buffers), systematic conditions and outcomes, ML-ready, open-source
  • Weaknesses: Still small (~1M reactions as of early 2026). Coverage patchy — depends on voluntary contributions.

CAS SciFinder (American Chemical Society)

The largest curated reaction database, maintained by the Chemical Abstracts Service since 1907.

  • Strengths: Most comprehensive coverage, high curation quality, integrated substance and reference databases
  • Weaknesses: Highly restricted access — no bulk download for ML training. Essentially unavailable for model development.

3.2 Database Comparison

FeatureUSPTOReaxysPistachioORDCAS SciFinder
Size~3.5M~130M~16M~1M~200M+
SourceUS patentsLiteraturePatentsContributedLiterature + patents
Condition dataSparse, noisyStructured, partialPartialStandardized, systematicStructured
Yield dataRarePartial (~30%)PartialSystematicPartial
Negative resultsNoRareNoYes (by design)No
Reaction classificationNoYesYes (NameRxn)PartialYes
AccessPublic, freeCommercialCommercialPublic, freeRestricted
Noise levelHighLow-moderateModerateLowLow
ML-ready formatSMILES/RXNRequires extractionSMILES/RXNProtocol BuffersRequires extraction

The critical observation: most datasets lack systematic reaction condition and yield data. A model trained on USPTO can learn to predict “does A + B give product C?” with reasonable accuracy. But it cannot reliably answer “under what conditions?” or “in what yield?” — because that information was never consistently recorded in the training data.


3.3 The Conditions-and-Yield Gap

Consider what a medicinal chemist actually needs versus what current models provide:

What the chemist asksWhat the model can answerWhat is missing
“What product do I get from A + B?”Major product prediction (good)Minor products, side reactions
“What conditions should I use?”Generic defaults from literatureSubstrate-specific optimization
“What yield can I expect?”Rough binary (works/doesn’t)Quantitative yield prediction
“Will this work on my substrate?”Analogy to training dataOOD generalization confidence
“What could go wrong?”Almost nothingFailure mode prediction

The gap between “this reaction is known” and “this reaction will work in your lab, under these conditions, at this yield” is where most of the practical value lies — and where data is scarcest.

The ORD is the most promising effort to close this gap — its schema captures inputs with quantities, conditions, outcomes, and crucially null outcomes. If the field can sustain and scale this effort, it will fundamentally change what synthesis AI models can learn. But for now, the landscape remains dominated by large, positive-only, condition-sparse databases.


4. The Landscape of AI Synthesis: A Roadmap

The barriers described above — combinatorial conditions, exception-rich rules, missing negative data — are not problems that a single model can solve. They require a layered approach, where different AI capabilities build on one another.

This series maps four layers of AI-driven synthesis, each addressing a distinct question and depending on the layers below it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
  ┌─────────────────────────────────────────────────────────────┐
  │              AI-Driven Synthesis Landscape                  │
  │                                                             │
  │   Part 2                Part 3                Part 4        │
  │  ┌─────────────┐     ┌──────────────┐     ┌─────────────┐  │
  │  │  Reaction   │     │   Retro-     │     │ Synthesis-  │  │
  │  │ Prediction  │────▶│  synthesis   │────▶│   Aware     │  │
  │  │             │     │  Planning    │     │  Design     │  │
  │  │ "What       │     │ "How to     │     │ "Generate   │  │
  │  │  forms?"    │     │  build it?" │     │  makeable   │  │
  │  │             │     │             │     │  molecules" │  │
  │  └─────────────┘     └──────────────┘     └──────┬──────┘  │
  │        │                    │                     │         │
  │        │                    │                     ▼         │
  │        │                    │              ┌─────────────┐  │
  │        └────────────────────┴─────────────▶│    Lab      │  │
  │                                            │ Execution   │  │
  │                                            │ (Part 5)    │  │
  │                                            │ "Does it    │  │
  │                                            │  work?"     │  │
  │                                            └─────────────┘  │
  └─────────────────────────────────────────────────────────────┘

Here is what each layer addresses and why the dependencies matter:

Part 2 — Reaction Prediction: “What forms?”

Given reactants and (optionally) conditions, predict the product. This is the forward model — the atomic operation of computational chemistry. Every retrosynthesis engine needs one. Three paradigms compete: template-based, sequence-to-sequence, and graph-based.

  • Core challenge: Generalizing beyond training data, especially to novel reaction types
  • Key models: Molecular Transformer (Schwaller et al., 2019, ACS Cent. Sci.), Electron Flow Matching (Joung et al., 2025, Nature)

Part 3 — Retrosynthesis Planning: “How to build it?”

Given a target molecule, propose a synthetic route — a tree of reactions leading back to purchasable starting materials. Requires single-step retrosynthetic models and multi-step search algorithms (MCTS, A*, beam search).

  • Core challenge: Evaluating route quality beyond step count — considering yield, cost, practicality
  • Key tools: ASKCOS (Coley et al., MIT), AiZynthFinder (Thakkar et al., AstraZeneca), Synthia (Grzybowski, UNIST/Allchemy)
  • Depends on Part 2: Each proposed retrosynthetic step must be validated by a forward model

Part 4 — Synthesis-Aware Design: “Generate makeable molecules”

Embed synthesis constraints directly into the generative model, rather than generating first and filtering after. This is where molecular design meets synthesis planning — and where connections to co-folding models and bespoke libraries become concrete.

  • Core challenge: Balancing drug-likeness, target affinity, and synthetic accessibility simultaneously
  • Key approaches: SynFlowNet (Cretu et al., 2024, Mila), SynFormer (Gao et al., 2025, PNAS), REINVENT + synthesis filters (AstraZeneca)
  • Depends on Part 3: The generative model needs a retrosynthesis oracle to evaluate proposed molecules

Part 5 — Lab Execution: “Does it work?”

The ultimate test: can an AI-proposed route be executed in a physical lab? Covers automated synthesis platforms, CRO handoff, and the gap between computational proposals and bench reality.

  • Core challenge: Translating algorithmic instructions into physical procedures that work on real equipment
  • Key efforts: Chemify/XDL (Cronin, Glasgow), PostEra’s AI-to-CRO pipeline, ORD data standardization
  • Depends on Parts 2-4: Lab execution validates the entire upstream stack

5. Who Is Working on This: The Research Landscape

The field is structured around four research axes. Understanding who is working on what — and where axes intersect — provides the map we will navigate throughout this series.


Axis 1: Reaction Understanding and Prediction

Core question: “Can we accurately predict what a chemical reaction produces?”

Research GroupAffiliationKey ContributionsDirection
Philippe SchwallerEPFLMolecular Transformer (2019), RXNMapper (2021), DRFP (2022), yield predictionBuilding a “language model for reactions” — treating SMILES as natural language. Unified approach from atom mapping to fingerprints to yield
Connor ColeyMITElectron Flow Matching (Nature, 2025), reaction condition predictionMechanism-level prediction — modeling electron flow rather than just product SMILES. Bridging data-driven speed with QM-level understanding
Frank GloriusU. MunsterGeneralization benchmarks, out-of-distribution evaluationSystematically testing whether models generalize to new reaction types. Designing evaluation protocols that expose overfitting to USPTO

Key tension in this axis: Schwaller’s approach treats reactions as a language problem (sequence-to-sequence translation), while Coley’s Electron Flow Matching treats them as a physical process (electron redistribution). Both achieve strong benchmark results. The question is which approach degrades more gracefully on truly novel chemistry.


Axis 2: Retrosynthesis Planning

Core question: “Can we automatically design synthetic routes to target molecules?”

Research GroupAffiliationKey ContributionsDirection
Connor ColeyMITASKCOS platform, higher-level retro strategies (2026)Moving beyond simple disconnections to strategy-level planning — protecting groups, functional group interconversions, convergent routes. “Think like a chemist”
AstraZeneca (Thakkar, Genheden et al.)AstraZenecaAiZynthFinder, RAscore, 3-year industrial deployment reportIndustrial deployment at scale. Configurable expansion policies, stock management, integration with med chem workflows
Bartosz GrzybowskiUNIST / AllchemySynthia (Chematica), 50K+ expert-encoded rulesExpert knowledge + ML hybrid. Hand-coded rules for accuracy, ML for scalability. Has demonstrated total synthesis-level route planning
Marwin SeglerMicrosoft ResearchEarly neural retrosynthesis (2017), Syntheseus framework (2024)Unified benchmarking framework. Systematic comparison of retro models + search algorithms in all combinations

Key tension in this axis: Rules vs. learning. Grzybowski’s Synthia encodes 50,000+ expert rules and achieves remarkable accuracy on complex targets, but scaling requires ongoing expert curation. Coley’s and AstraZeneca’s ML-based approaches learn from data and scale easily, but struggle with rare or novel disconnections. Segler’s Syntheseus framework enables fair comparison across these approaches.


Axis 3: Synthesis-Aware Molecular Generation

Core question: “Can generative models produce only molecules that are actually synthesizable?”

Research GroupAffiliationKey ContributionsDirection
Yoshua Bengio groupMilaGFlowNet framework, SynFlowNet (2024), RGFN (2024)GFlowNet-based generation where the action space is defined by reactions and building blocks. Diversity + synthesizability by construction
Connor Coley / Wenhao GaoMITSynFormer (PNAS, 2025), SynNet (2022), SCScoreTransformer-based synthetic pathway generation. SynFormer uses Transformer + diffusion for building block selection across local and global chemical space
AstraZenecaAstraZenecaREINVENT, Lib-INVENT (2022)RL-based molecular generation with synthesis filters. Library-focused design for combinatorial chemistry. Industrial-scale deployment
CGFlow groupVariousCGFlow (2025)Structure-based SynFlowNet with co-folding confidence as reward. First integration of synthesis-aware generation with protein-ligand co-folding

Key tension in this axis: Two competing architectures are emerging:

  • GFlowNet-based (Bengio/Mila): Actions = reactions + building blocks. Synthesizability is guaranteed by construction. Strong on diversity. Scales with available reaction templates.
  • Transformer-based (Coley/MIT): SynFormer generates full synthetic pathways directly. Greater flexibility and scalability. Synthesizability is a learned property rather than a hard constraint.

Both approaches converge on the same principle: synthesis constraints should be embedded at generation time, not applied as a post-hoc filter.


Axis 4: Lab Execution and Automation

Core question: “Can AI-proposed synthesis be automatically executed in a physical lab?”

Research Group / CompanyAffiliationKey ContributionsDirection
Lee CroninU. GlasgowChemify platform, XDL (Chemical Description Language)Programming language for synthesis. XDL makes procedures machine-readable and robot-executable. Demonstrated automated synthesis of complex molecules
PostEraStartup (US)COVID Moonshot, AI-to-CRO pipelineLargest open-science AI synthesis experiment. Demonstrated end-to-end AI retrosynthesis to CRO execution at scale
Insilico MedicineStartup (HK)Chemistry42, Rentosertib (Phase IIa)End-to-end validation from AI target identification through molecule generation to synthesis and clinical trials
ORD consortiumMulti-institutionOpen Reaction Database, standardized reaction schemaData infrastructure. Standardizing how reaction conditions, outcomes, and failures are recorded — enabling the next generation of condition-aware models

Key observation: The bottleneck in this axis is not algorithms but standards and integration. Cronin’s XDL, the ORD’s data schema, and PostEra’s CRO handoff experience address different pieces of the same infrastructure gap: how to translate computational proposals into physical experiments reproducibly.


Cross-Axis Map

The four axes are not independent. Research groups and their outputs flow across boundaries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
               Axis 1: Reaction Prediction
              ┌─────────────────────────────┐
              │  Schwaller (EPFL)            │
              │  Coley (MIT) ──────────────────────┐
              │  Glorius (Munster)           │     │
              └──────────────┬──────────────┘     │
                             │ forward models      │
                             │ feed into            │
                             ▼                      │
               Axis 2: Retrosynthesis              │ Coley spans
              ┌─────────────────────────────┐     │ Axes 1-3
              │  Coley (MIT) ──────────────────────┤
              │  AstraZeneca                │     │
              │  Grzybowski (UNIST)         │     │
              │  Segler (Microsoft)         │     │
              └──────────────┬──────────────┘     │
                             │ route evaluation     │
                             │ enables               │
                             ▼                      │
               Axis 3: Synthesis-Aware Design      │
              ┌─────────────────────────────┐     │
              │  Bengio group (Mila)        │     │
              │  Coley / Gao (MIT) ────────────────┘
              │  AstraZeneca                │
              │  CGFlow group               │
              └──────────────┬──────────────┘
                             │ generates routes
                             │ that must be
                             ▼
               Axis 4: Lab Execution
              ┌─────────────────────────────┐
              │  Cronin (Glasgow)           │
              │  PostEra                    │
              │  Insilico Medicine          │
              │  ORD consortium             │
              └─────────────────────────────┘

Three observations stand out:

  • Connor Coley (MIT) spans Axes 1-3 — from Electron Flow Matching through ASKCOS to SynFormer. This breadth positions his group uniquely for vertically integrated synthesis AI.
  • AstraZeneca leads in industrial deployment across Axes 2-3. Their 3-year AiZynthFinder retrospective (Genheden et al., 2024, J. Cheminform.) is the most detailed account of retrosynthesis AI in production.
  • The Mila/Bengio group introduced GFlowNet — a framework that natively encodes synthesis constraints. SynFlowNet and RGFN shift the paradigm from “generate then filter” to “generate within constraints.” CGFlow extends this to structure-based objectives, connecting to co-folding models traced in our Protein AI series.

Looking Ahead

This post has established three claims:

  1. The synthesis bottleneck is the rate-limiting step of AI-driven drug discovery. Accelerating Design further yields diminishing returns; the leverage is in Make.
  2. Chemistry resists AI for structural reasons — combinatorial conditions, exception-rich rules, and the near-total absence of negative data — not merely for lack of effort.
  3. The data landscape is large but lopsided. Millions of reactions are recorded, but systematic condition and yield data remain scarce, and failed reactions are nearly invisible.

These barriers are real, but they are not insurmountable. In the next four parts, we will examine the tools researchers are building to overcome them — layer by layer, from predicting individual reactions to executing full routes in automated laboratories.

In Part 2, we examine the first building block: reaction prediction — can AI reliably predict what happens when molecules meet reagents under specified conditions?


Next: Part 2 — Reaction Prediction: Can AI Predict What Chemistry Will Do?

This post is licensed under CC BY 4.0 by the author.