Post

Harness Engineering Part 3: In Practice — Tools, Patterns, and Starting Points

Harness Engineering Part 3: In Practice — Tools, Patterns, and Starting Points

Harness Engineering: From Concept to Practice

This is Part 3 of a 3-part series on Harness Engineering.


Tomorrow Morning, You Open Your Editor

You’ve read Part 1 on why harness engineering matters.

You’ve read Part 2 on how AI can design its own harness.

Now it’s Monday morning. You have an agent to ship. Your Slack is pinging. You open your editor.

What do you actually type?

This post is the practical toolkit. We’ll walk through the OSS landscape, how to pick a framework, how to use revfactory/harness specifically, five real-world scenarios you’re probably facing, how to evaluate what you build, and a 5-week roadmap to become competent at this.

No more theory. Let’s build.


Core Principles Cheat Sheet

Pin this to your wall. These principles come from Anthropic’s engineering team plus the Meta-Harness research we covered in Part 2.

The Big Three

  1. Simplicity first. Start with LLM API calls. Add structure only when measured failure demands it.
  2. Transparency. Every decision, every tool call, every reasoning step — log it. Invisible agents are undebuggable agents.
  3. ACI (Agent-Computer Interface). Tool docs deserve as much care as UX docs. A tool with a bad description is a broken tool.

Do / Don’t

Do

  • Separate your planner, generator, and evaluator into different agents
  • Use files as source of truth, not context window
  • Make subjective goals measurable before building
  • Log everything (you learned this in Part 2)
  • Set budgets — max iterations, max tokens, max time
  • Include human checkpoints for high-stakes decisions
  • Test removal of harness components periodically

Don’t

  • Let agents self-evaluate critical work
  • Build agents when workflows suffice
  • Let context grow unbounded
  • Trust frameworks blindly without understanding the primitives
  • Add agents prematurely
  • Skip tool documentation — agents read it

The Workflow vs Agent Decision

Before you pick a framework, decide what kind of system you’re building.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
                  ┌──────────────────────────────┐
                  │ Do you know the steps in     │
                  │ advance?                     │
                  └──────┬──────────────┬────────┘
                         │ Yes          │ No
                         ▼              ▼
              ┌──────────────────┐  ┌──────────────────┐
              │ Can one LLM call │  │ Does branching   │
              │ handle it?       │  │ depend on runtime│
              └──┬────────────┬──┘  │ feedback?        │
                 │ Yes        │ No  └──┬────────────┬──┘
                 ▼            ▼        │ Yes        │ No
              ┌──────┐  ┌──────────┐   ▼            ▼
              │Single│  │Workflow  │ ┌──────┐  ┌──────────┐
              │ call │  │(5 ptns)  │ │Agent │  │Workflow  │
              └──────┘  └──────────┘ │(full)│  │(router)  │
                                     └──────┘  └──────────┘

Rule of thumb: If you can draw a flowchart of the steps before you run it, you want a workflow. If you can’t, you want an agent.

Tip: Most “I need an agent” projects are actually “I need a workflow with a router and an evaluator.”


The 3-Layer OSS Landscape

The harness tooling ecosystem is starting to stratify into clear layers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
┌─────────────────────────────────────────────────────────────┐
│  LAYER 3: Meta / Auto-Harness                               │
│  ─────────────────────────────────                          │
│  • gstack — opinionated sprint workflow (23 skills)         │
│  • revfactory/harness (Claude Code plugin)                  │
│  • harness-100 (template library)                           │
│  • Meta-Harness (research, self-improvement loops)          │
│                                                             │
├─────────────────────────────────────────────────────────────┤
│  LAYER 2: High-Level Agent Frameworks                       │
│  ─────────────────────────────────                          │
│  • CrewAI — role-playing agents + flows                     │
│  • AutoGen — conversational multi-agent                     │
│  • OpenAI Swarm — lightweight handoff                       │
│  • LangChain Agents — high-level abstractions               │
│  • Claude Code — agentic CLI                                │
│                                                             │
├─────────────────────────────────────────────────────────────┤
│  LAYER 1: Low-Level Orchestration                           │
│  ─────────────────────────────────                          │
│  • LangGraph — StateGraph + checkpointing + HITL            │
│  • OpenHarness — lightweight open-source agent harness      │
│  • Temporal — durable workflow orchestration                │
│  • Prefect — dataflow orchestration                         │
│  • Raw SDK (Anthropic, OpenAI) — direct primitives          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Direction of dependency: Layer 2 sits on top of Layer 1 conceptually; Layer 3 sits on top of Layer 2.

Rule of thumb for starting out:

  • Experimenting? Start at Layer 1 (raw SDK or LangGraph).
  • Building production? Layer 1 (LangGraph) or Layer 2 (CrewAI/AutoGen).
  • Shipping fast in Claude Code? Layer 3 (revfactory/harness).

Common Pitfall: Picking Layer 2 without understanding Layer 1. When CrewAI does something weird, you need to understand what’s happening underneath.


Framework Selection Matrix

FrameworkLayerPhilosophyBest forAvoid if
Raw SDKL1No frameworkLearning, prototypingYou need state, retries
LangGraphL1Low-level orchestrationProduction stateful agentsYou want fast DX
TemporalL1Code-as-workflowLong-running, durability criticalTask is short/stateless
AutoGenL2Conversational multi-agentAgents that “talk to each other”You want tight control
CrewAIL2Role metaphorNon-developer-friendly team designYou need low-level control
OpenAI SwarmL2Minimal handoffLearning, simple multi-agentProduction systems
Claude CodeL2/3Agentic CLICoding tasks, file opsNon-coding workflows
OpenHarnessL1Lightweight open-source harnessMulti-provider agents, researchNeed battle-tested production infra
revfactory/harnessL3Meta-skill for Claude CodeQuick team-of-agents prototypingNon-Claude-Code users
gstackL3Opinionated sprint workflowShipping production code in Claude CodeNon-standard dev workflows

Quick Decision Tree

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
What environment are you in?
│
├── Already using Claude Code?
│     │
│     ├── Shipping production code (plan → build → QA → deploy)?
│     │     └─→ gstack (opinionated sprint workflow)
│     │
│     ├── Need domain-specific agent teams?
│     │     └─→ revfactory/harness (auto-generated teams)
│     │
│     └── Unsure?
│           └─→ Start with gstack for shipping, harness for exploration
│
├── Building production customer-facing?
│     │
│     ├── Needs HITL + checkpoints?
│     │     └─→ LangGraph
│     │
│     └── Multi-agent conversational?
│           └─→ AutoGen or CrewAI
│
├── Want multi-provider flexibility (open source)?
│     │
│     └─→ OpenHarness (Claude + OpenAI + GitHub Copilot)
│
├── Research / experimentation?
│     │
│     └─→ Raw SDK, then LangGraph or OpenHarness
│
└── Non-developer stakeholders in the loop?
      │
      └─→ CrewAI (roles are readable)

Deep Dive 1: LangGraph

LangGraph is the low-level orchestration framework of choice for production stateful agents. Companies like Klarna, Uber, and J.P. Morgan use it in production.

Core Concepts

  • StateGraph: your agent is a graph where nodes are functions and edges are control flow
  • Nodes: processing functions that read/write shared state
  • Edges: static or conditional, dictate flow between nodes
  • Checkpointing: every state change saved, agent can resume from failure
  • Human-in-the-loop (HITL): pause at any node, wait for human input, resume

Minimal Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from langgraph.graph import StateGraph, START, END
from typing import TypedDict

class State(TypedDict):
    messages: list
    plan: str
    output: str

def planner(state: State) -> State:
    # LLM call that produces a plan
    state["plan"] = generate_plan(state["messages"])
    return state

def executor(state: State) -> State:
    # LLM call that executes the plan
    state["output"] = execute(state["plan"])
    return state

def should_continue(state: State) -> str:
    if needs_more_work(state["output"]):
        return "executor"
    return END

graph = StateGraph(State)
graph.add_node("planner", planner)
graph.add_node("executor", executor)
graph.add_edge(START, "planner")
graph.add_edge("planner", "executor")
graph.add_conditional_edges("executor", should_continue)

app = graph.compile(checkpointer=MemorySaver())

Why Teams Pick LangGraph

  • Durable execution: failures don’t lose state; resume from checkpoint
  • HITL built in: pause and wait for human approval at any node
  • Memory: short-term (thread) + long-term (cross-session)
  • Production deployment: mature infra, observability integrations

When LangGraph Is Overkill

If your whole app is a single LLM call and a retry loop, you don’t need a graph framework. Use the raw SDK.

Try this: Re-implement Anthropic’s 5 workflow patterns from Part 1 as LangGraph StateGraphs. It’s the fastest way to learn.


Deep Dive 2: revfactory/harness + harness-100

If you use Claude Code, revfactory/harness is worth 30 minutes of your time. It’s a meta-skill plugin that automatically designs domain-specific agent teams.

Installation

Via Claude Code marketplace:

1
2
/plugin marketplace add revfactory/harness
/plugin install harness@harness

Direct install:

1
cp -r skills/harness ~/.claude/skills/harness

Required environment variable:

1
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

How You Use It

Just ask in natural language:

1
2
3
"Build a harness for this project"
"Design an agent team for API documentation"
"Set up a harness for code review with parallel specialists"

The plugin will analyze your domain, choose an orchestration pattern, and generate agents + skills.

The 6 Orchestration Patterns

The plugin supports six patterns for team composition:

PatternFlowBest for
PipelineA → B → CSequential dependent tasks
Fan-out / Fan-inA → {B1, B2, B3} → CParallel independent work
Expert PoolDispatcher → chosen expertContext-dependent specialization
Producer–ReviewerProducer → Reviewer (loop)Quality gates
SupervisorCentral agent distributesDynamic task allocation
Hierarchical DelegationTop-down recursiveDeep task decomposition

What It Generates

After invocation, the plugin creates this structure:

1
2
3
4
5
6
7
8
9
10
.claude/
├── agents/
│   ├── analyst.md       ← domain analysis agent
│   ├── builder.md       ← implementation agent
│   └── qa.md            ← quality verification agent
└── skills/
    ├── analyze/
    │   └── SKILL.md     ← scoped skill for analyst
    └── build/
        └── SKILL.md     ← scoped skill for builder

Each agent is a Markdown file with role, responsibilities, and tools. Each skill is a focused capability the agent can invoke.

harness-100: Template Library

revfactory/harness-100 is a companion repo with 100 production-ready harness templates across 10 domains:

  • Content creation (webtoon, YouTube, blog series)
  • Software development (code review, refactoring, debugging)
  • AI/data (pipeline design, evaluation, dataset curation)
  • Healthcare
  • Marketing
  • Research synthesis
  • API documentation
  • And more

Use it as reference, not blind copy-paste. Each template shows how a real harness was composed — which pattern, which agents, which skills.

Reported Performance

A/B testing by the author showed:

  • +60% quality improvement (79.3 vs 49.5 baseline)
  • 100% win rate across 15 software engineering tasks
  • Effectiveness scales with task complexity

Caveat: token consumption is significant. Users with smaller Claude Max plans reported hitting limits quickly. Budget accordingly.


Deep Dive 3: gstack — The Sprint Workflow Harness

If revfactory/harness is “generate an agent team for any domain,” gstack is “here’s the exact team you need to ship production code.” Built by YC CEO Garry Tan, who used it to ship 600,000+ lines of code in 60 days while running YC full-time. 64.8k GitHub stars.

Philosophy

gstack treats the software development cycle as a structured sprint:

1
Think → Plan → Build → Review → Test → Ship → Reflect

Each stage has dedicated skills that embody specific professional roles — not a generalist assistant, but a CEO reviewer, an engineering lead, a QA tester, a security officer. The model isn’t doing everything; each skill constrains the agent to one perspective.

23 Specialized Skills

StageSkillsWhat they do
Plan/office-hours, /plan-ceo-review, /plan-eng-review, /autoplanProduct discovery, architecture validation, sprint planning
Design/design-consultation, /design-shotgun, /design-htmlDesign systems, variant generation, production HTML
Build/browse, /codex, /investigatePersistent browser automation, multi-AI code review, debugging
QA/qaReal browser testing, atomic commits per bug, health scoring
Review/reviewMulti-stage PR analysis, parallel specialist review, adversarial review (Claude + Codex)
Security/cso15-phase vulnerability scan (OWASP + STRIDE)
Ship/ship, /land-and-deploy, /canaryPre-merge automation, merge-to-prod, staged rollout
Reflect/retroWeekly metrics across projects

The Killer Feature: Persistent Browser QA

gstack runs a long-lived headless Chromium daemon — not a fresh browser per command. This means:

  • 100-200ms command execution (vs. 2-3 seconds for fresh launches)
  • Persistent login state — the QA agent stays authenticated across tests
  • Real UI interactions — clicks, forms, navigation, not mocked endpoints
  • Before/after diffs — visual regression detection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌─────────────────────────────────────────────────────┐
│                gstack /qa                            │
│                                                     │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐   │
│  │ Browse   │────→│ Find bug │────→│ Fix +    │   │
│  │ real app │     │ (UI/UX)  │     │ atomic   │   │
│  │ (100ms)  │     │          │     │ commit   │   │
│  └──────────┘     └──────────┘     └────┬─────┘   │
│                                         │          │
│                                         ▼          │
│                                    [Next bug]      │
│                                    loop until       │
│                                    health score OK  │
└─────────────────────────────────────────────────────┘

Adversarial Code Review

The /review skill doesn’t just run one AI reviewer — it dispatches independent Claude + Codex reviewers, then deduplicates findings. Different models catch different blind spots. This is defense-in-depth for code quality.

When gstack Fits

  • Individual developers or small teams shipping production code — the sprint workflow matches real eng practice
  • You want strong opinions — gstack tells you how to work, not just what tools to use
  • You need real QA — persistent browser automation is unique in this space

When gstack Doesn’t Fit

  • Non-standard development workflows — the 23 skills are opinionated and may clash
  • You’re not on Claude Code — gstack is tightly coupled to the Claude Code runtime
  • You need domain-specific agents — gstack is for software development, not arbitrary domains (use revfactory/harness for that)

gstack vs revfactory/harness

 gstackrevfactory/harness
ApproachPrescriptive: “follow this sprint”Generative: “describe your domain, I’ll design a team”
Skills23 pre-built, software-dev focusedAuto-generated per domain
CustomizationTemplates (.tmpl)Natural language prompts
StrengthShipping code fast with quality gatesFlexible agent composition for any domain
Stars64.8k2.5k

They’re complementary, not competing. Use gstack for your daily coding workflow. Use revfactory/harness when you need a custom agent team for a non-coding domain.


Deep Dive 4: OpenHarness — The Lightweight Open-Source Reference

OpenHarness (HKUDS, Hong Kong University) is the first open-source implementation of the formal “Agent Harness” pattern defined in recent research (Meta-Harness, Natural-Language Agent Harnesses). It delivers Claude-Code-like functionality in ~11,700 lines of Python — roughly 3% of Claude Code’s codebase.

Why It Matters

OpenHarness is not trying to replace Claude Code or LangGraph. It’s a reference implementation that answers: “What does a minimal, functional agent harness actually look like in code?”

If LangGraph is the production-grade framework and Claude Code is the polished product, OpenHarness is the textbook implementation — readable, hackable, and provider-agnostic.

Architecture (10 Subsystems)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
┌──────────────────────────────────────────────┐
│              OpenHarness Core                 │
│                                              │
│  ┌────────┐  ┌────────┐  ┌────────┐         │
│  │ Engine │  │Toolkit │  │ Skills │         │
│  │ (loop) │  │(43+tools)│ │(40+ md)│        │
│  └───┬────┘  └───┬────┘  └───┬────┘         │
│      └───────────┼───────────┘               │
│                  ▼                            │
│  ┌────────┐  ┌────────┐  ┌────────┐         │
│  │Plugins │  │Permis- │  │Memory  │         │
│  │        │  │sions   │  │& State │         │
│  └────────┘  └────────┘  └────────┘         │
│                  ▼                            │
│  ┌────────┐  ┌────────┐  ┌────────┐         │
│  │ Swarm  │  │ Hooks  │  │ Config │         │
│  │(multi) │  │        │  │ & UI   │         │
│  └────────┘  └────────┘  └────────┘         │
└──────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    Anthropic       OpenAI-       GitHub
    (Claude)      compatible      Copilot
                  (DeepSeek,
                   Ollama, ...)

Multi-Provider: The Key Differentiator

Unlike Claude Code (Anthropic-only) or gstack (Claude Code-dependent), OpenHarness supports:

  • Anthropic format: Claude, Moonshot/Kimi, Bedrock, Vertex
  • OpenAI format: OpenAI, DeepSeek, Groq, Ollama, local instances
  • GitHub Copilot: OAuth device flow, no API keys required

This makes it the only harness in this list that lets you switch providers without changing your harness code. For cost-conscious teams or researchers benchmarking across models, this is significant.

What’s Good

  • Code density: 11,700 lines for near-parity with Claude Code is remarkable engineering
  • Anthropic ecosystem compatibility: CLAUDE.md files, skills, and plugins work unchanged
  • Production signals: 114+ tests, exponential backoff, token/cost tracking, permission model
  • Hackability: Small codebase means you can read and modify the entire harness

What’s Early

  • v0.1.0 (released April 1, 2026) — 5,450 stars but only days old
  • No published benchmarks against Claude Code or LangGraph
  • Multi-agent patterns exist (Swarm subsystem) but lack real-world examples
  • Documentation is solid at README level but thin on architecture deep-dives

When to Use OpenHarness

  • You want to understand harness internals — read the source as a learning exercise
  • You need multi-provider flexibility — swap between Claude, GPT, DeepSeek, Ollama
  • You’re extending the harness for research — small codebase means fast iteration
  • You want Claude-Code-like features without vendor lock-in

When Not to Use OpenHarness

  • You need battle-tested production infrastructure — use LangGraph
  • You want a polished UX — use Claude Code directly
  • You need a rich ecosystem of integrations — LangChain/LangGraph has far more

Watch this space: If OpenHarness delivers on its benchmarks and the community grows, it could become the “SQLite of agent harnesses” — the small, embeddable, reliable option that everyone reaches for when they don’t need a full framework.


5 Real-World Scenarios (And What to Use)

Here are the scenarios you’re probably facing, matched to tools and patterns.

Scenario A: Small Task, 1-2 Agents

Example: “Summarize this document and classify it into one of 5 categories.”

Recommendation: Raw SDK + prompt chaining.

1
[Input] → [Summarize] → [Classify] → [Output]
  • Tools: Anthropic/OpenAI SDK directly
  • Pattern: Prompt chaining
  • Don’t: Reach for a framework. You don’t need it yet.

Scenario B: Medium Team, 3-5 Agents

Example: “Write a technical blog post that’s researched, drafted, fact-checked, and edited.”

Recommendation: CrewAI or AutoGen with Producer-Reviewer.

1
2
3
[Researcher] → [Writer] → [Fact-checker] → [Editor]
                              │                │
                              └── feedback ────┘
  • Tools: CrewAI (readable roles) or AutoGen (conversational)
  • Pattern: Orchestrator-workers + evaluator-optimizer
  • Don’t: Let Writer be the Fact-checker. Separate them.

Scenario C: Long-Running Stateful

Example: “Customer support agent that handles multi-day escalations with human approval gates.”

Recommendation: LangGraph with checkpointing and HITL.

1
2
3
[Triage] → [Route] → [Handler] → [Human approval] → [Resolve]
                         ▲                             │
                         └── retry if rejected ────────┘
  • Tools: LangGraph (mandatory for durability), LangSmith for tracing
  • Pattern: Router + HITL
  • Don’t: Try to keep state in prompts. Use checkpointer.

Scenario D: Claude Code Native

Example: “I’m already in Claude Code, I need to ship a feature with plan, build, QA, and review.”

Recommendation: gstack for the sprint workflow. Use /autoplan → build → /qa/review/ship.

1
2
3
4
[/autoplan] → [Build] → [/qa (browser)] → [/review (adversarial)] → [/ship]
                             │                      │
                             └── fix + atomic ───────┘
                                 commit per bug
  • Tools: gstack (23 skills covering the full sprint cycle)
  • Pattern: Pipeline with evaluator-optimizer loops at QA and review stages
  • Don’t: Skip /qa. The persistent browser automation catches real UI bugs that unit tests miss.

Alternative: If your task is a custom domain (not standard software dev), use revfactory/harness to generate a domain-specific agent team instead.

1
2
3
4
5
[Auditor]
    ├── dispatch → [Security specialist]
    ├── dispatch → [Performance specialist]
    ├── dispatch → [Architecture specialist]
    └── merge ────── [Report aggregator]
  • Tools: revfactory/harness + harness-100 templates
  • Pattern: Fan-out / Fan-in

Scenario E: Research / Experimental

Example: “I want to find the best harness for my task. Let the AI figure it out.”

Recommendation: Meta-Harness-style loop (covered in Part 2).

1
2
3
4
5
6
[Baseline] → [Evaluate] → [Log to filesystem] → [Proposer reads history]
                                                         │
                                                         ▼
                                                  [New candidate]
                                                         │
                                                         └── loop
  • Tools: Claude Code as proposer, simple eval harness, filesystem logging
  • Pattern: Self-improvement loop
  • Don’t: Underestimate token cost. This is expensive per iteration.

Evaluation and A/B Testing Your Harness

A harness you can’t evaluate is a harness you can’t improve. Here’s the minimum evaluation setup.

Quantitative Metrics

  • Task completion rate — did the harness finish the task correctly?
  • Average iterations to completion — did it converge efficiently?
  • Token cost per successful task — what does each success actually cost?
  • Failure mode distribution — where does it break?
  • Wall-clock time — especially for long-running scenarios

Qualitative Metrics

  • Human preference in A vs B comparisons
  • Readability of produced artifacts
  • Recoverability from errors

A/B Test Protocol

1
2
3
4
5
6
7
1. Define baseline harness H0
2. State hypothesis: "Splitting planner from generator → +X% success"
3. Build variant H1 (minimal change from H0)
4. Run H0 and H1 each N times on the same task set
5. Paired comparison on identical tasks
6. Compute 95% confidence interval on the delta
7. If significant, promote H1. If not, discard.

Try this: Before your next harness change, write down what you expect to see. If you’re wrong more than half the time, your mental model of the harness is miscalibrated — that’s worth knowing.

Observability Tools

ToolStrength
LangSmith (LangChain)Best for LangGraph, rich trace UI
Arize PhoenixOpen-source, framework-agnostic
BraintrustEval-focused, strong comparisons
HeliconeRouting + observability combined

All four will give you tracing. Pick based on your stack.


Meta-Harness-Style Self-Improvement: Minimal Sketch

You don’t need a research lab to run Meta-Harness-style optimization. Here’s a minimal sketch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import json
from pathlib import Path

HISTORY_DIR = Path("./harness_history")
HISTORY_DIR.mkdir(exist_ok=True)

def log_candidate(iteration, code, score, traces):
    """Save everything to disk, not just the score."""
    candidate_dir = HISTORY_DIR / f"iter_{iteration:03d}"
    candidate_dir.mkdir(exist_ok=True)
    (candidate_dir / "harness.py").write_text(code)
    (candidate_dir / "score.json").write_text(json.dumps({"score": score}))
    (candidate_dir / "traces.jsonl").write_text("\n".join(traces))

def propose_next(history_dir):
    """Claude Code (as proposer) reads history, proposes new candidate."""
    # Claude Code gets filesystem access to history_dir
    # Reads whatever files it chooses
    # Writes a new harness.py
    pass

def evaluate(harness_code, task_set):
    """Run harness on tasks, return (score, traces)."""
    pass

# Main loop
for i in range(N_ITERATIONS):
    code = propose_next(HISTORY_DIR) if i > 0 else baseline_code()
    score, traces = evaluate(code, task_set)
    log_candidate(i, code, score, traces)

Key architectural choices:

  • History on disk, not in context window
  • Full traces stored (Meta-Harness lesson)
  • Proposer chooses what to read (adaptive inspection)

Start simple: 10 iterations, 5 tasks. See what happens.


5-Week Learning Roadmap

Want to go from reading this series to shipping a harness? Here’s a sequenced path:

Week 1: Foundations

  • Read Anthropic’s “Building effective agents” essay
  • Implement each of the 5 workflow patterns (Part 1) by hand with raw SDK
  • Don’t use a framework

Week 2: Low-level Orchestration

  • Work through LangGraph quickstart
  • Re-implement one workflow pattern as a StateGraph
  • Add a checkpointer, simulate a failure, resume

Week 3: Claude Code Harness Tools

  • Install gstack and run /qa + /review on a real project
  • Install revfactory/harness and generate a custom agent team
  • Compare: gstack’s opinionated sprint vs harness’s flexible composition
  • Read 3 templates from harness-100 and analyze their structure

Week 4: Ship Something

  • Pick a real task you have
  • Design the harness (Planner + Generator + Evaluator)
  • Build it. Measure it. Iterate.
  • Keep notes on what worked, what didn’t

Week 5: Evaluation + Meta-Loops

  • Add LangSmith/Phoenix/Braintrust tracing
  • Run an A/B test on one harness change
  • Sketch a minimal Meta-Harness loop for your task
  • (Optional) Let Claude Code iterate for 5-10 rounds

After week 5, you’ll know what you’re missing — and you’ll know how to find it.


7 Takeaways (Print This)

  1. Simple first, complex later. Single call → workflow → agent. Escalate only when necessary.
  2. Separate evaluation from generation. Never let an agent grade its own work.
  3. Files > context window. State lives on disk. Context is scratch space.
  4. Rich feedback beats clever structure. (Meta-Harness lesson: log everything.)
  5. Frameworks hide; direct code reveals. Know the primitives before you adopt abstractions.
  6. Harness is iterative. It’s a system you evolve, not a template you apply.
  7. Start with templates. revfactory/harness-100 or equivalent. Don’t start from zero.

Series Wrap-Up

Three posts in, here’s the arc:

Part 1 established that the scaffolding around your model is now the bottleneck — and that “harness” is the name of that scaffolding.

Part 2 showed that AI can design its own harness when given the right information diet, and that richer feedback consistently beats cleverer structure.

Part 3 gave you the tools, patterns, and scenarios to start building harnesses that work in production today.

If you take one thing from the series: harness engineering is a measurement-driven discipline. You don’t design a harness once — you evolve it, with your finger on the numbers.

Model capability is catching up. Harness quality is still a moat.


References and Further Reading

Engineering Blogs

  • Anthropic: “Harness design for long-running LLM applications”
  • Anthropic: “Building effective agents”
  • Cognition Labs: “Don’t Build Multi-Agents”

Research

  • Meta-Harness (arxiv 2603.28052)

Research

  • Natural-Language Agent Harnesses (arxiv 2603.25723)

OSS Projects

This post is licensed under CC BY 4.0 by the author.