Post

Agent Tool Interfaces Part 2: Orchestrating Tool Interfaces — From Harness Design to GraphRAG

Agent Tool Interfaces Part 2: Orchestrating Tool Interfaces — From Harness Design to GraphRAG

You know the interfaces. Now learn how to compose them — and how GraphRAG makes your agent smarter about which tools to use.

Agent Tool Interfaces: From Landscape to Orchestration

This is Part 2 of a 2-part series on Agent Tool Interfaces.


TL;DR: Having 43 tools is not a feature — it’s a liability if your agent can’t pick the right one. This post covers three things: (1) how to orchestrate CLI, MCP, and Code Execution in a single harness using the Three-Loop Architecture, (2) how to route, manage state, and recover from failures across interfaces, and (3) how GraphRAG enables intelligent tool discovery that collapses the schema bloat problem from O(n) to O(1). The punchline: a Tool Knowledge Graph with graph-guided selection achieves 43% accuracy vs 14% baseline while cutting token cost by 95%.


The 50-Tool Problem

Your agent has access to 43 GitHub tools, 8 Slack tools, 5 database tools, and a handful of file operations. That’s 60+ tools loaded into context.

What happens?

  • 44,000+ tokens consumed by schemas before any work begins
  • The LLM picks the wrong tool 28% of the time at this scale
  • Perplexity removed MCP support entirely citing these costs

Vercel discovered the fix: they cut their tool set by 80% and performance improved. The agent got better when it had fewer choices.

“The best harness is the simplest one that still works.” — Anthropic Engineering

But you still need those 60+ tools — just not all at once. The question is: how do you give the agent the right tools at the right time?

That’s orchestration. And in 2026, the answer involves three interconnected problems: routing, state management, and discovery.


The Three-Loop Architecture

In Part 1, we mapped six tool interfaces. In production, the top three — CLI, Code Execution, and MCP — layer into a three-loop architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
┌──────────────────────────────────────────────────────────────┐
│                    Agent System                              │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Inner Loop (CLI)                          │  │
│  │                                                        │  │
│  │  git ─── pytest ─── eslint ─── cargo ─── docker        │  │
│  │                                                        │  │
│  │  ● Stateless subprocess   ● Zero auth overhead         │  │
│  │  ● Millisecond latency    ● Billions of training data  │  │
│  │  ● 100% reliability       ● ~1,365 tokens/op           │  │
│  └────────────────────────────────────────────────────────┘  │
│                          │                                    │
│  ┌────────────────────────────────────────────────────────┐  │
│  │             Middle Loop (Code Execution)               │  │
│  │                                                        │  │
│  │  Multi-tool chains ─── Data transforms ─── API combos  │  │
│  │                                                        │  │
│  │  ● Sandboxed (E2B/V8)    ● N steps → 1 round-trip     │  │
│  │  ● Code is inspectable   ● 99.9% cheaper than MCP     │  │
│  │  ● ~600 tokens/op        ● State within sandbox       │  │
│  └────────────────────────────────────────────────────────┘  │
│                          │                                    │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Outer Loop (MCP + Browser)                │  │
│  │                                                        │  │
│  │  Slack ─── Notion ─── Jira ─── Salesforce ─── Legacy   │  │
│  │                                                        │  │
│  │  ● OAuth 2.1 auth        ● Audit trail                │  │
│  │  ● Structured I/O        ● Self-describing schemas     │  │
│  │  ● Cross-network         ● Browser for no-API targets  │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │        State Layer (Filesystem + Checkpoints)          │  │
│  │  Files as source of truth │ Survives truncation/crash  │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘

Why three loops, not two?

The original Two-Loop model (CLI inner, MCP outer) missed the middle ground: multi-step workflows that are too complex for a single CLI command but don’t need OAuth or external authentication. Code Execution fills this gap — it’s the place where you chain fetch → filter → aggregate → format in a single sandboxed pass instead of 4 sequential MCP calls.

Claude Code: The Canonical Implementation

Claude Code is the clearest production example of this pattern [Penligent]:

  • Inner Loop: 8 built-in CLI tools (Bash, Read, Edit, Write, Grep, Glob, Task, TodoWrite)
  • Middle Loop: Subagents via Task tool — nested conversations with isolated context windows
  • Outer Loop: MCP servers for external services via claude mcp add
  • Unified permissions: Bash(git diff *) and mcp__server__tool use identical syntax
  • Subagent scoping: Security-review agents get Read/Grep/Glob but not Edit/Bash

The LLM routes naturally between tools based on task descriptions — no explicit routing logic required. When you ask “check the latest PR and post a summary to Slack,” it uses gh (CLI) for the PR and the Slack MCP server for posting.

BKit: PDCA as a Harness Plugin

While Claude Code implements the Three-Loop Architecture internally, BKit implements it externally — as a plugin that wraps Claude Code, Gemini CLI, and Codex CLI with a structured PDCA (Plan-Do-Check-Act) workflow [GitHub].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
┌──────────────────────────────────────────────────────────┐
│                     BKit Layer                           │
│                                                          │
│  Plan ──▶ Design ──▶ Do ──▶ Check ──▶ Act ──▶ Report    │
│                              │                           │
│                     gap-detector agent                   │
│                     (design vs implementation            │
│                      match rate ≥ 90%)                   │
│                              │                           │
│              ┌───────────────┼───────────────┐           │
│              ▼               ▼               ▼           │
│         Opus agents    Sonnet agents    Haiku agents     │
│         (11: complex   (19: implement) (2: light        │
│          reasoning)                     tasks)          │
└──────────────────────────────────────────────────────────┘
                         │
                         ▼
              Claude Code / Gemini CLI / Codex CLI
              (Three-Loop execution underneath)

Key design choices that map to harness engineering principles:

  • 37 skills (18 workflow / 18 capability / 1 hybrid) loaded per phase — not all at once. This avoids the 50-Tool Firehose anti-pattern.
  • Separate evaluator: A gap-detector agent compares implementation against design docs. Match rate < 70% triggers automatic iteration (max 5 cycles). This is the Planner–Generator–Evaluator pattern enforced at the system level.
  • 6-layer hook system: 18 event types (SessionStart, PreToolUse, PostToolUse, PreCompact, Stop, etc.) inject context at distinct lifecycle points.
  • Cross-platform: ~95% code reuse across Claude Code, Gemini CLI, and Codex CLI. Only manifest files and hook event names differ.
  • Context preservation: PDCA state survives compaction, improving context retention from 30-40% to 75-85%.

BKit demonstrates that the Three-Loop Architecture isn’t just for framework builders — it can be imposed on existing agents from the outside via hooks, skills, and MCP servers (bkit-pdca, bkit-analysis).

References:


Tool Routing: How to Pick the Right Interface

Three Routing Strategies

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
┌──────────────────────────────────────────────────────────────┐
│                   Routing Strategy Spectrum                   │
│                                                              │
│  Fast ◄────────────────────────────────────────────► Smart   │
│                                                              │
│  Rule-Based          Hybrid               LLM-Based          │
│  ──────────          ──────               ─────────          │
│  ● 0ms overhead      ● Fast rules first   ● 50-100ms        │
│  ● Deterministic     ● Semantic middle     ● Flexible        │
│  ● Rigid             ● LLM fallback       ● Handles novel    │
│  ● Misclassifies     ● Best of both         combinations     │
│    edge cases        ● 39% cost savings   ● Explainable      │
│                                                              │
│  "if file_op: CLI"   Rules → Router → LLM  "Model picks"    │
└──────────────────────────────────────────────────────────────┘

Rule-Based: Predefined conditions per task type. If the task matches a known pattern (e.g., “run tests” → CLI), route immediately. Zero latency, but can’t handle ambiguity.

LLM-Based: The model itself decides which tool to call. This is Claude Code’s native approach — tool schemas are in context, and the model generates the appropriate tool call. Maximum flexibility, but 50-100ms additional inference latency per routing decision.

Hybrid (consensus best practice): Layer all three for optimal cost/quality [Patronus AI, Requesty]:

  1. Fast rules filter obvious cases (file ops → CLI, Slack → MCP)
  2. Semantic router handles the middle tier (embedding similarity to known patterns)
  3. LLM tackles edge cases and novel combinations

Measured results: 37-46% reduction in LLM usage, 32-38% latency improvement, 39% cost reduction.

Cost-Aware Routing

oh-my-claudecode implements model-level routing: simple tasks (variable rename) go to Haiku (fast, cheap), complex tasks (architecture decisions) go to Opus. Claimed 30-50% token savings [oh-my-claudecode GitHub].

BKit takes a different approach: role-based routing with fixed model assignments. 32 agents are pre-assigned — 11 to Opus (complex reasoning), 19 to Sonnet (implementation), 2 to Haiku (lightweight). The routing is deterministic by agent role, not dynamic by task complexity. This trades flexibility for consistency — every architecture decision always goes through Opus, every code generation always goes through Sonnet.

The cost hierarchy to internalize:

InterfaceCost/opWhen to route here
CLI~1,365 tokensCLI tool exists, simple auth
Code Exec~600 tokensMulti-step, data transforms
MCP~44,026 tokensOAuth required, no CLI exists
Browser~100K+ tokensNo API at all (vision + actions)

Always prefer the cheaper interface unless a specific requirement (auth, audit, structured output) forces you upward.


State Management: The Hardest Unsolved Problem

Each interface has a different state model. This is where most multi-interface harnesses break.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
┌──────────────────────────────────────────────────────────────┐
│              State Model per Interface                        │
│                                                              │
│  CLI (Bash)         MCP (stdio)        Code Execution        │
│  ──────────         ──────────         ──────────────         │
│  Stateless          Session-based      Sandbox-scoped         │
│  subprocess.        Process persists   State persists         │
│  Env resets         for client         within sandbox         │
│  between calls.     lifetime.          lifetime.              │
│                                                              │
│  MCP (HTTP)         Subagents                                │
│  ──────────         ─────────                                │
│  Stateless.         Separate context                         │
│  Each request       window. Returns                          │
│  is independent.    summary only.                            │
└──────────────────────────────────────────────────────────────┘

Solution 1: Filesystem-as-State (Anthropic’s Recommendation)

The core insight from Meta-Harness research [arxiv:2603.28052]: externalize state to the filesystem.

Three enforced properties:

  • Externalized: State is written to artifacts, not held in transient context
  • Path-addressable: Later stages reopen the exact object by file path
  • Compaction-stable: State survives context truncation, restart, and delegation

This is how Claude Code works — CLAUDE.md, project files, and TODO lists are the state store. The conversation window is disposable; the filesystem is truth.

“Use files as source of truth, not context window.” — Anthropic Engineering [Best Practices]

Solution 2: LangGraph Checkpointing

Every state change is saved as a checkpoint. The agent can resume from any prior point after failure [LangGraph Docs]:

  • Short-term: Thread-scoped memory (within a session)
  • Long-term: Cross-session memory (across conversations)
  • Recovery: If the agent fails at node N, resume from checkpoint N-1
  • Production users: Klarna, Uber, J.P. Morgan

Solution 3: Deep Agents CompositeBackend

LangChain’s Deep Agents SDK mixes state backends [Blog]:

  • In-memory for speed
  • Filesystem for persistence
  • LangGraph Store for cross-thread memory
  • The agent offloads large context to a virtual filesystem to prevent overflow

Practical Recommendation

Tool countState strategy
< 10 toolsFilesystem-as-State is sufficient
10-30 toolsAdd LangGraph checkpointing for crash recovery
30+ toolsCompositeBackend with selective offloading

Error Handling and Fallback Chains

The Reliability Math

If each step succeeds 95% of the time, chaining 20 steps gives:

1
0.95^20 = 36% end-to-end success rate

This is why production harnesses invest heavily in fallback chains.

Fallback Patterns

1
2
3
4
5
6
7
8
9
10
11
12
13
┌──────────────────────────────────────────────────────────────┐
│                    Fallback Chain                             │
│                                                              │
│  ┌───────────┐  timeout  ┌───────────┐  fail  ┌──────────┐  │
│  │ MCP call  │──────────▶│ CLI       │───────▶│ Human-in-│  │
│  │ (primary) │           │ fallback  │        │ the-loop │  │
│  └───────────┘           └───────────┘        └──────────┘  │
│                                                              │
│  ┌───────────┐  sandbox  ┌───────────┐  fail  ┌──────────┐  │
│  │ Code Exec │  error    │ Sequential│───────▶│ Log +    │  │
│  │ (batch)   │──────────▶│ tool calls│        │ retry    │  │
│  └───────────┘           └───────────┘        └──────────┘  │
└──────────────────────────────────────────────────────────────┘

MCP Gateway (the most impactful single optimization):

A middleware layer that solves MCP’s reliability problem [StackOne]:

  • Connection pooling: TCP timeout failures drop from 28% → ~1%
  • Schema filtering: Token cost drops 90% (3 relevant tools instead of 43)
  • Auth centralization: Single OAuth flow for all downstream servers
  • Implementations: Lasso MCP Gateway, WunderGraph (GraphQL-based), Portkey

LangGraph checkpoint recovery: If an agent fails at node N, resume from checkpoint N-1 with preserved state. No restart from scratch.

oh-my-*’s targeted repair: The VERIFY stage checks combined output. If verification fails, only the broken parts get targeted repairs in the FIX stage — no full re-execution. Auto-resume daemon handles rate limit interruptions [GitHub].


GraphRAG: How Agents Find the Right Tool

This is where the orchestration story gets interesting. Everything above — routing, state, fallbacks — assumes the agent already knows which tools exist. But with 50+ tools across multiple MCP servers, tool discovery itself becomes the bottleneck.

The Schema Bloat Problem, Quantified

1
2
3
4
5
MySQL MCP Server:      106 tools = 207KB = ~54,600 tokens
GitHub MCP Server:      43 tools = ~44,000 tokens
7+ MCP servers:        67,000+ tokens consumed before any work

All loaded on every request, even if you need 2 tools.

This is the “hidden token tax” [Layered.dev]. Current solutions (lazy loading, dynamic toolsets, gateways) help, but they’re syntactic optimizations — they reduce what’s loaded, not how intelligently tools are selected.

The semantic solution is retrieval: don’t load all tools — search for the right ones.

Vector RAG vs GraphRAG for Tool Selection

Vector RAG embeds tool descriptions and retrieves the top-k most similar. Simple, fast, but fundamentally limited:

MetricVector RAGGraphRAG
Overall accuracy56.2%~90% (FalkorDB)
Multi-entity queries (5+ entities)Degrades to 0%Stable
Multi-hop reasoningPoorStrong
Hallucination rateBaselineUp to 90% reduction

Source: Diffbot KG-LM Benchmark [FalkorDB]

Why the gap? Vector search treats tools as independent documents. But tools have relationships — dependencies, composition patterns, input/output type compatibility — that vectors can’t capture.

1
2
3
4
5
6
7
8
9
10
11
12
13
Vector RAG sees:                    GraphRAG sees:
┌──────────┐                        ┌──────────┐
│ Tool A   │   (independent)        │ Tool A   │──output──▶(:DataType)
└──────────┘                        └──────────┘              │
┌──────────┐                        ┌──────────┐              │
│ Tool B   │   (independent)        │ Tool B   │◀──input───────┘
└──────────┘                        └──────────┘
┌──────────┐                        ┌──────────┐
│ Tool C   │   (independent)        │ Tool C   │──alt_to──▶(Tool B)
└──────────┘                        └──────────┘

Vector finds Tool A.                Graph finds the chain: A → B
                                    and the alternative: A → C

Building a Tool Knowledge Graph

The schema design for tool orchestration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
┌──────────────────────────────────────────────────────────────┐
│              Tool Knowledge Graph Schema                     │
│                                                              │
│  (:Tool {name, description, version, cost, latency})         │
│     │                                                        │
│     ├──[:REQUIRES_INPUT]──▶ (:DataType {schema, format})     │
│     ├──[:PRODUCES_OUTPUT]──▶ (:DataType)                     │
│     ├──[:DEPENDS_ON]──▶ (:Tool)                              │
│     ├──[:COMPOSES_WITH]──▶ (:Tool {pattern})                 │
│     ├──[:ALTERNATIVE_TO]──▶ (:Tool {tradeoff})               │
│     ├──[:BELONGS_TO]──▶ (:ToolCategory)                      │
│     ├──[:HAS_PRECONDITION]──▶ (:Condition {expression})      │
│     └──[:HAS_EFFECT]──▶ (:Effect {description})              │
│                                                              │
│  What graph structure captures that flat schemas miss:        │
│  ● Tool A's output type matches Tool B's input (composable)  │
│  ● Tool C requires Tool D to run first (dependency)          │
│  ● Tools E, F, G solve the same problem differently (alts)   │
│  ● Tool H has a precondition only Tool I can satisfy         │
└──────────────────────────────────────────────────────────────┘

Research validation:

  • SciToolAgent (Nature Computational Science, 2025): Knowledge graph of scientific tools enables informed selection and combination across biology, chemistry, and materials science [Nature]
  • Agent-as-a-Graph (arxiv:2511.18194): Represents tools AND agents as graph nodes. +14.9% Recall@5 and +14.6% nDCG@5 over prior retrievers on LiveMCPBenchmark [arxiv]
  • GAP: Graph-Based Agent Planning (NeurIPS 2025 Workshop): Models inter-task dependencies as a DAG. Enables parallel execution of independent tools while respecting sequential dependencies [arxiv]

Graph-Guided Tool Selection in Action

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
User query: "Find the customer's order history and send them a summary email"

Step 1: Query the Tool Knowledge Graph
        ─────────────────────────────
        MATCH (t:Tool)-[:PRODUCES_OUTPUT]->(d:DataType)
              <-[:REQUIRES_INPUT]-(t2:Tool)
        WHERE t.description CONTAINS 'customer'
        RETURN path

Step 2: Graph returns the chain
        ─────────────────────────
        search_customers ──output: customer_id──▶ get_order_history
                                                   ──output: order_list──▶ send_email

Step 3: Present only 3 tools (not 106)
        ──────────────────────────────
        Token cost: ~3,000 (vs ~54,600 for full schema load)
        Accuracy: 43.13% vs 13.62% baseline

Source: RAG-MCP [arxiv:2505.03275] — formally addresses MCP schema bloat via semantic retrieval. >50% token reduction, 3x accuracy improvement.

The Hybrid Retrieval Architecture

Production systems in 2026 use three-pillar retrieval — vectors for breadth, graphs for depth, keywords for precision:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
┌──────────────────────────────────────────────────────────────┐
│                Hybrid Tool Retrieval                         │
│                                                              │
│  User Query: "Deploy the staging build and notify the team"  │
│       │                                                      │
│       ├──▶ Vector Index ──▶ "deploy", "staging" → kubectl,   │
│       │    (semantic)       docker, terraform (by similarity) │
│       │                                                      │
│       ├──▶ Graph Index ──▶ kubectl ──[:DEPENDS_ON]──▶        │
│       │    (relational)    docker_build                      │
│       │                   kubectl ──[:COMPOSES_WITH]──▶       │
│       │                    slack_notify                       │
│       │                                                      │
│       └──▶ BM25 Index ──▶ exact match: "staging"            │
│            (keyword)       → staging_deploy_script           │
│                                                              │
│  Reciprocal Rank Fusion ──▶ Final ranked tool list           │
│                                                              │
│  Result: [docker_build, kubectl_deploy, slack_notify]        │
│  Token cost: ~3,000 (vs loading all schemas)                 │
└──────────────────────────────────────────────────────────────┘

Performance: Hybrid approaches show 10-30% improvement in retrieval accuracy over single-strategy systems [Calmops].


Practical Graph DB Implementations

Neo4j: The Default Knowledge Layer

Neo4j positions itself as the knowledge layer for agentic systems, with the most mature integration ecosystem:

MCP Servers (6 official/Labs servers) [Neo4j MCP]:

ServerPurpose
Official Neo4j MCPSchema retrieval, Cypher execution (read/write), GDS algorithms
MCP-Neo4j-MemoryKnowledge graph persistence — entity/relationship management
MCP-Neo4j-Data-ModelingSchema design, constraint generation, 7 templates
MCP-Neo4j-GDSGraph algorithms — PageRank, Louvain, Leiden, Dijkstra

Framework integration: LangChain (langchain-neo4j), LlamaIndex (PropertyGraphIndex), CrewAI, Semantic Kernel, Google ADK [Neo4j Blog].

Text2Cypher: LLM generates Cypher from natural language. 92% accuracy on Spider-Graph benchmarks, 150ms end-to-end latency. Fine-tuned models available on HuggingFace [Neo4j Medium].

GraphRAG Retrievers as MCP Server: Neo4j published a pattern to expose vector, text2cypher, and hybrid retrievers directly as MCP tools [Neo4j Blog].

FalkorDB: Performance-First

Purpose-built for GraphRAG speed [FalkorDB]:

  • 496x faster than Neo4j for point lookups (Redis hash tables, O(1) in-memory)
  • 6x better memory efficiency
  • 2.9x faster 2-hop traversal
  • Batch ingestion at 5,000: 22,784/s
  • Own MCP server for graph-based AI integration [FalkorDB MCP]
  • GraphRAG SDK: 90%+ accuracy for schema-heavy enterprise queries

Microsoft GraphRAG

The open-source reference implementation [GitHub]:

  • 6-phase indexing: Chunk → Extract entities → Leiden community detection → Summarize
  • Three search modes: Local (entity neighborhood), Global (community summaries), DRIFT (hybrid)
  • LazyGraphRAG (June 2025): 700x lower query cost than full GraphRAG with comparable quality. Indexing cost identical to vector RAG (0.1% of full GraphRAG) [Microsoft Research]

Zep / Graphiti: Temporal Knowledge Graphs

For agents that need to remember across sessions [GitHub]:

  • Three-tier subgraph: Episode (raw events) → Semantic Entity → Community
  • Bitemporal modeling: event_time + ingestion_time on every node/edge
  • 94.8% accuracy on Deep Memory Retrieval (vs MemGPT’s 93.4%) [arxiv:2501.13956]
  • Non-lossy updates: full history of fact validity periods

Amazon Bedrock + Neptune

Fully managed GraphRAG (GA March 2025) [AWS Blog]:

  • Automatic entity extraction during document ingestion
  • Two-step query: vector search → graph traversal for multi-hop reasoning
  • No graph DB management required

The Production Architecture

Putting it all together — the full stack for a production agent system with intelligent tool discovery:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
┌────────────────────────────────────────────────────────────────┐
│                 Production Agent System                        │
│                                                                │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                   Agent (LLM)                            │  │
│  │  Planner → Generator → Evaluator (separate contexts)     │  │
│  └──────────────────────┬───────────────────────────────────┘  │
│                         │                                      │
│  ┌──────────────────────▼───────────────────────────────────┐  │
│  │              Tool Knowledge Graph                        │  │
│  │         (Neo4j / FalkorDB / Neptune)                     │  │
│  │                                                          │  │
│  │  (:Tool)──[:REQUIRES_INPUT]──▶(:DataType)                │  │
│  │  (:Tool)──[:DEPENDS_ON]──▶(:Tool)                        │  │
│  │  (:Tool)──[:COMPOSES_WITH]──▶(:Tool)                     │  │
│  │  (:Tool)──[:ALTERNATIVE_TO]──▶(:Tool)                    │  │
│  │                                                          │  │
│  │  Hybrid Retrieval: Vector + Graph + BM25                 │  │
│  └──────────────────────┬───────────────────────────────────┘  │
│                         │ Graph-Guided Discovery               │
│  ┌──────────────────────▼───────────────────────────────────┐  │
│  │                 Hybrid Router                            │  │
│  │  Rule-based fast path → Semantic → LLM fallback          │  │
│  └───┬──────────────────┬──────────────────┬────────────────┘  │
│      │                  │                  │                    │
│      ▼                  ▼                  ▼                    │
│  ┌────────┐      ┌───────────┐      ┌───────────┐             │
│  │  CLI   │      │   Code    │      │    MCP    │             │
│  │ Inner  │      │  Middle   │      │   Outer   │             │
│  │  Loop  │      │   Loop    │      │    Loop   │             │
│  └────────┘      └───────────┘      └───────────┘             │
│      │                  │                  │                    │
│      └──────────────────┼──────────────────┘                   │
│                         ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │          State Layer                                     │  │
│  │  Filesystem-as-State + LangGraph Checkpoints             │  │
│  │  (survives truncation, crash, delegation)                │  │
│  └──────────────────────────────────────────────────────────┘  │
│                         │                                      │
│  ┌──────────────────────▼───────────────────────────────────┐  │
│  │          MCP Gateway                                     │  │
│  │  Schema filtering (90%) │ Connection pooling (28%→1%)    │  │
│  │  Auth centralization    │ PII detection                  │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

How It Flows

  1. Query arrives: “Find the customer’s order history and send them a summary email”
  2. Graph-guided discovery: Tool Knowledge Graph finds the chain: search_customers → get_order_history → send_email (3 tools, not 106)
  3. Hybrid router: search_customers and get_order_history are database tools → MCP outer loop. send_email is Slack → MCP outer loop. All three need OAuth → no CLI fallback.
  4. But wait — it’s a 3-step chain: Router recognizes multi-step pattern → Middle Loop (Code Execution). Agent writes a Python script that calls all three via the MCP SDK in one sandbox pass.
  5. State persisted: Results written to filesystem. If the sandbox crashes at step 2, LangGraph checkpoint enables resume from step 1’s result.
  6. MCP Gateway: All three MCP calls routed through gateway. Connection pooling prevents TCP timeouts. Schema filtering means only 3 tool schemas loaded, not 106.

Where RTK and BKit Fit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
┌────────────────────────────────────────────────────────────────┐
│  Methodology Layer ──── BKit (PDCA workflow, quality gates)    │
│       │                 37 skills, 32 agents, gap-detector     │
│       ▼                                                        │
│  Harness Layer ──────── Three-Loop + GraphRAG + Routing        │
│       │                                                        │
│       ▼                                                        │
│  Tool Interface Layer ─ CLI ─── MCP ─── Code Exec ─── Browser │
│       │                  │                                     │
│       ▼                  ▼                                     │
│  Infrastructure ──────── RTK (CLI output compression, 60-90%) │
│                         ICM (persistent memory, KG)            │
│                         Grit (parallel agent git locking)      │
└────────────────────────────────────────────────────────────────┘

BKit pushes structure DOWN: methodology → harness → tool selection
RTK pushes efficiency UP: infrastructure → tool interface → harness

BKit and RTK represent two complementary directions of harness optimization:

  • BKit operates top-down — it imposes methodology (PDCA) onto the agent, controlling which tools are available at each phase and how their outputs are evaluated. The gap-detector ensures quality; phase-aware skill loading ensures focus.
  • RTK operates bottom-up — it optimizes the infrastructure that tools run on, making every CLI call cheaper without changing what the agent does. A 70% token reduction across a session means more room for reasoning.

Both validate the core thesis of this series: tool interface orchestration is a multi-layer engineering discipline, not a single protocol choice. The best systems optimize at every layer simultaneously.


Eight Anti-Patterns to Avoid

Lessons extracted from production harness failures:

1. The 50-Tool Firehose

Problem: Giving the model 50+ tool schemas and hoping it picks correctly.

Fix: Per-phase tool scoping. Planning step gets read-only tools. Execution gets write tools. QA gets test tools. Vercel cut tools by 80% and performance improved.

2. Self-Evaluation

Problem: Letting the same agent evaluate its own output → 90% false-positive “task complete” signals.

“Agents confidently praise their own work — even when quality is obviously mediocre.” — Anthropic

Fix: Architecturally separate evaluator with its own context. The Planner–Generator–Evaluator pattern from Harness Engineering Part 1.

3. Context Window as State Store

Problem: At 80%+ context usage, models rush to finish (context anxiety), skip verification, declare premature completion.

Fix: Externalize state to the filesystem. The context window is disposable; files are truth.

4. MCP Schema Bloat

Problem: Loading all schemas on connect. GitHub MCP = 44,000 tokens before any work.

Fix: Lazy loading (95%), Code Mode (99.9%), MCP Gateway (90%), or GraphRAG-based discovery (95%+).

5. Over-Summarizing Feedback

Problem: Compressing execution traces to scalar scores before passing to the evaluator.

Data: Full traces achieve 50.0 accuracy vs 34.6 for scalar scores (Meta-Harness ablation [arxiv:2603.28052]).

Fix: Log everything. Summarize on demand, not by default.

6. Premature Multi-Agent

Problem: Adding agents before measuring the specific failure you’re solving.

“Multi-agent systems have coordination overhead that often exceeds their benefits.” — Cognition Labs

Data: Forrester 2025: 75% of firms building custom agentic architectures will fail due to infrastructure complexity.

Fix: Start single-agent. Measure. Escalate only when measured.

7. Ignoring Amdahl’s Law

Problem: Assuming 2x workers = 2x speed. The parallelizable portion of a codebase has limits.

Fix: oh-my-* acknowledges this — for a bug fix or two-file refactor, pipeline overhead exceeds savings. Multi-agent pays off only at feature-module scale (10+ files, cross-cutting changes).

8. Data Overload (The Firehose Effect)

Problem: Tools returning database dumps, binary data, massive JSON payloads.

Fix: Tools should return filtered, relevant data. If a tool returns more than the model can process, the tool’s interface is broken — fix the tool, not the model. For CLI specifically, RTK intercepts command output and applies 12 filtering strategies (stats extraction, error-only, pattern grouping, deduplication) to compress 60-90% of noise before it reaches the agent — turning cargo test from 4,823 tokens to 11 [github.com/rtk-ai/rtk].


Scenario-Based Recommendations

ScenarioTool countRecommended stackTool discovery
Single-file bug fix3-5CLI only, static routingNone needed
Feature development (10+ files)10-15CLI + Code Exec, LLM routingLazy loading
External service integration15-25Three-Loop, Hybrid routingMCP Gateway
Enterprise platform (50+ tools)50+Three-Loop + GraphRAGTool Knowledge Graph
Multi-agent teamVariablePer-agent tool scoping + GatewayGraph per agent role

The Minimal Viable Harness Roadmap

Add complexity only when measured failure demands it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Step 1: Single LLM + CLI
        ─────────────────
        Measure: What fails? What's the completion rate?
        If ≥ 90% success → Stop here.

Step 2: + MCP (for external services)
        ─────────────────────────────
        Measure: Token cost? How much is schema overhead?
        If schema overhead < 10% of total → Stop here.

Step 3: + Code Execution (for multi-step workflows)
        ─────────────────────────────────────────────
        Measure: How many round-trips? What's the latency?
        If round-trips < 3 on average → Stop here.

Step 4: + MCP Gateway (for reliability and token cost)
        ─────────────────────────────────────────────
        Measure: MCP failure rate? Schema bloat impact?
        If failure rate < 5% → Stop here.

Step 5: + GraphRAG tool discovery (for 50+ tools)
        ─────────────────────────────────────────
        Measure: Tool selection accuracy? Discovery latency?
        Start with LazyGraphRAG (cheapest), graduate to full.

        "측정 없이 추가하지 마라."

Starting Stack Recommendation

For most teams starting today:

  • LazyGraphRAG for knowledge retrieval — indexing cost identical to vector RAG, 700x cheaper queries [Microsoft Research]
  • Neo4j MCP Server for tool discovery — query the graph via the same MCP interface [Neo4j]
  • Embedding-based tool selection for MCP schema bloat — RAG-MCP pattern [arxiv:2505.03275]
  • Graduate to full Tool Knowledge Graph with dependency-aware execution (GAP pattern [arxiv:2510.25320]) when you exceed 50 tools.

Connections to the Harness Engineering Series

This post fills gaps in the Harness Engineering series:

Harness EngineeringThis Post
Part 1: Three failure modes (context anxiety, self-eval bias, coherence drift)Adds: the 50-tool firehose as a fourth failure mode
Part 2: “Richer feedback beats cleverer structure” (Meta-Harness)Extends: richer tool discovery (GraphRAG) beats loading more schemas
Part 3: Framework landscape (LangGraph, CrewAI, gstack)Extends: tool interface layer underneath the framework layer
Part 3: “ACI — tool docs deserve as much care as UX docs”Extends: Tool Knowledge Graph as the structured version of ACI

The Planner–Generator–Evaluator architecture from Part 1 benefits directly from GraphRAG:

  • Planner uses graph-guided discovery to find the minimal tool set
  • Generator receives only relevant tool schemas (95%+ token savings)
  • Evaluator can query the graph for expected tool behaviors and verify results

References

Tool Orchestration and Harness Design

  1. Inside Claude Code Architecturepenligent.ai
  2. Claude Code Agent Harness Architecturewavespeed.ai
  3. Claude Code Best Practicesanthropic.com/engineering
  4. Claude Code Sandboxinganthropic.com/engineering
  5. Anatomy of an Agent Harnessblog.langchain.com
  6. LangChain Deep Agents SDKlangchain.com/deep-agents
  7. Microsoft Agent Framework v1.0devblogs.microsoft.com
  8. Natural-Language Agent Harnessesarxiv:2603.25723
  9. Meta-Harness Researcharxiv:2603.28052

Tool Routing and Optimization

  1. AI Agent Routing Tutorialpatronus.ai
  2. Intelligent LLM Routing in Enterprise AIrequesty.ai
  3. Toward Super Agent System with Hybrid AI Routersarxiv:2504.10519
  4. MCP Token Optimization: 4 Approachesstackone.com
  5. MCP Tool Schema Bloat: The Hidden Token Taxlayered.dev
  6. MCP vs CLI Benchmarkscalekit.com

GraphRAG and Knowledge Graphs

  1. Microsoft GraphRAGgithub.com/microsoft/graphrag
  2. LazyGraphRAG: New Standard for Quality and Costmicrosoft.com/research
  3. GraphRAG: Dynamic Community Selectionmicrosoft.com/research
  4. RAG-MCP: Mitigating Prompt Bloatarxiv:2505.03275
  5. Agent-as-a-Grapharxiv:2511.18194
  6. SciToolAgent (Nature Computational Science) — nature.com
  7. GAP: Graph-Based Agent Planning (NeurIPS 2025) — arxiv:2510.25320
  8. Agentic RAG with Knowledge Graphsarxiv:2507.16507

Graph Database Implementations

  1. Neo4j GraphRAG Workflow with LangGraphneo4j.com/blog
  2. Neo4j GraphRAG Retrievers as MCP Serverneo4j.com/blog
  3. Text2Cypher Production Guidemedium.com/neo4j
  4. FalkorDB GraphRAG Accuracy Benchmarkfalkordb.com
  5. FalkorDB MCP Integrationfalkordb.com
  6. Amazon Bedrock GraphRAG with Neptuneaws.amazon.com
  7. Zep / Graphiti: Temporal Knowledge Graphsgithub.com/getzep/graphitiarxiv:2501.13956
  8. GraphRAG MCP Server (community)github.com/rileylemm/graphrag_mcp

Agent Frameworks and Patterns

  1. LangGraph Documentationlangchain.com/langgraph
  2. LangGraph MCP Integrationlatenode.com
  3. CrewAI Role-Based Orchestrationdigitalocean.com
  4. AIO Sandboxgithub.com/agent-infra/sandbox
  5. Tool RAG: Next Breakthrough in Scalable AI AgentsRed Hat
  6. Cognee: Knowledge Engine for AI Agentsgithub.com/topoteretes/cognee
  7. LlamaIndex GraphRAG v2developers.llamaindex.ai
  8. Graphs Meet AI Agents: Taxonomy and Opportunitiesarxiv:2506.18019

Production Harness Implementations

  1. BKit for Claude Codegithub.com/popup-studio-ai/bkit-claude-code
  2. BKit for Gemini CLIgithub.com/popup-studio-ai/bkit-gemini
  3. BKit for Codex CLIgithub.com/popup-studio-ai/bkit-codex
  4. RTK: CLI Output Compression (18.6k stars) — github.com/rtk-ai/rtk
  5. ICM: Persistent Memory for Agentsgithub.com/rtk-ai/icm
  6. Grit: Git for Parallel Agentsgithub.com/rtk-ai/grit
This post is licensed under CC BY 4.0 by the author.