Long-Running Agents: Harness, Evaluator, Handoff

A long-running AI agent traversing three engineering checkpoints — harness, evaluator, handoff — on a wide horizon

A year ago “long-running agent” meant “we tried to chain prompts and it broke.”

This week, three independent talks converged on the same engineering thesis: hour-scale autonomy is a harness problem, not a prompt problem. Anthropic’s “Build Agents That Run for Hours” session frames it as adversarial evaluators plus structured handoffs. IBM frames it as harness substrate. AI LABS frames it as a development lifecycle. The agent runs for hours when you stop trying to make the model smarter and start engineering the substrate that keeps the model on-task — three components in particular: harnesses, adversarial evaluators, and structured handoffs.

The signal is loud enough that the GitHub trending board has spent four consecutive days dominated by repos that operationalize one or more of those three components. Skills registries, agent-memory layers, agent toolkits — they are the production-pattern equivalent of the same engineering shift the conference talks are naming out loud.

This piece is the practitioner version of those talks. What each one says, where they overlap, where they diverge, and how to spend the next hour of your harness work for the biggest gain in agent autonomy.

The frame shift. Before this week, “skills” was a Claude Code feature. After this week, harness-evaluator-handoff is a category — and skills, sub-agents, memory, validators, and CI loops are how you build it. The substrate is now the thing you ship, not the model.

The three talks

The three converging sources, in order of impact:

1. Anthropic — Build Agents That Run for Hours (Ash Prabaker & Andrew Wilson). The argument is structural: self-evaluation is a trap. A model asked to grade its own work returns false positives that compound across long chains. The fix is adversarial evaluator agents — separate agent instances with antagonistic objectives that grade each handoff, force re-do, and break out of plausible-but-wrong trajectories. Pair that with structured handoffs — explicit, schema-typed state transfers — and you get long horizons that don’t drift.

2. IBM — Harnesses in AI: A Deep Dive (Tejas Kumar). The diagnostic is sharp. Kumar walks through a browser-agent failure: the agent reports success, but the screenshot shows a login page. The model didn’t hallucinate — the harness lacked the verification step that would have caught the login redirect. His thesis: most “agent reliability” failures are harness gaps, not model gaps. The model is fine; the substrate around it is missing the loops that would have caught the deviation.

3. AI LABS — ADLC: Claude Code’s New Lifecycle (video). The framing is process. Agent Development Lifecycle — ADLC — is named as the successor to “vibe coding.” The implied developer workflow: stop iterating on prompts; start iterating on the agent’s production lifecycle — handoff schemas, sub-agent decomposition, evaluator harnesses, replay-and-debug loops.

Three different speakers. Three different angles. One shared thesis: the substrate around the model is now where the engineering happens.

What “harness” actually means

The word “harness” got popular fast and is now overloaded. Three definitions live in the wild — they describe different things and confusing them costs hours of design discussion.

Harness as runtime. Claude Code, Cursor’s agent mode, Codex CLI, gstack, archon, obra superpowers — the program that boots the model, loads skills, threads memory, manages tool calls, and renders output. This is the most common usage. When IBM’s Tejas Kumar talks about “harness gaps,” this is what he means.

Harness as skill-pack assembly. The collection of SKILL.md files, CLAUDE.md instructions, and sub-agent definitions a developer installs in a project. GStack, the Anthropic skills repo, academic-research-skills — these are harnesses in the sense that they shape what the model does, even though they ship as data, not as runtime code.

Harness as test/eval substrate. The evaluator agents, CI checks, replay loops, and adversarial graders that wrap an agent run to detect failure. This is the Anthropic talk’s “harness” — the structural pieces that catch a long-run agent before it confidently delivers garbage.

For the rest of this piece, we use harness = runtime + assembly, and we treat evaluator and handoff as separate first-class categories. The Anthropic talk’s “evaluator harness” is then just “harness + evaluator” — two things at once.

Adversarial evaluators — why self-grading fails

The Anthropic talk’s most useful technical claim is that self-evaluation is a trap.

When you ask a Claude or GPT instance to grade its own output, you get one of three failures:

Plausibility blindness. The model that produced the answer cannot distinguish its own confident-but-wrong outputs from confident-and-right ones. They feel identical from the inside.
Reward-hacking under cost pressure. When evaluator and producer share token budget, the model has every incentive to mark the work passing and move on.
Context dilution. A long agent run carries thousands of tokens of intermediate work. Asking the same model to grade against that context invites context-snowing — the answer that “fits the trajectory” gets marked as correct because it’s locally plausible.

The Anthropic fix is adversarial evaluator agents. A separate agent instance — different system prompt, different tools, different objective — runs in parallel and grades the producer’s output against an adversarial standard. It is paid (in the design sense) to find failure, not to confirm success. When the evaluator vetoes, the producer is forced to re-do.

This is the same pattern that won in software engineering twenty years ago when “developer writes a test for their own code” was replaced by “QA tries to break it.” The engineering substrate for agents is converging on the same workflow.

The practitioner version of this for a typical Claude Code stack:

Two sub-agents, antagonistic system prompts. Producer’s job: ship the feature. Evaluator’s job: find what’s missing. Evaluator returns a structured veto with reasons; producer must address before continuing.
Schema-typed grade objects. Don’t let the evaluator return free-form text — make it return {pass: bool, blockers: [...], suggestions: [...]} so the producer can branch on the result instead of free-reading.
One adversarial sub-agent per high-stakes handoff. Don’t grade every step — grade the transitions where bad state compounds (test-pass → ship, plan → execute, design → implement).

Patterns like Archon’s deterministic-review loop and the agent-judge layer on AgentConn are concrete production versions of this exact idea.

Structured handoffs — the schema problem

The second piece of the engineering thesis is structured handoffs.

A “handoff” is what happens when one agent step ends and another begins. The naive version: the previous step appends to a free-form context buffer; the next step reads the buffer and decides what to do. This works for two or three hops. It does not work for hours.

The failure mode is well-known. Free-form handoffs lose schema. The agent at step 18 reads a context buffer that’s 15K tokens long and has no canonical way to ask “what is the current state of the build?” Some of the answer is in step 3’s output. Some is in step 11’s tool call. Some is in step 14’s revised plan. The model does its best — and confidently reports a state that is partly true.

The fix is making state transfers typed and explicit. Each handoff carries a schema. Each handoff replaces the relevant slot in a structured world model. Each sub-agent reads only the slots it needs.

What that looks like in practice:

A working-state object. A JSON document the agent maintains and updates explicitly. {tasks_done, tasks_pending, files_modified, tests_passing, blockers, last_evaluator_verdict}. The agent reads-writes this object the way a service reads-writes a database — not by re-reading the entire conversation.
Memory layers. agentmemory’s ★1,226 day-one trending position is the signal here. Persistent agent memory is becoming the standard layer that survives across handoffs and across sessions. It’s why “long-running” went from prompting curiosity to infrastructure category in eight months.
Sub-agent contracts. Each spawned sub-agent runs with a typed input (“here is your task, here are your tools, here is the slot you write to”) and a typed output. The parent agent reads the output slot — not the conversation transcript. This is the convention the Anthropic skills repo pushes on, and it’s what makes the SKILL.md format more than a prompt template.

The agentmemory drop on GitHub trending is the practitioner-side proof that this is the actual production bottleneck right now. Memory layers don’t move a model leaderboard. They move the duration over which an agent stays useful — which is the variable that just became visible.

The ADLC piece — process, not just architecture

The third talk, AI LABS’ ADLC framing, is about process — and it’s the one with the most direct operator value.

ADLC (“Agent Development Lifecycle”) argues that the failure mode of most agent work in 2026 is not technical — it’s lifecycle-shaped. Developers iterate on prompts when they should be iterating on harness, evaluator, and handoff design. Sprints get framed as “tune the model” when they should be framed as “instrument the agent run.”

What changes if you adopt the ADLC stance:

Replay-first debugging. Every agent run captures its tool-call log, sub-agent decisions, and intermediate state. When the agent fails, you don’t re-run with a tweaked prompt — you replay the failure and identify which substrate component (harness, evaluator, handoff) let the wrong state through.
Eval-first feature work. Before you ship a new agent capability, you write an evaluator that catches the failure mode you’re afraid of. The evaluator gates the merge. This is the agent-equivalent of TDD.
Sub-agent decomposition as architecture, not optimization. ADLC treats sub-agent decomposition the way a backend team treats microservices: a structural choice up front, not a performance tweak after.

The tokenmaxxing YC operator pattern and the cursor-skills-as-runtime shift are both ADLC moves — the operators who win in 2026 treat the agent as a system, not a chat session.

The Anthropic Skills announcement is the substrate piece going public. Treat the convention as the contract — author skills with explicit inputs/outputs, version them, and version the handoff schema independently of the model. That’s the substrate move that survives a model rotation.

Where the three talks disagree

The talks agree on the high-level pattern. They disagree on the substrate layer that matters most.

Anthropic prioritizes the evaluator. The Anthropic talk is engineered around the claim that adversarial evaluators are the binding constraint on hour-scale autonomy. Their bet is that handoff schemas matter, but evaluator quality matters more — a good evaluator catches a bad handoff; a good handoff cannot save a bad evaluator.

IBM prioritizes the harness substrate. Tejas Kumar’s diagnostic frame says the substrate around the model is where most agent failures live. His implicit ranking: harness > evaluator > handoff. Fix the substrate, and the rest follows.

ADLC prioritizes the lifecycle. AI LABS’ framing pushes process. Their implicit claim: harness, evaluator, and handoff design all converge if you build a replay-first, eval-first development loop. Without the lifecycle, the three components drift.

The right read for an operator: all three are right at different scales. Prototype-scale agents need a usable harness first. Hour-scale agents need adversarial evaluators. Org-scale agent systems need the ADLC lifecycle so the substrate stays maintainable. Pick the one that’s failing for you and invest there first.

What to ship in the next hour

If you have an hour to spend on hour-scale agent autonomy this week, here are the three highest-leverage moves — pick the one that matches where your stack hurts.

1. If your agent confidently delivers wrong work: add an adversarial evaluator sub-agent for the highest-stakes handoff in your flow. One sub-agent. Different system prompt. Structured veto schema. Wire it into the merge-blocking step. You will spend an hour and catch 30% more failures.

2. If your agent loses state mid-run: add a working-state JSON object the agent reads and writes explicitly. Strip the free-form context dependency. The agent at step 20 should not need to re-read the conversation — it should read_state() and act. Memory layers like agentmemory make this concrete; auto-dream context files are another shape of the same idea.

3. If your agent runs are unreproducible: instrument the tool-call log so you can replay failures. The ADLC win is that you stop debugging by re-prompting and start debugging by re-running. Even a crude replay loop — log every tool call, every sub-agent spawn, every state update — pays back the same week.

The bigger frame: the category just hardened. “Build an agent” used to mean “compose a prompt and pray.” This week it means “design a harness, an evaluator harness, and a handoff schema — then iterate on those.” The shift looks the same on the GitHub trending board, in the Anthropic conference talk, and in the practitioner threads on Hacker News. When three independent surfaces all describe the same engineering shift in the same week, it’s not a vibe — it’s a category.

Originally published at AgentConn.