The Agent-of-Agents Problem: Orchestrating a Fleet
When one agent is not enough. How teams are managing fleets of background agents with Orca, Copilot /fleet, and custom orchestration.
The single-agent era lasted about nine months. For most of 2025, the engineering challenge was getting one agent to reliably complete one task — loop design, memory management, harness selection. That problem isn’t solved, but it’s increasingly irrelevant to what’s happening on the ground. Stripe now merges over 1,000 agent-authored pull requests per week. Ramp reports 30% of merged PRs in their frontend and backend repos come from agents. Clay runs billions of agent invocations monthly for GTM workflows. The agent is no longer the unit of concern — the fleet is.
And fleets have a management problem that single-agent design never anticipated. When you have five background agents running maintenance across repositories, responding to PR events, remediating vulnerabilities on a schedule, and triaging issues overnight, the question shifts from “how do I make this agent work” to “who watches the agents, and what happens when three of them conflict?”
This is the agent-of-agents problem: the orchestration layer that spawns, routes, observes, and kills a fleet of parallel agents. The tools are arriving fast — stablyai/orca shipped an open-source ADE for fleet management, GitHub launched Copilot /fleet for parallel agent execution, and Harrison Chase announced LangSmith Fleet as an enterprise workspace for managing agent fleets. The pattern is converging, and if you’re building with agents in 2026, you need to understand it.
What fleet management actually means
Single-agent reliability is about loop design — how many iterations, when to bail, how to recover from tool failures. Fleet management is a different discipline entirely. It has three operational primitives that don’t exist in single-agent architectures:
Observe. When five agents are running concurrently, you need a command center — not a terminal window. Which agent is stuck? Which one finished? Which one is about to merge a PR that conflicts with what another agent just pushed? Orca’s mobile interface lets you monitor and steer agents from your phone, get notified when an agent finishes, and send follow-ups. Airtable’s Hyperagent provides a command center to oversee an entire fleet at scale.
Route. Not every task should go to every agent. The orchestrator pattern — one agent that receives the task, breaks it into subtasks, delegates each to a specialist worker, and assembles the results — is how you avoid paying GPT-4o prices for work that GPT-4o-mini handles fine. Microsoft’s MDASH orchestrates 100+ specialized agents across an ensemble of models, with each stage handled by whichever model fits the task. The orchestrator uses a capable model while workers use cheaper, task-specific ones — cutting costs 40–60%.
Kill. The operation nobody talks about. When an agent goes off the rails — hallucinating file paths, writing to the wrong branch, spinning in a retry loop — you need to kill it without taking down the other four. Orca gives each agent its own Git worktree, so one agent’s bad merge doesn’t contaminate another’s working directory. Kill-switch isolation is the feature that separates a fleet management tool from a glorified terminal multiplexer.
💡 Fleet management is NOT the same as multi-agent frameworks. Frameworks like LangGraph and CrewAI define agent topologies at the code level. Fleet management operates at the infrastructure level — observing, routing, and killing agents that may be running different frameworks, different models, and different codebases simultaneously.
The tools: what’s shipping now
Orca: the open-source fleet ADE
Orca from stablyai is the most complete open-source fleet management tool available today. It’s an Agent Development Environment (ADE) — not a framework, not a harness, but a desktop and mobile application for managing parallel agents.
The core capability: spin up Claude Code, Codex, OpenCode, or any of 27 supported CLI agents side-by-side, each in its own isolated Git worktree, and manage them from one interface. Fan one prompt across five agents, compare the results, and merge the winner. The /orchestrate skill lets you delegate work across models to their strengths.
What sets Orca apart from terminal multiplexers like cmux or Herdr is the mobile interface. You can monitor and steer agents from your phone — a significant workflow change when you have agents running overnight builds or processing large migration tasks.
GitHub Copilot /fleet
GitHub’s answer to fleet management is the /fleet command in Copilot CLI, shipped in April 2026. Instead of working through tasks sequentially, Copilot analyzes your prompt, determines whether it can be divided into subtasks, and dispatches multiple subagents to execute them simultaneously.
The key design decision: /fleet is built into the existing Copilot CLI rather than being a separate tool. This means it inherits Copilot’s context window, tool access, and authentication — but it also means you’re locked into Copilot’s agent ecosystem. You can’t use /fleet to manage a Claude Code agent alongside a Codex agent. For that, you need something like Orca.
/fleet shines for intra-codebase parallelism — refactoring across multiple files, generating documentation for several components at once, implementing a feature that spans API, UI, and tests. It’s less suited for cross-repo fleet management or heterogeneous agent stacks.
LangSmith Fleet
Harrison Chase’s LangSmith Fleet takes the enterprise angle. It’s a workspace for creating, using, and managing agent fleets with two agent types: “Claws” (fixed credentials and identities) and “Assistants” (act on behalf of the end user). The distinction matters for compliance — in regulated environments, you need to know which agent has which permissions and audit trail.
Ona: the infrastructure layer
Ona is building the background agent infrastructure that companies like Stripe and Ramp discovered they needed. Rather than a user-facing tool, Ona provides the plumbing — scheduling, sandboxing, monitoring — that lets you run agent fleets in production without building the orchestration layer yourself.
The patterns: how to wire a fleet
The production patterns for multi-agent orchestration in 2026 have consolidated around six proven architectures. Three matter for fleet management:
Supervisor/Worker
One orchestrator agent receives the task, decomposes it, delegates subtasks to specialist workers, and assembles results. This is the default pattern — Microsoft MDASH uses it, GitHub Copilot /fleet uses it, and most custom implementations start here. The orchestrator uses a capable model (Opus, GPT-4o) while workers use cheaper, task-specific models (Haiku, GPT-4o-mini).
Parallel fan-out
Multiple agents execute the same task independently, and you pick the best result. Orca’s “fan one prompt across five agents” feature is pure fan-out. This is expensive but powerful for creative tasks where output quality varies between runs. The cost trade-off: 5x the tokens for a dramatically higher ceiling on output quality.
Hierarchical delegation
A tiered structure where higher-level agents supervise teams of lower-level workers. Higher levels focus on coordination and planning; lower levels focus on execution. This pattern emerges naturally when your fleet grows past 10 agents — a flat supervisor/worker structure doesn’t scale because the orchestrator’s context window fills with status updates from too many workers.
⚠️ 40% of multi-agent pilots fail within six months of production deployment. The failure mode is almost always orchestration complexity, not individual agent capability. Start with supervisor/worker and add hierarchy only when your orchestrator’s context window becomes the bottleneck.
What Stripe and Ramp learned running fleets in production
The most instructive data on fleet management comes from engineering teams that have been running agent fleets for months, not weeks. At the Background Agents Summit in May 2026, leaders from Stripe, Ramp, Spotify, Uber, and Harvey shared what they’ve learned.
Stripe built its agent harness on top of Goose, running one-shot agents in air-gapped sandboxes with 400+ internal tools via MCP. Engineers kick off tasks from Slack or Jira. The fleet runs across thousands of repositories, responding to PR events, remediating vulnerabilities on a schedule, and triaging issues overnight.
The key insight from Stripe: air-gapped sandboxes are non-negotiable. When agents have write access to production repos and can trigger CI, the blast radius of a single bad agent is enormous. Each agent gets its own sandbox with constrained permissions, and the orchestrator monitors for anomalous behavior.
Ramp built their system around the principle that agents should verify their own work. Each agent can run tests, query Sentry and Datadog, toggle feature flags, and take screenshots to check its output. The result: 30% of merged PRs come from agents, but with a lower revert rate than human-authored code — because agents run the full test suite before submitting.
The pattern both teams converged on: the fleet needs an immune system, not just a nervous system. Observation (the nervous system) tells you what’s happening. Self-verification (the immune system) catches problems before they reach review.
Microsoft MDASH: 100+ agents, one pipeline
For a concrete example of large-scale fleet orchestration, Microsoft’s MDASH (Multi-model Agentic Scanning Harness) is the most detailed case study available. It orchestrates over 100 specialized agents in a multi-stage pipeline for vulnerability discovery.
The architecture uses distinct agent roles: scanning agents analyze code, debate agents cross-check findings, validation agents confirm reproducibility, deduplication agents merge overlapping reports, and exploitation agents prove real-world impact. The pipeline scored 88.45% on CyberGym — the top score on the benchmark, roughly five points ahead of the next entry — and found 16 new vulnerabilities in the Windows networking stack, including four Critical RCEs.
What makes MDASH relevant to fleet management beyond security: it’s model-agnostic by design. Teams can swap or upgrade models while keeping the orchestration infrastructure intact. The pipeline structure — scan → debate → validate → dedup → exploit — is transferable to any domain where you need multiple specialized agents working in sequence with cross-checks between stages.
For more on MDASH’s architecture and benchmarks, see our detailed breakdown.
Fleet orchestration vs. single-agent loop design
If you’re already building with agents, you may be wondering where fleet management fits relative to loop engineering and harness design.
The distinction is scope:
| Concern | Single-Agent (Loop) | Fleet (Orchestration) |
|---|---|---|
| Unit | One agent, one task | N agents, M tasks |
| Failure mode | Infinite loop, hallucination | Agent conflict, resource contention |
| Recovery | Retry, escalate, bail | Kill, reroute, rebalance |
| Memory | Context window, RAG | Cross-agent state, shared context |
| Tooling | Claude Code, Codex | Orca, /fleet, LangSmith Fleet |
Loop design (loopcraft) is still foundational — a fleet of poorly-designed agents fails faster than a single well-designed one. But fleet management adds a layer that loop design doesn’t address: what happens when Agent A’s output is Agent B’s input, and Agent A hallucinates?
Agent memory is another adjacent concern that changes at fleet scale. A single agent’s memory is its context window plus whatever RAG or persistence layer you’ve added. A fleet’s memory includes cross-agent state — Agent A needs to know what Agent B already tried, or they’ll duplicate work. This is where tools like Agentfab’s bespoke self-curating memory system start to matter.
What this means for builders
If you’re running one agent, keep running one agent. Fleet management is overhead you don’t need until you do.
If you’re running two or three agents, start with Orca. The Git-worktree isolation alone is worth it — it prevents the merge-conflict nightmare that hits every team running multiple agents against the same repo. The mobile monitoring is a bonus for overnight runs.
If you’re evaluating fleet management for a team, the decision tree is:
- All Copilot? → Use /fleet. It’s integrated, it’s free with your Copilot subscription, and it handles the parallelism internally.
- Mixed agent stack? → Use Orca. It’s the only tool that manages Claude Code, Codex, OpenCode, and custom agents from one interface.
- Enterprise compliance? → Watch LangSmith Fleet. The Claws/Assistants distinction and audit trail are designed for regulated environments.
- Building your own? → Start with the supervisor/worker pattern. Use Ona’s infrastructure guides as a blueprint. Don’t build hierarchical delegation until your orchestrator’s context window is actually the bottleneck.
The agent-of-agents pattern is the inevitable consequence of agents that actually work. Once you have agents you trust to complete tasks, you want more of them. Once you have more of them, you need something to manage the fleet. The tools are here — the question is how fast your fleet grows past what manual oversight can handle.
For more on the harness layer beneath fleet management, see our guide to building agents that run for hours. For the memory systems that fleet-scale agents need, see the agent memory wars.



