
Everyone building with AI asks how many agents they need. It's the wrong question.
You can now staff an entire company with agents — one for marketing, one for research, one for QA, a box on the org chart filled by a model instead of a person. It's the obvious way to build. It's also the most common design mistake there is.
The question every agentic build runs into is simple: how many agents do I need? The reflex answer is to replicate the organisation you already know. Marketing manager becomes a marketing agent, researcher becomes a research agent, QA engineer becomes a QA agent.
But an org chart was never a design. It's a workaround for human constraints — limited hours, finite attention, the cost of hiring. None of those survive the move to agents. Copy the chart and you inherit limits that no longer apply.
The way out is to see that "how many agents" is two questions wearing one coat. Get that distinction right and the count answers itself.
The wrong default
Roles merge all the time. The full-stack engineer is a front-end and a back-end developer in one seat. The product designer holds interaction, visual, and information architecture that a larger team would split across three people. Wherever one person can actually do several jobs, the org chart quietly collapses them into one.
It stops collapsing where the person runs out. No one can be the entire engineering department, or keep every client in their head at once. So the work gets divided and the boxes get filled — not because the work is naturally separate, but because a human can only hold so much. The org chart is a map of where human capacity runs out.
An agent's capacity runs out somewhere else entirely. The limits that drew your boxes aren't the limits it has. Hand the work to an agent and the question stops being how to divide it across roles. It becomes whether it needs dividing at all.
Why agents don't need an org chart
The design team is one instance of a general pattern. Every box on an org chart is there to work around a specific human limit — and each of those limits is one an agent doesn't share.
Cognitive limits. Real expertise is narrow. It takes years to build, depth trades off against breadth, and a single career only holds so many domains — so past a point, the work gets split across more people. An agent isn't bound that way. It already carries breadth no individual could, and you can widen its scope without the years it would cost a person.
Hiring and headcount. Adjusting a workforce is slow and lumpy. You can bring in a contractor for a day — but the work doesn't arrive in clean day-sized blocks, and you can't find, vet, and brief someone for each small piece as it comes. So a firm runs on a roughly fixed roster, a compromise: overstaffed in the quiet stretches, short-handed when the work spikes. Agents carry none of that friction. Capacity appears the moment the work does and falls to nothing when it's gone — it follows demand instead of guessing at it.
Labour markets. Roles are shaped by what the market sells. You build the org out of the skill-packages you can actually hire — "a paralegal," "a back-end engineer" — and the shape of the team ends up dictated by the shape of the hiring market. An agent isn't bought off that shelf. It has no pre-set trade, so the work doesn't have to bend to fit one.
Career ladders and span of control. Some of the org chart isn't about the work at all. Layers exist to give people somewhere to climb — junior, senior, lead — and managers to oversee other managers. An agent has no career to advance and needs no minder for headcount's sake; neither layer applies.
Some tools are busy hardening this mistake into a product. CrewAI has you give each agent a role, a goal, and a backstory — codifying a human limitation that no longer exists, personality and all. As Christian Lizell put it, the org chart is a skeuomorph: we copy "the shape of something without asking why that shape existed in the first place."
The work itself doesn't go away. The marketing, the research, the review — it all still has to happen. The same responsibilities on the org chart are still relevant, but the reasons the boxes divide them aren't.
The right question
The question of how many agents has to be distilled further. Before deciding, one first needs to know whether it's the same kind of work.
If it's the same kind of work, scaling horizontally — the same agent, cloned — makes sense. More volume, not more kinds: you run more instances of one agent. And you don't set that number. A well-built system spins copies up as the work arrives and stands them down as it drains away — capacity following demand instead of being guessed at. The clone count isn't something you design; it sizes itself.
If the work is genuinely different, a new agent may need to be defined — sometimes the difference is as little as a permission boundary, sometimes it's an entirely new domain and skillset. But "different work" doesn't automatically mean a new agent. Often the better move is one agent that holds additional skills, cloned on demand.
That last part is where companies can go wrong. They see more work coming and break it down by skill — the same instinct as the org chart — spinning up multiple agents to share the load. That's a volume problem answered with a capability solution: they needed more clones of one agent, and built a team of different agents instead.
Scaling is easy. New agent definition is not. The factors that decide are the same for every company — but how they are applied depends on what work the company does and how it likes to work. The trick is to read those factors from scratch, not inherit the splits of a human-constrained org chart.
So the easy part is settled: same work, clone it. Which means "how many agents do I need?" was the wrong question all along. The copies are elastic — the only thing you actually design is how many kinds of agent the work calls for. Distinct definitions, not instances. The hard part is right there: when different work earns its own definition, and when it's just more for one agent to hold. That's a test, and it turns on four factors.
When should you split one agent into multiple?
If the work needs the same tools, the same context, and is measured the same way, it isn't a new capability — it's more capacity. Clone it. A support agent reading one knowledge base, fielding one customer or a thousand, is one definition at many instances — and the count looks after itself.
Sometimes the work genuinely is different, and the knee-jerk reaction is "new agent." Resist it. Most of the time the better move is an existing agent that holds the extra skill. Before defining a new one, run the work through four factors. Each is a hard boundary — if one of them is crossed, you need a new agent. If not, clone.
Tools and permissions. Would you hand this agent the same credentials — the keys, tokens, and access that let it write to a database, send an email, or move money? The question isn't whether the new work is intellectually different; it's whether it needs a different level of trust.
Take a support agent. One that only reads tickets is low-risk: the worst it can do is say something wrong. One that can issue refunds can move money. Fold both into a single agent and every routine ticket it reads is now one bad inference — or one prompt injection buried in a customer email — away from a refund it was never meant to issue. Split them, and the refund key lives behind its own tightly-scoped agent the high-volume reader can't reach. Permissions are best set as hard walls — an agent can't make a mistake it was never given the keys to make. Different blast radius, different agent.
Independent evaluation. Do you need to measure and tune it on its own? Every agent has a definition of "good" you score it against, and a tuning loop — you adjust the prompt and check whether the score moved. Two things break when one agent carries two jobs with different yardsticks. First, you lose sight of each: a summariser is judged on faithfulness, a classifier on accuracy, and a single blended score can't tell you whether it's summarising well and classifying badly, or the reverse. Second, the two share one prompt, so tuning summaries to be more faithful can quietly pull classification accuracy down — and the blended score hides it.
Split them and each agent gets its own metric and tuning loop: you measure faithfulness and accuracy separately, and improve each without touching the other. If you can't measure a piece of work on its own, or can't improve it without disturbing something else, it needs to be its own agent to be tuned at all.
Context isolation. Does its playbook pollute the other's? An agent's context — its instructions, the rules and procedures for the job, the examples and reference it leans on — works best holding only what the task in front of it needs. Picture a contract-review agent and a customer-email agent in one. The reviewer's playbook is clause taxonomies, risk flags, jurisdiction rules; the email agent's is tone guidelines, templates, escalation paths. Fold them together and every email drags in the entire contract rulebook it will never use, every contract drags in the email templates: each call is bigger, slower, and dearer, and the model's attention is spread across instructions that don't apply. Concentrated context isn't just cheaper per call — it shows up in the quality of the output.
And the playbooks bleed: the email agent picks up legalese, the reviewer turns chatty in a risk assessment. When the instructions one job needs would be noise — or actively misleading — inside the other's, split them. (This is the standing playbook, not the running history of a task — that's the next factor.)
Separate memory. Would the other job's history just be noise in this one? Memory is the dynamic twin of context: not the standing playbook, but the running record a task builds as it goes — what's been done, what's been found, what's still open. Take the contract reviewer again, now also drafting customer emails. As it works a deal it accumulates a working memory of flagged clauses and open points; fold the email thread in and that record fills with customer back-and-forth, until the agent turns back to the review, wades through the chatter to find its place, and risks treating a line from an email as if it were a term in the contract.
Stale history costs twice: it grows the context on every turn, and it can actively mislead. An agent stays sharper when its memory holds only its own task. When one job's record would clutter — or contaminate — the other's, give each its own.
If none of the four fire, it isn't a new agent — it's a clone with a fuller playbook. And the bias should be toward fewer, because splitting isn't free. Notice, too, what isn't among the four: parallelism. Whether work can run concurrently is a scaling question — clones already run side by side as the work allows — not a reason to define a new kind of agent. The same logic warns against the opposite mistake: don't carve a single sequential task — each step feeding the next — into a chain of separate agents. They only wait on each other, and every handoff costs you. Splitting helps when the work breaks into genuinely independent parts that can run at once; it hurts when the work was really one thing in sequence.
The numbers bear it out. Anthropic's multi-agent systems beat single-agent ones by 90%+ on their research evals — but at roughly 15× the token cost. Google's scaling research found agent systems gained +80.9% on parallelizable tasks yet lost 39% to 70% on sequential ones, with performance degrading past about seven agents. More agents is faster and better only when the work genuinely splits. When it doesn't, it's slower, more expensive, and worse.
When should you not use multiple agents?
The four factors aren't a checklist you run once at the start. They're a test you re-run whenever the work grows or changes shape, because the answer can change with it. What stays constant is the rule for applying them — short enough to hold in your head, and small enough to fit in a single view.
Clone, or define a new agent?
Start with one. Need more of the same work? Clone it — and you don't set the number; a well-built system scales copies to demand. Define a new kind of agent only when the work trips one of these four:
| Ask of the new work | Define a new agent when… | Example |
|---|---|---|
| Tools & permissions — would you hand it the same keys? | it needs broader or more dangerous access than the existing agent should carry | reads tickets → vs → issues refunds |
| Independent evaluation — can you measure and tune it on its own? | its definition of "good" differs, and tuning it would disturb the other | faithfulness (summariser) vs accuracy (classifier) |
| Context isolation — would its playbook pollute the other's? | each job's instructions are noise, or misleading, inside the other's context | contract clauses vs email templates |
| Separate memory — would the other's history be noise? | its running record would clutter or contaminate the other's | a deal's working notes vs a customer thread |
If none fire, it isn't a new agent — it's the same agent with more in its playbook. Default to the fewest kinds the work needs, not the most the org chart suggests.
Apply that rule and the default becomes obvious: most moments that feel like "I need another agent" aren't new kinds at all. Same work in greater volume is a clone. Different work that crosses none of the four lines is the same agent with a longer playbook. Work that runs in steps, each feeding the next, is one agent moving through the sequence. A new kind earns its place only when a boundary genuinely forces it — and most of the time, none does.
So the honest answer to "how many agents?" is: as few kinds as the work truly requires. You aren't managing a headcount of copies — those scale themselves. You're managing a much smaller number, the distinct kinds of agent the work actually needs, and the discipline is to keep it that way: add a new kind only under real pressure, never on the reflex that more work means more agents.
Restraint is the design
Setting up a new agent costs almost nothing. That's exactly what makes the mistake so easy to make. The price of an agent isn't in the building — it's in the keeping.
Agents are software, and with software the cost of maintenance dwarfs the cost of the build. Every kind of agent you define is a permanent liability — something the system has to route to, govern, secure, and watch over for as long as it exists. A clone only adds load; a new kind adds surface area. Define kinds on reflex and you don't end up with a capable system, you end up with a pile of agents — each one more to maintain, each boundary one more thing that can break.
The two ways to get it wrong cost differently. Under-split, with one bloated agent doing everything, and quality erodes: attention spreads thin, playbooks bleed into each other, and you can't tune any part without disturbing the rest. Over-split, with a new kind for every task, and the tokens, the latency, the routing, and the maintenance all multiply for boundaries the work never needed. Neither is free, and the balance between them isn't something you stumble into at runtime.
It's decided before anything is built. The best outcomes come from the effort put in up front — the analysis ahead of the design — sorting which work is the same, to be cloned and spun up automatically as demand rises, and which is different enough to earn its own kind. The cloning takes care of itself. The kinds are the judgment call — and it's the one part of this that doesn't get cheaper as the models improve. The code beneath is commoditising; the judgment above it — what to build, what to combine, what to leave to a person — is not. That judgment is the design layer, and designing it well is the work Aikin does.