AgentAtlas: LLM Agent Benchmarks Need More Than Accuracy
Every major LLM agent benchmark reports an accuracy number. SWE-Bench gives you a percentage of resolved issues. Tau-Bench gives you a pass rate. WebArena gives you task success. All useful — and all radically insufficient for deciding whether an agent is safe to deploy.
That gap is what AgentAtlas addresses. Published on May 19, 2026 by Parsa Mazaheri (UC Santa Cruz) and Kasra Mazaheri (MIT), the paper does not introduce a new leaderboard. Instead it provides a taxonomy — a precise vocabulary for the decisions agents make and the ways they fail — then uses that vocabulary to audit fifteen existing benchmarks and run eight models under two different evaluation conditions.
The results are uncomfortable for anyone who relies on a single number to compare agents.
Why One Accuracy Column Is No Longer Enough
A chatbot can be judged by whether its response is correct. An agent works differently: it executes a sequence of decisions over time, each of which can be right or wrong independent of the final outcome. The same final score can conceal very different behaviors.
Consider two agents that both achieve 67% task success on a benchmark:
- Agent A encounters a database connection error on step 1, retries with backoff, recovers, and completes the task.
- Agent B completes the first sub-task successfully, then stops before finishing the second.
One is a well-behaved agent that handles failures gracefully. The other has a premature-stop bug that will silently fail in production. Under an accuracy-only evaluation, they look identical.
This is not a hypothetical. A 2026 analysis of fifteen major agent benchmarks found that 13 out of 15 rely on binary success metrics and that zero of them integrate cost-efficiency or safety into their primary scoring rubric. The evaluation problem is real and widespread.
The AgentAtlas Framework: Four Components
AgentAtlas brings four tools to the problem.
Component 1: The Six-State Control-Decision Taxonomy
The paper proposes that every decision an agent makes at each step belongs to one of six categories:
| State | Meaning |
|---|---|
| ACT | Execute a tool call |
| ASK | Solicit clarification from the user |
| REFUSE | Decline a request (policy or capability boundary) |
| STOP | Terminate — task is done or is unrecoverable |
| CONFIRM | Request explicit approval before a destructive action |
| RECOVER | Retry or repair after an observed failure |
The question for agent evaluation is no longer just "did the task succeed?" It becomes: did the agent ACT, ASK, REFUSE, STOP, CONFIRM, and RECOVER at the right moments?
A model that never fires CONFIRM before deleting files might score well on accuracy benchmarks while silently posing safety risks in production. A model that fires ASK excessively scores well on safety but is frustrating in practice. Neither failure mode is visible without the taxonomy.
Component 2: The Nine-Category Trajectory-Failure Taxonomy
When something goes wrong in an agent trajectory, AgentAtlas assigns two orthogonal labels:
- primary_error_source: the root cause category (one of nine)
- impact: how much the failure affected the outcome (benign / partial / critical)
The nine primary error categories are:
tool_invocation— wrong arguments, missing parametersplanning— incorrect sub-goal sequencehallucination— fabricated facts or tool outputscontext_loss— dropped earlier constraints mid-trajectorypremature_stop— stopped before the task was finishedover_execution— continued acting after the task was completerefusal_error— refused a safe, explicitly authorized requestconfirm_omission— performed a destructive action without a CONFIRM steprecovery_failure— failed to adapt after a tool error
Two trajectories can have identical accuracy scores while carrying completely different risk profiles depending on which of these nine categories their failures fall into.
Component 3: Taxonomy-Aware vs. Taxonomy-Blind Evaluation
This is where the paper gets practically important. AgentAtlas tests eight models (four frontier closed-weight, four open-weight) under two conditions:
- Taxonomy-aware: the evaluation prompt explicitly includes the six-state label menu and nine failure categories
- Taxonomy-blind: no label menu is provided; the model must classify without guidance
Removing the explicit label menu drops every model's trajectory accuracy by 14 to 40 percentage points, collapsing all models to a tight 0.54–0.62 floor regardless of model family.
That means a significant portion of what leaderboards measure is not the agent's capability — it is how much scaffold the evaluation prompt provides. The paper calls this scaffold sensitivity, and it is a systematic confound in how the field currently compares models.
Component 4: Benchmark-Coverage Audit
AgentAtlas maps fifteen agent benchmarks against six behavioral axes, asking which aspects of agent behavior each benchmark actually covers. The audit reveals systematic blind spots: most benchmarks were designed to measure task success in a specific domain and never intended to assess control decisions like CONFIRM, RECOVER, or REFUSE.
| Coverage Axis | Benchmarks that cover it | Benchmarks that skip it |
|---|---|---|
| Task success (binary) | 13 / 15 | 2 / 15 |
| Trajectory-level failure label | ~3 / 15 | ~12 / 15 |
| Safety / destructive action handling | ~4 / 15 | ~11 / 15 |
| Recovery behavior | ~5 / 15 | ~10 / 15 |
| Cost-efficiency in primary score | 0 / 15 | 15 / 15 |
| Multi-axis composite score | 0 / 15 | 15 / 15 |
No single benchmark wins on all axes. Combining results across benchmarks using the AgentAtlas vocabulary gives a more complete picture than any leaderboard column alone.
Effloow Lab PoC: Reproducing the Taxonomy in Python
To validate the paper's core claims, Effloow Lab implemented the taxonomy logic using Python stdlib and ran it against six synthetic agent trajectories.
from enum import Enum, auto
from dataclasses import dataclass
from typing import Optional
class ControlDecision(Enum):
ACT = auto()
ASK = auto()
REFUSE = auto()
STOP = auto()
CONFIRM = auto()
RECOVER = auto()
class PrimaryErrorSource(Enum):
TOOL_INVOCATION = "tool_invocation"
PLANNING = "planning"
HALLUCINATION = "hallucination"
CONTEXT_LOSS = "context_loss"
PREMATURE_STOP = "premature_stop"
OVER_EXECUTION = "over_execution"
REFUSAL_ERROR = "refusal_error"
CONFIRM_OMISSION = "confirm_omission"
RECOVERY_FAILURE = "recovery_failure"
class ImpactLevel(Enum):
BENIGN = "benign"
PARTIAL = "partial"
CRITICAL = "critical"
@dataclass
class TrajectoryStep:
step_id: int
decision: ControlDecision
tool: Optional[str]
outcome: str # "success" | "failure" | "partial"
notes: str = ""
The PoC ran two labelers over the same trajectories: a taxonomy-blind heuristic and a taxonomy-aware classifier that uses the full label menu. One case diverged clearly: an over-execution failure (an agent that continued sending emails after the report task was already complete) was misclassified as planning by the blind labeler and correctly identified as over_execution by the aware labeler.
The accuracy-masking finding reproduced cleanly. Two trajectories — one involving a tool failure followed by successful recovery, the other a premature stop — achieved nearly identical accuracy scores (67% vs. 50%) while representing fundamentally different agent behaviors. Only the failure taxonomy label (premature_stop vs. tool_invocation) distinguished them.
Full PoC code and outputs are recorded in data/lab-runs/agent-atlas-llm-benchmark-coverage-audit-paper-poc-2026.md.
What This Changes for Developers Building Agents
If you are building or evaluating an LLM agent system, the AgentAtlas framework has three direct practical implications.
Add control-decision logging to your agent loop. Tracking which of the six states your agent fires at each step costs almost nothing at instrumentation time and gives you diagnostic data that pure accuracy logging cannot provide. A spike in CONFIRM-less ACT decisions on destructive tool calls is a safety signal that no accuracy dashboard will surface.
Use failure taxonomy labels when triaging production incidents. When an agent fails in deployment, the first question is usually "what went wrong." Labeling the failure step with a primary_error_source category (hallucination? context_loss? premature_stop?) makes root-cause analysis faster and builds a structured dataset for future training or evaluation.
Be skeptical of scaffold-sensitive benchmarks. The 14–40 percentage point gap between taxonomy-aware and taxonomy-blind conditions means that some published leaderboard scores are measuring prompt-level scaffolding as much as model capability. When comparing agents, test under both conditions and report the gap.
The Broader Context: Related Work
AgentAtlas is not alone in pushing agent evaluation beyond accuracy. Two related lines of work are worth knowing:
AgentRx (Microsoft Research, arXiv:2602.02475) approaches the same problem from the debugging angle: given a failed agent trajectory, automatically localize which step was critical and why. Their grounded-theory derived failure taxonomy has significant overlap with AgentAtlas's nine categories, which is not coincidental — both are responding to the same gap in current evaluation tooling.
ATBench (arXiv:2604.02022) specifically targets trajectory safety: rather than asking whether a task succeeded, it asks whether the path the agent took was safe. This is the evaluation-axis equivalent of AgentAtlas's CONFIRM and REFUSE states.
The common thread across all three is that the field is moving from evaluating outcomes to evaluating behavior trajectories. A model that scores 80% on SWE-Bench while silently skipping confirmation steps on destructive actions is not an 80% deployable agent.
Common Questions
Q: Does AgentAtlas replace existing benchmarks like SWE-Bench or Tau-Bench?
No. The paper is explicit that it does not aim to replace existing benchmarks or introduce a new leaderboard. AgentAtlas provides a vocabulary and an audit methodology that can be applied on top of any existing benchmark. Think of it as a lens, not a replacement.
Q: How does scaffold sensitivity affect published leaderboard numbers?
The paper found that removing the explicit label menu from evaluation prompts drops every tested model's trajectory accuracy by 14–40 percentage points. This suggests that leaderboard rankings partly reflect how well a model leverages evaluation-prompt scaffolding rather than underlying capability. The effect size varies by model family, which means relative rankings also change depending on prompt format.
Q: Can I use the AgentAtlas taxonomy in my own agent evaluation pipeline today?
Yes. The taxonomy is described fully in the paper (arXiv:2605.20530) and requires no new tools or infrastructure. You can implement the six control-decision states as an enum in any language and add failure-category logging to your agent's trajectory recorder. The Effloow Lab PoC above demonstrates a minimal Python implementation using stdlib only.
Q: What is the nine-category failure taxonomy most useful for?
Root-cause analysis in production. When an agent fails, tagging the failure step with a primary_error_source category lets you aggregate failure patterns across runs, identify which categories your agent is most prone to, and target training data or evaluation coverage accordingly.
Key Takeaways
AgentAtlas addresses a real gap: current agent benchmarks were built to measure task success, not the quality of the behavioral trajectory that led to it. A 6-state control-decision taxonomy and a 9-category failure taxonomy give developers a precise vocabulary for what the accuracy column omits.
The finding that removing the taxonomy label menu from evaluation prompts collapses all model scores to a 0.54–0.62 floor should give pause to anyone citing leaderboard numbers as a measure of agent capability. A significant portion of those numbers reflect scaffolding, not the model.
For developers building production agent systems, the practical takeaway is straightforward: log control decisions, label failures, and test your agents under conditions that do not hand them the answer key.
AgentAtlas does not replace benchmarks — it exposes what they miss. The 6-state control taxonomy and 9-category failure taxonomy are small, implementable additions to any agent evaluation pipeline that make the difference between an accuracy score and an honest assessment of deployability.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.