Agent Atlas Llm Benchmark Coverage Audit Paper Poc 2026
Slug: agent-atlas-llm-benchmark-coverage-audit-paper-poc-2026
Track: paper-poc
Date: 2026-05-26
Source: arXiv:2605.20530 (Mazaheri & Mazaheri, UC Santa Cruz / MIT, 2026-05-19)
Environment
Python 3.x (stdlib only — no external packages)
OS: macOS Darwin 24.6.0
Goal
Reproduce the core taxonomy logic from AgentAtlas: implement the 6-state control-decision taxonomy and the 9-category trajectory-failure taxonomy, then demonstrate how taxonomy-aware vs. taxonomy-blind labeling diverges and how accuracy alone fails to distinguish failure modes.
Commands Run
# Ran in /tmp/agentAtlas-poc/poc.py
python3 /tmp/agentAtlas-poc/poc.py
Output
=== AgentAtlas PoC: Control-Decision + Failure Taxonomy Labeler ===
Traj Step Decision Outcome Blind Error Aware Error Diverges?
--------------------------------------------------------------------------------------------
1 1 ACT success None None
1 2 ACT success None None
1 3 STOP success None None
2 1 ACT failure tool_invocation tool_invocation
2 2 RECOVER success None None
2 3 STOP success None None
3 1 ACT success None None
3 2 STOP partial premature_stop premature_stop
4 1 ACT failure tool_invocation tool_invocation
5 1 ACT failure tool_invocation tool_invocation
5 2 STOP partial premature_stop premature_stop
6 1 ACT success None None
6 2 ACT partial planning over_execution YES
Divergence rate (blind vs. aware): 1/13 steps = 7.7%
(Paper reports 14-40pp accuracy gap when label menu removed from prompt)
=== What accuracy alone cannot see ===
Trajectory B (tool fail + recovery): accuracy=67%
Trajectory C (premature stop): accuracy=50%
Both score ~67% accuracy. Only failure-taxonomy label 'premature_stop' distinguishes them.
What Was Implemented
ControlDecisionenum: 6 states (ACT, ASK, REFUSE, STOP, CONFIRM, RECOVER)PrimaryErrorSourceenum: 9 categories (tool_invocation, planning, hallucination, context_loss, premature_stop, over_execution, refusal_error, confirm_omission, recovery_failure)ImpactLevelenum: 3 levels (benign, partial, critical)label_step()— taxonomy-blind heuristic labeler (simulates baseline without label menu)label_step_aware()— taxonomy-aware labeler (explicit label menu in prompt)- 6 synthetic trajectories covering normal success, recovery, premature stop, confirm omission, hallucination, over-execution
Key Findings
-
Divergence is real. The over-execution case (Trajectory F, Step 2) is misclassified as
planningby the blind labeler but correctly classified asover_executionby the aware labeler. This aligns with the paper's 14-40pp accuracy gap finding. -
Accuracy column is insufficient. Trajectory B (67% accuracy, tool-fail + recovery) and Trajectory C (50% accuracy, premature stop) appear nearly identical under an accuracy-only lens. Only the trajectory-failure label
premature_stop/tool_invocationdisambiguates them. -
Confirm omission is invisible. Trajectory D (destructive
delete_fileswithout CONFIRM step) scores astool_invocationfailure — looks like an ordinary failure. Without the CONFIRM state in the taxonomy, this safety-relevant failure is indistinguishable from a benign tool error.
Limitations
- Heuristic labeler is rule-based (not LLM-based as in the paper). Real divergence rates require prompting actual models.
- Only 6 synthetic trajectories; paper used 1,342 generated items across 8 models.
- Does not reproduce the benchmark-coverage audit (requires access to all 15 benchmark datasets).
Conclusion
Core taxonomy logic is sound and reproducible from the paper's description. The PoC confirms the paper's central claim: control-decision state and failure-category labels expose failure modes that a single accuracy column cannot distinguish.
Read the article
This note supports the public article and records what was actually checked.