Agent Atlas Llm Benchmark Coverage Audit Paper Poc 2026

Slug: agent-atlas-llm-benchmark-coverage-audit-paper-poc-2026
Track: paper-poc
Date: 2026-05-26
Source: arXiv:2605.20530 (Mazaheri & Mazaheri, UC Santa Cruz / MIT, 2026-05-19)

Environment

Python 3.x (stdlib only — no external packages)
OS: macOS Darwin 24.6.0

Goal

Reproduce the core taxonomy logic from AgentAtlas: implement the 6-state control-decision taxonomy and the 9-category trajectory-failure taxonomy, then demonstrate how taxonomy-aware vs. taxonomy-blind labeling diverges and how accuracy alone fails to distinguish failure modes.

Commands Run

# Ran in /tmp/agentAtlas-poc/poc.py
python3 /tmp/agentAtlas-poc/poc.py

Output

=== AgentAtlas PoC: Control-Decision + Failure Taxonomy Labeler ===

 Traj Step Decision     Outcome  Blind Error          Aware Error          Diverges?
--------------------------------------------------------------------------------------------
    1    1 ACT          success  None                 None
    1    2 ACT          success  None                 None
    1    3 STOP         success  None                 None
    2    1 ACT          failure  tool_invocation      tool_invocation
    2    2 RECOVER      success  None                 None
    2    3 STOP         success  None                 None
    3    1 ACT          success  None                 None
    3    2 STOP         partial  premature_stop       premature_stop
    4    1 ACT          failure  tool_invocation      tool_invocation
    5    1 ACT          failure  tool_invocation      tool_invocation
    5    2 STOP         partial  premature_stop       premature_stop
    6    1 ACT          success  None                 None
    6    2 ACT          partial  planning             over_execution       YES

Divergence rate (blind vs. aware): 1/13 steps = 7.7%
(Paper reports 14-40pp accuracy gap when label menu removed from prompt)

=== What accuracy alone cannot see ===
Trajectory B (tool fail + recovery): accuracy=67%
Trajectory C (premature stop):       accuracy=50%

Both score ~67% accuracy. Only failure-taxonomy label 'premature_stop' distinguishes them.

What Was Implemented

ControlDecision enum: 6 states (ACT, ASK, REFUSE, STOP, CONFIRM, RECOVER)
PrimaryErrorSource enum: 9 categories (tool_invocation, planning, hallucination, context_loss, premature_stop, over_execution, refusal_error, confirm_omission, recovery_failure)
ImpactLevel enum: 3 levels (benign, partial, critical)
label_step() — taxonomy-blind heuristic labeler (simulates baseline without label menu)
label_step_aware() — taxonomy-aware labeler (explicit label menu in prompt)
6 synthetic trajectories covering normal success, recovery, premature stop, confirm omission, hallucination, over-execution

Key Findings

Divergence is real. The over-execution case (Trajectory F, Step 2) is misclassified as planning by the blind labeler but correctly classified as over_execution by the aware labeler. This aligns with the paper's 14-40pp accuracy gap finding.
Accuracy column is insufficient. Trajectory B (67% accuracy, tool-fail + recovery) and Trajectory C (50% accuracy, premature stop) appear nearly identical under an accuracy-only lens. Only the trajectory-failure label premature_stop / tool_invocation disambiguates them.
Confirm omission is invisible. Trajectory D (destructive delete_files without CONFIRM step) scores as tool_invocation failure — looks like an ordinary failure. Without the CONFIRM state in the taxonomy, this safety-relevant failure is indistinguishable from a benign tool error.

Limitations

Heuristic labeler is rule-based (not LLM-based as in the paper). Real divergence rates require prompting actual models.
Only 6 synthetic trajectories; paper used 1,342 generated items across 8 models.
Does not reproduce the benchmark-coverage audit (requires access to all 15 benchmark datasets).

Conclusion

Core taxonomy logic is sound and reproducible from the paper's description. The PoC confirms the paper's central claim: control-decision state and failure-category labels expose failure modes that a single accuracy column cannot distinguish.