Skip to content
Effloow
← Back to article
EFFLOOW LAB LAB-RUN

Agent Atlas Llm Benchmark Coverage Audit Paper Poc 2026

Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.

Slug: agent-atlas-llm-benchmark-coverage-audit-paper-poc-2026
Track: paper-poc
Date: 2026-05-26
Source: arXiv:2605.20530 (Mazaheri & Mazaheri, UC Santa Cruz / MIT, 2026-05-19)

Environment

Python 3.x (stdlib only — no external packages)
OS: macOS Darwin 24.6.0

Goal

Reproduce the core taxonomy logic from AgentAtlas: implement the 6-state control-decision taxonomy and the 9-category trajectory-failure taxonomy, then demonstrate how taxonomy-aware vs. taxonomy-blind labeling diverges and how accuracy alone fails to distinguish failure modes.

Commands Run

# Ran in /tmp/agentAtlas-poc/poc.py
python3 /tmp/agentAtlas-poc/poc.py

Output

=== AgentAtlas PoC: Control-Decision + Failure Taxonomy Labeler ===

 Traj Step Decision     Outcome  Blind Error          Aware Error          Diverges?
--------------------------------------------------------------------------------------------
    1    1 ACT          success  None                 None
    1    2 ACT          success  None                 None
    1    3 STOP         success  None                 None
    2    1 ACT          failure  tool_invocation      tool_invocation
    2    2 RECOVER      success  None                 None
    2    3 STOP         success  None                 None
    3    1 ACT          success  None                 None
    3    2 STOP         partial  premature_stop       premature_stop
    4    1 ACT          failure  tool_invocation      tool_invocation
    5    1 ACT          failure  tool_invocation      tool_invocation
    5    2 STOP         partial  premature_stop       premature_stop
    6    1 ACT          success  None                 None
    6    2 ACT          partial  planning             over_execution       YES

Divergence rate (blind vs. aware): 1/13 steps = 7.7%
(Paper reports 14-40pp accuracy gap when label menu removed from prompt)

=== What accuracy alone cannot see ===
Trajectory B (tool fail + recovery): accuracy=67%
Trajectory C (premature stop):       accuracy=50%

Both score ~67% accuracy. Only failure-taxonomy label 'premature_stop' distinguishes them.

What Was Implemented

  • ControlDecision enum: 6 states (ACT, ASK, REFUSE, STOP, CONFIRM, RECOVER)
  • PrimaryErrorSource enum: 9 categories (tool_invocation, planning, hallucination, context_loss, premature_stop, over_execution, refusal_error, confirm_omission, recovery_failure)
  • ImpactLevel enum: 3 levels (benign, partial, critical)
  • label_step() — taxonomy-blind heuristic labeler (simulates baseline without label menu)
  • label_step_aware() — taxonomy-aware labeler (explicit label menu in prompt)
  • 6 synthetic trajectories covering normal success, recovery, premature stop, confirm omission, hallucination, over-execution

Key Findings

  1. Divergence is real. The over-execution case (Trajectory F, Step 2) is misclassified as planning by the blind labeler but correctly classified as over_execution by the aware labeler. This aligns with the paper's 14-40pp accuracy gap finding.

  2. Accuracy column is insufficient. Trajectory B (67% accuracy, tool-fail + recovery) and Trajectory C (50% accuracy, premature stop) appear nearly identical under an accuracy-only lens. Only the trajectory-failure label premature_stop / tool_invocation disambiguates them.

  3. Confirm omission is invisible. Trajectory D (destructive delete_files without CONFIRM step) scores as tool_invocation failure — looks like an ordinary failure. Without the CONFIRM state in the taxonomy, this safety-relevant failure is indistinguishable from a benign tool error.

Limitations

  • Heuristic labeler is rule-based (not LLM-based as in the paper). Real divergence rates require prompting actual models.
  • Only 6 synthetic trajectories; paper used 1,342 generated items across 8 models.
  • Does not reproduce the benchmark-coverage audit (requires access to all 15 benchmark datasets).

Conclusion

Core taxonomy logic is sound and reproducible from the paper's description. The PoC confirms the paper's central claim: control-decision state and failure-category labels expose failure modes that a single accuracy column cannot distinguish.

Read the article

This note supports the public article and records what was actually checked.

Open article →