Wildtoolbench Llm Tool Use Wild Benchmark Poc 2026

Date: 2026-06-04 Environment: macOS 15 arm64, Python 3.12.8 stdlib only Track: paper-poc Paper: arXiv:2604.06185 — "Benchmarking LLM Tool-Use in the Wild" (submitted February 13, 2026)

Paper Summary

WildToolBench is a benchmark grounded in real user behavior patterns collected from ToolBench-V2 production logs. Unlike prior controlled benchmarks, it preserves the natural complexity of real interactions.

Key finding: 57 LLMs were evaluated. No model exceeded 15% accuracy. This is a dramatic gap versus controlled benchmarks where top models score 60–80%.

Three challenge categories identified from user behavior analysis:

Compositional tasks — requires non-linear tool chaining with fan-in/fan-out DAG topology
Implicit intent — intent is distributed across prior turns; LLM must resolve cross-turn references to fill tool arguments
Instruction transition — conversations mix tool calls, clarifications, and casual chat; model must detect mode shifts without explicit signals

Local PoC Executed

File: /tmp/wildtoolbench_poc_v2.py Dependencies: Python 3.12.8 stdlib (json, collections.defaultdict) — no API calls, no external packages

Challenge 1: Compositional Task DAG

Reproduced the fan-in merge topology documented in the paper:

fetch_slack  (tool: slack_list_messages)  ─┐
                                            ├── create_ticket (tool: jira_create_ticket)
search_docs  (tool: docs_search)          ─┘

PoC computed DAG depth and width:

max_depth=1 (one dependency layer)
max_width=2 (two parallel leaf nodes)
levels: {0: 2, 1: 1} — 2 parallel tools at level 0, 1 fan-in at level 1

The structure confirms that existing benchmarks that linearize tasks (one tool per turn) miss this coordination requirement.

Challenge 2: Implicit Intent

Implemented cross-turn reference scoring on a 3-turn dialogue:

Turn 1: "I'm planning a trip to Tokyo." [0 refs]
Turn 2: "What's the weather like there?" [1 ref: city from turn 1]  
Turn 3: "Find me hotels for those dates." [1 ref: city + 1 UNRESOLVABLE: dates]

Results:

per_turn scores: [0, 1, 3] (unresolvable references penalized 2x)
mean_score: 1.33
unresolvable references: 1 — LLM must ask a clarifying question or hallucinate dates

Challenge 3: Instruction Transition

Mode-switch analysis on a 5-turn conversation mixing tool calls, clarification, and chat:

total turns: 5, switches: 3
switch_rate: 75%
Transition pairs: tool_call → clarification → conversational → tool_call

This confirms that high mode-switching frequency is a structural property of real user sessions — models trained on clean tool-call datasets are not exposed to this pattern.

What Was NOT Tested

No actual LLM was evaluated against WildToolBench
No WildToolBench dataset was downloaded or parsed
No accuracy scores were reproduced (the 15% ceiling is from the paper directly)
No ToolBench-V2 production logs were accessed
The PoC illustrates challenge category structure, not benchmark scoring

Sources

arXiv:2604.06185: arxiv.org/abs/2604.06185 (verified via WebSearch 2026-06-04)
arXiv PDF: arxiv.org/pdf/2604.06185 (full paper accessed)
arXiv HTML: arxiv.org/html/2604.06185 (accessible)
EmergentMind ToolBench coverage: emergentmind.com/topics/toolbench
TRAJECT-Bench related work: arxiv.org/pdf/2510.04550