Wildtoolbench Llm Tool Use Wild Benchmark Poc 2026
Date: 2026-06-04 Environment: macOS 15 arm64, Python 3.12.8 stdlib only Track: paper-poc Paper: arXiv:2604.06185 — "Benchmarking LLM Tool-Use in the Wild" (submitted February 13, 2026)
Paper Summary
WildToolBench is a benchmark grounded in real user behavior patterns collected from ToolBench-V2 production logs. Unlike prior controlled benchmarks, it preserves the natural complexity of real interactions.
Key finding: 57 LLMs were evaluated. No model exceeded 15% accuracy. This is a dramatic gap versus controlled benchmarks where top models score 60–80%.
Three challenge categories identified from user behavior analysis:
- Compositional tasks — requires non-linear tool chaining with fan-in/fan-out DAG topology
- Implicit intent — intent is distributed across prior turns; LLM must resolve cross-turn references to fill tool arguments
- Instruction transition — conversations mix tool calls, clarifications, and casual chat; model must detect mode shifts without explicit signals
Local PoC Executed
File: /tmp/wildtoolbench_poc_v2.py
Dependencies: Python 3.12.8 stdlib (json, collections.defaultdict) — no API calls, no external packages
Challenge 1: Compositional Task DAG
Reproduced the fan-in merge topology documented in the paper:
fetch_slack (tool: slack_list_messages) ─┐
├── create_ticket (tool: jira_create_ticket)
search_docs (tool: docs_search) ─┘
PoC computed DAG depth and width:
max_depth=1(one dependency layer)max_width=2(two parallel leaf nodes)levels: {0: 2, 1: 1}— 2 parallel tools at level 0, 1 fan-in at level 1
The structure confirms that existing benchmarks that linearize tasks (one tool per turn) miss this coordination requirement.
Challenge 2: Implicit Intent
Implemented cross-turn reference scoring on a 3-turn dialogue:
Turn 1: "I'm planning a trip to Tokyo." [0 refs]
Turn 2: "What's the weather like there?" [1 ref: city from turn 1]
Turn 3: "Find me hotels for those dates." [1 ref: city + 1 UNRESOLVABLE: dates]
Results:
per_turn scores: [0, 1, 3](unresolvable references penalized 2x)mean_score: 1.33unresolvable references: 1— LLM must ask a clarifying question or hallucinate dates
Challenge 3: Instruction Transition
Mode-switch analysis on a 5-turn conversation mixing tool calls, clarification, and chat:
total turns: 5, switches: 3switch_rate: 75%- Transition pairs:
tool_call → clarification → conversational → tool_call
This confirms that high mode-switching frequency is a structural property of real user sessions — models trained on clean tool-call datasets are not exposed to this pattern.
What Was NOT Tested
- No actual LLM was evaluated against WildToolBench
- No WildToolBench dataset was downloaded or parsed
- No accuracy scores were reproduced (the 15% ceiling is from the paper directly)
- No ToolBench-V2 production logs were accessed
- The PoC illustrates challenge category structure, not benchmark scoring
Sources
- arXiv:2604.06185: arxiv.org/abs/2604.06185 (verified via WebSearch 2026-06-04)
- arXiv PDF: arxiv.org/pdf/2604.06185 (full paper accessed)
- arXiv HTML: arxiv.org/html/2604.06185 (accessible)
- EmergentMind ToolBench coverage: emergentmind.com/topics/toolbench
- TRAJECT-Bench related work: arxiv.org/pdf/2510.04550
Read the article
This note supports the public article and records what was actually checked.