BoundaryRouter: Train-Free LLM-vs-Agent Routing
Your agent is expensive. Not because the model is expensive — because you're sending every query through the full ReAct loop whether or not it needs to be there. A question like "What is the capital of France?" doesn't need a web search, tool calls, or multi-step planning. But in most deployed systems today, it goes through the same execution path as "Find the latest pricing for every major cloud GPU provider and compare them."
That gap between what queries need and what they actually get is where agent infrastructure cost is quietly hemorrhaging. Agent execution typically costs 10–50x more than direct LLM inference per query in combined latency and API calls. At scale, that multiplier breaks budgets.
The intuitive fix — build a classifier that routes cheap queries to the LLM and expensive queries to the agent — runs into a practical wall immediately. You need labeled training data. Collecting it requires running the very agent system you're trying to optimize, under supervision, at scale. It's circular.
A paper submitted to arXiv in May 2026 (arXiv:2605.07180, "Learning Agent Routing From Early Experience") from researchers at Princeton, University of Michigan, Tsinghua, SJTU, University of Edinburgh, and King's College London describes a way around this. BoundaryRouter routes queries between direct LLM inference and full agent execution without any pre-labeled training data, and without building a classifier. It uses a small seed set of queries where you run both systems, then retrieves similar past cases to inform each new routing decision.
Effloow Lab ran a conceptual PoC reproducing the routing algorithm. The full lab notes are at data/lab-runs/boundaryrouter-agent-routing-early-experience-poc-2026.md.
Why This Matters: The Routing Problem and Cost Stakes
The cost gap between LLM-only and agent execution is not theoretical. ReAct improves over direct LLM inference by 44 percentage points on GAIA (web-search benchmark tasks). But CoT actually degrades performance by 15 percentage points on HumanEval (code tasks) compared to direct inference. The right execution strategy depends entirely on the query — and the difference between getting it right and wrong is large in both directions.
Fixed strategies fail in both directions. Always routing to agents over-spends on queries that don't need tools. Always routing to direct LLM inference leaves a large accuracy gap on queries that do need them. The question is whether you can make per-query decisions cheaply enough that the routing step itself does not eat the savings.
Prompt-based routing — asking a model to decide "does this query need an agent?" — exists and works to a degree, but it lacks grounding in actual performance evidence. The model is guessing based on surface features. BoundaryRouter grounds the decision in real observed quality deltas from past cases.
The paper shows on RouteBench, their new benchmark:
- 60.6% inference time reduction versus always routing to agents (arXiv:2605.07180)
- 28.6% performance improvement versus always using direct LLM inference (arXiv:2605.07180)
- 37.9% improvement over prompt-based routing (arXiv:2605.07180)
- 8.2% improvement over retrieval-only routing without rubric reasoning (arXiv:2605.07180)
Those numbers land in a range that matters operationally. A 60% inference time reduction changes latency SLAs. A 28% accuracy improvement over the baseline LLM-only approach means routing is doing real work, not just cost-cutting.
How BoundaryRouter Works
The framework has three phases. The first runs once during setup. The second and third run at inference time for every new query.
Phase 1: Build the Experience Memory
Select a seed set of N queries that are representative of your expected workload. The paper uses approximately 50–200 queries. For each query in the seed set:
- Run the direct LLM on the query and record its output
- Run the full agent (ReAct loop with tools) on the same query and record its output
- Score both outputs using a rubric — either a reference answer or a model-as-judge approach
- Store the record:
{query, llm_score, agent_score, delta}wheredelta = agent_score - llm_score
This collection becomes the experience memory. A high positive delta means the agent meaningfully outperformed the LLM on that query type. A near-zero or negative delta means direct inference was sufficient.
This is a one-time cost. You run both systems on N queries, pay for N agent executions, and get a routing asset that informs all future decisions.
Phase 2: Retrieve Similar Past Cases
When a new query arrives, embed it and find the top-k most similar queries from the experience memory using cosine similarity. Those k retrieved records are the routing context — they tell you what happened when similar queries ran through both systems.
Phase 3: Rubric-Guided Routing Decision
Feed the new query and the retrieved records into a reasoning prompt. The model reasons: "Given that these similar past queries showed these quality deltas, does this new query require agent execution?" The structured rubric — rather than an open-ended "should I use tools?" question — anchors the decision in observed evidence rather than surface heuristics.
The routing decision emerges from this reasoning step, and the query is dispatched accordingly.
Practical Implementation
Here is a conceptual Python implementation of the BoundaryRouter algorithm. This reproduces the core routing logic using TF-IDF for embeddings (no external dependencies required) and a heuristic delta threshold for routing. Effloow Lab ran this PoC — full output in the lab notes.
#!/usr/bin/env python3
"""
BoundaryRouter PoC — conceptual reproduction of arXiv:2605.07180
Uses TF-IDF cosine similarity for retrieval and heuristic delta threshold.
Production systems should use dense embeddings and LLM-scored rubrics.
"""
import math
from collections import Counter
def tokenize(text):
return text.lower().split()
def build_idf(corpus):
N = len(corpus)
df = Counter()
for doc in corpus:
for tok in set(tokenize(doc)):
df[tok] += 1
return {tok: math.log((N + 1) / (freq + 1)) for tok, freq in df.items()}
def tfidf_vector(text, idf):
tokens = tokenize(text)
tf = Counter(tokens)
vec = {}
for tok, count in tf.items():
if tok in idf:
vec[tok] = (count / len(tokens)) * idf[tok]
return vec
def cosine_sim(a, b):
common = set(a) & set(b)
if not common:
return 0.0
dot = sum(a[k] * b[k] for k in common)
norm_a = math.sqrt(sum(v**2 for v in a.values()))
norm_b = math.sqrt(sum(v**2 for v in b.values()))
return dot / (norm_a * norm_b + 1e-9)
# Synthetic experience memory: (query, llm_score, agent_score, delta)
SEED_MEMORY = [
("What is 2 + 2?", 9.5, 9.4, -0.1),
("Explain gradient descent in one sentence.", 9.0, 9.1, 0.1),
("What is the capital of France?", 9.8, 9.7, -0.1),
("Search the web for today's top AI news and summarize it.", 3.2, 8.7, 5.5),
("Browse GitHub and find the top Python repo by stars.", 2.0, 8.5, 6.5),
("Run a script to fetch and parse a CSV from a URL.", 3.5, 8.8, 5.3),
("Write a haiku about machine learning.", 9.2, 9.3, 0.1),
("Debug this error: NameError: name 'x' is not defined.", 8.5, 8.6, 0.1),
("Compare current pricing of GPT-4o and Claude Sonnet.", 3.0, 8.9, 5.9),
("Translate 'hello world' to Spanish.", 9.9, 9.8, -0.1),
("Find all Python files in /src and count lines of code.", 2.5, 8.7, 6.2),
("What are the pros and cons of microservices?", 8.8, 8.9, 0.1),
]
AGENT_THRESHOLD = 2.0 # route to agent if mean delta of top-k exceeds this
def build_index(memory, idf):
return [(tfidf_vector(q, idf), delta) for q, _, _, delta in memory]
def route(query, index, idf, k=3):
q_vec = tfidf_vector(query, idf)
scored = [(cosine_sim(q_vec, entry[0]), entry[1]) for entry in index]
top_k = sorted(scored, key=lambda x: -x[0])[:k]
mean_delta = sum(row[1] for row in top_k) / k
return "AGENT" if mean_delta > AGENT_THRESHOLD else "LLM"
The rubric-guided reasoning step in the actual BoundaryRouter paper uses an LLM prompt with the retrieved cases as context. Here is a simplified version of that pattern:
def build_routing_prompt(query: str, retrieved_cases: list) -> str:
cases_text = "\n".join(
f"- Query: '{case['query']}' | LLM score: {case['llm_score']:.1f} "
f"| Agent score: {case['agent_score']:.1f} | Delta: {case['delta']:+.1f}"
for case in retrieved_cases
)
return f"""You are a routing agent. Decide whether the following query
should be answered by direct LLM inference or by a full agent (with tools).
Similar past queries and their outcomes:
{cases_text}
New query: "{query}"
Based on the pattern above, does this query require agent execution?
Answer with AGENT or LLM and a one-sentence justification."""
In a production system, this prompt goes to a lightweight model (not the full agent model) specifically for the routing decision. The cost of this routing call should be well under 1% of a full agent execution.
RouteBench: A New Evaluation Standard
One of the paper's secondary contributions is RouteBench, a benchmark designed to evaluate routing systems across multiple difficulty levels. Prior to RouteBench, there was no standardized way to compare routing approaches — teams evaluated on their own private workloads with incompatible metrics.
RouteBench covers three routing difficulty settings:
In-domain: The routing model sees queries from the same distribution as its seed set. This is the baseline case — how well does the system route when the problem space is familiar?
Paraphrased: Queries are semantically equivalent to seed examples but expressed differently. This tests whether the routing is based on meaning (which should transfer) or on surface token overlap (which should not).
Out-of-domain: Queries come from a genuinely different task distribution. This is the stress test — how does the system handle query types it has not seen in its experience memory?
The three-level structure is worth noting for practitioners designing their own routing evaluations. Most real production systems face a mix of all three conditions simultaneously. A routing framework that only performs well in-domain will degrade over time as user behavior drifts from the seed distribution.
BoundaryRouter's rubric-guided reasoning step is specifically designed to handle the out-of-domain case better than pure retrieval. Even if no retrieved case closely matches the new query, the LLM reasoner can generalize from the structural pattern of past cases (tool-requiring vs. knowledge-requiring) rather than relying on surface similarity alone.
Comparing Routing Strategies
| Strategy | Training Required | Accuracy vs. Oracle | Cost vs. Always-Agent | Cold-Start Ready | Out-of-Domain |
|---|---|---|---|---|---|
| Always LLM | None | Low (misses agent-needed queries) | Lowest cost | Yes | N/A — no routing |
| Always Agent | None | High (but overkill on easy queries) | Highest cost (10–50x LLM) | Yes | N/A — no routing |
| Prompt-Based Routing | None | Moderate (surface heuristics) | Moderate savings | Yes | Weak (no evidence grounding) |
| Retrieval-Only Routing | Seed set (no labels) | Good in-domain | Good savings | Yes | Degrades on novel queries |
| BoundaryRouter | Seed set (no labels) | Best (rubric-grounded) | 60.6% reduction vs. always-agent | Yes (~50 seed queries) | Strong (LLM generalizes from cases) |
Source: Strategy comparison derived from arXiv:2605.07180 results on RouteBench. Oracle here means the theoretical maximum achievable by a perfect routing system with full knowledge of each query's difficulty.
The trained-classifier approach (not shown above, as it falls outside BoundaryRouter's target scope) would occupy its own column between retrieval-only and BoundaryRouter in accuracy — but requires labeled training data that BoundaryRouter explicitly avoids needing.
Common Mistakes and When Not to Route
Routing is not free. Before implementing BoundaryRouter, consider whether your system actually has a mixed workload worth routing.
Mistake 1: Adding routing to a homogeneous workload. If 90% of your queries are web-search tasks that need agents, routing will rarely choose LLM-only — and you've added latency and complexity for near-zero savings. Audit your query distribution first. Routing pays off when both paths see significant traffic.
Mistake 2: Under-sizing the seed set. The paper uses 50–200 seed queries. Using fewer than 30 creates an experience memory that is too sparse to retrieve meaningfully similar cases. If your workload has 10 distinct task types, aim for at least 5 examples per type in the seed set.
Mistake 3: Treating the seed set as permanent. The experience memory reflects the workload at collection time. As user behavior drifts, the retrieved cases become less representative. Plan to refresh the seed set on a regular cadence — monthly is a reasonable starting point.
Mistake 4: Skipping the rubric step and using retrieval alone. Retrieval-only routing is faster to build but leaves performance on the table, particularly on out-of-domain queries. The paper shows retrieval-only is 8.2% worse than the full BoundaryRouter approach. The additional LLM reasoning call for routing costs a fraction of a full agent execution, so skipping it to save compute is rarely the right trade.
Mistake 5: Cold-starting with too few diverse seed queries. BoundaryRouter works under cold-start conditions — that's one of its stated advantages. But cold start with 50 queries of a single type will produce systematically biased routing. If you cannot collect a diverse seed set upfront, begin with prompt-based routing and build toward BoundaryRouter as you accumulate real query data.
For teams building agentic systems where the practical ceiling of coordination overhead tends to kick in around 3–4 agents (see our agent scaling ceiling analysis), routing is one of the highest-leverage optimizations available before adding more agents. It cuts cost without changing the capability surface.
FAQ
Q: How do I score seed query outputs without human annotators?
The paper uses a rubric-guided LLM-as-judge approach: a model evaluates both the LLM and agent outputs against a defined rubric (accuracy, completeness, factual correctness). This requires no human labels — just a capable judge model and a well-defined rubric for your task domain. For tasks with verifiable answers (math, code, factual queries), exact-match or execution-based scoring works and avoids judge model costs entirely.
Q: Does BoundaryRouter work if my agent uses a proprietary toolset not represented in the seed set?
Yes, with a caveat. The routing decision is based on whether a query type tends to benefit from agent execution, not on the specific tools available. If your seed set includes examples of tool-requiring queries (web search, file operations, API calls) that structurally resemble your proprietary tool use cases, the routing signal transfers. If your proprietary tools handle a genuinely novel query type with no analogues in the seed set, retrieval will fall back on weak similarity and the rubric step carries more weight. Test out-of-domain coverage explicitly during evaluation.
Q: Can BoundaryRouter route to more than two systems (not just LLM vs. one agent)?
The paper's framing is binary: LLM or agent. Extending to multi-way routing — LLM, lightweight agent, heavy agent — is a natural generalization but is not evaluated in arXiv:2605.07180. A reasonable extension would track per-system quality deltas in the experience memory and use the rubric to select from N options rather than binary. This has not been experimentally validated in the published work, so treat any multi-way extension as engineering exploration rather than paper-backed practice. For context on multi-agent orchestration patterns, see Sakana AI's RL Conductor work and the reward hacking considerations that come with more complex routing systems.
Key Takeaways
BoundaryRouter addresses a problem that most production agentic systems face but few have a principled answer to: when to use the agent and when not to. The training-free design matters practically. The moment you require labeled data to bootstrap a routing classifier, you've created a chicken-and-egg problem that stalls deployment. Starting from a small seed set where you run both systems and record observed quality gives you a real foundation without that dependency.
The 60.6% inference time reduction and 28.6% accuracy improvement over fixed strategies (arXiv:2605.07180) reflect the value of routing grounded in evidence rather than surface heuristics. The gap between BoundaryRouter and retrieval-only routing (8.2%) shows that the rubric-guided reasoning step is doing real work, not just decorating a similarity search.
The RouteBench benchmark is a secondary but meaningful contribution. Routing research has lacked a shared evaluation surface, and the three-level difficulty structure (in-domain, paraphrased, out-of-domain) reflects the real conditions under which production routing operates.
For teams that have already read our token efficiency work on Chain of Draft and test-time scaling research, BoundaryRouter occupies a different but complementary position in the cost-optimization stack. CoD reduces token cost within a single inference. BoundaryRouter reduces the frequency of expensive inference paths altogether. Both compound.
BoundaryRouter is a practical, training-free routing framework backed by multi-institution research (arXiv:2605.07180). The algorithm is straightforward to reproduce conceptually, the seed set requirement is low (~50–200 queries), and the savings are large enough to justify the setup cost for any mixed LLM/agent workload processing more than a few thousand queries per day. If your team is paying agent-level costs for LLM-sufficient queries, this paper is worth reading and implementing.
Effloow Lab ran a conceptual PoC reproducing the BoundaryRouter routing algorithm using TF-IDF similarity and synthetic experience memory. Full lab notes, PoC code, and output are at data/lab-runs/boundaryrouter-agent-routing-early-experience-poc-2026.md. No live LLM or agent API calls were made during this lab run. All performance figures cited above are from arXiv:2605.07180 (May 2026). RouteBench benchmark results reflect in-domain, paraphrased, and out-of-domain evaluation settings as described in the paper.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.