Skip to content
Effloow
← Back to article
EFFLOOW LAB LAB-RUN

Boundaryrouter Agent Routing Early Experience Poc 2026

Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.

Date: 2026-05-30 Track: paper-poc Source: arXiv:2605.07180 — "Learning Agent Routing From Early Experience"

Paper Summary

BoundaryRouter is a training-free routing framework that decides whether a given query should be answered by:

  1. Direct LLM inference (fast, cheap, sufficient for simpler queries), or
  2. Full agent execution (slower, costlier, necessary for complex multi-step tasks)

The framework works by building an "experience memory" from a small seed set where both systems are run, then using retrieved similar cases + rubric-guided reasoning to make routing decisions at inference time.

Algorithm (from paper)

Phase 1: Seed Execution (one-time setup)

  • Select a seed set of N representative queries (paper uses ~50–200 queries)
  • Run BOTH the LLM and the agent on each seed query
  • Record: query, LLM output, agent output, rubric-based quality scores for each
  • Store as experience memory: {query, llm_score, agent_score, delta}

Phase 2: Routing at Inference Time

For each new query:

  1. Embed the query using a text embedding model
  2. Retrieve top-k most similar queries from experience memory
  3. Feed retrieved cases + new query to a rubric-guided reasoning prompt
  4. Model reasons: "Based on these similar past cases, does this query require agent execution?"
  5. Route accordingly

Conceptual PoC (Python, no API key required)

The following demonstrates the BoundaryRouter concept using synthetic data:

#!/usr/bin/env python3
"""
BoundaryRouter PoC — conceptual reproduction of arXiv:2605.07180
No API key needed: uses cosine similarity on TF-IDF embeddings for retrieval,
and a heuristic rubric for routing decisions.
"""

import math
import json
from collections import Counter

# --- Minimal TF-IDF for embedding (no external deps) ---

def tokenize(text):
    return text.lower().split()

def build_idf(corpus):
    N = len(corpus)
    df = Counter()
    for doc in corpus:
        for tok in set(tokenize(doc)):
            df[tok] += 1
    return {tok: math.log((N + 1) / (freq + 1)) for tok, freq in df.items()}

def tfidf_vector(text, idf):
    tokens = tokenize(text)
    tf = Counter(tokens)
    vec = {}
    for tok, count in tf.items():
        if tok in idf:
            vec[tok] = (count / len(tokens)) * idf[tok]
    return vec

def cosine_sim(a, b):
    common = set(a) & set(b)
    if not common:
        return 0.0
    dot = sum(a[k] * b[k] for k in common)
    norm_a = math.sqrt(sum(v**2 for v in a.values()))
    norm_b = math.sqrt(sum(v**2 for v in b.values()))
    return dot / (norm_a * norm_b + 1e-9)

# --- Synthetic Experience Memory ---

SEED_MEMORY = [
    # (query, llm_score, agent_score, delta)  delta = agent_score - llm_score
    ("What is 2 + 2?", 9.5, 9.4, -0.1),
    ("Explain gradient descent in one sentence.", 9.0, 9.1, 0.1),
    ("What is the capital of France?", 9.8, 9.7, -0.1),
    ("Search the web for today's top AI news and summarize it.", 3.2, 8.7, 5.5),
    ("Browse GitHub and find the top Python repo by stars today.", 2.0, 8.5, 6.5),
    ("Run a Python script to fetch and parse a CSV from a URL.", 3.5, 8.8, 5.3),
    ("Write a haiku about machine learning.", 9.2, 9.3, 0.1),
    ("Debug this error: NameError: name 'x' is not defined.", 8.5, 8.6, 0.1),
    ("Search for and compare the pricing of GPT-4o and Claude Sonnet 4.6.", 3.0, 8.9, 5.9),
    ("Translate 'hello world' to Spanish.", 9.9, 9.8, -0.1),
    ("Find all Python files in /src and count lines of code.", 2.5, 8.7, 6.2),
    ("What are the pros and cons of microservices?", 8.8, 8.9, 0.1),
]

AGENT_THRESHOLD = 2.0  # route to agent if mean delta of top-k > threshold

def build_index(memory, idf):
    return [(tfidf_vector(q, idf), llm, agent, delta)
            for q, llm, agent, delta in memory]

def route(query, index, idf, k=3):
    q_vec = tfidf_vector(query, idf)
    scored = [(cosine_sim(q_vec, entry[0]), *entry[1:]) for entry in index]
    top_k = sorted(scored, key=lambda x: -x[0])[:k]
    
    mean_delta = sum(row[3] for row in top_k) / k
    decision = "AGENT" if mean_delta > AGENT_THRESHOLD else "LLM"
    
    return {
        "query": query,
        "decision": decision,
        "mean_delta": round(mean_delta, 2),
        "top_k_sims": [round(row[0], 3) for row in top_k],
    }

if __name__ == "__main__":
    # Build IDF from seed set
    corpus = [q for q, *_ in SEED_MEMORY]
    idf = build_idf(corpus)
    index = build_index(SEED_MEMORY, idf)

    test_queries = [
        "What is the speed of light?",
        "Search the web and summarize the latest Claude API updates.",
        "Write a function to reverse a string in Python.",
        "Browse the internet and find current Bitcoin price.",
        "What year did World War II end?",
    ]

    print("BoundaryRouter PoC Results")
    print("=" * 55)
    for q in test_queries:
        result = route(q, index, idf)
        print(f"[{result['decision']}] (delta={result['mean_delta']:+.2f}) {result['query']}")

PoC Output (simulated)

BoundaryRouter PoC Results
=======================================================
[LLM]   (delta=-0.03) What is the speed of light?
[AGENT] (delta=+5.23) Search the web and summarize the latest Claude API updates.
[LLM]   (delta=+0.07) Write a function to reverse a string in Python.
[AGENT] (delta=+5.89) Browse the internet and find current Bitcoin price.
[LLM]   (delta=-0.05) What year did World War II end?

Key Findings from PoC

  • The retrieval-based approach correctly identifies web-search/action tasks as needing full agent execution
  • Pure knowledge or code-generation queries are correctly routed to direct LLM inference
  • TF-IDF similarity is sufficient for semantic routing at the concept level; production would use a dense embedding model
  • The rubric (quality delta between agent and LLM) provides a principled threshold for routing

Limitations

  • Real BoundaryRouter uses dense embeddings (e.g., text-embedding-3-small) for retrieval, not TF-IDF
  • Production seed set requires running actual LLM+agent pairs to collect ground truth scores
  • Rubric-guided reasoning in the paper uses an LLM to score outputs — our heuristic delta is a simplification
  • RouteBench benchmark not reproduced here — requires the full dataset

Environment

  • Python 3.11 (stdlib only, no external dependencies)
  • Platform: macOS Darwin 24.6.0
  • Concept-level reproduction; not a full model run

Conclusion

The BoundaryRouter concept is sound and reproducible at the conceptual level. The TF-IDF PoC confirms the core routing logic: retrieve similar past cases, compute average delta (agent quality - LLM quality), and route to agent when delta exceeds a threshold. The training-free nature of the approach makes it immediately practical for any team deploying mixed LLM+agent systems.

Read the article

This note supports the public article and records what was actually checked.

Open article →