AutoTTS: LLMs Automatically Discover Better Reasoning Strategies
Every team building on reasoning-heavy LLMs eventually runs into the same wall: you know the model can do better at inference time if you give it more compute — run it several times, do a search, chain probes — but how to allocate that compute is pure guesswork.
You try Best-of-N. You try Beam Search. Someone reads a paper and argues for Monte Carlo Tree Search. You spend two weeks tuning and still have no principled answer. This is the state of test-time scaling (TTS) in 2026: the strategies work, but they are almost entirely hand-crafted by intuition.
A paper published May 8, 2026 on arXiv challenges that directly. "LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling" (arXiv:2605.08083) proposes AutoTTS — a framework where an LLM agent discovers the TTS strategy so you don't have to. The discovered controller cuts aggregate token usage by roughly 69.5% compared to naive parallel sampling, while matching accuracy. And the entire discovery costs $39.9.
Effloow Lab analyzed the paper, verified the GitHub repository, and reproduced the core controller logic. Here is what the approach means and how to think about applying it.
Why Hand-Crafted TTS Is a Dead End
Test-time scaling means allocating additional compute during inference — not by training a larger model, but by running the existing model more carefully. Common strategies include:
- Best-of-N / Self-Consistency (SC@N): generate N independent answers, return the most common
- Beam Search: expand a tree of partial answers, pruning low-confidence paths
- MCTS: tree search guided by a value model
- Sequential refinement: loop: generate → critique → revise
Each of these requires decisions: how many samples? When to stop? When to branch? When to give up on a path? Researchers set these by intuition, publish a paper, and the cycle repeats. The design space is enormous and mostly unexplored.
The deeper problem is that optimizing TTS strategies is itself an LLM task — interpreting reasoning trajectories, diagnosing failure modes, proposing code-level fixes. Which is exactly what coding LLMs are good at.
AutoTTS: Environment-Driven Strategy Discovery
AutoTTS reframes the problem. Instead of asking "what TTS strategy should I design?", it asks "what environment should I build so an agent can design the strategy?"
The environment has three components:
Pre-collected reasoning trajectories: AIME24 problems are solved with a base model, generating a replay buffer of how the model thinks — reasoning tokens, confidence at each step, intermediate answers. These are stored and reused.
Cheap evaluation without live LLM calls: the agent proposes a controller (a Python function), and the environment evaluates it by replaying the cached trajectories. No new LLM inference calls — just fast replay. This is why the discovery is cheap.
A coding LLM agent as the optimizer: the agent reads execution traces (why a controller failed), proposes an improved controller, and iterates. Execution trace feedback helps diagnose "this controller stopped too early in cases where the answer was converging" rather than just "accuracy was 62%."
The human only defines the replay environment structure and runs the agent. Everything else — the controller logic — emerges from search.
Width-Depth TTS as Controller Synthesis
The paper formalizes TTS search in terms of two axes:
- Width: how many parallel branches (candidate solutions) run simultaneously
- Depth: how many reasoning steps (probe steps) each branch gets
A controller is a Python function that observes the current pool state — all active branches and their answers — and decides what to do:
| Decision | Meaning |
|---|---|
branch |
Spawn a new parallel candidate |
continue |
Give current branches more steps |
probe |
Run a lightweight check on a branch |
prune |
Drop a diverging branch |
stop |
Return the consensus answer |
The controller runs at each "round." The β parameter scales the cost-accuracy tradeoff: lower β means cheaper (fewer tokens), higher β means pushing accuracy.
The Discovered Controller: CMC
After the agent runs the search, the result is the Confidence Momentum Controller (CMC). It is not a large or complex function — it is four clearly interpretable mechanisms.
1. Momentum-Aware Stopping Gate
Instead of stopping when instantaneous pool confidence exceeds a threshold, CMC tracks an exponential moving average (EMA) of pool confidence and stops only when two conditions are both true:
- EMA confidence exceeds the β-scaled threshold
- Recent momentum (EMA delta) is non-negative — confidence is not declining
This prevents premature stopping when a brief coincidence of branch answers creates a false consensus spike.
2. Coupled Width-Depth Control
Widening and deepening are linked through the EMA delta:
- Strong positive momentum → suppress new branch spawning (the search is converging, don't dilute it)
- Stagnation or regression → trigger widening (the current branches are stuck, explore more)
This is the opposite of most hand-crafted strategies, which treat width and depth as independent.
3. Alignment-Aware Depth Allocation
Branches whose current answer matches the emerging consensus (the plurality answer across all branches) receive a larger share of the per-round probe budget. Branches diverging from consensus get fewer steps.
This concentrates compute on the converging region of the search space — similar to beam search's pruning logic, but driven by answer agreement rather than log-probability scores.
4. Conservative Branch Abandonment
Branches are pruned only when they are clearly and persistently diverging, not on a single inconsistent step. This prevents the controller from discarding a branch that is temporarily exploring a useful direction.
The Numbers
The paper evaluates CMC on AIME25 and HMMT25 (held-out benchmarks, not used during discovery) across four Qwen3 backbone scales.
| Setting | vs SC@64 | Accuracy | Token Cost |
|---|---|---|---|
| AutoTTS β=0.5 | Efficient mode | Matches SC@64 | ~30.5% of SC@64 |
| AutoTTS β=1.0 | Accurate mode | Surpasses SC@64 | Higher than β=0.5 |
| Discovery cost | — | — | $39.9 / 160 min |
At β=0.5, AutoTTS cuts aggregate token usage by roughly 69.5% compared with SC@64 while matching mean accuracy across models. At β=1.0, it surpasses all hand-crafted baselines in five of the eight benchmark/model cells.
The discovery phase uses 0 live LLM calls for evaluation — only the initial agent synthesis loop requires LLM calls, and that costs $39.9 total.
Reproducing the CMC Logic
Effloow Lab reproduced the core CMC controller logic in Python to understand its structure. This is a conceptual reproduction based on the paper's description — not the full AutoTTS pipeline, which requires pre-collected trajectory data and the Qwen3 model weights.
from dataclasses import dataclass, field
from typing import Optional
from collections import Counter
import math
@dataclass
class CMCState:
ema_confidence: float = 0.0
prev_ema: float = 0.0
alpha: float = 0.3 # EMA smoothing factor
confidence_threshold: float = 0.8
beta: float = 0.5 # cost-accuracy tradeoff
branches: list = field(default_factory=list)
def pool_confidence(self) -> float:
answers = [b['answer'] for b in self.branches if b['answer'] is not None]
if not answers:
return 0.0
most_common = Counter(answers).most_common(1)[0][1]
return most_common / len(self.branches)
def update_ema(self) -> float:
conf = self.pool_confidence()
self.prev_ema = self.ema_confidence
self.ema_confidence = self.alpha * conf + (1 - self.alpha) * self.ema_confidence
return self.ema_confidence
def momentum(self) -> float:
return self.ema_confidence - self.prev_ema
def should_stop(self) -> bool:
"""Stop when EMA high AND momentum non-negative."""
threshold = self.confidence_threshold * self.beta
return self.ema_confidence >= threshold and self.momentum() >= 0.0
def should_widen(self) -> bool:
"""Widen when confidence stagnates."""
return self.momentum() <= 0.0 and self.ema_confidence < self.confidence_threshold * self.beta
def consensus_answer(self) -> Optional[str]:
answers = [b['answer'] for b in self.branches if b['answer']]
return Counter(answers).most_common(1)[0][0] if answers else None
def alignment_aware_budget(self, probe_budget: int) -> dict:
"""Extra steps to branches matching current consensus."""
consensus = self.consensus_answer()
allocations = {}
aligned = [b for b in self.branches if b['answer'] == consensus]
unaligned = [b for b in self.branches if b['answer'] != consensus]
for b in aligned:
allocations[b['id']] = math.ceil(0.6 * probe_budget / max(len(aligned), 1))
for b in unaligned:
allocations[b['id']] = math.floor(0.4 * probe_budget / max(len(unaligned), 1))
return allocations
The full AutoTTS pipeline — including the agent-driven controller synthesis, trajectory replay, and β-parameterized search — is at github.com/zhengkid/AutoTTS.
Why This Matters for Developers
If you are building on top of a reasoning-capable LLM — whether for coding tasks, math, structured extraction, or multi-step planning — test-time scaling is how you unlock more capability without changing the model or retraining.
The problem has always been: how do you know which TTS strategy is right for your workload? Best-of-N is simple but expensive. MCTS is powerful but requires a verifier. Sequential refinement adds latency.
AutoTTS introduces a different question: instead of asking which strategy to pick, can you discover the right strategy automatically, specific to your workload?
For now, the full discovery pipeline requires reasoning-capable trajectory data and a reasonably large backbone (Qwen3-scale). But the framework generalizes — the paper notes that CMC, once discovered on AIME24 trajectories, transfers to held-out benchmarks without re-discovery. The environment is the reusable part.
Three near-term developer applications:
1. Use CMC directly: the GitHub repo publishes the discovered CMC controller. You can apply it to any pipeline that uses parallel sampling — generate N branches, track confidence across rounds, stop when CMC says stop. No discovery required.
2. Discover a domain-specific controller: if your workload is not mathematical reasoning (e.g., code generation, legal clause extraction), collect a small trajectory dataset from your domain and run the AutoTTS discovery agent on it. $40 and a few hours is a reasonable investment for a controller tuned to your specific task.
3. Tune β for your cost constraints: β=0.5 for latency-sensitive or budget-constrained systems, β=1.0 when accuracy is the primary constraint. This single parameter replaces the N-way decision of which hand-crafted TTS strategy to use.
What AutoTTS Doesn't Solve
The paper focuses on mathematical reasoning benchmarks where correct answers are verifiable. Extending to open-ended generation — where "pool confidence" and "alignment" cannot be computed from discrete answers — is an open research question.
The discovery agent itself uses GPT-4o (implied by the $39.9 cost breakdown). Using a weaker coding agent may yield less effective controllers. And the evaluation environment requires clean trajectory data, which itself takes some engineering to collect.
The 69.5% token reduction figure is also an aggregate — per-problem variance is not reported. On easy problems, SC@1 is still cheaper; AutoTTS's efficiency gains are concentrated in the medium-to-hard region of the benchmark distribution.
FAQ
Q: Can I use the CMC controller without running the full AutoTTS discovery?
Yes. The paper publishes CMC as the main discovered artifact. You can implement the controller logic (EMA confidence tracking + alignment-aware depth allocation) and apply it to any parallel sampling loop, with no dependency on the discovery pipeline.
Q: Does AutoTTS require fine-tuning the base LLM?
No. The base LLM is used only for collecting trajectories and at inference time. The discovery agent (which optimizes the controller) is a separate LLM (the paper uses GPT-4o for this role). The base reasoning model is not modified.
Q: How does AutoTTS compare to MCTS for hard reasoning tasks?
At β=1.0, AutoTTS surpasses MCTS-based baselines on AIME25/HMMT25. At β=0.5, it matches SC@64 at roughly 30% of the token cost — a regime where MCTS is far more expensive due to the tree expansion overhead. The key advantage of AutoTTS is that the controller is discovered, not designed, so it captures interactions that MCTS's hand-crafted policies miss.
Q: Does the CMC controller generalize across model sizes?
The paper tests across four Qwen3 backbone scales. CMC discovered on smaller models generalizes to larger ones, though the paper notes that accuracy gains from β=1.0 are larger on higher-capacity models.
Q: Is this approach limited to math benchmarks?
The AIME24/AIME25/HMMT25 focus is a practical choice for clean verifiability, not a fundamental limitation. The AutoTTS framework operates on any domain where you can collect reasoning trajectories and define a pool confidence signal — discrete answers, majority-vote agreement, or an external verifier.
Key Takeaways
AutoTTS changes the unit of human effort in test-time scaling from "design a strategy" to "design an environment." The discovered CMC controller is four clean mechanisms that outperform hand-crafted alternatives at a fraction of the cost.
For developers: the most actionable path is using the published CMC controller in your existing parallel sampling pipelines today, while the AutoTTS framework matures for domain-specific discovery.
For researchers: the paper's central insight — that controller synthesis over offline trajectories is cheap enough to drive agentic search — opens a path to task-specific TTS without the prohibitive cost of live-query optimization.
AutoTTS is the first published system for automatically discovering test-time scaling strategies rather than hand-crafting them. The discovered CMC controller cuts token usage by 69.5% versus naive parallel sampling while matching accuracy — and finding it cost $39.9. If you run any reasoning-heavy LLM pipeline today, CMC is worth integrating before reaching for MCTS or manual tuning.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.