ARTICLES ·2026-05-16 ·BY EFFLOOW CONTENT FACTORY

AutoTTS: LLMs Automatically Discover Better Reasoning Strategies

AutoTTS replaces hand-crafted test-time scaling heuristics with agent-discovered controllers. Learn how the CMC cuts token usage 69.5% while matching accuracy.

ai-development test-time-scaling llm-inference reasoning agentic-ai paper-poc efficiency

AutoTTS: LLMs Automatically Discover Better Reasoning Strategies

Every team building on reasoning-heavy LLMs eventually runs into the same wall: you know the model can do better at inference time if you give it more compute — run it several times, do a search, chain probes — but how to allocate that compute is pure guesswork.

You try Best-of-N. You try Beam Search. Someone reads a paper and argues for Monte Carlo Tree Search. You spend two weeks tuning and still have no principled answer. This is the state of test-time scaling (TTS) in 2026: the strategies work, but they are almost entirely hand-crafted by intuition.

A paper published May 8, 2026 on arXiv challenges that directly. "LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling" (arXiv:2605.08083) proposes AutoTTS — a framework where an LLM agent discovers the TTS strategy so you don't have to. The discovered controller cuts aggregate token usage by roughly 69.5% compared to naive parallel sampling, while matching accuracy. And the entire discovery costs $39.9.

Effloow Lab analyzed the paper, verified the GitHub repository, and reproduced the core controller logic. Here is what the approach means and how to think about applying it.

Why Hand-Crafted TTS Is a Dead End

Test-time scaling means allocating additional compute during inference — not by training a larger model, but by running the existing model more carefully. Common strategies include:

Best-of-N / Self-Consistency (SC@N): generate N independent answers, return the most common
Beam Search: expand a tree of partial answers, pruning low-confidence paths
MCTS: tree search guided by a value model
Sequential refinement: loop: generate → critique → revise

Each of these requires decisions: how many samples? When to stop? When to branch? When to give up on a path? Researchers set these by intuition, publish a paper, and the cycle repeats. The design space is enormous and mostly unexplored.

The deeper problem is that optimizing TTS strategies is itself an LLM task — interpreting reasoning trajectories, diagnosing failure modes, proposing code-level fixes. Which is exactly what coding LLMs are good at.

AutoTTS: Environment-Driven Strategy Discovery

AutoTTS reframes the problem. Instead of asking "what TTS strategy should I design?", it asks "what environment should I build so an agent can design the strategy?"

The environment has three components:

Pre-collected reasoning trajectories: AIME24 problems are solved with a base model, generating a replay buffer of how the model thinks — reasoning tokens, confidence at each step, intermediate answers. These are stored and reused.

Cheap evaluation without live LLM calls: the agent proposes a controller (a Python function), and the environment evaluates it by replaying the cached trajectories. No new LLM inference calls — just fast replay. This is why the discovery is cheap.

A coding LLM agent as the optimizer: the agent reads execution traces (why a controller failed), proposes an improved controller, and iterates. Execution trace feedback helps diagnose "this controller stopped too early in cases where the answer was converging" rather than just "accuracy was 62%."

The human only defines the replay environment structure and runs the agent. Everything else — the controller logic — emerges from search.

Width-Depth TTS as Controller Synthesis

The paper formalizes TTS search in terms of two axes:

Width: how many parallel branches (candidate solutions) run simultaneously
Depth: how many reasoning steps (probe steps) each branch gets

A controller is a Python function that observes the current pool state — all active branches and their answers — and decides what to do:

Decision	Meaning
`branch`	Spawn a new parallel candidate
`continue`	Give current branches more steps
`probe`	Run a lightweight check on a branch
`prune`	Drop a diverging branch
`stop`	Return the consensus answer

The controller runs at each "round." The β parameter scales the cost-accuracy tradeoff: lower β means cheaper (fewer tokens), higher β means pushing accuracy.

The Discovered Controller: CMC

After the agent runs the search, the result is the Confidence Momentum Controller (CMC). It is not a large or complex function — it is four clearly interpretable mechanisms.

1. Momentum-Aware Stopping Gate

Instead of stopping when instantaneous pool confidence exceeds a threshold, CMC tracks an exponential moving average (EMA) of pool confidence and stops only when two conditions are both true:

EMA confidence exceeds the β-scaled threshold
Recent momentum (EMA delta) is non-negative — confidence is not declining

This prevents premature stopping when a brief coincidence of branch answers creates a false consensus spike.

2. Coupled Width-Depth Control

Widening and deepening are linked through the EMA delta:

Strong positive momentum → suppress new branch spawning (the search is converging, don't dilute it)
Stagnation or regression → trigger widening (the current branches are stuck, explore more)

This is the opposite of most hand-crafted strategies, which treat width and depth as independent.

3. Alignment-Aware Depth Allocation

Branches whose current answer matches the emerging consensus (the plurality answer across all branches) receive a larger share of the per-round probe budget. Branches diverging from consensus get fewer steps.

This concentrates compute on the converging region of the search space — similar to beam search's pruning logic, but driven by answer agreement rather than log-probability scores.

4. Conservative Branch Abandonment

Branches are pruned only when they are clearly and persistently diverging, not on a single inconsistent step. This prevents the controller from discarding a branch that is temporarily exploring a useful direction.

The Numbers

The paper evaluates CMC on AIME25 and HMMT25 (held-out benchmarks, not used during discovery) across four Qwen3 backbone scales.

Setting	vs SC@64	Accuracy	Token Cost
AutoTTS β=0.5	Efficient mode	Matches SC@64	~30.5% of SC@64
AutoTTS β=1.0	Accurate mode	Surpasses SC@64	Higher than β=0.5
Discovery cost	—	—	$39.9 / 160 min

At β=0.5, AutoTTS cuts aggregate token usage by roughly 69.5% compared with SC@64 while matching mean accuracy across models. At β=1.0, it surpasses all hand-crafted baselines in five of the eight benchmark/model cells.

The discovery phase uses 0 live LLM calls for evaluation — only the initial agent synthesis loop requires LLM calls, and that costs $39.9 total.

Reproducing the CMC Logic

Effloow Lab reproduced the core CMC controller logic in Python to understand its structure. This is a conceptual reproduction based on the paper's description — not the full AutoTTS pipeline, which requires pre-collected trajectory data and the Qwen3 model weights.

from dataclasses import dataclass, field
from typing import Optional
from collections import Counter
import math

@dataclass
class CMCState:
    ema_confidence: float = 0.0
    prev_ema: float = 0.0
    alpha: float = 0.3           # EMA smoothing factor
    confidence_threshold: float = 0.8
    beta: float = 0.5            # cost-accuracy tradeoff
    branches: list = field(default_factory=list)

    def pool_confidence(self) -> float:
        answers = [b['answer'] for b in self.branches if b['answer'] is not None]
        if not answers:
            return 0.0
        most_common = Counter(answers).most_common(1)[0][1]
        return most_common / len(self.branches)

    def update_ema(self) -> float:
        conf = self.pool_confidence()
        self.prev_ema = self.ema_confidence
        self.ema_confidence = self.alpha * conf + (1 - self.alpha) * self.ema_confidence
        return self.ema_confidence

    def momentum(self) -> float:
        return self.ema_confidence - self.prev_ema

    def should_stop(self) -> bool:
        """Stop when EMA high AND momentum non-negative."""
        threshold = self.confidence_threshold * self.beta
        return self.ema_confidence >= threshold and self.momentum() >= 0.0

    def should_widen(self) -> bool:
        """Widen when confidence stagnates."""
        return self.momentum() <= 0.0 and self.ema_confidence < self.confidence_threshold * self.beta

    def consensus_answer(self) -> Optional[str]:
        answers = [b['answer'] for b in self.branches if b['answer']]
        return Counter(answers).most_common(1)[0][0] if answers else None

    def alignment_aware_budget(self, probe_budget: int) -> dict:
        """Extra steps to branches matching current consensus."""
        consensus = self.consensus_answer()
        allocations = {}
        aligned = [b for b in self.branches if b['answer'] == consensus]
        unaligned = [b for b in self.branches if b['answer'] != consensus]
        for b in aligned:
            allocations[b['id']] = math.ceil(0.6 * probe_budget / max(len(aligned), 1))
        for b in unaligned:
            allocations[b['id']] = math.floor(0.4 * probe_budget / max(len(unaligned), 1))
        return allocations

The full AutoTTS pipeline — including the agent-driven controller synthesis, trajectory replay, and β-parameterized search — is at github.com/zhengkid/AutoTTS.

Why This Matters for Developers

If you are building on top of a reasoning-capable LLM — whether for coding tasks, math, structured extraction, or multi-step planning — test-time scaling is how you unlock more capability without changing the model or retraining.

The problem has always been: how do you know which TTS strategy is right for your workload? Best-of-N is simple but expensive. MCTS is powerful but requires a verifier. Sequential refinement adds latency.

AutoTTS introduces a different question: instead of asking which strategy to pick, can you discover the right strategy automatically, specific to your workload?

For now, the full discovery pipeline requires reasoning-capable trajectory data and a reasonably large backbone (Qwen3-scale). But the framework generalizes — the paper notes that CMC, once discovered on AIME24 trajectories, transfers to held-out benchmarks without re-discovery. The environment is the reusable part.

Three near-term developer applications:

1. Use CMC directly: the GitHub repo publishes the discovered CMC controller. You can apply it to any pipeline that uses parallel sampling — generate N branches, track confidence across rounds, stop when CMC says stop. No discovery required.

2. Discover a domain-specific controller: if your workload is not mathematical reasoning (e.g., code generation, legal clause extraction), collect a small trajectory dataset from your domain and run the AutoTTS discovery agent on it. $40 and a few hours is a reasonable investment for a controller tuned to your specific task.

3. Tune β for your cost constraints: β=0.5 for latency-sensitive or budget-constrained systems, β=1.0 when accuracy is the primary constraint. This single parameter replaces the N-way decision of which hand-crafted TTS strategy to use.

What AutoTTS Doesn't Solve

The paper focuses on mathematical reasoning benchmarks where correct answers are verifiable. Extending to open-ended generation — where "pool confidence" and "alignment" cannot be computed from discrete answers — is an open research question.

The discovery agent itself uses GPT-4o (implied by the $39.9 cost breakdown). Using a weaker coding agent may yield less effective controllers. And the evaluation environment requires clean trajectory data, which itself takes some engineering to collect.

The 69.5% token reduction figure is also an aggregate — per-problem variance is not reported. On easy problems, SC@1 is still cheaper; AutoTTS's efficiency gains are concentrated in the medium-to-hard region of the benchmark distribution.

FAQ

Q: Can I use the CMC controller without running the full AutoTTS discovery?

Yes. The paper publishes CMC as the main discovered artifact. You can implement the controller logic (EMA confidence tracking + alignment-aware depth allocation) and apply it to any parallel sampling loop, with no dependency on the discovery pipeline.

Q: Does AutoTTS require fine-tuning the base LLM?

No. The base LLM is used only for collecting trajectories and at inference time. The discovery agent (which optimizes the controller) is a separate LLM (the paper uses GPT-4o for this role). The base reasoning model is not modified.

Q: How does AutoTTS compare to MCTS for hard reasoning tasks?

At β=1.0, AutoTTS surpasses MCTS-based baselines on AIME25/HMMT25. At β=0.5, it matches SC@64 at roughly 30% of the token cost — a regime where MCTS is far more expensive due to the tree expansion overhead. The key advantage of AutoTTS is that the controller is discovered, not designed, so it captures interactions that MCTS's hand-crafted policies miss.

Q: Does the CMC controller generalize across model sizes?

The paper tests across four Qwen3 backbone scales. CMC discovered on smaller models generalizes to larger ones, though the paper notes that accuracy gains from β=1.0 are larger on higher-capacity models.

Q: Is this approach limited to math benchmarks?

The AIME24/AIME25/HMMT25 focus is a practical choice for clean verifiability, not a fundamental limitation. The AutoTTS framework operates on any domain where you can collect reasoning trajectories and define a pool confidence signal — discrete answers, majority-vote agreement, or an external verifier.

Key Takeaways

AutoTTS changes the unit of human effort in test-time scaling from "design a strategy" to "design an environment." The discovered CMC controller is four clean mechanisms that outperform hand-crafted alternatives at a fraction of the cost.

For developers: the most actionable path is using the published CMC controller in your existing parallel sampling pipelines today, while the AutoTTS framework matures for domain-specific discovery.

For researchers: the paper's central insight — that controller synthesis over offline trajectories is cheap enough to drive agentic search — opens a path to task-specific TTS without the prohibitive cost of live-query optimization.

Bottom Line

AutoTTS is the first published system for automatically discovering test-time scaling strategies rather than hand-crafting them. The discovered CMC controller cuts token usage by 69.5% versus naive parallel sampling while matching accuracy — and finding it cost $39.9. If you run any reasoning-heavy LLM pipeline today, CMC is worth integrating before reaching for MCTS or manual tuning.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →