RL Doesn't Teach LLMs New Reasoning — It Fixes 1-3% of Tokens
The conventional story about RLVR (reinforcement learning with verifiable rewards) is seductive: you take a capable base model, run GRPO or REINFORCE on math problems, and the model learns to reason better. DeepSeek-R1, OpenAI o1, and Qwen3 all lean into this narrative. It sounds like RL discovers new cognitive strategies.
A paper submitted to arXiv on May 7, 2026 — "Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning" (arXiv:2605.06241, Akgül et al.) — argues the story is wrong. And the evidence is surprisingly granular.
The Core Finding: RL Touches Almost Nothing
The researchers ran token-level analysis across multiple model families, multiple scales, and six RL algorithms. What they found:
RL modifies only 1–3% of token positions. The promoted token at each of those positions always lies within the base model's top-5 candidate tokens. And the positions RL edits are exactly the positions where the base model's own probability distribution is most uncertain — the high-entropy decision forks.
That's a narrow footprint. In a 1,000-token reasoning chain, RL is effectively intervening at 10–30 positions. Every other token stays close to what the base model would have produced anyway.
The practical implication: RL does not expand what the model knows. It does not teach new proof strategies, new algebraic moves, or new reasoning chains the base model lacked. It steers the model toward the right branch at moments of maximum uncertainty — moments where the base model was already considering the correct path among its top choices but wasn't reliably picking it.
Why Entropy Identifies Decision Forks
Shannon entropy measures the spread of a probability distribution. A low-entropy position is one where the model assigns, say, 95% probability to a single token (a near-certain continuation). A high-entropy position is one where probability mass is spread across several alternatives — the model is genuinely unsure which direction to go.
Consider a math reasoning chain:
"The sum of the first n integers is ... [context tokens: low entropy]
Let's use the formula ... [fork: should this be closed-form or inductive? HIGH entropy]
n(n+1)/2 ... [context tokens: low entropy]
So for n=100 ... [fork: direct substitution or re-derive? MEDIUM entropy]
= 5050 [low entropy]"
The tokens that look like "Let's use" or that select among "formula", "induction", "recursion" are forks. The tokens that spell out arithmetic after the path is chosen are predictable continuations. RL concentrates its edits at the forks.
Effloow Lab ran a conceptual entropy-analysis PoC (see data/lab-runs/rethinking-rl-llm-reasoning-sparse-policy-selection-poc-2026.md). Using stdlib Python and manually specified toy probability distributions:
import math
def entropy(probs):
return -sum(p * math.log2(p) for p in probs if p > 0)
# Context token: predictable
context_token = [0.94, 0.04, 0.01, 0.01]
print(f"Context entropy: {entropy(context_token):.4f} bits") # → 0.40 bits
# Decision fork: model is uncertain which reasoning branch to take
fork_token = [0.28, 0.27, 0.25, 0.15, 0.05]
print(f"Fork entropy: {entropy(fork_token):.4f} bits") # → 2.15 bits
Output:
Context entropy: 0.4025 bits
Fork entropy: 2.1509 bits
The gap is large. Setting an entropy threshold (e.g., H > 1.5 bits) cleanly separates the ~1–3% of positions where RL has anything useful to do from the vast majority where the base model is already on the right track.
Corroboration From an Independent Team
The arxiv:2605.06241 finding does not stand alone. A separate paper — "Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning" (arXiv:2506.01939, NeurIPS 2025 poster) — reached the same conclusion from a different direction.
That team restricted RLVR policy gradient updates to the high-entropy minority tokens only. The result on Qwen3-32B: +11.04 points on AIME '25, exceeding the gain from full-gradient RLVR. Updating only the low-entropy majority tokens: near-zero effect, or slight degradation.
A third paper, "Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs" (arXiv:2603.22446), confirmed the distributional sparsity pattern using a different analytical lens.
Three independent teams, three different methodologies, same answer: the benefits of RLVR are concentrated in a tiny subset of tokens at high-entropy positions.
RL post-training is not capability learning. It is sparse policy selection: committing to the right reasoning branch at the handful of positions where the base model is genuinely undecided. The base model already knows the answer paths — it just needs help picking the right fork, reliably.
What Is ReasonMaxxer?
The Akgül et al. paper does not just diagnose the problem — it proposes an alternative. ReasonMaxxer is what they call an "embarrassingly cheap post-training" method that achieves comparable results to full RL without running an RL optimization loop.
The method in outline:
- Sample a few hundred rollouts from the base model on a small problem set (tens of problems).
- Compute token entropy at every position across those rollouts.
- Identify high-entropy positions using an entropy gate threshold.
- Apply a contrastive loss exclusively at those positions — using the advantage-weighted signal from correct vs. incorrect rollouts, but only at the entropy-gated tokens. All other positions are anchored to the base distribution.
- Run for minutes on a single GPU.
The claimed result: matches or exceeds full GRPO/REINFORCE across three model families, six scales, and six math reasoning benchmarks — at roughly 1,000× lower training cost.
The key insight behind this efficiency: if RL is only useful at 1–3% of positions, and those positions are identifiable from the base model's own entropy, you can skip the expensive on-policy rollout loop entirely. Sample once, identify the forks, and apply targeted corrections.
Why This Matters for Developers
If you're fine-tuning reasoning models, this changes what you should optimize. Standard RLVR training runs full forward/backward passes over every token on every rollout. Most of that compute affects tokens where the gradient signal provides no useful information (the model is already near-deterministic there). ReasonMaxxer's entropy gating focuses the compute budget on positions that actually move accuracy.
If you're analyzing RL-trained models, the 1–3% figure reframes what you're looking for. A 90%→92% accuracy jump from RLVR training does not mean the model learned new math. It means the model's choice at a handful of critical forks became more reliable. Looking at performance as "capability acquisition" sets the wrong expectations for what RL can and cannot do.
If you're evaluating whether RL is worth running, the implied comparison is important: if the same accuracy gain is achievable with 10s of problems and minutes of training rather than thousands of problems and days of GPU time, the cost-benefit calculus for RL-based post-training shifts significantly. ReasonMaxxer is not yet widely validated beyond the paper's three model families, but the direction is clear.
If you're thinking about base model selection, the sparse-policy-selection framing suggests that the quality of a base model's top-5 token candidates at decision forks matters more than previously recognized. A base model that already has the correct reasoning path in its top-5 alternatives at branch points can be brought to high accuracy cheaply. A model that needs genuinely new paths introduced cannot be helped by RL alone — it needs additional pretraining data.
The Broader Claim: RL as Distribution Shaping, Not Discovery
The paper's framing is deliberately provocative. The authors argue that RL for LLM reasoning should be understood as distribution shaping over existing solutions, not capability learning. The base model already contains the correct reasoning paths — the training data ensured that. What's uncertain is which path the model will commit to at each fork point.
This connects to a broader pattern visible in the post-training literature. Papers like "STILL-2" and "ProRL" showed that very long RL training runs can eventually surface genuinely new reasoning behaviors, but those gains appear only after the easy wins from correcting high-entropy forks are exhausted. The "new capability" regime of RL training may require orders of magnitude more compute than the "policy correction" regime this paper characterizes.
For the vast majority of production use cases — where developers are adapting existing frontier models for specific domains — the relevant regime is the policy-correction one. And in that regime, the paper's core message holds: RL is doing very targeted work at very few positions.
What the Paper Does Not Claim
A few things to be precise about:
- The paper does not claim RL is useless. It claims RL is doing something narrower than commonly assumed, and that narrower thing can be achieved more cheaply.
- The 1–3% figure applies to the math reasoning benchmarks in the study. Different task domains (code generation, long-form reasoning, tool use) may have different entropy profiles and different fractions of high-entropy positions.
- ReasonMaxxer's results are reported on three model families. Independent replication across more diverse architectures has not yet been published as of the paper's submission date.
- The paper's authors note that "very long-horizon RL training" — far beyond what standard fine-tuning runs — may eventually teach capabilities not present in the base model. The sparse-policy-selection claim is about the regime most practitioners operate in.
FAQ
Q: Does this mean RL training is a waste of resources?
Not entirely. The paper argues that most of RL's accuracy gains in standard fine-tuning regimes come from sparse token-level corrections, not broad capability learning. Those corrections are real and produce measurable performance improvements. The argument is that you can achieve the same corrections more cheaply by targeting only the high-entropy positions directly, skipping the expensive on-policy generation loop.
Q: Can I implement entropy-gated training without ReasonMaxxer specifically?
Yes. The concept is straightforward: generate rollouts from your base model, compute per-token entropy, mask the loss to only high-entropy positions, and train. Several existing RLVR frameworks (TRL, OpenRLHF) support custom loss masks. The specific hyperparameters and gating thresholds in ReasonMaxxer are proprietary to the paper, but the approach is implementable from the description.
Q: How does this interact with extended thinking models like o3 or Qwen3-Thinking?
Extended thinking models use separate reasoning chains (often <think> tokens) before producing final answers. Those thinking chains typically have much higher entropy throughout — the model is exploring rather than converging. The sparse-policy-selection finding was tested primarily on standard RLVR settings; its applicability to extended thinking training is an open question the paper notes explicitly.
Q: If the promoted token is always in the base model's top-5, what limits reasoning quality?
The base model's top-5 candidates at each fork. If the correct reasoning path is not among the top-5 candidates the base model considers at a critical decision point, RL cannot promote it — the token is not in scope. This is why base model quality remains the binding constraint for hard reasoning tasks. RL tunes the selector; it does not expand the candidate set.
Q: Is the corroborating arXiv:2506.01939 result from a different team?
Yes. The NeurIPS 2025 paper (arXiv:2506.01939) is from Shenzhi Wang et al., independent of the Akgül et al. team. They reached convergent conclusions using a different experimental design: rather than analyzing where RL edits tokens, they directly ablated which tokens receive gradient updates during RLVR training.
Key Takeaways
- RL post-training for reasoning modifies only 1–3% of token positions — the high-entropy decision forks where the model is genuinely uncertain.
- The promoted token is always within the base model's existing top-5 candidates. RL does not introduce paths the model had not considered.
- Base model token entropy at inference time provides a reliable proxy for identifying which positions will benefit from intervention — no RL-trained model required.
- ReasonMaxxer, a method proposed in the paper, achieves RL-comparable results using tens of problems and minutes of single-GPU training by targeting only entropy-gated positions.
- Three independent research groups have now published corroborating evidence for the sparse-policy-selection pattern.
- For most production fine-tuning use cases, the relevant question is not "can RL teach this model new capabilities?" but "can RL (or a cheaper alternative) correct the high-entropy forks reliably?"
The paper is a useful corrective to overstated claims about what RL post-training accomplishes. It also points toward a class of cheaper, more targeted post-training methods that may be more practical for teams without large GPU budgets. Whether ReasonMaxxer or similar entropy-gated methods gain adoption will become clearer as independent replication work follows the initial publication.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.