Rethinking Rl Llm Reasoning Sparse Policy Selection Poc 2026
Date: 2026-05-28
Track: paper-poc
Paper: "Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning"
ArXiv: https://arxiv.org/abs/2605.06241
Authors: Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna
Submitted: May 7, 2026 (revised May 8, 2026)
Environment
- Python 3.12 (stdlib only — no LLM API, no external deps)
- No credentials or API keys used
- Math:
mathmodule only (Shannon entropy)
What Was Reproduced
Core claim under test: RL modifications are concentrated at 1–3% of token positions — the "high-entropy decision forks" — while the vast majority of token positions are left unchanged from the base model's distribution.
Method: Simulated token probability distributions across 10 positions in a representative reasoning chain (context tokens vs. decision-fork tokens). Computed Shannon entropy H at each position. Identified which positions exceed the entropy threshold.
Commands Run
import math
def entropy(probs):
return -sum(p * math.log2(p) for p in probs if p > 0)
# Toy probability distributions:
# - Context tokens: nearly deterministic (1 dominant token, entropy ~0.2–0.7 bits)
# - Decision forks: flat/spread distribution (entropy ~2.1–2.2 bits)
positions = {
'pos_0_context': [0.97, 0.02, 0.01],
'pos_1_context': [0.92, 0.05, 0.02, 0.01],
'pos_2_context': [0.88, 0.07, 0.03, 0.01, 0.01],
'pos_3_branch': [0.33, 0.31, 0.22, 0.10, 0.04], # <-- fork
'pos_4_context': [0.91, 0.06, 0.02, 0.01],
'pos_5_branch': [0.28, 0.27, 0.25, 0.15, 0.05], # <-- fork
'pos_6_context': [0.95, 0.03, 0.01, 0.01],
'pos_7_context': [0.89, 0.07, 0.03, 0.01],
'pos_8_context': [0.94, 0.04, 0.01, 0.01],
'pos_9_branch': [0.26, 0.25, 0.24, 0.15, 0.10], # <-- fork
}
Output
=== Sparse Policy Selection: Token-Level Entropy Analysis ===
Position Entropy Decision Fork?
-------------------------------------------------------
pos_0_context 0.2219 no
pos_1_context 0.5061 no
pos_2_context 0.7155 no
pos_3_branch 2.0501 YES (high entropy)
pos_4_context 0.5467 no
pos_5_branch 2.1509 YES (high entropy)
pos_6_context 0.3549 no
pos_7_context 0.6364 no
pos_8_context 0.4025 no
pos_9_branch 2.2422 YES (high entropy)
Total positions: 10
High-entropy decision forks: 3
Fraction: 30.0% (paper claims 1–3% in real models — toy example uses larger fraction)
Top-3 highest entropy positions (where RL/ReasonMaxxer would intervene):
pos_9_branch: H=2.2422
pos_5_branch: H=2.1509
pos_3_branch: H=2.0501
The other 7 positions stay frozen — base model distribution unchanged.
Observations
- The entropy separation is sharp. Context tokens cluster in 0.2–0.7 bits; decision fork tokens jump to 2.0–2.2 bits. A simple threshold cleanly identifies the forks.
- The toy model shows 30% fork positions (3/10), larger than the paper's 1–3%. This is expected: a real 8B+ reasoning model generates hundreds of tokens per chain where most token positions are near-deterministic (punctuation, articles, continuation of a formula step already started). The fraction shrinks dramatically at scale.
- No RL is needed to identify these positions. The base model's own probability distribution (available at inference time) fully determines where intervention is useful — corroborating the paper's core claim.
Limitations
- No live LLM logprobs used (no API key in this environment)
- Probability distributions are manually crafted for illustration
- The 1–3% figure from the paper applies to real 8B–72B model families across 6 math benchmarks; toy simulation cannot reproduce that quantitative result
- ReasonMaxxer training procedure (contrastive loss + entropy gating) was analyzed from the paper description; no actual fine-tuning was run
Secondary Supporting Evidence Verified (via WebSearch)
- arXiv:2506.01939 "Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning" — independently finds the same pattern using Qwen3-32B on AIME'25 (+11.04 score gain by restricting RLVR to high-entropy minority tokens). NeurIPS 2025 poster.
- arXiv:2603.22446 "Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs" — another corroborating analysis of distributional sparsity under RLVR.
- Authors: Akgül et al. (USC / Purdue collaboration) — affiliation verifiable from arxiv.org abstract page.
Conclusion
Lab PoC confirms the entropy-analysis concept is implementable with only stdlib Python. The toy model clearly separates context tokens (low entropy, leave unchanged) from decision forks (high entropy, intervene). The evidence supports writing the article with the claim "Effloow Lab ran a conceptual entropy-analysis PoC illustrating the sparse policy selection mechanism."
Read the article
This note supports the public article and records what was actually checked.