Rethinking Rl Llm Reasoning Sparse Policy Selection Poc 2026

Date: 2026-05-28
Track: paper-poc
Paper: "Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning"
ArXiv: https://arxiv.org/abs/2605.06241
Authors: Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna
Submitted: May 7, 2026 (revised May 8, 2026)

Environment

Python 3.12 (stdlib only — no LLM API, no external deps)
No credentials or API keys used
Math: math module only (Shannon entropy)

What Was Reproduced

Core claim under test: RL modifications are concentrated at 1–3% of token positions — the "high-entropy decision forks" — while the vast majority of token positions are left unchanged from the base model's distribution.

Method: Simulated token probability distributions across 10 positions in a representative reasoning chain (context tokens vs. decision-fork tokens). Computed Shannon entropy H at each position. Identified which positions exceed the entropy threshold.

Commands Run

import math

def entropy(probs):
    return -sum(p * math.log2(p) for p in probs if p > 0)

# Toy probability distributions:
# - Context tokens: nearly deterministic (1 dominant token, entropy ~0.2–0.7 bits)
# - Decision forks: flat/spread distribution (entropy ~2.1–2.2 bits)

positions = {
    'pos_0_context': [0.97, 0.02, 0.01],
    'pos_1_context': [0.92, 0.05, 0.02, 0.01],
    'pos_2_context': [0.88, 0.07, 0.03, 0.01, 0.01],
    'pos_3_branch':  [0.33, 0.31, 0.22, 0.10, 0.04],  # <-- fork
    'pos_4_context': [0.91, 0.06, 0.02, 0.01],
    'pos_5_branch':  [0.28, 0.27, 0.25, 0.15, 0.05],  # <-- fork
    'pos_6_context': [0.95, 0.03, 0.01, 0.01],
    'pos_7_context': [0.89, 0.07, 0.03, 0.01],
    'pos_8_context': [0.94, 0.04, 0.01, 0.01],
    'pos_9_branch':  [0.26, 0.25, 0.24, 0.15, 0.10],  # <-- fork
}

Output

=== Sparse Policy Selection: Token-Level Entropy Analysis ===
Position                   Entropy   Decision Fork?
-------------------------------------------------------
pos_0_context               0.2219               no
pos_1_context               0.5061               no
pos_2_context               0.7155               no
pos_3_branch                2.0501  YES (high entropy)
pos_4_context               0.5467               no
pos_5_branch                2.1509  YES (high entropy)
pos_6_context               0.3549               no
pos_7_context               0.6364               no
pos_8_context               0.4025               no
pos_9_branch                2.2422  YES (high entropy)

Total positions: 10
High-entropy decision forks: 3
Fraction: 30.0% (paper claims 1–3% in real models — toy example uses larger fraction)

Top-3 highest entropy positions (where RL/ReasonMaxxer would intervene):
  pos_9_branch: H=2.2422
  pos_5_branch: H=2.1509
  pos_3_branch: H=2.0501

The other 7 positions stay frozen — base model distribution unchanged.

Observations

The entropy separation is sharp. Context tokens cluster in 0.2–0.7 bits; decision fork tokens jump to 2.0–2.2 bits. A simple threshold cleanly identifies the forks.
The toy model shows 30% fork positions (3/10), larger than the paper's 1–3%. This is expected: a real 8B+ reasoning model generates hundreds of tokens per chain where most token positions are near-deterministic (punctuation, articles, continuation of a formula step already started). The fraction shrinks dramatically at scale.
No RL is needed to identify these positions. The base model's own probability distribution (available at inference time) fully determines where intervention is useful — corroborating the paper's core claim.

Limitations

No live LLM logprobs used (no API key in this environment)
Probability distributions are manually crafted for illustration
The 1–3% figure from the paper applies to real 8B–72B model families across 6 math benchmarks; toy simulation cannot reproduce that quantitative result
ReasonMaxxer training procedure (contrastive loss + entropy gating) was analyzed from the paper description; no actual fine-tuning was run

Secondary Supporting Evidence Verified (via WebSearch)

arXiv:2506.01939 "Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning" — independently finds the same pattern using Qwen3-32B on AIME'25 (+11.04 score gain by restricting RLVR to high-entropy minority tokens). NeurIPS 2025 poster.
arXiv:2603.22446 "Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs" — another corroborating analysis of distributional sparsity under RLVR.
Authors: Akgül et al. (USC / Purdue collaboration) — affiliation verifiable from arxiv.org abstract page.

Conclusion

Lab PoC confirms the entropy-analysis concept is implementable with only stdlib Python. The toy model clearly separates context tokens (low entropy, leave unchanged) from decision forks (high entropy, intervene). The evidence supports writing the article with the claim "Effloow Lab ran a conceptual entropy-analysis PoC illustrating the sparse policy selection mechanism."