Rro Rising Reward Trajectories Llm Agent Optimization Paper Poc 2026

Date: 2026-05-22
Track: paper-poc
Slug: rro-rising-reward-trajectories-llm-agent-optimization-paper-poc-2026
Paper: arXiv:2505.20737 — RRO: LLM Agent Optimization Through Rising Reward Trajectories

Environment

Python 3 (stdlib only — random, math)
No external dependencies, no LLM API calls
Sandbox: temporary inline scripts (not saved to disk)

What Was Reproduced

The RRO trajectory filtering logic was reproduced in two complementary simulations:

Sim 1: Flat vs Dynamic Exploration (200 trials)

# Flat: always sample n_candidates=5 per step, take best
# RRO:  sample candidates until reward rises above previous step, then stop

Results:

Method	Avg Reward	Avg Samples/Traj
Flat (fixed 5 candidates)	0.5967	30.0
RRO (dynamic stop)	0.5179	10.1
Delta	-13.2% reward	-66.4% samples

Note: In this synthetic simulation, flat exploration has slightly higher raw reward because it always takes the best of 5. The actual RRO advantage reported in the paper comes from training data quality — rising-reward trajectories are better supervision signals for the PRM, leading to better trained models. This simulation illustrates the sample efficiency aspect only.

Sim 2: Trajectory Quality Analysis (20 trajectories)

Generated 20 simulated multi-step trajectories (6 steps each). Applied the strict rising reward criterion (each step must exceed the previous).

Result: 6/20 trajectories (30%) satisfied the strict criterion. These are the only ones RRO would select for process supervision training.

Sim 3: Dynamic Expansion Demo

Showed step-by-step how RRO stops candidate evaluation as soon as a rising reward is found:

Step 1 (prev=0.000): Candidate 1 = 0.000 skip, Candidate 2 = 0.107 → ACCEPT
Step 2 (prev=0.107): Candidate 1 = 0.206 → ACCEPT  
Step 3 (prev=0.206): Candidate 1 = 0.358 → ACCEPT

Average candidates evaluated: 1.33 per step (vs 5 for fixed exploration)

What the Paper Reports

WebShop: RRO = 62.91 reward, 1.86 samples/step (vs Fixed: 61.20, 5 samples)
InterCode-SQL: RRO = 55.08 reward, 1.64 samples/step (vs Fixed: 54.68, 5 samples)
RRO outperforms baseline PRM approaches while using ~63% fewer exploration samples

Limitations of This PoC

No actual LLM used — rewards are synthetic Gaussian processes
No PRM training loop reproduced (requires GPU + model weights)
The reward improvement over flat exploration in the real paper comes from training quality, not greedy selection — this PoC only demonstrates the sample efficiency and filtering logic
Real InterCode-SQL and WebShop tasks require database sandboxes and web environments

Conclusion

The rising reward filtering logic is simple to reason about: stop exploring when you find a step that improves on the previous. This is the key insight from the paper. The educational value is high — the concept is intuitive and the implementation is straightforward, even without actual model training.