Rro Rising Reward Trajectories Llm Agent Optimization Paper Poc 2026
Date: 2026-05-22
Track: paper-poc
Slug: rro-rising-reward-trajectories-llm-agent-optimization-paper-poc-2026
Paper: arXiv:2505.20737 — RRO: LLM Agent Optimization Through Rising Reward Trajectories
Environment
- Python 3 (stdlib only —
random,math) - No external dependencies, no LLM API calls
- Sandbox: temporary inline scripts (not saved to disk)
What Was Reproduced
The RRO trajectory filtering logic was reproduced in two complementary simulations:
Sim 1: Flat vs Dynamic Exploration (200 trials)
# Flat: always sample n_candidates=5 per step, take best
# RRO: sample candidates until reward rises above previous step, then stop
Results:
| Method | Avg Reward | Avg Samples/Traj |
|---|---|---|
| Flat (fixed 5 candidates) | 0.5967 | 30.0 |
| RRO (dynamic stop) | 0.5179 | 10.1 |
| Delta | -13.2% reward | -66.4% samples |
Note: In this synthetic simulation, flat exploration has slightly higher raw reward because it always takes the best of 5. The actual RRO advantage reported in the paper comes from training data quality — rising-reward trajectories are better supervision signals for the PRM, leading to better trained models. This simulation illustrates the sample efficiency aspect only.
Sim 2: Trajectory Quality Analysis (20 trajectories)
Generated 20 simulated multi-step trajectories (6 steps each). Applied the strict rising reward criterion (each step must exceed the previous).
Result: 6/20 trajectories (30%) satisfied the strict criterion. These are the only ones RRO would select for process supervision training.
Sim 3: Dynamic Expansion Demo
Showed step-by-step how RRO stops candidate evaluation as soon as a rising reward is found:
Step 1 (prev=0.000): Candidate 1 = 0.000 skip, Candidate 2 = 0.107 → ACCEPT
Step 2 (prev=0.107): Candidate 1 = 0.206 → ACCEPT
Step 3 (prev=0.206): Candidate 1 = 0.358 → ACCEPT
Average candidates evaluated: 1.33 per step (vs 5 for fixed exploration)
What the Paper Reports
- WebShop: RRO = 62.91 reward, 1.86 samples/step (vs Fixed: 61.20, 5 samples)
- InterCode-SQL: RRO = 55.08 reward, 1.64 samples/step (vs Fixed: 54.68, 5 samples)
- RRO outperforms baseline PRM approaches while using ~63% fewer exploration samples
Limitations of This PoC
- No actual LLM used — rewards are synthetic Gaussian processes
- No PRM training loop reproduced (requires GPU + model weights)
- The reward improvement over flat exploration in the real paper comes from training quality, not greedy selection — this PoC only demonstrates the sample efficiency and filtering logic
- Real InterCode-SQL and WebShop tasks require database sandboxes and web environments
Conclusion
The rising reward filtering logic is simple to reason about: stop exploring when you find a step that improves on the previous. This is the key insight from the paper. The educational value is high — the concept is intuitive and the implementation is straightforward, even without actual model training.
Read the article
This note supports the public article and records what was actually checked.