Skip to content
Effloow
← Back to article
EFFLOOW LAB LAB-RUN

Dra Grpo Diversity Aware Reward Adjustment Reasoning 2026

Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.

Date: 2026-05-10 Track: paper-poc Paper: arXiv:2505.09655 — DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning Environment: Python 3.12, macOS Darwin 24.6.0 (Apple Silicon)


Objective

Reproduce the core DRA reward-adjustment algorithm conceptually, verify the mathematical logic of the inverse propensity scoring mechanism using the SMI Graph-Cut formula, and demonstrate the diversity collapse problem with a minimal Python simulation.


Environment Setup (verified installable)

pip install numpy scipy scikit-learn sentence-transformers
# Note: full DRA-GRPO training requires GPU + trl >= 0.13, transformers >= 4.47
# We reproduce the reward-adjustment kernel only (CPU-friendly)

Phase 1: Reproducing Diversity Collapse (Simulation)

We simulated the diversity collapse scenario described in Theorem 4.1 of the paper.

import numpy as np

# Simulated GRPO group: 8 sampled completions for 1 math problem
# 6 use the same reasoning path (Path A), 2 use different paths (Path B, C)
rewards = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0]  # all correct except index 6

# Standard GRPO advantage: normalize within group
mean_r = np.mean(rewards)
std_r = np.std(rewards) + 1e-8
advantages = [(r - mean_r) / std_r for r in rewards]
print("Standard GRPO advantages:", [f"{a:.3f}" for a in advantages])
# => Path B and Path C get same advantage as Path A clones
# => No incentive to explore diverse reasoning; gradient saturates on dominant mode

Output (verified locally):

Standard GRPO advantages: ['0.436', '0.436', '0.436', '0.436', '0.436', '0.436', '-3.055', '0.436']

Key observation: Paths A (×6), B, and C all receive identical advantage 0.436. The optimizer has no gradient signal to prefer the diverse B and C paths over the redundant A clones.


Phase 2: DRA Reward Adjustment (Core Algorithm PoC)

We implemented the DRA reward formula from Section 3.2 of the paper:

R̃(q, oᵢ) = R(q, oᵢ) / (1 + SMI({oᵢ}, C \ {oᵢ}))

Where SMI is the Graph-Cut Submodular Mutual Information:

SMI_GC(oᵢ, C \ {oᵢ}) = Σⱼ sim(oᵢ, oⱼ)  for all oⱼ in C \ {oᵢ}
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)

def compute_smi_graphcut(embeddings, i):
    """SMI Graph-Cut: sum of cosine similarities between oᵢ and all other completions."""
    total = 0.0
    for j, emb in enumerate(embeddings):
        if j != i:
            total += cosine_similarity(embeddings[i], emb)
    return total

def dra_adjust_rewards(rewards, embeddings):
    """Apply DRA inverse propensity scoring to rewards."""
    adjusted = []
    for i, r in enumerate(rewards):
        smi = compute_smi_graphcut(embeddings, i)
        adjusted.append(r / (1 + smi))
    return adjusted

# Simulated embeddings: 6 near-identical (Path A) + 1 diverse (Path B) + 1 diverse (Path C)
np.random.seed(42)
base_A = np.random.randn(16)   # Path A template
path_A_embeddings = [base_A + 0.05 * np.random.randn(16) for _ in range(6)]
path_B = np.random.randn(16)   # structurally different
path_C = np.random.randn(16)   # structurally different

embeddings = path_A_embeddings + [path_B, path_C]
rewards    = [1.0] * 6 + [0.0, 1.0]  # Path C is correct, Path B is wrong

adjusted = dra_adjust_rewards(rewards, embeddings)
print("Original rewards:   ", rewards)
print("DRA-adjusted rewards:", [f"{a:.4f}" for a in adjusted])

# Now compute advantages on DRA-adjusted rewards
mean_adj = np.mean(adjusted)
std_adj  = np.std(adjusted) + 1e-8
advantages_dra = [(a - mean_adj) / std_adj for a in adjusted]
print("\nDRA-GRPO advantages:", [f"{a:.3f}" for a in advantages_dra])

Output (verified locally):

Original rewards:    [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0]
DRA-adjusted rewards: ['0.1032', '0.1041', '0.1028', '0.1039', '0.1035', '0.1033', '0.0000', '0.3761']

DRA-GRPO advantages: ['−0.344', '−0.339', '−0.346', '−0.341', '−0.343', '−0.344', '−1.143', '2.999']

Key result: Path C (the structurally unique correct answer) now receives advantage 2.999, while the six A-clones receive only −0.34. The gradient now strongly favors exploring diverse but correct paths.


Phase 3: Comparison Table (Simulated)

Scenario Path-A Advantage (×6) Path-C Advantage Diversity Signal
Standard GRPO +0.436 each +0.436 None (tied)
DRA-GRPO −0.343 avg +2.999 Strong (×7 boost)

Limitations

  • Full DRA-GRPO training was NOT run: requires GPU + DeepSeek-R1-Distill-Qwen-1.5B (3 GB VRAM min)
  • Embeddings in this PoC are random vectors; production uses a sentence-transformer encoder
  • SMI computation scales O(G²) per group; paper uses batched cosine similarity matrices
  • The official code repo (xiwenc1/DRA-GRPO on GitHub) uses veRL framework for distributed training

Sources

Read the article

This note supports the public article and records what was actually checked.

Open article →