Lmr Bench Llm Reproduce Nlp Research Code Paper Poc 2026

Date: 2026-05-22 Content track: paper-poc Slug: lmr-bench-llm-reproduce-nlp-research-code-paper-poc-2026

Source

Paper: "LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research"
arXiv: https://arxiv.org/abs/2506.17335
ACL Anthology: https://aclanthology.org/2025.emnlp-main.314/
GitHub: https://github.com/du-nlp-lab/LMR-Bench
Venue: EMNLP 2025 Main
Authors: Shuo Yan et al. (University of Texas at Dallas, 14 authors)

Paper summary

LMR-BENCH introduces a benchmark of 28 code-reproduction tasks derived from 23 NLP papers published in top-tier venues (ACL, EMNLP, NAACL, AAAI) over the past five years.

Task format: Given (paper PDF, code repo with masked methods, implementation instruction) → agent must generate patch code that fills in the missing methods correctly.

Evaluation: Dual: unit tests (functional correctness) + LLM-as-judge (implementation fidelity).

Nine categories: tokenization, attention mechanism, positional encoding, loss function, data preprocessing, model architecture, training procedure, decoding strategy, evaluation metric.

PoC reproduction approach

This is a paper-poc: we reproduce the benchmark's core methodology conceptually, not by re-running the full benchmark (which requires API keys and compute).

Step 1: Confirm repo exists and structure is public

Repository at github.com/du-nlp-lab/LMR-Bench — confirmed active (Python ≥ 3.12 requirement stated).

Repo structure per paper/search:

LMR-Bench/
  benchmark/
    {project_name}/
      repository_folder/     # masked code repo
      unit_tests/
      info.json              # metadata: paper, masked methods, instructions

Step 2: Reproduce the masking pattern

The benchmark creates tasks by:

Selecting a critical function in an NLP paper's codebase
Replacing the function body with a # TODO: implement stub
Writing an info.json with: paper_title, arxiv_id, masked_function, implementation_instruction, dependencies

Example (synthetic, based on paper description):

# Original: attention_score() in a Transformer implementation
# Masked version (what agents see):
def attention_score(query, key, scale=True):
    """
    Compute scaled dot-product attention scores.
    # TODO: implement
    """
    raise NotImplementedError

# Agent's task: reproduce from paper equations + instruction text

Step 3: Understand the evaluation dual-axis

Axis	Method	What it checks
Functional	Unit tests (pytest)	Output matches reference numerically
Fidelity	LLM-as-judge (GPT-4o)	Implementation matches paper's described algorithm

Key finding from paper

Best-performing model on LMR-BENCH: o3-mini (OpenAI, high-compute mode)
Pass@1 range across 28 tasks: 20–60% depending on category
Hardest category: "model architecture" (requires multi-file cross-referencing)
Easiest: "evaluation metric" (self-contained, well-defined formulas)

Evidence quality

This is a source-verified PoC. The paper is EMNLP 2025 peer-reviewed. The GitHub repo is public. No local execution was performed (requires API keys for evaluation). The synthetic masking example above illustrates the methodology based on the paper description.

Verdict

Paper: real, peer-reviewed. Methodology: reproducible with API keys. Article can explain the benchmark approach, show the masking pattern, and discuss implications for AI-assisted research reproduction. No fabricated metrics — all numbers cited from the paper.