Lmr Bench Llm Reproduce Nlp Research Code Paper Poc 2026
Date: 2026-05-22
Content track: paper-poc
Slug: lmr-bench-llm-reproduce-nlp-research-code-paper-poc-2026
Source
- Paper: "LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research"
- arXiv: https://arxiv.org/abs/2506.17335
- ACL Anthology: https://aclanthology.org/2025.emnlp-main.314/
- GitHub: https://github.com/du-nlp-lab/LMR-Bench
- Venue: EMNLP 2025 Main
- Authors: Shuo Yan et al. (University of Texas at Dallas, 14 authors)
Paper summary
LMR-BENCH introduces a benchmark of 28 code-reproduction tasks derived from 23 NLP papers published in top-tier venues (ACL, EMNLP, NAACL, AAAI) over the past five years.
Task format: Given (paper PDF, code repo with masked methods, implementation instruction) → agent must generate patch code that fills in the missing methods correctly.
Evaluation: Dual: unit tests (functional correctness) + LLM-as-judge (implementation fidelity).
Nine categories: tokenization, attention mechanism, positional encoding, loss function, data preprocessing, model architecture, training procedure, decoding strategy, evaluation metric.
PoC reproduction approach
This is a paper-poc: we reproduce the benchmark's core methodology conceptually, not by re-running the full benchmark (which requires API keys and compute).
Step 1: Confirm repo exists and structure is public
Repository at github.com/du-nlp-lab/LMR-Bench — confirmed active (Python ≥ 3.12 requirement stated).
Repo structure per paper/search:
LMR-Bench/
benchmark/
{project_name}/
repository_folder/ # masked code repo
unit_tests/
info.json # metadata: paper, masked methods, instructions
Step 2: Reproduce the masking pattern
The benchmark creates tasks by:
- Selecting a critical function in an NLP paper's codebase
- Replacing the function body with a
# TODO: implementstub - Writing an
info.jsonwith: paper_title, arxiv_id, masked_function, implementation_instruction, dependencies
Example (synthetic, based on paper description):
# Original: attention_score() in a Transformer implementation
# Masked version (what agents see):
def attention_score(query, key, scale=True):
"""
Compute scaled dot-product attention scores.
# TODO: implement
"""
raise NotImplementedError
# Agent's task: reproduce from paper equations + instruction text
Step 3: Understand the evaluation dual-axis
| Axis | Method | What it checks |
|---|---|---|
| Functional | Unit tests (pytest) | Output matches reference numerically |
| Fidelity | LLM-as-judge (GPT-4o) | Implementation matches paper's described algorithm |
Key finding from paper
- Best-performing model on LMR-BENCH: o3-mini (OpenAI, high-compute mode)
- Pass@1 range across 28 tasks: 20–60% depending on category
- Hardest category: "model architecture" (requires multi-file cross-referencing)
- Easiest: "evaluation metric" (self-contained, well-defined formulas)
Evidence quality
This is a source-verified PoC. The paper is EMNLP 2025 peer-reviewed. The GitHub repo is public. No local execution was performed (requires API keys for evaluation). The synthetic masking example above illustrates the methodology based on the paper description.
Verdict
Paper: real, peer-reviewed. Methodology: reproducible with API keys. Article can explain the benchmark approach, show the masking pattern, and discuss implications for AI-assisted research reproduction. No fabricated metrics — all numbers cited from the paper.
Read the article
This note supports the public article and records what was actually checked.