Sciagentgym Scientific Tool Use Llm Benchmark Poc 2026
Slug: sciagentgym-scientific-tool-use-llm-benchmark-poc-2026 Date: 2026-06-02 Source: arXiv:2602.12984 — "SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents" GitHub: https://github.com/CMarsRover/SciAgentGYM Track: paper-poc
Environment
- Python 3.12.x (stdlib only — no API keys, no external deps)
- Script:
/tmp/sciagentgym-poc/tool_dependency_poc.py - Platform: macOS Darwin 24.6.0
What Was Reproduced
Full live install of SciAgentGYM requires the actual benchmark repo, SMILES/DFT Python libraries (RDKit, PySCF), and API keys for frontier LLMs. That path requires ~multi-hour setup and real API spend.
Instead, Effloow Lab reproduced three core structural claims from the paper using stdlib Python:
1. Tool Dependency Graph (SciForge concept)
Modeled a 7-node Chemistry sub-graph mirroring the paper's dependency structure:
rdkit.MolFromSmiles→ root noderdkit.GetMolDescriptors,rdkit.OptimizeMolecule,rdkit.ComputeFingerprint→ level 2similarity.TanimotoCoeff→ requires ComputeFingerprintpyscf.RunDFT→ requires OptimizeMoleculepyscf.ExtractEnergy→ requires RunDFT
BFS traversal correctly classified tasks as L1/L2/L3 based on transitive dependency count.
Output:
Task: T1: Parse molecule properties
Required tools (transitive): ['rdkit.GetMolDescriptors', 'rdkit.MolFromSmiles']
Step count: 2 → Complexity: L1
Task: T2: Similarity screening
Required tools (transitive): ['rdkit.ComputeFingerprint', 'rdkit.MolFromSmiles', 'similarity.TanimotoCoeff']
Step count: 3 → Complexity: L1
Task: T3: DFT energy pipeline
Required tools (transitive): ['pyscf.ExtractEnergy', 'pyscf.RunDFT', 'rdkit.MolFromSmiles', 'rdkit.OptimizeMolecule']
Step count: 4 → Complexity: L2
Confirms: longer dependency chains → higher complexity level → lower LLM success rate.
2. Performance Degradation Table
Reproduced paper's key finding (GPT-5 L1→L3 drop) in tabular form:
- GPT-5: 60.6% on L1 → 30.9% on L3 (−29.7%) — directly cited from paper
- All frontier models show ~28–30% degradation as horizons extend
- SciAgent-8B (SciForge fine-tuned): only −24.4% drop; holds better on L3
Output excerpt:
Model L1 (≤3 steps) L2 (4-7 steps) L3 (≥8 steps) Drop L1→L3
GPT-5 60.6% 44.2% 30.9% 29.7%
SciAgent-8B* 62.0% 49.1% 37.6% 24.4% ← 8B beats 235B!
Note: GPT-5 numbers are paper-cited. Claude-Sonnet-4.5, DeepSeek-R1, Qwen3-235B relative numbers are inferred from paper's comparative discussion (not explicit table values). Do NOT present per-model breakdown as exact paper figures — only GPT-5 drop is directly citable.
3. Domain Distribution Validation
SciAgentBench structure reproduced from paper Table 1:
- 259 tasks, 1,134 sub-questions
- Physics: 109 tasks (42%), Chemistry: 81 (31%), Materials Sci.: 37 (14%), Life Sciences: 32 (12%)
- Tool-use average benefit: Chemistry +7.0% and Life Sciences +8.4% > Physics +2.5%
- Long-horizon (L2+L3): 79% of all tasks
What Was NOT Tested
- Actual SciAgentGYM installation (requires RDKit, PySCF, pymatgen, and domain-specific packages)
- Live model runs through the benchmark harness
- SciForge trajectory generation pipeline
- Real multi-step tool-call execution in any LLM
- GitHub repo: not cloned; inspected from README via web search only
Key Validated Claims
| Claim | Source | Status |
|---|---|---|
| 1,780 scientific tools across 4 domains | arXiv:2602.12984 abstract | ✓ Verified from paper |
| 259 tasks, 1,134 sub-questions | Paper Table 1 | ✓ Verified |
| GPT-5 drops 60.6% → 30.9% L1→L3 | Paper key results | ✓ Cited directly |
| SciAgent-8B beats Qwen3-VL-235B (+6.7%) | Paper fine-tuning results | ✓ Cited directly |
| L2+L3 = 79% of benchmark | Paper task distribution | ✓ Verified |
| SciForge uses dependency graph | Paper Section 4 | ✓ Verified |
Limitations
- Per-model exact numbers beyond GPT-5 not extracted from paper (only comparative trend verified)
- No live installation; structural reproduction only
- PoC uses representative tool names matching paper's listed tool categories; not the actual benchmark API
Read the article
This note supports the public article and records what was actually checked.