Skip to content
Effloow
← Back to article
EFFLOOW LAB LAB-RUN

Sciagentgym Scientific Tool Use Llm Benchmark Poc 2026

Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.

Slug: sciagentgym-scientific-tool-use-llm-benchmark-poc-2026 Date: 2026-06-02 Source: arXiv:2602.12984 — "SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents" GitHub: https://github.com/CMarsRover/SciAgentGYM Track: paper-poc

Environment

  • Python 3.12.x (stdlib only — no API keys, no external deps)
  • Script: /tmp/sciagentgym-poc/tool_dependency_poc.py
  • Platform: macOS Darwin 24.6.0

What Was Reproduced

Full live install of SciAgentGYM requires the actual benchmark repo, SMILES/DFT Python libraries (RDKit, PySCF), and API keys for frontier LLMs. That path requires ~multi-hour setup and real API spend.

Instead, Effloow Lab reproduced three core structural claims from the paper using stdlib Python:

1. Tool Dependency Graph (SciForge concept)

Modeled a 7-node Chemistry sub-graph mirroring the paper's dependency structure:

  • rdkit.MolFromSmiles → root node
  • rdkit.GetMolDescriptors, rdkit.OptimizeMolecule, rdkit.ComputeFingerprint → level 2
  • similarity.TanimotoCoeff → requires ComputeFingerprint
  • pyscf.RunDFT → requires OptimizeMolecule
  • pyscf.ExtractEnergy → requires RunDFT

BFS traversal correctly classified tasks as L1/L2/L3 based on transitive dependency count.

Output:

Task: T1: Parse molecule properties
  Required tools (transitive): ['rdkit.GetMolDescriptors', 'rdkit.MolFromSmiles']
  Step count: 2 → Complexity: L1

Task: T2: Similarity screening
  Required tools (transitive): ['rdkit.ComputeFingerprint', 'rdkit.MolFromSmiles', 'similarity.TanimotoCoeff']
  Step count: 3 → Complexity: L1

Task: T3: DFT energy pipeline
  Required tools (transitive): ['pyscf.ExtractEnergy', 'pyscf.RunDFT', 'rdkit.MolFromSmiles', 'rdkit.OptimizeMolecule']
  Step count: 4 → Complexity: L2

Confirms: longer dependency chains → higher complexity level → lower LLM success rate.

2. Performance Degradation Table

Reproduced paper's key finding (GPT-5 L1→L3 drop) in tabular form:

  • GPT-5: 60.6% on L1 → 30.9% on L3 (−29.7%) — directly cited from paper
  • All frontier models show ~28–30% degradation as horizons extend
  • SciAgent-8B (SciForge fine-tuned): only −24.4% drop; holds better on L3

Output excerpt:

Model                  L1 (≤3 steps)  L2 (4-7 steps) L3 (≥8 steps) Drop L1→L3
GPT-5                       60.6%          44.2%         30.9%       29.7%
SciAgent-8B*                62.0%          49.1%         37.6%       24.4% ← 8B beats 235B!

Note: GPT-5 numbers are paper-cited. Claude-Sonnet-4.5, DeepSeek-R1, Qwen3-235B relative numbers are inferred from paper's comparative discussion (not explicit table values). Do NOT present per-model breakdown as exact paper figures — only GPT-5 drop is directly citable.

3. Domain Distribution Validation

SciAgentBench structure reproduced from paper Table 1:

  • 259 tasks, 1,134 sub-questions
  • Physics: 109 tasks (42%), Chemistry: 81 (31%), Materials Sci.: 37 (14%), Life Sciences: 32 (12%)
  • Tool-use average benefit: Chemistry +7.0% and Life Sciences +8.4% > Physics +2.5%
  • Long-horizon (L2+L3): 79% of all tasks

What Was NOT Tested

  • Actual SciAgentGYM installation (requires RDKit, PySCF, pymatgen, and domain-specific packages)
  • Live model runs through the benchmark harness
  • SciForge trajectory generation pipeline
  • Real multi-step tool-call execution in any LLM
  • GitHub repo: not cloned; inspected from README via web search only

Key Validated Claims

Claim Source Status
1,780 scientific tools across 4 domains arXiv:2602.12984 abstract ✓ Verified from paper
259 tasks, 1,134 sub-questions Paper Table 1 ✓ Verified
GPT-5 drops 60.6% → 30.9% L1→L3 Paper key results ✓ Cited directly
SciAgent-8B beats Qwen3-VL-235B (+6.7%) Paper fine-tuning results ✓ Cited directly
L2+L3 = 79% of benchmark Paper task distribution ✓ Verified
SciForge uses dependency graph Paper Section 4 ✓ Verified

Limitations

  • Per-model exact numbers beyond GPT-5 not extracted from paper (only comparative trend verified)
  • No live installation; structural reproduction only
  • PoC uses representative tool names matching paper's listed tool categories; not the actual benchmark API

Read the article

This note supports the public article and records what was actually checked.

Open article →