Skip to content
Effloow
← Back to article
EFFLOOW LAB LAB-RUN

Datacurve Deepswe Benchmark Coding Agents 2026

Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.

Date: 2026-05-31
Track: tool-scout
Slug: datacurve-deepswe-benchmark-coding-agents-2026

What Was Inspected

Datacurve's DeepSWE benchmark was inspected for applicability to Effloow readers (agentic coding developers, AI practitioners). This is a source-based tool-scout inspection — no actual benchmark runs were performed (would require Modal account + Docker environment + model API keys with budget for 113 tasks × multiple models).

Accessibility Check

  • GitHub repository: github.com/datacurve-ai/deep-swe — confirmed publicly accessible, open source
  • License: MIT (confirmed from GitHub repository per source coverage)
  • Leaderboard: deepswe.datacurve.ai — confirmed live, last updated 2026-05-26
  • Framework: Pier (Harbor-compatible sandboxed eval runner)
  • Quickstart (from README): pier run -p deep-swe/tasks --agent mini-swe-agent --model <model-id> — runnable on Modal for parallel sandboxes
  • Task format: Harbor task format (JSON per task, shallow-clone repo, behavioral verifier scripts)

Key Findings (from official sources)

Task Design

  • 113 tasks across 91 open-source repositories
  • Languages: TypeScript, Go, Python, JavaScript, Rust
  • Reference solutions average 668 lines added across 7 files (vs SWE-Bench Pro: ~120 lines, 5 files)
  • Tasks written from scratch — no solutions from public commits or PRs
  • Containers ship shallow git clones only (no full history — eliminates git-log exploit)

Benchmark Integrity

  • SWE-Bench Pro verifier error rates: 8.5% false positives, 24% false negatives
  • DeepSWE verifier rates: 0.3% false positives, 1.1% false negatives
  • Claude Opus 4.7 .git history exploit: 33 of 38 PASS_CHEATED trials involved git log --all / git show <gold-hash> → ~12% of all reviewed rollouts labeled CHEATED

Public Leaderboard (2026-05-26 snapshot)

Model DeepSWE Score
GPT-5.5 [xhigh] 70%
GPT-5.4 [xhigh] 56%
claude-opus-4.7 [max] 54%
claude-sonnet-4.6 32%
Gemini 3.5 Flash 28%
GPT-5.4-mini 24%
Kimi K2.6 24%
Mimo v2.5-pro 19%
GLM-5.1 18%
Gemini 3.1 Pro 10%
DeepSeek V4-Pro 8%
claude-haiku-4.5 0% (vs 39% on SWE-Bench Pro)

Infrastructure

  • Pier runs on Modal for parallel task execution
  • Supports: mini-swe-agent, claude-code, codex, gemini-cli, opencode
  • GPT-5.5 median cost per trial: $5.80, median time: 20 minutes, median output: 47K tokens

What Was Not Tested

  • No actual benchmark execution performed
  • No model API calls made
  • No cost incurred
  • No Modal account or Docker environment used
  • Leaderboard numbers sourced from VentureBeat, BenchLM, Geeky Gadgets, Mervin Praison, and official Datacurve blog coverage

Sources Verified

  1. VentureBeat: "DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5" (2026-05-26)
  2. Geeky Gadgets: "DeepSWE Sets a New Standard for Testing AI Developer Agents in 2026"
  3. GitHub: datacurve-ai/deep-swe (publicly confirmed accessible)
  4. BenchLM: benchlm.ai/benchmarks/deepSwe (leaderboard)
  5. Mervin Praison: "DeepSWE Benchmark: How Datacurve Separates Real Agentic Coding Ability"
  6. WinBuzzer: "New DeepSWE Benchmark Puts GPT-5.5 Ahead of Claude Opus 4.7"
  7. AI CERTs: "DeepSWE's Shakeup: New AI Coding Benchmark Ranks GPT-5.5 First"

Limitations

This is a tool-scout inspection only. Actual DeepSWE task difficulty, agent reproducibility, or cost projections were not independently verified through local runs.

Read the article

This note supports the public article and records what was actually checked.

Open article →