GPT-5.5 Agentic Coding: Terminal-Bench 82.7% Guide
OpenAI released GPT-5.5 on April 23, 2026 — the first fully retrained base model since GPT-4.5. The headline number is 82.7% on Terminal-Bench 2.0, a 13-point lead over Claude Opus 4.7's 69.4% on the same eval. That's significant, but the story is more nuanced than a single score.
This guide covers what GPT-5.5 actually brings for engineering teams, how its agentic coding benchmarks compare to the competition, how to use it via Codex and the direct API, and — critically — when Claude Opus 4.7 still wins.
If you are looking for the multimodal API setup and omnimodal features, see our earlier GPT-5.5 Spud multimodal developer guide. This guide focuses entirely on agentic coding workflows.
What Changed in GPT-5.5
GPT-5.5 is a full retraining, not a fine-tune on top of GPT-5.4. OpenAI describes it as the first base model rebuilt from scratch since GPT-4.5. The architectural differences that matter for agentic coding:
Natively omnimodal training. Text, images, audio, and video were all included during base training, not added as adapters later. For coding agents this means vision input — screenshots of a running app, a UI bug, a CI output image — can be passed inline without separate preprocessing.
Improved instruction following on long-horizon tasks. OpenAI specifically tuned GPT-5.5 to understand task intent over many steps, not just the immediate prompt. In practice this shows up as fewer mid-task direction changes and better persistence through tool call failures.
~40% token efficiency gain in Codex. GPT-5.5 produces the same work in roughly 40% fewer tokens compared to GPT-5.4 when used through Codex. Combined with its higher per-token price, the net cost change is approximately 20% more expensive in real workloads — OpenAI's own figure.
1M-token context window via the API. Codex is limited to 400K tokens, but the raw API exposes the full 1M context. For large codebase analysis this matters: 1M tokens is roughly 750,000 words or a very large monorepo.
Benchmark Breakdown
The benchmark picture for GPT-5.5 is split, and understanding the split is more useful than the headline.
| Benchmark | GPT-5.5 | Claude Opus 4.7 | What It Tests |
|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 69.4% | CLI workflows: planning, iteration, tool coordination |
| SWE-Bench Pro | 58.6% | 64.3% | Real GitHub issue resolution in large codebases |
| GDPval | 84.9% | [DATA NOT AVAILABLE] | Knowledge work across 44 occupations |
| OSWorld-Verified | 78.7% | [DATA NOT AVAILABLE] | Autonomous computer operation in real OS environments |
Terminal-Bench 2.0 tests multi-step command-line workflows: running builds, debugging CI failures, chaining shell scripts, and managing output across multiple tool invocations. GPT-5.5's 82.7% is a meaningful gap over Opus 4.7's 69.4%.
SWE-Bench Pro tests whether a model can resolve a real, full-context GitHub issue end-to-end — reading the codebase, understanding the bug, writing the fix, and running the tests. Claude Opus 4.7 leads here at 64.3% vs GPT-5.5's 58.6%. The 5.7-point gap matters in production: if you are running a coding agent that touches large, unfamiliar codebases, Opus 4.7 closes more issues per run.
GDPval at 84.9% is GPT-5.5's strongest eval. It tests "knowledge work" — documents, spreadsheets, research synthesis, and structured output across professional domains. For teams that want one model to handle both coding and supporting knowledge tasks (project specs, test plans, runbooks), the GDPval score matters.
OSWorld-Verified at 78.7% measures autonomous computer use — can the model operate a real OS, click through apps, and complete multi-step tasks without a human? This is the eval most relevant to computer-use agents and Operator-style automation.
Pricing
GPT-5.5 doubled the per-token price of the GPT-5 line.
| Model | Input (per M tokens) | Output (per M tokens) |
|---|---|---|
| GPT-5.5 | $5.00 | $30.00 |
| GPT-5.5 Pro | $30.00 | $180.00 |
| Claude Opus 4.7 | $5.00 | $25.00 |
Input pricing is identical to Claude Opus 4.7. Output is $5 more expensive per million tokens. OpenAI argues the net cost increase is approximately 20% in real workloads due to the ~40% token efficiency gain in Codex. The math holds if you use GPT-5.5 via Codex for complex multi-step tasks where it produces shorter outputs. For prompt-response workflows via the raw API, the token savings are smaller and the price increase is felt more directly.
GPT-5.5 Pro is designed for the highest-complexity research and deep knowledge tasks. At $30/$180 per million, it is priced identically to GPT-5.4 Pro, which it supersedes.
Using GPT-5.5 via the API
Model ID
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-5.5", # or "gpt-5.5-pro" for the Pro variant
messages=[
{
"role": "user",
"content": "Review the attached diff and write a one-paragraph summary of the risk."
}
],
max_tokens=512
)
print(response.choices[0].message.content)
The API surface is unchanged from previous GPT-5.x releases. Existing integrations on gpt-5.4 can switch to gpt-5.5 by updating the model string. Token counts and response format are compatible.
Passing Vision Input Inline
GPT-5.5's omnimodal base means you can mix vision input with code analysis in a single call:
response = client.chat.completions.create(
model="gpt-5.5",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "This is a screenshot of a failing test run. Identify the root cause."
},
{
"type": "image_url",
"image_url": {"url": "data:image/png;base64,<base64_encoded_screenshot>"}
}
]
}
]
)
For agentic coding agents, this allows attaching CI screenshots, UI snapshots, or error dumps directly — no OCR preprocessing step required.
Codex Workflow with GPT-5.5
Codex is OpenAI's cloud coding agent platform, and GPT-5.5 is the recommended default model for Codex tasks as of April 2026. The Codex context window is 400K tokens — large enough for most codebases but smaller than the full 1M API window.
When Codex + GPT-5.5 Fits
- Well-scoped implementation tasks: "Add an endpoint that returns paginated results for this query" — GPT-5.5 can plan and execute with minimal direction-change.
- Refactors with clear target state: "Migrate these three services from class-based to functional components."
- Test generation and coverage expansion: Multi-file test generation where the model needs to understand the codebase shape.
- Knowledge-work artifacts alongside code: Writing the migration plan doc and the migration script in the same task.
When to Use the Raw API Instead
Codex's 400K context ceiling becomes a constraint for very large monorepos. If your codebase context exceeds 200K tokens (to leave room for tool call overhead), use the raw API directly:
# For large codebase analysis via raw API (up to 1M context)
response = client.chat.completions.create(
model="gpt-5.5",
messages=[
{"role": "system", "content": "You are a code review assistant..."},
{"role": "user", "content": f"<full_codebase_context>{codebase_dump}</full_codebase_context>\n\nIdentify all places where this refactor needs to happen."}
]
)
AWS Bedrock Integration
As part of the AWS and OpenAI partnership announced in May 2026, GPT-5.5 is available on Amazon Bedrock in limited preview. This is significant for teams already on AWS infrastructure.
The key appeal: GPT-5.5 on Bedrock inherits the full AWS enterprise control surface — IAM access management, AWS PrivateLink, CloudTrail logging, encryption at rest and in transit, and existing compliance frameworks. No new security model to configure.
Codex on Bedrock brings the OpenAI coding agent into your AWS environment. Authentication uses your existing AWS credentials. Inference runs through Bedrock. Codex usage applies toward your existing AWS cloud commitments.
Access is through the Model Access section in the Bedrock console. The Codex CLI, desktop app, and VS Code extension are all available once access is granted.
For teams with strict data residency or compliance requirements, this path matters. OpenAI's direct API routes traffic through OpenAI's infrastructure; Bedrock keeps it inside your AWS region.
GPT-5.5 vs Claude Opus 4.7: A Decision Guide
This is the practical question for most engineering teams in 2026. Both models are frontier-class, both are available via API, and both are priced similarly on input.
- Your tasks are well-scoped CLI workflows, automation scripts, or multi-step terminal tasks (Terminal-Bench lead: +13pp)
- You need knowledge-work artifacts alongside code — specs, runbooks, documentation alongside implementation
- You are running high-volume pipelines where the ~72% output token efficiency gain compounds across thousands of calls
- Your team is already on AWS and wants enterprise controls without a separate vendor relationship
- Tasks involve vision input — screenshots, UI bugs, CI images — inline with code context
- You are resolving complex, full-context GitHub issues in large codebases (SWE-Bench Pro lead: +5.7pp)
- Tool call failures are likely and graceful error recovery matters — Opus 4.7 adapts better when something unexpected breaks
- Your workflow involves deep multi-file codebase understanding rather than scoped task execution
- Cost is a priority and you can use Claude's lower output pricing to offset higher task counts
One specific pattern from third-party analyses: on identical prompts, GPT-5.5 produces fewer tokens in its first-pass output and makes fewer initial tool-call mistakes. But when something fails unexpectedly — a test suite breaks, an API returns an error — Opus 4.7 is more likely to diagnose and adapt, while GPT-5.5 can retry the same failing approach.
For teams running automated coding agents at scale, this means GPT-5.5 is a better choice when tasks are well-defined and failure paths are limited. Opus 4.7 is a better choice when the agent must navigate ambiguity and unexpected states.
Our earlier comparison of AI coding agents covers the broader landscape including open-weight alternatives. For the Claude Opus 4.7 side of this comparison, see the Claude Opus 4.7 developer guide.
Common Mistakes
Switching to GPT-5.5 expecting a universal upgrade. The SWE-Bench Pro regression is real. Teams that use coding agents primarily to close GitHub issues on complex codebases may see lower task completion rates after switching from GPT-5.4 or Claude Opus 4.7.
Assuming the 40% token efficiency claim applies everywhere. The efficiency gain is most pronounced in Codex tasks with its specific tool-use patterns. Raw API chat completions see less dramatic savings, especially on short or medium-length tasks.
Using Codex's 400K context for very large repo analysis. If your codebase dump exceeds 150-200K tokens, route directly to the API. Codex's 400K limit leaves less buffer for tool call overhead and response generation than it appears.
Choosing GPT-5.5 Pro when GPT-5.5 suffices. Pro's $30/$180 pricing is for the highest-complexity research tasks where maximum reasoning depth matters. For standard agentic coding workflows, GPT-5.5 at $5/$30 is the right starting point.
FAQ
Q: Does GPT-5.5 support function calling and tool use?
Yes. The function calling and tool use API surface is identical to previous GPT-5.x models. Existing agent implementations built on gpt-5.4 with tool use can switch to gpt-5.5 with a model string change.
Q: Is GPT-5.5 available via the OpenAI Assistants API?
GPT-5.5 is available via the Chat Completions API (confirmed). OpenAI has generally made each new flagship model available across its full API surface — Assistants, Chat Completions, and Batch — within days of release. Check the OpenAI platform model page or your Assistants dashboard for current availability in your tier.
Q: How does the 1M context window compare to competitors?
As of May 2026, the 1M context window matches Gemini 3 Ultra's 2M context in practical use for most codebases. Qwen 3.6 Plus also ships a 1M token context window. For teams choosing between frontier models on context size, 1M tokens is now the standard expectation for top-tier models.
Q: When will Bedrock access move out of limited preview?
[DATA NOT AVAILABLE] — OpenAI and AWS have not announced a general availability timeline for GPT-5.5 and Codex on Bedrock. Limited preview access can be requested through the AWS Bedrock Model Access console.
Q: Is GPT-5.5 the right model for building long-running autonomous agents?
For well-scoped autonomous tasks with defined completion criteria, yes. The Terminal-Bench and GDPval results support GPT-5.5 for planning-and-execution agent patterns. For agents that must navigate highly ambiguous, open-ended research or complex codebase investigation, the SWE-Bench Pro gap suggests Opus 4.7 still leads.
Key Takeaways
GPT-5.5 is a genuine step forward for agentic coding on well-defined tasks. The 82.7% Terminal-Bench 2.0 score is the clearest proof point. But the SWE-Bench Pro result means it is not a universal replacement for teams whose primary use case is codebase-level issue resolution.
For high-volume Codex workflows, the token efficiency gain is real and compounds at scale. For teams on AWS with compliance requirements, the Bedrock integration removes a significant friction point.
The practical rule for May 2026: start with GPT-5.5 for new agentic automation projects, especially CLI-heavy workflows. Stick with or evaluate Claude Opus 4.7 if your primary metric is closed GitHub issues per dollar on complex, multi-file codebases.
GPT-5.5 leads Terminal-Bench 2.0 by 13 points and excels at well-scoped agentic automation, knowledge work, and computer use. It trails Claude Opus 4.7 on SWE-Bench Pro for deep codebase issue resolution. Use GPT-5.5 for high-volume automation and CLI-heavy pipelines; keep Opus 4.7 for complex, ambiguous codebase work where error recovery matters.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.