Effloow
Effloow
est. 2026 · v2.0
Service
Channels
Pages
~ / articles / gpt-5-4-api-developer-guide-2026 Apr 19 · Sunday
← Back to Articles
ARTICLES ·2026-04-19 ·BY EFFLOOW CONTENT FACTORY

GPT-5.4 API Guide: Reasoning Effort, Computer Use, Image Gen

Complete GPT-5.4 API developer guide: reasoning.effort levels, computer use tool, GPT Image 1.5, Realtime API GA, and mini/nano pricing.
openai gpt-5.4 api reasoning computer-use image-generation realtime-api ai-tools
SHARE
GPT-5.4 API Guide: Reasoning Effort, Computer Use, Image Gen

GPT-5.4 landed in March 2026 with four headline additions that change how developers interact with OpenAI's API: a tunable reasoning.effort parameter, a native computer use tool, GPT Image 1.5, and the Realtime API going generally available. Each of these shifts what's architecturally possible — and changes the cost-quality math for production systems.

This guide covers what actually changed, how to use it in code, and how to pick the right model tier for your workload.

Why This Release Matters

Previous OpenAI releases layered capabilities on top of existing API shapes. GPT-5.4 is different: it introduces a second tuning axis for every inference call. Until now, developers controlled output quality through prompt engineering and temperature. With reasoning.effort, you directly control how much internal chain-of-thought the model applies before responding.

That shift has a compounding effect: it changes how you architect multi-step pipelines. A routing layer can send cheap, high-volume classification requests to gpt-5.4-nano with reasoning.effort: "none", while expensive code review tasks hit gpt-5.4 with reasoning.effort: "high" — all within the same OpenAI account.

The computer use and native image generation tools close gaps that previously required separate services (Playwright, DALL·E API calls, STT/TTS chains). That consolidation alone removes significant operational complexity.

The Model Lineup

GPT-5.4 ships as four distinct models with different capability and cost profiles.

Model Context Input ($/MTok) Output ($/MTok) Best For
gpt-5.4 1,050,000 $2.50 $15.00 General-purpose production
gpt-5.4-pro 1,050,000 $30.00 $180.00 High-stakes, agentic tasks
gpt-5.4-mini 400,000 $0.75 $4.50 High-volume, cost-sensitive
gpt-5.4-nano 400,000 $0.20 $1.25 Classification, data extraction

Key detail: the standard gpt-5.4 model doubles its input cost beyond 272K tokens. For codebase-scale context, that's a meaningful cliff — plan your chunking strategy accordingly.

The gpt-5.4-mini is the standout value. At SWE-bench Pro scores of 54.38% vs the standard model's 57.7%, you lose about 3 points of coding performance for a 6× cost reduction. For most production workloads, that tradeoff is straightforward.

Reasoning Effort: Your New Cost-Quality Dial

The reasoning.effort parameter was added to the Responses API. It controls how much internal chain-of-thought processing the model allocates before producing its response.

Five Effort Levels

Level Behavior When to Use
none No reasoning tokens, fastest Routing, classification, simple extraction
low Minimal reasoning Summarization, translation, short Q&A
medium Balanced (default) General coding, analysis, longer writing
high Deep reasoning Complex debugging, architecture review
xhigh Maximum depth, 3-5× cost Math proofs, multi-step planning, security audits

Important distinction: reasoning.effort is not the same as temperature. Temperature controls randomness in output token selection. Effort controls how many internal reasoning tokens the model generates before responding — it's the difference between an LLM that glances at a problem and one that works through it step by step.

Python: Controlling Effort via the Responses API

from openai import OpenAI

client = OpenAI()

# Low effort — fast, cheap, good for classification
fast_response = client.responses.create(
    model="gpt-5.4-mini",
    reasoning={"effort": "low"},
    input=[
        {
            "role": "user",
            "content": "Classify the following support ticket: 'My subscription won't renew.'"
        }
    ]
)

# High effort — deep analysis
deep_response = client.responses.create(
    model="gpt-5.4",
    reasoning={"effort": "high"},
    input=[
        {
            "role": "user",
            "content": "Review this Python function for security vulnerabilities and edge cases:\n\n" + code_snippet
        }
    ]
)

print(fast_response.output_text)
print(deep_response.output_text)

JavaScript: Effort-Aware Multi-Step Pipeline

import OpenAI from "openai";

const openai = new OpenAI();

async function triage(task) {
  // Step 1: cheap triage with nano
  const classification = await openai.responses.create({
    model: "gpt-5.4-nano",
    reasoning: { effort: "none" },
    input: [{ role: "user", content: `Classify task complexity: simple|medium|complex\nTask: ${task}` }],
  });

  const complexity = classification.output_text.trim();

  // Step 2: route to appropriate model + effort
  const config = {
    simple:  { model: "gpt-5.4-nano",  effort: "low" },
    medium:  { model: "gpt-5.4-mini",  effort: "medium" },
    complex: { model: "gpt-5.4",       effort: "high" },
  }[complexity] ?? { model: "gpt-5.4-mini", effort: "medium" };

  const result = await openai.responses.create({
    model: config.model,
    reasoning: { effort: config.effort },
    input: [{ role: "user", content: task }],
  });

  return { complexity, result: result.output_text };
}

This two-step routing pattern reduces cost on simple tasks while preserving quality for complex ones. In production systems, a routing layer like this typically reduces API spend by 40-60% compared to sending everything to the top-tier model.

Computer Use: Desktop Automation in the API

GPT-5.4 introduced computer_use_preview as a built-in tool type in the Chat Completions endpoint. The model can take screenshots, click UI elements, type text, and navigate applications — without any external browser automation library.

It scores 75% on OSWorld, the standard benchmark for computer use capabilities. That's a meaningful jump from earlier implementations and sufficient for automating predictable desktop workflows.

How It Works

The computer use loop follows a screenshot-action-screenshot cycle:

  1. You provide an initial instruction and an optional screenshot of the current screen state
  2. The model outputs one or more computer actions (click, type, scroll, screenshot)
  3. Your code executes those actions and captures a new screenshot
  4. You send the screenshot back to continue the loop
from openai import OpenAI
import base64

client = OpenAI()

def encode_screenshot(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

# Initial request with screenshot
response = client.chat.completions.create(
    model="gpt-5.4",
    tools=[{"type": "computer_use_preview"}],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Open the settings panel and enable dark mode."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{encode_screenshot('screen.png')}"
                    }
                }
            ]
        }
    ]
)

# Parse and execute the returned actions
for action in response.choices[0].message.tool_calls:
    print(action.function.name, action.function.arguments)

Practical Limitations

Computer use works well for repetitive, predictable workflows with stable UIs. It struggles with:

  • Dynamic layouts that shift between sessions
  • CAPTCHAs and security challenges
  • Very small click targets (< 20px)
  • Applications with poor accessibility metadata

Latency is also a real consideration. Screenshot round-trips add 2-4 seconds per action step. Long automation sequences can take minutes. For latency-sensitive workflows, traditional browser automation (Playwright, Puppeteer) with deterministic selectors will be faster and more reliable. Computer use shines when you don't control the application and can't write custom selectors.

GPT Image 1.5: Native Image Generation

GPT Image 1 launched as OpenAI's first natively multimodal image model — it accepts both text and image inputs and produces image outputs within the same API surface as your language requests. GPT Image 1.5, released December 2025, brought a 4× speed improvement and significantly better instruction following.

The GPT Image Family

  • gpt-image-1.5: Best quality, 10-30 second generation, best prompt adherence
  • gpt-image-1: Standard quality/cost balance
  • gpt-image-1-mini: Cost-optimized, as low as $0.005 per 1024×1024 image

API Usage

from openai import OpenAI
import base64

client = OpenAI()

# Generate an image
response = client.images.generate(
    model="gpt-image-1.5",
    prompt="A clean, dark-themed developer dashboard showing real-time API metrics, minimal UI",
    size="1792x1024",
    quality="high",
    n=1,
)

# Save the result
image_data = base64.b64decode(response.data[0].b64_json)
with open("dashboard.png", "wb") as f:
    f.write(image_data)

Inpainting: Precise Region Editing

GPT Image 1.5's inpainting is more surgical than earlier approaches. When you ask it to change a specific element, it modifies only that element. Facial features, lighting, background composition, and surrounding details remain untouched.

from pathlib import Path

# Edit a specific region of an existing image
response = client.images.edit(
    model="gpt-image-1.5",
    image=open("product.png", "rb"),
    mask=open("jacket_mask.png", "rb"),  # white = edit zone
    prompt="Change the jacket to a navy blue color with subtle texture",
    size="1024x1024",
)

edited = base64.b64decode(response.data[0].b64_json)
Path("product_edited.png").write_bytes(edited)

For product photography, marketing assets, or UI mockup generation, this level of precision makes GPT Image 1.5 practical for production image pipelines.

Realtime API: Voice That Actually Works in Production

The Realtime API went generally available after several months in beta. It enables bidirectional audio streaming with sub-200ms perceived latency — eliminating the STT → LLM → TTS pipeline that characterized earlier voice applications.

Architecture Comparison

Traditional voice pipeline:

Audio input → Whisper (STT) → GPT → TTS → Audio output
Latency: 1.5-3 seconds per turn
3 API calls, 3 potential failure points

Realtime API:

Audio input → gpt-realtime model → Audio output
Latency: <200ms perceived
1 WebSocket connection, 1 model

The Realtime API communicates over WebSocket or WebRTC. It natively handles:

  • Turn detection (phrase endpointing)
  • Interruption handling — users can cut off the model mid-sentence
  • Function calling while the model continues speaking
  • Text, image, and audio inputs simultaneously

Basic WebSocket Setup

import asyncio
import websockets
import json

async def voice_session():
    uri = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "silence_duration_ms": 800
                }
            }
        }))

        # Send audio (base64-encoded PCM16 at 24kHz)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": encoded_audio_chunk
        }))

        # Commit the buffer and request a response
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
        await ws.send(json.dumps({"type": "response.create"}))

        # Stream back audio events
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.audio.delta":
                yield event["delta"]  # base64-encoded audio chunk

gpt-realtime-mini is the model to use here — it's explicitly optimized for the Realtime API's latency requirements, with support for streaming audio, interruption handling, voice activity detection, and background function calling.

Common Mistakes When Adopting GPT-5.4

Using xhigh effort everywhere. The cost multiplier is 3-5×. Save it for tasks where reasoning depth actually matters: proofs, security audits, complex code reviews. Routine coding tasks perform nearly identically at medium.

Sending everything to gpt-5.4 when gpt-5.4-mini would do. Mini's SWE-bench score of 54.38% vs 57.7% means that for the vast majority of real-world coding tasks (not competition benchmarks), you won't notice the difference. Benchmark the two against your actual workload before paying 3× more.

Building computer use for dynamic UIs. Computer use is reliable when UI layouts are predictable. If the application you're automating ships frequent UI updates, the model's screen coordinates will drift. Use traditional selectors for applications you control; use computer use for applications you don't.

Ignoring the 272K token input cost cliff. The standard gpt-5.4 model doubles its input price per token beyond 272K. If you're sending entire codebases or large document sets, run the math before assuming the context window is flat-priced.

Using the Chat Completions API instead of Responses API for reasoning control. The reasoning.effort parameter is only available in the Responses API (client.responses.create()), not in Chat Completions (client.chat.completions.create()). If you try to add it to a Chat Completions call, it will be silently ignored.

FAQ

Q: What's the difference between GPT-5.4 and GPT-5.4-pro?

GPT-5.4-pro is priced at $30/$180 per million tokens and reserved for high-stakes, long-running agentic tasks where maximum capability is required. Standard gpt-5.4 ($2.50/$15) handles the large majority of production use cases. Unless you're running extended autonomous agents on complex business workflows, start with the standard model.

Q: Does reasoning.effort work in Chat Completions?

No — reasoning.effort is a Responses API parameter only. It's passed in the reasoning object alongside your input and model. If you're currently using client.chat.completions.create(), you'll need to migrate to client.responses.create() to access it.

Q: Is GPT Image 1.5 the same as DALL·E 4?

No. DALL·E models are diffusion-based. GPT Image models are natively multimodal language models that produce image outputs. They accept text and image inputs, understand compositional instructions more precisely, and integrate with the same API surface as text requests. They are separate model families.

Q: When should I use the Realtime API vs a traditional STT/LLM/TTS pipeline?

Use the Realtime API when you need sub-200ms perceived latency for conversational voice interfaces, interruption support, or real-time audio streaming. Use a traditional pipeline when you need more control over individual components, have specific STT/TTS model requirements, or are building non-conversational audio processing (batch transcription, audio analysis).

Q: Can computer use access the internet?

The computer use tool controls whatever environment you run your automation in. If your automation environment has browser access, the model can navigate websites. OpenAI's API itself doesn't proxy internet access — it returns actions for your code to execute in your environment.

Key Takeaways

GPT-5.4 is the first OpenAI release that directly exposes reasoning depth as a first-class API parameter. That unlocks cost-optimized architectures that weren't practical before: cheap routing layers, tiered workload assignment by complexity, and per-request quality control without prompt engineering gymnastics.

The practical checklist for adopting GPT-5.4:

  1. Migrate performance-sensitive flows to the Responses API to access reasoning.effort
  2. Default to gpt-5.4-mini and benchmark against standard before committing to the higher tier
  3. Use reasoning.effort: "low" for classification and routing, "high" only for tasks requiring deep analysis
  4. Evaluate computer use for legacy application automation where traditional selectors aren't available
  5. Replace STT/LLM/TTS chains with the Realtime API for conversational voice features
  6. Track the 272K token input cliff if you send large context payloads to the standard model
Bottom Line

GPT-5.4 is the most operationally consequential OpenAI release for developers since function calling. The reasoning.effort parameter alone changes how you architect inference pipelines — it's no longer just about which model you call, but how hard you make it think. Start with gpt-5.4-mini at medium effort and tune from there; you'll likely find it handles 80% of your workload at a fraction of the cost.


Prefer a deep-dive walkthrough? Watch the full video on YouTube.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →

More in Articles

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.