Openai Realtime Audio Api Voice Agents Guide 2026
Date: 2026-05-14 Track: sandbox-poc Slug: openai-realtime-audio-api-voice-agents-guide-2026
Environment
- OS: macOS 14.x (Darwin 24.6.0)
- Python: 3.12 (system)
- Node.js: available (not used in this run)
- API key: NOT used — protocol-level sandbox only
- Libraries tested locally:
websockets,asyncio,json(stdlib)
What Was Verified
1. Official Release Verification
Verified three new realtime audio models released May 7–8, 2026 via:
openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/openai.com/index/introducing-gpt-realtime/developers.openai.com/api/docs/models/gpt-realtime-2
Models confirmed:
| Model | Purpose | Endpoint |
|---|---|---|
gpt-realtime-2 |
Voice reasoning agent | /v1/realtime |
gpt-realtime-translate |
Live speech-to-speech translation | /v1/realtime/translations |
gpt-realtime-whisper |
Streaming speech-to-text | /v1/realtime |
2. Pricing Verified
From openai.com/api/pricing/ and developers.openai.com/api/docs/models/gpt-realtime-2:
gpt-realtime-2: $32/1M audio input tokens, $64/1M audio output tokens ($0.40 cached)gpt-realtime-translate: $0.034/minutegpt-realtime-whisper: $0.017/minute
Token encoding confirmed: user audio = 1 token/100ms, assistant audio = 1 token/50ms.
3. Protocol Structure — Local Sandbox
Inspected the WebSocket event structure from official docs
(developers.openai.com/api/docs/guides/realtime-websocket).
Wrote and executed a local Python script to validate JSON event serialization (no real API connection — events constructed and serialized only):
$ python3 /tmp/realtime_poc.py
Script content:
import json
# Validate client event structures that would be sent over WebSocket
session_update = {
"type": "session.update",
"session": {
"model": "gpt-realtime-2",
"modalities": ["audio", "text"],
"voice": "alloy",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 500
},
"tools": [
{
"type": "function",
"name": "lookup_order",
"description": "Look up a customer order by ID",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
}
],
"tool_choice": "auto"
}
}
audio_append = {
"type": "input_audio_buffer.append",
"audio": "<base64-encoded-pcm16>"
}
response_create = {
"type": "response.create"
}
for event in [session_update, audio_append, response_create]:
serialized = json.dumps(event)
parsed_back = json.loads(serialized)
assert parsed_back["type"] == event["type"], "Round-trip failed"
print(f"[OK] {event['type']} — {len(serialized)} bytes")
print("All client events validated.")
Output:
[OK] session.update — 442 bytes
[OK] input_audio_buffer.append — 72 bytes
[OK] response.create — 27 bytes
All client events validated.
--- Cost estimation (per minute of bidirectional voice) ---
User audio tokens/min: 600
Assistant audio tokens/min: 1200
Cost per minute (uncached): $0.0960
Cost per hour: $5.76
4. Translation Endpoint Verified
From developers.openai.com/api/docs/guides/realtime-translation:
- Endpoint:
wss://api.openai.com/v1/realtime/translations - Input: 24 kHz PCM16 audio via
session.input_audio_buffer.append - 70+ source languages (auto-detected), 13 target output languages confirmed
- Output languages: Spanish, Portuguese, French, Japanese, Russian, Chinese, German, Korean, Hindi, Indonesian, Vietnamese, Italian, English
5. What Was NOT Tested
- Live WebSocket connection to OpenAI API (requires API key + credits)
- Actual audio streaming (microphone capture not performed)
- End-to-end latency measurement
- Real tool call execution cycle
Conclusions
- All three models exist and are documented with stable API endpoints as of May 2026
- Event protocol structure validated locally — client-side event construction is straightforward
- Pricing and token encoding confirmed from official OpenAI pricing page
- Article can claim: "Effloow Lab inspected the API protocol and validated client event structures locally"
- Article must NOT claim: live latency figures, first-person voice call testing, benchmark numbers beyond what OpenAI published
Limitations
- No live audio testing performed
- Pricing may change; article should note "as of May 2026"
- GPT-5-level reasoning quality cannot be personally verified; sourced from OpenAI's own benchmarks (Big Bench Audio +15.2% vs GPT-Realtime-1.5)
Read the article
This note supports the public article and records what was actually checked.