OpenAI Launches GPT-Realtime-2: Its First Voice Model That Can Reason While You Talk
OpenAI released three new models for its Realtime API on May 7 — GPT-Realtime-2 (reasoning-capable, 128K context), GPT-Realtime-Translate (live speech translation across 70+ languages), and GPT-Realtime-Whisper (streaming transcription). GPT-Realtime-2 is the first voice model built on GPT-5-class intelligence, letting developers build voice agents that think through hard problems mid-conversation without awkward silence.
For the past two years, building a production voice AI agent has required developers to stitch together three separate services: speech-to-text, a reasoning model, and text-to-speech. The result was inevitably laggy, the handoffs introduced errors, and users could hear the seams. On May 7, OpenAI released three new models that collapse this architecture — and in doing so, raises the ceiling for what voice interfaces can do.
The announcement centers on GPT-Realtime-2, the company’s first voice model built on GPT-5-class intelligence. But the release is really a three-part suite: GPT-Realtime-2 for intelligent voice interaction, GPT-Realtime-Translate for live multilingual conversation, and GPT-Realtime-Whisper for streaming transcription. Each model is available through the Realtime API today.
GPT-Realtime-2: Voice That Thinks
The headline capability of GPT-Realtime-2 is reasoning during live speech — something no prior voice AI model has done well. Previous voice systems, including OpenAI’s own earlier Realtime API models, were built on smaller, faster models optimized for latency. They could hold a conversation and call basic tools, but fell apart on complex requests that required multi-step reasoning, nuanced instructions, or long context.
GPT-Realtime-2 changes this by combining speech-to-speech processing with GPT-5-class reasoning in a single model pass. The context window has grown from 32K to 128K tokens, making it possible to sustain extended conversations, reference earlier parts of a session, and handle agentic workflows that accumulate context over time.
Developers can configure five levels of reasoning effort — minimal, low, medium, high, and xhigh — allowing applications to trade latency for accuracy depending on the task. A customer support bot handling routine queries might run on minimal; a medical intake assistant parsing complex symptoms would dial up to high or xhigh.
Two new developer features make the reasoning process audible rather than silent. Preambles let the model say something like “let me check that” or “one moment” before launching into a tool call — eliminating the dead air that previously made reasoning agents feel broken. Parallel tool calls let the model fire multiple backend requests simultaneously and narrate which one is currently in flight: “I’m pulling your reservation and checking the weather at the same time.”
These are not cosmetic features. Research on voice UX consistently finds that unexplained silence is the primary cause of user dropout in voice interactions. Making the model’s work audible keeps users engaged during latency that is inherent to reasoning.
Pricing and the Developer Math
GPT-Realtime-2 is priced at $32 per million audio input tokens and $64 per million audio output tokens, with cached input tokens available at $0.40 per million — a significant discount for applications that can reuse conversation context.
For reference, a typical three-minute customer service call generates roughly 5,000–7,000 audio tokens. At full pricing, that’s about $0.23–$0.45 per call, before caching. For high-volume consumer applications this matters; for enterprise deployments where voice agents are replacing human agents at $20–40 per hour, the economics are compelling at almost any token volume.
The five reasoning tiers also give developers fine-grained cost control. Running at minimal reasoning is substantially cheaper than xhigh, and many production applications will find that low or medium reasoning handles 80–90% of queries adequately.
GPT-Realtime-Translate: Live Multilingual Conversation
The second model, GPT-Realtime-Translate, is purpose-built for real-time speech translation. It accepts audio in more than 70 input languages and outputs translated speech in 13 languages, keeping pace with the speaker rather than buffering for complete sentences.
The practical applications are significant. A customer support center could route calls in any language to English-speaking agents, or vice versa, without requiring either party to use text or a human interpreter. Telemedicine platforms serving multilingual patient populations could offer real-time consultations without language bottlenecks. Field service teams working across borders could communicate through the model as a persistent translation layer.
GPT-Realtime-Translate is billed by the minute rather than by token, which simplifies cost modeling for applications with predictable call lengths.
GPT-Realtime-Whisper: Streaming Transcription
The third model, GPT-Realtime-Whisper, addresses a narrower but frequently requested use case: high-accuracy streaming transcription that works word-by-word as the speaker talks, rather than sentence-by-sentence.
Earlier approaches to live transcription required developers to choose between low-latency partial transcripts (which were often inaccurate and required correction) and high-accuracy complete transcripts (which arrived with a delay that made them feel disconnected from the speech). GPT-Realtime-Whisper claims to split this tradeoff — delivering transcription that is both fast and accurate by streaming token probabilities rather than committing to final text only after sentence completion.
Like GPT-Realtime-Translate, GPT-Realtime-Whisper is billed per minute.
Why This Matters for the Ecosystem
OpenAI’s Realtime API has been available since October 2024, but adoption by enterprise developers has been limited by the reasoning gap — the original models could talk naturally but couldn’t handle the kinds of complex, multi-turn workflows that make voice agents genuinely useful in production. GPT-Realtime-2 is the first iteration that closes this gap.
The timing aligns with a broader market shift. Microsoft, Google, and Amazon have all signaled major investments in voice-based AI interfaces for enterprise in 2026, and the competition for developer mindshare in this space is intensifying. OpenAI’s decision to ship GPT-Realtime-2 alongside two companion models (rather than a single voice upgrade) suggests the company is trying to own the full developer stack for voice — covering intelligent conversation, multilingual translation, and transcription with a unified API surface.
For developers who have been waiting for voice AI capable enough to deploy in complex enterprise workflows, this week’s release removes the biggest remaining technical barrier. The open question now is not whether the models are capable enough — it is whether the pricing and latency tradeoffs clear the bar for each specific application.
Based on early access testing reported by several enterprise developers, GPT-Realtime-2 at medium reasoning shows latency in the 800ms–1.2 second range for typical queries — fast enough for natural conversation, though the xhigh reasoning tier introduces delays of 2–4 seconds on complex requests, which will require careful UX design in applications where users expect instant responses.
The three-model release represents the most significant evolution of OpenAI’s voice platform since its 2024 launch, and is likely to accelerate the deployment of AI voice agents across customer service, healthcare, and enterprise productivity applications through the second half of 2026.