OpenAI GPT-Realtime-2: Everything you need to know about the new voice AI models

There has always been one glaring issue with Voice AI demos. It seems like magic until something too complicated is thrown at it or the bot loses track of what it is saying. OpenAI seems to be going after this problem with three new models being released in their Realtime API today, and the standout is GPT-Realtime-2.

Also read: OpenAI lets you add an emergency contact in ChatGPT: Here’s what that actually means

The GPT-Realtime-2 is the first ever voice model from OpenAI built on GPT-5 reasoning power that can take more challenging requests and keep the conversation flowing naturally. This difference makes all the difference because the model outperforms its predecessors on audio intelligence by 15.2% on Big Bench Audio test, and by 13.8% on Audio MultiChallenge on instruction following that involves multi-turn conversations, integrating contexts, and correcting in the middle. The tests were done on the models running at “high” and “xhigh” reasoning capabilities so production performances might look different when run at the default “low” latency option.

Along with that, OpenAI is rolling out two other models that address certain problems that have been pieced together by developers previously. GPT-Realtime-Translate provides live translations from speech in over 70 languages (input), translated into 13 different languages (output), in real-time along with the speaking process. GPT-Realtime-Whisper translates voice in real-time into text as a person speaks. These models were built from scratch as opposed to being a patch on top of existing products, and that’s reflected in their cost as well. The cost for the GPT-Realtime-Translate is $0.034 per minute, and for GPT-Realtime-Whisper, it’s $0.017 per minute, much lower compared to most business-class competitors.

Also read: Claude Mythos found decade old Firefox bugs that years of fuzzing missed

You can try the demo here.

GPT-Realtime-2 charges $32 per million audio input tokens, and $64 per million audio output tokens, with $0.40 per million tokens for cached input. This is a more complicated pricing strategy compared to the one per minute, so developers need to consider their usage patterns beforehand.

The Realtime API also exits beta today and is now generally available, which removes a meaningful friction point for teams who were waiting before committing production systems to it. All three models are available immediately and can be tested in the OpenAI Playground.

The real ambition here is larger than any single model. OpenAI describes the goal as moving realtime audio from simple call-and-response toward voice interfaces that can actually do work: listen, reason, translate, transcribe, and take action as a conversation unfolds. Whether that vision holds up in production at scale is what the next few months of developer adoption will show.

Also read: Why OpenAI, AMD, NVIDIA, Intel, Broadcom, and Microsoft all agreed on one networking protocol

Vyom Ramani

A journalist with a soft spot for tech, games, and things that go beep. While waiting for a delayed metro or rebooting his brain, you’ll find him solving Rubik’s Cubes, bingeing F1, or hunting for the next great snack.

Connect On :