OpenAI halves their inference cost but no one knows how

OpenAI halves their inference cost but no one knows how

Somewhere in the final week of June, several employees at OpenAI allegedly confided to their colleagues that they have solved a major issue. The cost of inference, the figure that keeps every finance team at AI labs up all night, had allegedly been slashed by more than half. The number of Nvidia GPUs running ChatGPT traffic has allegedly dropped to a couple hundred. There has been no blog post, no press release, no podium, no CEO quote. Just an article in The Information citing one individual with knowledge of the internal conversation, and the next thing you know, half the AI media outlets are writing headlines as if OpenAI has already come up with a solution to their biggest problem. They haven’t. This is important to keep in mind before anything else.

Digit.in Survey
✅ Thank you for completing the survey!

Also read: Anthropic Claude Science explained: An AI lab bench that lives inside your terminal

The actual facts are below, and there is less of it than you would expect. The effectiveness of the method has been proven only for logged-out ChatGPT traffic – the users who never signed up for a user account and whose traffic is limited by OpenAI. It’s not a coincidence, either. This is the most low-stakes traffic segment where the experiment can be carried out without anybody knowing about it if it fails. There is no information on whether the same method will work for paid accounts, API, and reasoning models, which are much more costly to operate than simple logged-out chat. The difference between cutting costs in half for traffic that does not matter at all and cutting costs in half for traffic that pays the bills is the whole story here.

But how credible is this? Take it in the same way you would take any report with a single source on a company’s internal success stories. It certainly seems convenient for OpenAI to tell this kind of a technically vague but positive story. The company is preparing for an IPO when people are going to question the ability of their business model to make sense at the volume it is operating at. And the story that implies that there is a limit on how high the cost can go arrives at a very convenient time and from nobody who is willing to speak for OpenAI on record about some figures that are impossible to verify.

Also read: Claude Sonnet 5 vs Opus 4.8: Is the flagship model still worth paying for

What exactly did they do? This is the one thing that OpenAI has not told yet, and all media outlets writing about this event have been guessing. The most probable options are some boring engineering advances, like improved batching, more efficient use of cache, quantization techniques that reduce the required level of precision without degrading performance, or routing simple requests to other models. And each of them might easily lead to the reduction of more than half. They do not require anything new in hardware.

Which brings us to Jalapeño and the significance of the way in which OpenAI has dealt with it. While the cost narrative was leaked through an unnamed source, the handling of Jalapeño was completely different; there was a name for the product, there was Broadcom for the partner, there was the unveiling of the product and a specific mention that the cost-efficiency of Jalapeño was 50 percent better than that of the normal AI GPUs. This is what OpenAI does when it needs to make the world believe its cost story. In contrast, there was no such ceremony in case of the software optimization.

And if this claim is even half true, how should the field react? Anthropic and Google are racing toward the same goal, but both have failed to publish comparable benchmarks, meaning that either one has not reached this milestone yet, or simply has decided not to disclose it. Specialized silicon inference for AI applications is becoming a three-horse race with Jalapeño, Google’s TPUs, and Meta’s MTIA being the players involved, each one convinced that it is the only escape route from Nvidia’s margin squeeze. If OpenAI indeed figured out some software magic on top of the chips that won’t come into full deployment for another year, other competitors, who rely solely on the efficiency gains on GPUs, are facing a deadline to do the same. But honestly, nobody knows, and anything written at this point is more or less just fiction.

Also read: Fable 5, Mythos 5 coming back: Anthropic still hasn’t answered key question

Vyom Ramani

Vyom Ramani

A journalist with a soft spot for tech, games, and things that go beep. While waiting for a delayed metro or rebooting his brain, you’ll find him solving Rubik’s Cubes, bingeing F1, or hunting for the next great snack. View Full Profile