Elon Musk rarely ever does anything quiet, and his companies are no different. xAI has just launched standalone Speech-to-Text and Text-to-Speech APIs for developers, and it comes with benchmark scores that make it the king of the hill. But is Grok’s voice actually that good, or is this just Elon waving numbers around again?
Also read: OpenAI study says India is a Top-5 AI nation, but only in big cities
The two brand-new APIs Grok STT and Grok TTS were developed with the same technology that is there for Grok Voice on mobile devices, within Tesla vehicles, and Starlink customer support – so they are not introducing anything new that wasn’t already being used previously at scale.
The STT (Speech-to-Text) API provides real-time/batch transcriptions in 25 languages with speaker diarisation, word-level time stamps, and supports 12 different audio file types, and the TTS (Text-to-Speech) API has five expressively voiced options (Ara, Eve, Leo, Rex, and Sal) across 20 languages, and can output realistic-sounding speech with tags such as [laugh] or [sigh].
The price point for both APIs from xAI is absolutely amazing. The cost of using STT for batch transcription is $0.10 per hour (streaming $0.20/hour), while TTS is priced at only $4.20 per 1M characters – both below what you’d find from other competitors currently in this space.
Also read: How to detect bias in ChatGPT output in 3 easy ways
When it comes to phone call entity recognition (names, account numbers, dates), Grok’s STT offers just 5.0% error rates, whereas ElevenLabs has 12.0%, Deepgram has 13.5%, and AssemblyAI has 21.3%. This difference in performance is large enough for Grok to become the go-to for speech recognition. And according to xAI, this is especially true for healthcare, law, and financial use cases.
However, self-reported benchmark numbers are, at the end of the day, self-reported. Every AI company shows benchmark scores that make them the best. ElevenLabs has decades worth of experience optimizing their models for nuance, expressiveness, and edge cases. But these things may or may not appear in a phone call recognition test. When it comes to TTS, which requires not just accuracy but sound quality to boot, there is no telling about the performance of the tool until you listen to how it sounds.
But the thing xAI does have on their side is scale. Millions of transactions made with both Tesla and Starlink have already tested that stack. As to whether Grok’s speech APIs are “better” than others, they are definitely better at something and probably cheaper. Whether this alone is enough for developers to switch, that’s a whole different question.
Also read: BITKRAFT’s Jens Hilgers on how the best apps won’t feel like AI at all