Saaras V3 explained: How 1 million hours of audio taught AI to speak “Hinglish”

Saaras V3 explained: How 1 million hours of audio taught AI to speak “Hinglish”

The linguistic landscape of India is not a collection of neat, isolated boxes. It is a fluid, rhythmic blend where languages collide and merge in the middle of a single breath. For years, global speech recognition models have struggled with this reality, often tripping over the “Hinglish” or “Tanglish” phrases that define modern Indian conversation. Sarvam AI has challenged this status quo with the release of Saaras V3, a model built on the foundational belief that to understand India, an AI must first understand the art of the mix.

Digit.in Survey
✅ Thank you for completing the survey!

Also read: Indian tech entrepreneurs more successful than returning NRIs: ORF study

The massive data engine 

The secret to Saaras V3’s fluency lies in its staggering training scale. While many models are fine-tuned on clean, academic datasets, Sarvam AI curated over one million hours of multilingual audio data. This dataset captures the raw, unvarnished reality of Indian speech, spanning across various accents, background noise levels, and acoustic conditions. By feeding the model such a massive volume of diverse data, the researchers ensured that the AI wouldn’t just recognize dictionary-perfect Hindi or Bengali, but would also become deeply familiar with the “low-resource” languages and regional dialects that are often ignored by Silicon Valley giants.

Solving the code-mixing conundrum

Code-mixing, the practice of alternating between two or more languages in a single conversation, is perhaps the greatest hurdle for traditional Automatic Speech Recognition (ASR). Most systems are designed to identify one primary language and treat everything else as an error or “noise.” Saaras V3 flips this script by treating code-mixing as a primary feature of its architecture. Because it was trained on real-world conversations where English technical terms are naturally woven into local sentences, the model maintains a high degree of “numeric fidelity” and entity recognition. It doesn’t hallucinate or drop words when a speaker switches from Marathi to English to explain a bank transaction; it simply follows the flow.

Also read: I tried Lenovo’s glass-less 3D screen gaming laptop, here’s how it went

A unified approach to 23 languages

Rather than building twenty-three separate models for twenty-three different languages, Sarvam AI opted for a unified multilingual model. This approach allows the system to leverage “cross-lingual transfer,” where the AI uses its understanding of one language to improve its performance in another phonetically similar one. This unified design supports the 22 official languages of India plus English, ensuring that the model remains lightweight yet incredibly powerful. This architectural choice is what allows Saaras V3 to achieve a Word Error Rate of 19.3% on the IndicVoices benchmark, consistently outperforming frontier models like GPT-4o and Gemini 3 Pro when tested on the ground in India.

High speed meets high precision

Beyond mere accuracy, Saaras V3 is engineered for the fast-paced world of live interaction. Many ASR systems suffer from a “processing lag” that makes voice assistants feel clunky and robotic. Saaras V3 utilizes a streaming-first architecture with causal attention, which allows it to begin transcribing almost the instant a person starts speaking. With a time-to-first-token of under 150 milliseconds, the model provides the responsive backbone needed for real-time voice bots, live captions, and interactive customer service. By combining this speed with advanced features like speaker diarization – the ability to tell who is speaking in a room – Sarvam AI has created a tool that doesn’t just hear words, but understands the structure of human dialogue.

Also read: Isomorphic Labs IsoDDE vs AlphaFold 3: What’s new in the AI drug design engine?

Vyom Ramani

Vyom Ramani

A journalist with a soft spot for tech, games, and things that go beep. While waiting for a delayed metro or rebooting his brain, you’ll find him solving Rubik’s Cubes, bingeing F1, or hunting for the next great snack. View Full Profile

Digit.in
Logo
Digit.in
Logo