Sarvam AI’s Bulbul V3 delivers more natural, stable speech for Indian languages, accents, and mixed-language conversations than global rivals.
In independent blind listening and telephony quality evaluations, Bulbul V3 outperformed several competitors and showed advantages over Google Gemini and ChatGPT in select tasks.
Launched ahead of the India-AI Impact Summit 2026, Bulbul V3 strengthens India’s homegrown AI ecosystem, with real-time speech, enterprise features, and consent-based voice cloning.
Bengaluru-based startup Sarvam AI has recently launched Bulbul V3, which is a new text-to-speech model designed for Indian languages, accents, and real-world use cases. The company says the model delivers more natural and stable speech than global rivals and has already outperformed tools from Google and OpenAI in key evaluations. With Bulbul V3, Sarvam is positioning itself as a serious player in voice AI, an area long dominated by US-based companies. Moreover, Bulbul V3 is one of several tools Sarvam has launched in a 14-day rollout ahead of the India-AI Impact Summit 2026 in New Delhi. The startup is also among the 12 entities selected under the Rs 10,300 crore India AI Mission, where sovereign Indian AI models are expected to be unveiled later this month.
Sarvam says Bulbul V3 is designed around the realities of Indian speech. People often mix languages in a single sentence, pronounce the same word differently across regions, and use names or expressions that global systems struggle to handle. According to the company, Bulbul V3 manages these challenges without breaking flow or meaning.
As per the reports, the model is capable of generating speech with natural pauses, emphasis, and pace. Furthermore, it also supports real-time audio output, which is useful for live conversations, call centres, and interactive apps. Sarvam says that the fast response time is highly important in such settings, as delayed responses can hurt the user experience.
Bulbul V3 was tested by an independent third party through blind listening studies across 11 languages. Human listeners compared audio clips from different AI models without knowing which system produced them. While ElevenLabs ranked highest in overall sound quality, Bulbul V3 beat competitors like Cartesia Sonic-3 in general evaluations.
Sarvam Vision achieves state-of-the-art accuracy of 84.3% on the olmOCR-Bench (English only subset) outperforming frontier models like Gemini 3 Pro and recent OCR models like DeepSeek OCR 2. pic.twitter.com/NsvN15hNqa
Sarvam also said Bulbul V3 performed best in telephony quality tests, which are important for phone-based services. The model showed fewer skipped words and mispronunciations compared to rivals. In related document and speech tasks through Sarvam Vision, the company has earlier claimed better results than Google Gemini and ChatGPT on certain benchmarks.
The new model also allows users to create custom AI voices through consent-based voice cloning. Sarvam says the feature includes safeguards and is built for large enterprise use. Developers can access the model through the Sarvam Dashboard, with unlimited API usage available until February 28, 2026.
Bhaskar is a senior copy editor at Digit India, where he simplifies complex tech topics across iOS, Android, macOS, Windows, and emerging consumer tech. His work has appeared in iGeeksBlog, GuidingTech, and other publications, and he previously served as an assistant editor at TechBloat and TechReloaded. A B.Tech graduate and full-time tech writer, he is known for clear, practical guides and explainers. View Full Profile