Rumik’s Silk Mulberry 1.5 claims to be the best Indian Voice AI right now

Rumik’s Silk Mulberry 1.5 claims to be the best Indian Voice AI right now

Most text-to-speech tools still hand you a dropdown – voice A, B, or C, each one somebody else’s idea of “natural.” Indian AI startup Rumik wants to scrap the dropdown entirely. Its new model, Silk Mulberry 1.5, lets you type a plain-language description and get a synthetic voice built to match age, gender presentation, accent, pitch, emotional register, even mid-sentence code-switching between languages.

Digit.in Survey
✅ Thank you for completing the survey!

Also read: Epic is rebuilding Unreal Engine from the ground up: Here’s everything that’s changing with UE6

What makes this possible is the architecture underneath. Mulberry isn’t a conventional TTS system bolted onto a neural net; it’s built as an audio language model. A transformer backbone generates speech tokens conditioned on your voice description, and an audio decoder converts those tokens into the final waveform. Because everything is token-based, audio can stream out frame-by-frame instead of rendering in one go, which is what gives the model its headline feature: under 200 milliseconds time-to-first-chunk on a single H100 GPU, even with 80 concurrent requests running.

The transformer backbone also does some quietly useful work. It resolves pronunciation ambiguity using context, the way a human would tell “read” the present tense from “read” the past tense. And it generalises across phrasing. You could train it on “Haryanvi man in his 30s, low pitch” and it’ll still understand “deep-voiced guy from Haryana,” because the model is matching meaning through attention rather than exact strings.

Also read: Samsung Galaxy Z Fold 8 Ultra vs Galaxy Z Fold 7: 5 Biggest upgrades expected this year

Rumik also published some mechanistic interpretability work behind the code-switching ability. The team trained a linear probe on the model’s internal activations to track, token by token, whether it was “thinking” in English or Hindi. The probe’s confidence spikes sharply on Hindi tokens and falls back once English resumes, suggesting the model isn’t holding a persistent “language mode,” but reacting to lexical content as it goes. It’s a small but telling look under the hood of how code-switched speech actually gets generated.

On benchmarks, Mulberry scores 4.23 on the MOS quality scale, putting it within range of ElevenLabs v3 and Gemini 3.1 Flash TTS without quite beating either. Where it stands out is instruction-following, on the InstructTTS Eval benchmark, it hits 74% overall consistency, doing strongest on descriptive style prompts (82.7%) and weakest on abstract role-play scenarios like “a teacher scolding a student” (64.4%).

Rumik isn’t claiming the most natural-sounding voice on the market. It’s betting that control, the ability to design a voice rather than pick one, is the more interesting problem to solve. The model is live now on the Rumik playground.

Also read: Best phones with periscope camera to buy in India in 2026: Vivo X300 Ultra, iPhone 17 Pro and more

Vyom Ramani

Vyom Ramani

A journalist with a soft spot for tech, games, and things that go beep. While waiting for a delayed metro or rebooting his brain, you’ll find him solving Rubik’s Cubes, bingeing F1, or hunting for the next great snack. View Full Profile