Microsoft releases Speech Corpus for Indian languages to help researchers build better speech recognition technology

By Digit NewsDesk | Published on 07 Sep 2018
Microsoft releases Speech Corpus for Indian languages to help researchers build better speech recognition technology

Microsoft Research makes speech training and test data for Telugu, Tamil, and Gujarati publicly available in an attempt to advance research in speech recognition technology.


IBM Developer Contest

Take the quiz to test your coding skills and stand a chance to win exciting vouchers and prizes upto Rs.10000

Click here to know more

Microsoft India announced the availability of Speech Corpus for Indian languages today. Becoming the largest publicly available Indian language speech dataset, Speech Corpus offers speech training and test data for Telugu, Tamil, and Gujarati. This release is aimed at helping researchers improve speech recognition technology for applications where speech data is used. Speech Corpus is made available by the Microsoft Research Open Data initiative. It’s a collection of free datasets that can be used to extend research in areas like natural language processing, computer vision, and domain-specific sciences.

“We believe India’s increasing digital literacy needs to be supported by a multi-lingual digital world. Microsoft Indian Language Speech Corpus is an extension of our ongoing efforts to reduce language barriers and empower Indians to harness the full potential of the Internet. Using our technology expertise, we want to accelerate innovation in voice based computing for India by supporting researchers and academia,” commented Sundar Srinivasan, General Manager of Artificial Intelligence & Research at Microsoft India.

According to a press release shared by Microsoft, Speech Corpus for Indian languages was tested at Interspeech 2018, which is the world’s largest conference on language processing and the science and technology that drives it. Participants of the Low Resource Speech Recognition challenge used data from Speech Corpus’ Indian languages dataset to build Automatic Speech Recognition (ASR) systems. They reportedly succeeded in creating high-quality speech recognition models using the available data.

Microsoft also reports in its press release that there isn’t enough digital data for text, speech, and linguistic resources to build large machine learning models for many vernacular languages across the globe. The challenge is understandable given how the differences in enunciation, accent, diction, and slang across various regions in India are very subtle. Microsoft believes that the release of Speech Corpus for Indian languages will help in overcoming these differences and in building systems that can connect more easily with users in the future.

Digit NewsDesk

The guy who answered the question 'What are you doing?' with 'Nothing'.

Digit caters to the largest community of tech buyers, users and enthusiasts in India. The all new Digit in continues the legacy of as one of the largest portals in India committed to technology users and buyers. Digit is also one of the most trusted names when it comes to technology reviews and buying advice and is home to the Digit Test Lab, India's most proficient center for testing and reviewing technology products.

We are about leadership-the 9.9 kind! Building a leading media company out of India.And,grooming new leaders for this promising industry. Protection Status