Microsoft creates speech recognition system with human-level accuracy

The software registered a word error rate of 5.9 on the industry standard Switchboard test

Published Date
20 - Oct - 2016
| Last Updated
29 - Dec - 2016
Microsoft creates speech recognition system with human-level accu...

In an unprecedented breakthrough a team of researchers and engineers at Microsoft Artificial Intelligence and Research reported that they have created a technology that can recognise words from a conversation just as well as an average human. The team added that that the speech recognition system makes the same number of errors as a human transcriptionist.

"We've reached human parity. This is an historic achievement," Xuedong Huang, Microsoft's chief speech scientist stated in a blog post.

According to a paper published on Monday, October 17 the researchers reported a word error rate (WER) of 5.9 percent against 6.3 percent reported last month. The researchers tested the speech recognition system on the “Switchboard” speech recognition system.

Switchboard is a collection of recorded phone conversations in English, Spanish, and Mandarin, first released by the National Institute of Standards and Technology (NIST) USA, in the early 90s. It has now become the industry standard speech recognition test and companies such as IBM, Google, and Microsoft have used the Switchboard test to test the accuracy of their speech recognition software. “This accomplishment is the culmination of over 20 years of effort,” said Geoffrey Zweig, who manages the Speech & Dialog research group.

The implications of this new development are manifold. The speech recognition tech can augment consumer entertainment devices such as the Xbox and accessibility tools like the instant speech-to-text transcription and voice assistants like Microsoft Cortana. It can also be used to help people suffering from speech-related issues.

Now the team is looking at ways to ensure that the speech recognition system works just as well in a real world setting, including places where there is a lot of background noise, such as a concert or an echoing room. The team will also try to develop software that will not just recognise words but also understand them. "The next frontier is to move from recognition to understanding," stated Zweig.