Voice recognition technology with the world’s highest standard of accuracy

February 19, 2019

NEC has been involved with the research and development of voice recognition technology, including participating in a third-party evaluation test in 2018. What is the current level achieved by voice recognition, and how can it be used? Here, we ask a developer on the details.

Achieving 95% accuracy under harsh conditions

Takafumi Koshinaka, Ph. D.
Senior Principal Researcher
Biometrics Research Laboratories

― How accurate is NEC’s voice recognition?

NEC possesses some of the World's No.1 biometrics, including face recognition. In recent years, NEC has also been involved with the research and development of voice recognition, and has been achieving exceptionally high accuracy rates.
In 2018, NEC participated in benchmark tests conducted by a third-party evaluator in the U. S., the National Institute of Standards and Technology (NIST), and was able to successfully demonstrate the ability of its voice recognition. Carrying out this test was extremely challenging from a technical standpoint. For example, audio used for a task that involved identifying people in a telephone conversation had extremely loud background noise and line noise, and was difficult to hear even for human beings. However, despite the harsh circumstances, NEC’s voice recognition system was able to uphold an accuracy rate of approximately 95%. As the baseline system accuracy rate set by NIST was at approximately 89%, the error rate was recorded at lower than half than that of the baseline system. As you can see, we were able to demonstrate an exceptionally high level of technological ability.
Although we are unable to publicize the results ranking due to the strong academic disposition of NIST’s voice recognition evaluation, this evaluation proved to be another good opportunity for us to show that our voice recognition is at a level worthy to compete globally.

Unwavering recognition accuracy in any environment

― Why does NEC’s voice recognition have such high accuracy?

One main reason is that it is resistant to change in environment. The system is designed so that it can effectively recognize even when there are various factors that obstruct recognition.
In deep learning, gathering a larger amount of data leads to higher accuracy. At NEC, we use a unique kind of data augmentation technology in which noise, reverberation, and so forth are added to a certain piece of speech, creating a different piece of speech. Through this, we can acquire a large variety of speech patterns and improve the level of accuracy drastically. In addition to the augmentation method mentioned above, we can covert Person A’s voice to that of a different person, Person A’, thus making it possible to effectively collect speech data with a large variety of persons. With the implementation of this technology, we have in fact been able to reduce about 30% of recognition errors.
Furthermore, another important point is that we are incorporating a unique neural network that extracts individual characteristics. For speech signals, the parts that show a person’s unique properties differ from person to person. NEC has developed a unique “attention mechanism” in which the parts that show such properties are automatically extracted and relayed to the recognition neural network. This technology was first announced in a paper in September 2018, and received highly positive feedback at a academic conference. * Through the use of this mechanism, the required amount of speech time for recognition has been shortened to about half the time required conventionally.

*
K. Okabe et al, “Attentive Statistics Pooling for Deep Speaker Embedding, ” INTERSPEECH 2018, Hyderabad, September, 2018

The only biometrics that can recognize objects remotely through telephone

― What kinds of things are you planning on applying it to?

First, what I can say is that voice recognition works exceptionally well with the telephone. In using the telephone, the object of recognition can be effectively recognized even if the object is somewhere else. This is one advantage of voice recognition that other biometrics do not have.
Furthermore, speaking is, in comparison, a motion that gives little psychological burden. Since it does not require proactive motions like putting your finger on a machine or looking into a camera, it facilitates the process of authentication, which is one of its greatest advantages.
In terms of situations in which these advantages can be put to use, we are currently thinking of two possible solutions in general.
First is applying it to e-commerce or Internet banking. By making identity verification and payments possible through the telephone, services with a good balance between security and user-friendliness can be developed.
Second is implementing it at call centers. Speakers will be recognized from their voices, and data from previous call logs will be available for reference. This can assist in avoiding any potential trouble and contribute to offering better services.

※
The information posted on this page is the information at the time of publication.

Go back to Featured Technologies