Please note that JavaScript and style sheet are used in this website,
Due to unadaptability of the style sheet with the browser used in your computer, pages may not look as original.
Even in such a case, however, the contents can be used safely.

Speech Recognition

What is Speech Interface Technology?

We have been researching speech interfaces to provide all kinds of people-from children to the elderly-with a easy-to-use means of handling information devices in diverse settings including the office, the home, the automobile, and outdoors. As a medium used by human beings, "speech" is the most natural of all, and our objective is to use speech as an interface to create a human-friendly "natural user interface" (NUI).

In this section, I would like to introduce our speech interface technology used in call centers and production management. My hope here is to give the reader a feeling for the great possibilities of speech input and speech interfaces.

Real-time conversational-speech-to-text technology? - Application example: call center -

In addition to receiving customer requests and complaints and solving customer problems, a call center has the important role of reflecting valuable customer comments and opinions in products and solutions. In the past, such comments and opinions received from customers by phone were input by computer by call-center operators. The data so obtained were then passed on to relevant departments and grouped together in management reports. This process suffered from various problems including faulty data caused by input errors and the huge amount of time required to assemble reports. Against this background, we chose speech recognition technology as a means of making an operator's work more efficient and making the information received from customers even more valuable.

We are using, in particular, technology for matching input speech using the tree-structured acoustic model that I described in the previous section, and technology for autonomously learning a speaker's voice features and recognizing that speaker's speech even with a small number of voice samples (autonomous speaker adaptation technology). We are also working to improve conversational-sentences analysis technology to collect and analyze casual expressions and chatty expressions in speech.

By making full use of these technologies, we have been able to convert operator replies at a call center to text in nearly real time and display that text on the operator's screen, and to display a screen that automatically extracts important keywords from the operator's speech.

demo video

Please note that this video differs from actual video for the sake of demonstration.

MediaPlayerDownload

Please view with a broadband connection for best results. Users without Windows Media Player can download it from here.

Noise cancellation technology - Application example: VoiceDo/HT series -

An annoying problem when attempting to input or retrieve data by speech is background noise. In an office environment that is relatively quiet with few space constraints like a call center, speech can be recognized with almost no problem by having operators attach a microphone-and-earphone headset. On the other hand, at places like production-management sites and auction houses where personnel who often have their hands full would like to register data by speech, the situation is altogether different.

For a speech recognition device, the sound of conveyer belts and the voices of various people become "noise." We developed noise cancellation technology to remove this noise.

For example, NEC's VoiceDo/HT series of hand-held terminals for use in factories, warehouses, outdoor situations, etc. has adopted a system called "two-microphone input noise cancellation." In this system, the terminal headset is equipped with two microphones, one placed near the user's mouth for capturing his or her speech and the other placed away from the mouth to capture noise. In this way, the user's actual voice from which noise has been removed can be recognized.

Figure 3: Recognizes speech by inputting speech and noise with two microphones and separating noise componentsFigure 3: Recognizes speech by inputting speech and noise with two microphones and separating noise components

In robot applications in which no microphone can be attached near a mouth, noise cancellation is achieved by increasing the number of microphones attached to the robot.

Speech synthesis technology - Application example: VoiceDo/HT series -

What type of voice should be used to present the results of speech recognition? This is also an important issue in speech interface technology, and it has led to the appearance of speech synthesis technology that enables text to be read out by a synthesized voice. We have achieved high-quality and diverse synthesized speech in a compact configuration through speech generation technology based on an original text-analysis and waveform-editing method.

In addition to high-accuracy text analysis by conversational-sentences analysis technology that I mentioned earlier, we are also developing other technologies in this area. These include speech generation technology that assigns natural accents and intonation and modifies waveforms accordingly while providing smooth connections between those waveforms, and compression technology for reducing speech waveforms with much variation to a compact size while maintaining a feeling of natural speech.

Thoughts on recognition rate

As our last topic in this section, I would like to touch upon recognition rate. The first thing I am asked about speech recognition as a recognition technology is "How high is the recognition rate?" Of course, I would like to be able to say "100%," but to tell the truth, the conditions under which recognition is performed play a big role in recognition rate, and achieving a rate of 100% is extremely rare. Just as it might be difficult to hear someone in a noisy place like the platform of a train station or to catch the words of a person speaking softly, there are many situations in which recognition can be difficult because of the surrounding environment or features of the speaker's voice. In light of the above, we endeavor to improve recognition rate in various ways such as by removing noise through the noise cancellation technology that I described above and by enabling the recognition of casual language and chatty expressions through learning techniques. In recent research, we have also attempted to link speech recognition with video data to raise recognition accuracy.

Nevertheless, the effect of using speech as an input interface is profound even without reaching a recognition rate of 100%. Given that a certain amount of speech can be recognized, the rest can usually be resolved by the person processing the information in question. Again, taking the call center as an example, the automatic input of operator/customer exchanges -even with some mistakes- can make work much more efficient than having the operator input all such information via a computer keyboard.

What is a waveform editing method?

A waveform editing method is a form of speech synthesis that enables any text to be readout with a synthesized voice. This is achieved by having a professional announcer utter a variety of sentences and textual patterns, recording those utterances, and cutting and pasting recorded waveforms to generate speech corresponding to the text in question. This method recognizes that the waveform required to read out the character "a", for example, will depend on the intonation and accent desired as well as on the sounds coming before and after. It therefore attempts to select the waveform data from among many recorded variations that best fits the conditions in question. A person's voice, however, results from the continuous movement of mouth and tongue, and simply connecting waveforms extracted from different locations in recorded speech does not translate to smooth speech. To produce a smooth listening experience and achieve a manner of speaking close to that of a human being, a variety of techniques can be used such as making prior sounds softer and subsequent sounds progressively louder.

What is a tree-structured acoustic model?

In speech recognition, a sound is recognized by associating a sound input to the method with a sound sample (= acoustic model) stored in the method. When attempting to recognize a variety of speaking styles, accuracy improves as the number of sound samples increases. Unfortunately, the amount of speech data required to prepare such a large number of samples is huge and the resources (= memory capacity, computational complexity) needed to perform recognition in such a method are many. At NEC, we have developed a method that can automatically determine the optimal manner of preparing sound samples once a certain amount of speech data and usable resources have been decided on. This method arranges sound samples in a tree structure starting with a “root” that gives approximate features and extending to “branches” and “leaves” that represent detailed features, and selects an optimal level of samples from this structure according to the amount of speech data and usable resources. This technology enables a high level of balance to be achieved between speech recognition and required resources. It has already enabled real-time recognition of an operator’s speech on a typical desk-top computer.

What is an autonomous speaker adaptation method?

Technology that uses the "tree-structured acoustic model" to register a speaker's utterances and efficiently learns the features of a speaker's voice from a small amount of speech data is called an "autonomous speaker adaptation method.In this method, the amount of speech data uttered by a new speaker comes into play when attempting to adapt the sound samples stored in a tree-structured acoustic model to the features of that individual. Specifically, for sounds having a small amount of data, the method adapts samples at a "root" level having approximate features, and for sounds having a sufficient amount of data, it adapts up to samples at the "leaf" level having detailed features. In contrast to past methods in which we had to manually adjust the level to which samples were to be adapted, this technique enables optimal adaptation according to the amount of data available without the need of human intervention.

What is conversational-sentences analysis technology?

In the speech recognition process, the association of input sounds with sound samples is followed by the recognition of word sequences using a "language model". This language model is prepared by dividing text data obtained from previously collected speech into words and subjecting it to statistical processing. The spoken language differs from the written language found in newspaper articles and elsewhere because the former includes unnecessary words like "oh" and "well" as well as polite expressions used in a social setting. To deal with these special characteristics, we need "conversational-sentences" analysis technology. At NEC, we have extended text analysis techniques developed in the fields of text mining and automatic interpretation to the spoken language to realize high-accuracy conversational-sentences analysis technology. This has enabled us to efficiently create a language model with good accuracy from text data generated from speech.