Please note that JavaScript and style sheet are used in this website,
Due to unadaptability of the style sheet with the browser used in your computer, pages may not look as original.
Even in such a case, however, the contents can be used safely.

Speech Recognition

What is Speech Recognition Technology?

Akitoshi Okumura

To introduce myself, I am Akitoshi Okumura of the Common Platform Software Research Laboratories. Since joining NEC in 1986, I have been involved in the research and development of technologies for translating and comprehending human language. The ultimate goal of our research is an "International Prince Shotoku" system. The name of this system refers to Japanese legendary figure Prince Shotoku, who was known for his ability to listen to eight persons simultaneously and understand everything that each person was saying. Similarly, we consider a situation in which a number of people are speaking together but in different languages. If there was a means of understanding what each of these persons were saying, could we not overcome the language barrier and even surmount cultural differences? Providing anyone with the ability to visualize conversations between people speaking not only Japanese but other world languages too should prove to be very useful not only in the business world but in everyday life as well.

In the first section of this article, I'd like to explain what speech recognition technology for visualizing conversations is all about.

Humans detect sound with their ears, and it is said that they identify words and connect them with meaning using the left side of the brain and discern emotion from words and facial expressions using the right side of the brain. To reproduce this series of functions on a machine such as a computer, we need to develop functions for recognizing the sounds themselves and functions for analyzing those sounds. But as you know, a machine recognizes speech based on preregistered words, sentences, and grammar, which makes it difficult to recognize non-grammatical expressions that result, for example, when a speaker omits articles, rearranges word order, or inserts a sentence within a sentence. This is the weak point of machine-based recognition. At the same time, the human brain is capable of remembering several ten thousands of words through learning or experience, and slight mistakes in pronunciation or even the use of unknown words in another person's utterances is not a major problem since context often enables the listener to understand what that person is trying to say. In a machine, however, the vocabulary and experiences that have come to be registered do not cover that of all people making the complete analysis of sound difficult. It would be impossible for a machine to record human vocabulary in its entirety, all existing sentences, all possible human experiences, and all voices. And even if that were possible, checking input data with all such registered data would require a huge amount of processing time and consume an excessive amount of power. By what means, then, can we efficiently and correctly recognize everyday human conversation, that is, the spontaneous speech? This question has been answered by the development of novel technologies including sound modeling and search technology, process-downsizing technology, and embedding technology.

How is spontaneous speech recognized? - NEC speech recognition engine -

A small dictionary of everyday words features a vocabulary of about 50-100 thousand words while a large dictionary of a nation's language holds more than 200 thousand words. In terms of simply remembering such a vocabulary, machines are more adept than humans. However, to retrieve a word from such a huge collection of words in a nearly real-time manner based on speech signals differing greatly among individuals, we need an acoustic model that can effectively represent individual variation, and technology that, based on this model, can efficiently retrieve the correct word from stored vocabulary. Specifically, we need technology that can perform speech recognition by extracting feature data from a person's utterances, computing the distance between that data and previously prepared acoustic models, and computing similarity between the utterances and words and sentences in a dictionary using those results(word matching).

At the dawn of speech recognition technology, this series of processes was performed by registering a person's voice and words beforehand in a computer and employing a technique called dynamic programming (DP) matching. This was an extremely limited form of recognition technology since it was capable of recognizing only preregistered words spoken by the person whose voice was registered beforehand. But in the second half of the 1980s, we achieved an explosive increase in the number of words and sentences that could be recognized through the use of a demi-syllable recognition technique. This technique originates in a method that recognizes speech not on the level of individual words but rather on a level of smaller units, i.e., on kana (the Japanese syllabary). In addition to this technology, development of the following technologies have brought us to a technical level where the spontaneous speech can be recognized without having to specify a particular speaker or the words targeted for recognition.


We call this set of speech recognition technologies a "speech recognition engine." This is a compact and scalable large-vocabulary continuous speech recognition engine that can be loaded on equipment ranging from main-frame computers to mobile devices. Our engine features an original "acoustic model" that plays an important part in the development of a matching algorithm and "information criterion" that provides an index for selecting suitable samples from a large number of sample candidates so as to achieve a compact acoustic model. These features underscore NEC's technical expertise and are a particular source of pride to us.

Preparing many sound samples enables a system to support many voices and to perform accurate recognition. For this reason, sound samples are generally prepared as an "acoustic model" in speech recognition technology. However, the greater is the number of prepared samples, the harder it is to find a suitable sample from that large collection of samples. For example, the sound of vowel "a" and that of consonant "k" can differ depending on the sounds that come before and after them as well as on the person uttering them. Consequently, while it is necessary to prepare many samples to support variation in context and speakers, the preparation of a huge number of samples that covers all cases is prohibitive. Against this background, we decided to group together similar sound samples and create a parent sample to represent them. This parent sample provides an approximate representation of the features of the child sounds. The result of this process is a tree-structured model like the one shown in Fig. 1 for the case of "a" and "k". In the figure, "a1234" and "k1234" denote parent samples.


Figure 1: Model compression by information criterionFigure 1: Model compression by information criterion

To enable speech recognition based on this model on a compact terminal having relatively low CPU performance and small storage capacity, as few as possible samples should be selected to prevent a drop in recognition performance. The index for making this selection is an "information-volume criterion" that sums the value indicating the suitability, on the whole, of selected samples for recognizing given data and the value indicating the number of selected samples. If recognition accuracy does not fall when using representative parent samples, parent samples will be selected, but if recognition accuracy deteriorates when using representative parents, child samples that represent more detailed sounds will be used. In other words, easy-to-recognize sounds means matching by the parent model that narrows down the number of candidates and hard-to-recognize sounds means matching by the child model having multiple candidate sounds.

The development of this acoustic model and compression technique based on an information criterion has enabled us to minimize the processing required for analyzing sounds and performing matching. This, I believe, was a breakthrough in speech-recognition algorithms. The technology has enabled us to equip compact terminals like PDAs in addition to PCs with a Japanese-English bidirectional translation function for travelers on a 50,000-word level.

By what means can speech recognition be comfortably performed at low power on a portable terminal? - Parallel processing technology and acoustic look-ahead technology -

More recently, we have been working on the development of technology for executing speech recognition processing using multiple processors in parallel. To perform high-speed speech recognition processing at a level of several ten thousands of words on a compact device like a cell phone, processor performance must be high even when using a speech recognition engine like the one described above. At the same time, a high-performance processor consumes an enormous amount of power that can deplete a battery's charge in no time at all. In response to this problem, we turned our attention to the use of a multicore processor having more than one processor on a single chip. This is because multiple processors consume less power than a single high-performance processor making a multicore processor more suitable for portable devices.

But even if multiple processors are used to perform a series of processes (such as calculating the distance between voice feature data and previously prepared acoustic samples and calculating the similarity between that data and dictionary words to perform word matching), failure to equalize the processing time allocated to each processor can result in bottlenecks caused by the slowest processor making the use of parallel processing meaningless. On investigating the time required for such a series of processes, we found that the time required for word matching was more than that for extracting voice features and calculating the distance between that data and previously prepared acoustic samples. In short, we found that simply assigning the processes of "voice feature extraction", "distance calculation with acoustic samples", and "word matching" to different processors would generate a situation in which two of the three processors would have to wait for the other processor to complete word matching.

We therefore came up with a way of shortening the time for word matching, the most time-consuming process. Our idea was to divide up the word-matching process by performing matching at the simple syllable level before performing matching at the word level. Instead of starting directly with matching at the word level, we would first calculate more likely syllables and then use those results in matching at the word level. In the development of this technology for performing preprocessing before the actual word-matching process (called "acoustic look-ahead technology"), we successfully divided up speech recognition processes equally on the following three processors thereby maximizing the performance of a multicore processor.

For example, when attempting to recognize the word "konnichiwa", the following computational processes were allocated to three processors:

  • Processor A: Voice feature extraction and distance calculations with acoustic samples
    This processor extracts voice (sound) features as voiceprint data and calculates the distance between that feature data and acoustic samples.
  • Processor B: Acoustic look-ahead
    This processor sequentially inputs the processing results of Processor A and calculates, in parallel, approximation scores with syllables such as "a", "ka", "sa", "ta", and "na" to determine similarity on a simple syllable leve
  • Processor C: Word-sequence matching
    This processor calculates similarity on the word level while referring to syllables with high similarity as determined by Processor B. Again, using "konichiwa" as an example of input speech, we consider matching at the word level against candidate words "konnichiwa" and "konbanwa". Now, without similarity processing at the syllable level by Processor B, it would be necessary to focus in on the sound following "kon", the sound shared by the two candidate words. In other words, similarity with the input sound must be calculated for both "ni" and "ba". In the proposed scheme, however Processor B computes which of "ni" or "ba" the input sound is closer to, and based on that result, Processor C has no need to calculate similarity for word candidate "konbanwa" having a low possibility of matching. In this way, we have shortened the time required for matching word sequences and have divided all of speech recognition processing among three processors equally.


Figure 1: Model compression by information criterionFigure 2: Speech recognition by a multicore processor* for cell phones

* MP series of application processors from NEC Electronics Corporation

What is an N-gram statistical language model?

An N-gram is a sequence of N words. Speech recognition research makes much use of bigrams and trigrams, that is, sets of two words and sets of three words, respectively. This technology statistically models language-related information such as “The article “wa” often follows the word “kyou” and “yoi” often precedes “tenki”, but “wa” and “yoi” are usually not connected.” In the past when working with speech recognition on a level of several tens of words, this kind of grammatical information was simply programmed into the system manually as needed. But today, with systems that deal with several ten thousands of words, there is no way that such information can be manually programmed. We instead learn the frequency at which sets of words appear using a large-scale text database and apply the knowledge so gained to speech recognition.

What is a tree-structured acoustic model?

In speech recognition, a sound is recognized by associating a sound input to the method with a sound sample (= acoustic model) stored in the method. When attempting to recognize a variety of speaking styles, accuracy improves as the number of sound samples increases. Unfortunately, the amount of speech data required to prepare such a large number of samples is huge and the resources (= memory capacity, computational complexity) needed to perform recognition in such a method are many. At NEC, we have developed a method that can automatically determine the optimal manner of preparing sound samples once a certain amount of speech data and usable resources have been decided on. This method arranges sound samples in a tree structure starting with a “root” that gives approximate features and extending to “branches” and “leaves” that represent detailed features, and selects an optimal level of samples from this structure according to the amount of speech data and usable resources. This technology enables a high level of balance to be achieved between speech recognition and required resources. It has already enabled real-time recognition of an operator’s speech on a typical desk-top computer.

What is an autonomous speaker adaptation method?

Technology that uses the "tree-structured acoustic model" to register a speaker's utterances and efficiently learns the features of a speaker's voice from a small amount of speech data is called an "autonomous speaker adaptation method.In this method, the amount of speech data uttered by a new speaker comes into play when attempting to adapt the sound samples stored in a tree-structured acoustic model to the features of that individual. Specifically, for sounds having a small amount of data, the method adapts samples at a "root" level having approximate features, and for sounds having a sufficient amount of data, it adapts up to samples at the "leaf" level having detailed features. In contrast to past methods in which we had to manually adjust the level to which samples were to be adapted, this technique enables optimal adaptation according to the amount of data available without the need of human intervention.

What is conversational-sentences analysis technology?

In the speech recognition process, the association of input sounds with sound samples is followed by the recognition of word sequences using a "language model". This language model is prepared by dividing text data obtained from previously collected speech into words and subjecting it to statistical processing. The spoken language differs from the written language found in newspaper articles and elsewhere because the former includes unnecessary words like "oh" and "well" as well as polite expressions used in a social setting. To deal with these special characteristics, we need "conversational-sentences" analysis technology. At NEC, we have extended text analysis techniques developed in the fields of text mining and automatic interpretation to the spoken language to realize high-accuracy conversational-sentences analysis technology. This has enabled us to efficiently create a language model with good accuracy from text data generated from speech.