IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
Talking to machines
COVER STORY: Talking to Machines

By Eric J. Lerner

25 years of speech recognition research and accelerated ways of translating ideas into innovative offerings are paying off in new product

In Brief:

Fundamental understanding of speech recognition gleaned by Research over the past quarter-century is leading to products that satisfy customers and market needs. Recently, Research has helped to develop systems that recognize continuous speech in specialized applications, such as creation of radiologists' reports. Research has used further advances in continuous-speech recognition to create telephony applications. The ultimate goal is giving computers the ability to act on complex, naturally spoken queries and commands.


In Arthur C. Clarke's novel 2001: A Space Odyssey, the HAL 9000 computer that guides the Discovery spacecraft to the neighborhood of Saturn "became operational" on January 12, 1997. On that recent birthdate of the mythical machine that perfectly understood human speech, scientists in the human language technology group at IBM's Thomas J. Watson Research Center, headed by David Nahamoo, were preparing to celebrate a more realistic anniversary: 25 years of developing down-to-earth means of speech recognition. For while the day has not yet arrived in which any individual can go up to a computer, talk naturally to it in any language on any topic and be immediately understood, the team is making significant progress toward that goal.

Take voice recognition software such as IBM's VoiceType® Simply Speaking, which transforms users' dictation straight into written text without the intervention of a stenographer or a typist. Released late last November, Simply Speaking is fulfilling a strong market need for improvements in the human interface with computers - and doing so with remarkable success. In the first four months after its release, IBM's Speech Business Unit had increased its sales by an order of magnitude compared with the previous year, says Jan Winston, who heads the unit.

Another IBM speech recognition product released last year, known as MedSpeak/Radiology, illustrates the broadening scope of the technology. The world's first continuous-speech recognition dictation and work-flow product, the system takes dictation from a radiologist examining a patient's X-ray and smoothly converts the comments into a written report. The doctor checks over the transcript, makes a few changes, adds an electronic signature and sends it on its way.

The technology "has been extremely well received by the health sector," reports Anne-Marie Derouault, a Paris-based executive for global marketing and sales in IBM's Speech Business Unit. Indeed, for hundreds of radiologists now using MedSpeak/Radiology, the ultimate goal of a generation of speech research is already here. It takes the form of continuous-speech, large-vocabulary, user-independent recognition.

MedSpeak is a manifestation of IBM's commitment to moving the results of speech recognition research rapidly into the marketplace, based on first-of-a-kind projects. The new approach closely links research, development, marketing and customers in specific niche markets to exploit current, near-future and more advanced levels of technological development. This substantially reduces the time from lab to market, and matches products more closely to the users needs.

The Year of Speech

More results of this new approach and all other efforts by the Speech Business Unit will appear at an increasing pace. Already in the pipeline are a specialized developer's toolkit that will permit software developers to use IBM's VoiceType technology in voice-activated browsers for the World Wide Web - such as VoiceType Connect for Netscape, IBM's voice-activated home director - and speech recognition systems for telephony applications. Small wonder that consulting company TMA Associates has predicted that 1997 will be "the year of speech."

That achievement will put IBM squarely at the center of almost every future information technology. "The best way to make computers universally usable is to get them to communicate the way people do - by talking and listening," explains Nahamoo. "In my view, the revolutionary potential of the Internet can be fully realized when it harnesses the power of speech technology. We are now capable of making PCs, the Internet and, soon, your TV and microwave a lot easier to operate." Adds Derouault, "Speech recognition will force Internet technology to evolve, to accommodate new users who have not been able to use computers in the past."

IBM's focus on introducing speech recognition into the marketplace views research and development as a continuous process aimed at the eventual marketing of products. "We constantly evaluate the spectrum of our R&D efforts in speech and natural language areas across Research and Speech Business Unit development teams," explains Nahamoo. "Our goal is to strike a healthy balance between short-term products and delivery within one year, longer-term technology development over two to five years, and adventurous efforts whose time frame and payoff are naturally unclear."

The need for a more rapid approach to product development based on a tighter coupling of research and development with marketing became increasingly evident as the field reached a critical point with the release of IBM's first dictation system in 1992 and its Personal Dictation System for personal computers in 1993. In 1995, the company merged all aspects of its speech technology into the Speech Business Unit, whose aim was to coordinate the development of speech technologies and products within IBM.

"We formed a virtual worldwide business as a corporate business unit," explains Winston, "and it involves close teamwork between individuals in research, development, production, marketing, sales and finance. David Nahamoo's team in Research is a critical and valued element of IBM's Speech Business Unit."

Discrete Speech Recognition

The company's effort in speech recognition encompasses two key technologies: discrete speech, in which short pauses separate each word; and continuous speech, which we use in normal conversation. IBM products in speech recognition are introduced into the market based on these two styles of speaking. The products are aimed at text-entry applications - for example, dictation for email and word processing - and query and transaction processing, such as command and control for menu navigation of desktop applications.

Discrete speech recognition is the most mature technology, with general-use products already on the market. The most recent are IBM's VoiceType 3.0 and Simply Speaking, discrete-speech general-vocabulary dictation systems intended for use on Pentium®-based personal computers. These products are available for American English, UK English (treated as a separate language), French, German, Spanish, Italian and Japanese.

"The advance of VoiceType 3.0 over the earlier products is that it is speaker-independent and doesn't need a special audio adapter card; it's just software," explains Eddie Epstein, who participated in developing the system. Previous products required fairly extensive "training" - an hour or more of reading from a prepared text - to get used to a speaker's voice. VoiceType 3.0 reduces training to 15 to 20 minutes in most cases. Speakers with thick accents and dialects might still require the full one-hour training.

Relying on its highly accurate recognition performance, this technology separates correction from dictation. Once dictation is complete, the user corrects the manuscript, and can command a playback of what was actually said. This unique feature allows users to do visual tasks, such as analyzing X-ray pictures, while simultaneously dictating a report. The corrections allow the system to continue to train itself, with the result that accuracy improves with use.

The Holy Grail

Large-vocabulary continuous-speech recognition has always been the Holy Grail of this field. After all, people do speak continuously. Only because the computational task of accurately recognizing continuous-speech style proved extremely difficult did researchers pursue discrete speech recognizers, for which the job is much easier.

The basic problem is that, in speech, one sound modifies the sound that comes after it. The slurring together of words in continuous speech multiplies greatly the ways in which these modifications occur, and hence the number of possibilities that a computer must examine to determine what the speaker has said. This phenomenon not only adversely impacts the separability of different sounds, thereby increasing the number of errors, but also increases the amount of processing power and memory required for continuous-speech recognition. In addition, the computer must determine how to break continuous speech into words, a task already accomplished by discrete speech.

While discrete speech is easier on computers, it's harder on people. Most individuals are not trained to dictate well enough for a stenographer or a typist, let alone a computer. Thus continuous-speech recognition, leading to general-purpose dictation systems, remains a high priority of research, and progress towards that goal is now steady and fairly rapid.

From the start, IBM took a statistical approach to speech recognition. Instead of trying to figure out theoretically how people translate sound into meaningful words, this approach uses vast amounts of data to correlate features in speech with the basic sounds of a language. "Right now, we group sound into three thousand different units, based on their characteristic combinations of frequencies," says Watson researcher Lalit Bahl. "But to get better accuracy in continuous-speech recognition, we need to make finer distinctions among sounds. That means even more data and even more computer speed."

Bahl and his colleagues have experimented with different ways of analyzing sound, as well. The conventional method breaks up a sound according to its various frequencies, using a technique called Fourier analysis, which treats a sound as a sum of sine and cosine functions. A newer method based on wavelets, which treats sound as composed of pulses of various lengths, is showing some promise for better distinguishing closely related sounds.

Going, going . . .

Once the computer distinguishes the sounds, it has to figure out how they are combined into meaningful words. Again statistics help in the IBM approach, since the computer knows the relative frequency of all groups of three words in the language being recognized. For example, it knows that in English "going to go" is more frequent than "going, too, go" and far more frequent than "going two go."

Nevertheless, in continuous speech, a huge number of possibilities still remain to be searched and compared with the uttered sounds to determine the highest probability. So researchers are looking at new and faster methods of pruning search trees to obtain the most likely alternatives.

In one project to test and develop continuous-speech technologies, originally initiated by the Defense Advanced Research Projects Agency, researchers are attempting to transcribe broadcast radio news reports. "This is a huge free database," explains Ponani Gopalakrishnan, "and it's a good test because it tends to be worst-case material; broadcasts often have background noise, music and static that are much worse than the conditions in a quiet office. So if we can get this right, we know we can do general dictation."

One technique emerging from this work uses pauses in speech to detect and then filter out background sounds such as music or static. Another uses certain sounds from a speaker, such as vowels, to further refine the analysis. "Right now we are getting a 20 to 30 percent error rate," says Gopalakrishnan. "But with the present progress, we should be down to 10 to 15 percent in two years and at acceptable rates for real transcription products for this type of material in perhaps four years."

A Task of Low Perplexity

By late 1994, the effort to develop large-vocabulary, continuous-speech recognition in American English had progressed to the point at which James McGroddy, then IBM's senior vice president for research, began to ask scientists how the company could exploit the work in the market. The technology was not ready for an application in general vocabulary dictation, because its accuracy was too low.

Ordinary speech or written text is unpredictable, since the computer has no way of knowing if the user is dictating a business letter, a love letter, a poem or a scientific treatise. In these examples, the probability that one word follows another - the context that is vital to recognition accuracy - is low, on average. What was needed was a more limited, and therefore more predictable, application - a task with a reduced average number of choices of words that could follow each word in the vocabulary, technically referred to as a low-perplexity task. "The choice of transcribing radiologists dictated reports on X-rays and other images was obvious, owing to the low-perplexity behavior of these reports and the availability of a research prototype as a proof of concept," recalls Nahamoo.

Conventionally, radiologists dictate their findings into a tape recorder and send them to a transcriptionist for typing. Turnaround times to get transcripts back for correction and signature can range from hours to days. The new automatic radiology dictating system would allow radiologists to dictate into a computer and then correct and sign the report on the spot. This would enormously speed up this process, while saving the expense of the transcriptionist.

Not only did a need plainly exist, but it could also be defined in a limited framework. For while they require a vocabulary of 25,000 words, radiology reports are predictable. Like any medical document, they are laden with formulae that are repeated or slightly varied: "The (name of organ) appears normal." "There is evidence of (name of a medical effect) on the (name of organ)." This predictability made the speech recognition problem manageable with existing technology.

It took about six months before research, Health Industry Solutions, marketing, and development teams were in place to start product development. The team was able to get the product on the market in less than a year. Working closely with radiologists at Memorial Sloan-Kettering Cancer Center in New York and Massachusetts General Hospital in Boston, the team worked out the bugs in the system and adapted its user interfaces to the needs of the final users, the radiologists themselves.

The result is the first real-time continuous-speech dictation system on the market. Running on a PentiumPro® 200 MHz personal computer, it requires no training (although a 15-minute training session can improve accuracy). The software costs $4,500 and the hardware around $5,000. Savings on transcription can pay for the expenditure within a year.

Radiologists are very enthusiastic about the system. "From my standpoint of having responsibility for generating over 400,000 radiology reports each year, I can only call the IBM MedSpeak/Radiology system a spectacular breakthrough," says Dr. James H. Thrall, radiologist in chief at Massachusetts General Hospital.

Speech Recognition for Telecommunications

Other products that use the present limited continuous-speech recognition technology are moving rapidly toward the market. Again, the goal is to define applications with a limited number of choices. An obvious one is an automated phone-dialing system.

In the IBM Autodialer prototype, a list of names from a large company is entered into the computer, along with their pronunciations. An individual who calls the company can use the system to ask for the person needed by name; the recognizer identifies the name, repeats it back and dials the extension. "Right now we have 95 percent accuracy on the first try," explains Watson's Ken Davies. "If the machine doesn't get it right, the user repeats the name and then we get almost 100 percent accuracy, partly through the machine's adaptation from the first try and partly due to the multiple trial improvement in the hit rate." IBM plans to introduce this product shortly for use throughout the company.

Slightly further from final development is a voice-activated Yellow Pages. "We are attempting to build a system that will enable people to extract essentially the same information that they get from human operators using speech recognition technology," says Watson scientist David Lubensky.

The system will work via spoken hierarchical menus. Searching for a Hunan restaurant, for example, a novice user can obtain successive choices in the broad categories of entertainment, restaurants and Chinese restaurants before obtaining Hunan selections. Experienced users can "barge in" over the hierarchical structure. By saying "Hunan restaurants," the user will immediately obtain the Hunan listings, thereby minimizing the time to find the number. The system will play the first five or 10 entries or more on request. Once the inquirer has identified the needed number, the system will dial it automatically.

A development and marketing team from IBM's Telecommunications and Media Industry Solutions Unit, based in Fort Lauderdale, is actively pursuing deployment of telephony speech recognition technology to provide advanced solutions to telephone companies in areas such as enhanced services, directory assistance and dialing services.

Understanding What's Said

For the future, says Lubensky, the obvious progression will be to incorporate natural language understanding into the system - for example, permitting inquirers to ask for "Chinese restaurants in White Plains." That will require the computer to convert English sentences into a standard database query format. This type of conversion is another strong area of basic research at IBM.

"A basic problem of transforming natural language into formal language is to parse the sentence," says Watson's Salim Roukos, a key researchers in this area. "As in the case of speech recognition, we use statistical techniques, feeding the computer large databases of triplets of natural sentences, their parsings and their meanings, and derive from that statistical links between word orders and placements and final meanings."

A test system now under development uses natural language understanding for an airline schedule database. In response to a spoken request such as "I want to fly from Boston to Atlanta," the computer provides a list of Boston-to-Atlanta flights. At present, the system can recognize and correctly interpret some 5,000 sentences. A much more content-rich application such as natural-language conversational Yellow Pages is still in the future, since it needs to recognize a considerably broader variety of requests.

Also under study is the concept of incorporating speech recognition technology into products under development by IBM's subsidiary Lotus. Late last year, researchers at Watson and Lotus linked up in a project to incorporate speech technology into Kona - a series of applications such as spreadsheets, word processors and e-mail written in Java for use on the World Wide Web. "We were able to put together a sample demo using a simple microphone fairly quickly," says Watson research staff member Bruce Lucas. "Incorporating speech will differentiate the technology from such applications currently available in the market."

Another effort focuses on "collagen," which Lotus researcher Candy Sidner describes as "a fairly rich model of collaborative discourse that makes it possible to communicate with an agent about a task that the agent will carry out for the user." At present, such communication is restricted to the traditional keyboard and mouse. But preliminary work is under way to permit users to communicate with collagen agents via speech. "We're doing it for Java-based applications, and looking at using it in e-mailing, calendars and scheduling," says Sidner.

With the new organization that teams up research, development and marketing, IBM expects to roll out a steady stream of new speech recognition products. "We can not only make a market, but also differentiate IBM's offerings and improve their value to our customers," says Toby Maners, manager of worldwide speech products and business management. "Today we have dictation, command and control and text-to-speech. Tomorrow we will move further into the food chain of human interaction, with dialog and natural language understanding."

Just as important, says Mike Shannon, chief financial officer of the Speech Business Unit, the new offerings will be targeted at the mass consumer market. During the next five years or so, full continuous-speech recognition will be increasingly integrated into standard personal computer products and programs, making keyboarding a matter of choice rather than a necessity for everyone. In the longer run, as natural language understanding matures, the science fiction dream of computers like HAL, that can act on complex, naturally spoken commands, may well be realized.


Eric Lerner is a freelance science writer based in Lawrenceville, New Jersey.


More Information:

Coping with Chinese

Tools For Developers

IBM Speech Products


Coping with Chinese

To recognize some languages, continuous speech is the only possible approach, since those languages don't lend themselves to pauses between words. Mandarin, the most widely spoken language in China, provides an example. It is written without white space between characters, and contains several words that consist of two or more subwords, such as "bookmark." Since no basic standard exists on where to break such words in speech, explains Michael Picheny, a scientist at the Thomas J. Watson Research Center, different speakers say the words in different ways, choosing either to break the subwords or to run them on.

"Mandarin presents other special problems of its own," adds Picheny, "since like other Chinese languages it is tonal." That means that words change meaning depending on the tone pattern or voice pitch used. Mandarin has four basic tones: high, rising, dipping and falling.

Because of the tonality, researchers at Watson and the China Research Laboratory in Beijing, who were working on a Mandarin speech recognizer, had to use a set of measures different from those for recognizers of Western languages. However, with the basic set of statistical tools already developed for other languages, the team was able to develop a large-vocabulary continuous-speech Mandarin prototype in less than a year. An effort is currently under way to incorporate this technology into a marketable product.


Tools For Developers

If voice recognition is to become widespread, it can't be limited to applications developed by IBM. To help software vendors incorporate IBM technology into their products, researchers at the Thomas J. Watson Research Center and development teams in the Worldwide Speech Business Unit are developing a "toolkit" for speech recognition that the company will supply to vendors and will continually update. "What we are providing is the basic engine that does the speech recognition, and help for those who want to integrate it into an application," says Watson's Steve De Gennaro, who helps to drive the technology.

New toolkit releases will occur regularly, every 3 to 6 months. Each release will include developer's programming tools and sample programs, advice on how to use speech recognition in products and, as needed, marketing support services. IBM is already distributing information on the toolkits through a Web page.

"We'll be distributing three levels of software," explains De Gennaro. "Alpha is the early view of the newest stuff, on which we'll be soliciting feedback from the vendors. Beta is more developed and Golden is stable, finished technology." While the basic information on the technology will be distributed for free, support services will of course carry a fee, and any use of the technology in a product will require a license from IBM.

For more information see: http://www.software.ibm.com/voicetype


IBM Speech Products
  • IBM offers the most advanced selection of speech products available in the market today. Available in multiple languages and for a variety of operating systems, these products can also be enhanced with the addition of vocabularies available separately for the fields of law, medicine and journalism.
  • IBM VoiceType Simply Speaking for Windows 95 is a discrete-speech recognition program that allows users to dictate 70 to 100 wpm into speech-aware applications with a greater than 90 percent accuracy rate for most people out of the box, and 95 percent after a brief training period.
  • IBM VoiceType Dictation 3.0 for Windows 95 adds to the capabilities of Simply Speaking the ability to navigate menus by voice and other features, such as direct dictation in Microsoft Word®. Versions are also available for OS/2 and Windows 3.1. The OS/2 Warp 4.0 version is integrated into the operating system, simplifying its use with applications.
  • IBM VoiceType Developers Toolkit 3.0 for Windows 95 helps application developers write speech-aware applications.
  • MedSpeak/Radiology is the world's first real-time continuous-speech dictation system. Its 25,000-word vocabulary allows radiologists to automatically dictate reports on X-rays and other images into a computer, thereby saving the time and expense of transcription from a tape.




    About IBMPrivacyContact