25 years of speech recognition research and accelerated ways of
translating ideas into innovative offerings are paying off in new product
In Brief:
Fundamental understanding of speech recognition gleaned by Research over the
past quarter-century is leading to products that satisfy customers and market
needs. Recently, Research has helped to develop systems that recognize
continuous speech in specialized applications, such as creation of radiologists'
reports. Research has used further advances in continuous-speech recognition to
create telephony applications. The ultimate goal is giving computers the
ability to act on complex, naturally spoken queries and commands.
In Arthur C. Clarke's novel 2001: A Space Odyssey, the HAL 9000
computer that guides the Discovery spacecraft to the neighborhood of Saturn "became
operational" on January 12, 1997. On that recent birthdate of the mythical
machine that perfectly understood human speech, scientists in the human language
technology group at IBM's Thomas J. Watson Research Center, headed by David
Nahamoo, were preparing to celebrate a more realistic anniversary: 25 years of
developing down-to-earth means of speech recognition. For while the day has not
yet arrived in which any individual can go up to a computer, talk naturally to
it in any language on any topic and be immediately understood, the team is
making significant progress toward that goal.
Take voice recognition software such as IBM's VoiceType® Simply
Speaking, which transforms users' dictation straight into written text
without the intervention of a stenographer or a typist. Released late last
November, Simply Speaking is fulfilling a strong market need for improvements in
the human interface with computers - and doing so with remarkable success.
In the first four months after its release, IBM's Speech Business Unit had
increased its sales by an order of magnitude compared with the previous year,
says Jan Winston, who heads the unit.
Another IBM speech recognition product released last year, known as
MedSpeak/Radiology, illustrates the broadening scope of the technology. The
world's first continuous-speech recognition dictation and work-flow product, the
system takes dictation from a radiologist examining a patient's X-ray and
smoothly converts the comments into a written report. The doctor checks over the
transcript, makes a few changes, adds an electronic signature and sends it on
its way.
The technology "has been extremely well received by the health sector,"
reports Anne-Marie Derouault, a Paris-based executive for global marketing and
sales in IBM's Speech Business Unit. Indeed, for hundreds of radiologists now
using MedSpeak/Radiology, the ultimate goal of a generation of speech research
is already here. It takes the form of continuous-speech, large-vocabulary,
user-independent recognition.
MedSpeak is a manifestation of IBM's commitment to moving the results of
speech recognition research rapidly into the marketplace, based on
first-of-a-kind projects. The new approach closely links research, development,
marketing and customers in specific niche markets to exploit current,
near-future and more advanced levels of technological development. This
substantially reduces the time from lab to market, and matches products more
closely to the users needs.
The Year of Speech
More results of this new approach and all other efforts by the Speech
Business Unit will appear at an increasing pace. Already in the pipeline are a
specialized developer's toolkit that will permit software developers to use
IBM's VoiceType technology in voice-activated browsers for the World Wide Web -
such as VoiceType Connect for Netscape, IBM's voice-activated home director -
and speech recognition systems for telephony applications. Small wonder that
consulting company TMA Associates has predicted that 1997 will be "the year
of speech."
That achievement will put IBM squarely at the center of almost every future
information technology. "The best way to make computers universally usable
is to get them to communicate the way people do - by talking and listening,"
explains Nahamoo. "In my view, the revolutionary potential of the Internet
can be fully realized when it harnesses the power of speech technology. We are
now capable of making PCs, the Internet and, soon, your TV and microwave a lot
easier to operate." Adds Derouault, "Speech recognition will force
Internet technology to evolve, to accommodate new users who have not been able
to use computers in the past."
IBM's focus on introducing speech recognition into the marketplace views
research and development as a continuous process aimed at the eventual marketing
of products. "We constantly evaluate the spectrum of our R&D efforts in
speech and natural language areas across Research and Speech Business Unit
development teams," explains Nahamoo. "Our goal is to strike a healthy
balance between short-term products and delivery within one year, longer-term
technology development over two to five years, and adventurous efforts whose
time frame and payoff are naturally unclear."
The need for a more rapid approach to product development based on a tighter
coupling of research and development with marketing became increasingly evident
as the field reached a critical point with the release of IBM's first dictation
system in 1992 and its Personal Dictation System for personal computers in 1993.
In 1995, the company merged all aspects of its speech technology into the Speech
Business Unit, whose aim was to coordinate the development of speech
technologies and products within IBM.
"We formed a virtual worldwide business as a corporate business unit,"
explains Winston, "and it involves close teamwork between individuals in
research, development, production, marketing, sales and finance. David Nahamoo's
team in Research is a critical and valued element of IBM's Speech Business Unit."
Discrete Speech Recognition
The company's effort in speech recognition encompasses two key technologies:
discrete speech, in which short pauses separate each word; and continuous
speech, which we use in normal conversation. IBM products in speech recognition
are introduced into the market based on these two styles of speaking. The
products are aimed at text-entry applications - for example, dictation for
email and word processing - and query and transaction processing, such as
command and control for menu navigation of desktop applications.
Discrete speech recognition is the most mature technology, with general-use
products already on the market. The most recent are IBM's VoiceType 3.0 and
Simply Speaking, discrete-speech general-vocabulary dictation systems intended
for use on Pentium®-based personal computers. These products are available
for American English, UK English (treated as a separate language), French,
German, Spanish, Italian and Japanese.
"The advance of VoiceType 3.0 over the earlier products is that it is
speaker-independent and doesn't need a special audio adapter card; it's just
software," explains Eddie Epstein, who participated in developing the
system. Previous products required fairly extensive "training" -
an hour or more of reading from a prepared text - to get used to a
speaker's voice. VoiceType 3.0 reduces training to 15 to 20 minutes in most
cases. Speakers with thick accents and dialects might still require the full
one-hour training.
Relying on its highly accurate recognition performance, this technology
separates correction from dictation. Once dictation is complete, the user
corrects the manuscript, and can command a playback of what was actually said.
This unique feature allows users to do visual tasks, such as analyzing X-ray
pictures, while simultaneously dictating a report. The corrections allow the
system to continue to train itself, with the result that accuracy improves with
use.
The Holy Grail
Large-vocabulary continuous-speech recognition has always been the Holy
Grail of this field. After all, people do speak continuously. Only because the
computational task of accurately recognizing continuous-speech style proved
extremely difficult did researchers pursue discrete speech recognizers, for
which the job is much easier.
The basic problem is that, in speech, one sound modifies the sound that
comes after it. The slurring together of words in continuous speech multiplies
greatly the ways in which these modifications occur, and hence the number of
possibilities that a computer must examine to determine what the speaker has
said. This phenomenon not only adversely impacts the separability of different
sounds, thereby increasing the number of errors, but also increases the amount
of processing power and memory required for continuous-speech recognition. In
addition, the computer must determine how to break continuous speech into words,
a task already accomplished by discrete speech.
While discrete speech is easier on computers, it's harder on people. Most
individuals are not trained to dictate well enough for a stenographer or a
typist, let alone a computer. Thus continuous-speech recognition, leading to
general-purpose dictation systems, remains a high priority of research, and
progress towards that goal is now steady and fairly rapid.
From the start, IBM took a statistical approach to speech recognition.
Instead of trying to figure out theoretically how people translate sound into
meaningful words, this approach uses vast amounts of data to correlate features
in speech with the basic sounds of a language. "Right now, we group sound
into three thousand different units, based on their characteristic combinations
of frequencies," says Watson researcher Lalit Bahl. "But to get better
accuracy in continuous-speech recognition, we need to make finer distinctions
among sounds. That means even more data and even more computer speed."
Bahl and his colleagues have experimented with different ways of analyzing
sound, as well. The conventional method breaks up a sound according to its
various frequencies, using a technique called Fourier analysis, which treats a
sound as a sum of sine and cosine functions. A newer method based on wavelets,
which treats sound as composed of pulses of various lengths, is showing some
promise for better distinguishing closely related sounds.
Going, going . . .
Once the computer distinguishes the sounds, it has to figure out how they
are combined into meaningful words. Again statistics help in the IBM approach,
since the computer knows the relative frequency of all groups of three words in
the language being recognized. For example, it knows that in English "going
to go" is more frequent than "going, too, go" and far more
frequent than "going two go."
Nevertheless, in continuous speech, a huge number of possibilities still
remain to be searched and compared with the uttered sounds to determine the
highest probability. So researchers are looking at new and faster methods of
pruning search trees to obtain the most likely alternatives.
In one project to test and develop continuous-speech technologies,
originally initiated by the Defense Advanced Research Projects Agency,
researchers are attempting to transcribe broadcast radio news reports. "This
is a huge free database," explains Ponani Gopalakrishnan, "and it's a
good test because it tends to be worst-case material; broadcasts often have
background noise, music and static that are much worse than the conditions in a
quiet office. So if we can get this right, we know we can do general dictation."
One technique emerging from this work uses pauses in speech to detect and
then filter out background sounds such as music or static. Another uses certain
sounds from a speaker, such as vowels, to further refine the analysis. "Right
now we are getting a 20 to 30 percent error rate," says Gopalakrishnan. "But
with the present progress, we should be down to 10 to 15 percent in two years
and at acceptable rates for real transcription products for this type of
material in perhaps four years."
A Task of Low Perplexity
By late 1994, the effort to develop large-vocabulary, continuous-speech
recognition in American English had progressed to the point at which James
McGroddy, then IBM's senior vice president for research, began to ask scientists
how the company could exploit the work in the market. The technology was not
ready for an application in general vocabulary dictation, because its accuracy
was too low.
Ordinary speech or written text is unpredictable, since the computer has no
way of knowing if the user is dictating a business letter, a love letter, a poem
or a scientific treatise. In these examples, the probability that one word
follows another - the context that is vital to recognition accuracy -
is low, on average. What was needed was a more limited, and therefore more
predictable, application - a task with a reduced average number of choices
of words that could follow each word in the vocabulary, technically referred to
as a low-perplexity task. "The choice of transcribing radiologists dictated reports on X-rays and other images was obvious, owing to the
low-perplexity behavior of these reports and the availability of a research
prototype as a proof of concept," recalls Nahamoo.
Conventionally, radiologists dictate their findings into a tape recorder and
send them to a transcriptionist for typing. Turnaround times to get transcripts
back for correction and signature can range from hours to days. The new
automatic radiology dictating system would allow radiologists to dictate into a
computer and then correct and sign the report on the spot. This would enormously
speed up this process, while saving the expense of the transcriptionist.
Not only did a need plainly exist, but it could also be defined in a limited
framework. For while they require a vocabulary of 25,000 words, radiology
reports are predictable. Like any medical document, they are laden with formulae
that are repeated or slightly varied: "The (name of organ) appears normal."
"There is evidence of (name of a medical effect) on the (name of organ)."
This predictability made the speech recognition problem manageable with existing
technology.
It took about six months before research, Health Industry Solutions,
marketing, and development teams were in place to start product development. The
team was able to get the product on the market in less than a year. Working
closely with radiologists at Memorial Sloan-Kettering Cancer Center in New York
and Massachusetts General Hospital in Boston, the team worked out the bugs in
the system and adapted its user interfaces to the needs of the final users, the
radiologists themselves.
The result is the first real-time continuous-speech dictation system on the
market. Running on a PentiumPro® 200 MHz personal computer, it requires no
training (although a 15-minute training session can improve accuracy). The
software costs $4,500 and the hardware around $5,000. Savings on transcription
can pay for the expenditure within a year.
Radiologists are very enthusiastic about the system. "From my
standpoint of having responsibility for generating over 400,000 radiology
reports each year, I can only call the IBM MedSpeak/Radiology system a
spectacular breakthrough," says Dr. James H. Thrall, radiologist in chief
at Massachusetts General Hospital.
Speech Recognition for Telecommunications
Other products that use the present limited continuous-speech recognition
technology are moving rapidly toward the market. Again, the goal is to define
applications with a limited number of choices. An obvious one is an automated
phone-dialing system.
In the IBM Autodialer prototype, a list of names from a large company is
entered into the computer, along with their pronunciations. An individual who
calls the company can use the system to ask for the person needed by name; the
recognizer identifies the name, repeats it back and dials the extension. "Right
now we have 95 percent accuracy on the first try," explains Watson's Ken
Davies. "If the machine doesn't get it right, the user repeats the
name and then we get almost 100 percent accuracy, partly through the machine's
adaptation from the first try and partly due to the multiple trial improvement
in the hit rate." IBM plans to introduce this product shortly for use
throughout the company.
Slightly further from final development is a voice-activated Yellow Pages. "We
are attempting to build a system that will enable people to extract essentially
the same information that they get from human operators using speech recognition
technology," says Watson scientist David Lubensky.
The system will work via spoken hierarchical menus. Searching for a Hunan
restaurant, for example, a novice user can obtain successive choices in the
broad categories of entertainment, restaurants and Chinese restaurants before
obtaining Hunan selections. Experienced users can "barge in" over the
hierarchical structure. By saying "Hunan restaurants," the user will
immediately obtain the Hunan listings, thereby minimizing the time to find the
number. The system will play the first five or 10 entries or more on request.
Once the inquirer has identified the needed number, the system will dial it
automatically.
A development and marketing team from IBM's Telecommunications and Media
Industry Solutions Unit, based in Fort Lauderdale, is actively pursuing
deployment of telephony speech recognition technology to provide advanced
solutions to telephone companies in areas such as enhanced services, directory
assistance and dialing services.
Understanding What's Said
For the future, says Lubensky, the obvious progression will be to
incorporate natural language understanding into the system - for example,
permitting inquirers to ask for "Chinese restaurants in White Plains."
That will require the computer to convert English sentences into a standard
database query format. This type of conversion is another strong area of basic
research at IBM.
"A basic problem of transforming natural language into formal language
is to parse the sentence," says Watson's Salim Roukos, a key researchers in
this area. "As in the case of speech recognition, we use statistical
techniques, feeding the computer large databases of triplets of natural
sentences, their parsings and their meanings, and derive from that statistical
links between word orders and placements and final meanings."
A test system now under development uses natural language understanding for
an airline schedule database. In response to a spoken request such as "I
want to fly from Boston to Atlanta," the computer provides a list of
Boston-to-Atlanta flights. At present, the system can recognize and correctly
interpret some 5,000 sentences. A much more content-rich application such as
natural-language conversational Yellow Pages is still in the future, since it
needs to recognize a considerably broader variety of requests.
Also under study is the concept of incorporating speech recognition
technology into products under development by IBM's subsidiary Lotus. Late last
year, researchers at Watson and Lotus linked up in a project to incorporate
speech technology into Kona - a series of applications such as
spreadsheets, word processors and e-mail written in Java for use on the World
Wide Web. "We were able to put together a sample demo using a simple
microphone fairly quickly," says Watson research staff member Bruce Lucas. "Incorporating
speech will differentiate the technology from such applications currently
available in the market."
Another effort focuses on "collagen," which Lotus researcher Candy
Sidner describes as "a fairly rich model of collaborative discourse that
makes it possible to communicate with an agent about a task that the agent will
carry out for the user." At present, such communication is restricted to
the traditional keyboard and mouse. But preliminary work is under way to permit
users to communicate with collagen agents via speech. "We're doing it for
Java-based applications, and looking at using it in e-mailing, calendars and
scheduling," says Sidner.
With the new organization that teams up research, development and marketing,
IBM expects to roll out a steady stream of new speech recognition products. "We
can not only make a market, but also differentiate IBM's offerings and improve
their value to our customers," says Toby Maners, manager of worldwide
speech products and business management. "Today we have dictation, command
and control and text-to-speech. Tomorrow we will move further into the food
chain of human interaction, with dialog and natural language understanding."
Just as important, says Mike Shannon, chief financial officer of the Speech
Business Unit, the new offerings will be targeted at the mass consumer market.
During the next five years or so, full continuous-speech recognition will be
increasingly integrated into standard personal computer products and programs,
making keyboarding a matter of choice rather than a necessity for everyone. In
the longer run, as natural language understanding matures, the science fiction
dream of computers like HAL, that can act on complex, naturally spoken commands,
may well be realized.
Eric Lerner is a freelance science writer based in Lawrenceville, New
Jersey.
More Information:
Coping with Chinese
Tools For Developers
IBM Speech Products
Coping with Chinese
To recognize some languages, continuous speech is the only possible
approach, since those languages don't lend themselves to pauses between words.
Mandarin, the most widely spoken language in China, provides an example. It is
written without white space between characters, and contains several words that
consist of two or more subwords, such as "bookmark." Since no basic
standard exists on where to break such words in speech, explains Michael
Picheny, a scientist at the Thomas J. Watson Research Center, different speakers
say the words in different ways, choosing either to break the subwords or to run
them on.
"Mandarin presents other special problems of its own," adds
Picheny, "since like other Chinese languages it is tonal." That means
that words change meaning depending on the tone pattern or voice pitch used.
Mandarin has four basic tones: high, rising, dipping and falling.
Because of the tonality, researchers at Watson and the China Research
Laboratory in Beijing, who were working on a Mandarin speech recognizer, had to
use a set of measures different from those for recognizers of Western languages.
However, with the basic set of statistical tools already developed for other
languages, the team was able to develop a large-vocabulary continuous-speech
Mandarin prototype in less than a year. An effort is currently under way to
incorporate this technology into a marketable product.
Tools For Developers
If voice recognition is to become widespread, it can't be limited to
applications developed by IBM. To help software vendors incorporate IBM
technology into their products, researchers at the Thomas J. Watson Research
Center and development teams in the Worldwide Speech Business Unit are
developing a "toolkit" for speech recognition that the company will
supply to vendors and will continually update. "What we are providing is
the basic engine that does the speech recognition, and help for those who want
to integrate it into an application," says Watson's Steve De Gennaro, who
helps to drive the technology.
New toolkit releases will occur regularly, every 3 to 6 months. Each release
will include developer's programming tools and sample programs, advice on how to
use speech recognition in products and, as needed, marketing support services.
IBM is already distributing information on the toolkits through a Web page.
"We'll be distributing three levels of software," explains De
Gennaro. "Alpha is the early view of the newest stuff, on which we'll
be soliciting feedback from the vendors. Beta is more developed and Golden is
stable, finished technology." While the basic information on the technology
will be distributed for free, support services will of course carry a fee, and
any use of the technology in a product will require a license from IBM.
For more information see: http://www.software.ibm.com/voicetype
IBM Speech Products
- IBM offers the most advanced selection of speech products available in
the market today. Available in multiple languages and for a variety of operating
systems, these products can also be enhanced with the addition of vocabularies
available separately for the fields of law, medicine and journalism.
- IBM VoiceType Simply Speaking for Windows 95 is a discrete-speech
recognition program that allows users to dictate 70 to 100 wpm into speech-aware
applications with a greater than 90 percent accuracy rate for most people out of
the box, and 95 percent after a brief training period.
- IBM VoiceType Dictation 3.0 for Windows 95 adds to the capabilities of
Simply Speaking the ability to navigate menus by voice and other features,
such as direct dictation in Microsoft Word®. Versions are also available for
OS/2 and Windows 3.1. The OS/2 Warp 4.0 version is integrated into the operating
system, simplifying its use with applications.
- IBM VoiceType Developers Toolkit 3.0 for Windows 95 helps application
developers write speech-aware applications.
- MedSpeak/Radiology is the world's first real-time continuous-speech
dictation system. Its 25,000-word vocabulary allows radiologists to
automatically dictate reports on X-rays and other images into a computer,
thereby saving the time and expense of transcription from a tape.