Research on User Interface Technologies (UIT) at IBM is dedicated to developing innovative technologies, algorithms and tools for next-generation human-computer interfaces.
For decades, IBM has been among the leaders in all aspects of speech recognition technology. Most recently, the IBM Embedded Speech Recognition Technology that represents the state of the art in utilizing the voice modality for command and control, ViaVoice for Embedded Multiplatforms, was made available on multiple platforms and systems, such as Palm and iPaq handheld computers and cars.
The Telephony Speech Recognition project is aimed at developing algorithms for improving the accuracy & robustness of speech recognition in conversational telephony applications e.g., bank transactions, directory assistance, etc. The project has a significant product impact by directly contributing algorithms and data files for WebSphere Voice Server and related voice technology offerings from IBM.
The multi-year goal of the Superhuman Speech Recognition project is to develop speech recognition technology that surpasses the abilities of humans to recognize speech. We are addressing this goal by attacking a graded series of speech recognition tasks, ranging from simple read speech to complex spontaneous conversations in difficult channels and environments. The focal point of this work is the creation of more robust and sophisticated acoustic and language modeling techniques to improve performance while simultaneously reducing the labor required to install and tune new applications.
The Audio-Visual Speech Recognition project, which has been selected as an" IBM Research Science and Technology accomplishment" for 2002, explores the use of visual information in speech recognition systems. It aims at combining visual cues with audio signals for the purpose of improved automatic machine recognition of large-vocabulary continuous speech. This combination proves important particularly in acoustically challenging conditions with significant background noise levels, such as multi-speaker environments, production lines, airport halls, or cars. Audio-Visual Speech Recognition can also provide help with speech-reading for the hearing/speech impaired.
IBM is utilizing two key UIT technologies - speaker verification and conversational systems - to provide enhanced security for voice-based transactions. Our leadership speaker verification technology (ranked first among 25 worldwide participants in the 2002 Speaker Recognition Evaluation organized by the National Institute of Standards and Technology) is combined with user knowledge (i.e. passwords and personal information) elicited through a brief conversation. The combination of the two information sources - called a "Conversational Biometric" - greatly increases the security and reliability of voice based transactions in a non-intrusive manner and provides a flexible framework for various authentication scenarios so as to maximize user convenience. A demonstration of "Conversational Biometrics" can be seen at http://www.research.ibm.com/VIVA_Demo/.
Exciting advancements are taking place in the area of text-to-speech (TTS). The goal of the project is to produce computer-generated speech which is indistinguishable from natural speech; our newest system takes a big step in that direction. The TTS system relies on a large database of natural speech, which is automatically divided into small building blocks which are then reassembled to form arbitrary word sequences. In synthesis the blocks are chosen to minimize a cost function which considers various important aspects of naturalness in speech. Systems have been built in US and UK English as well as Chinese, Japanese, French, German, and Spanish. Here is an audio sample generated by our system.
An example of how UIT can enable human-to-human communication is the IBM Speech-To-Speech (Speech Translation) Technology. A speech translation system is capable of processing spoken input to translate the content into another language, for example translate from English to Mandarin Chinese and from Mandarin Chinese to English.
Another important example of translation technology is InfoScope, a handheld device equipped with a digital camera that can take snapshots of text in English, French, German, Spanish, Italian and Chinese and translate the image to another language in a matter of seconds. The device displays characteristics of augmented reality, by presenting the real world in the form of a captured image, such as a restaurant sign, and merging it with virtual data, by providing a translation of the image as an overlay to the PDA's screen.
The use of UIT to increase productivity is demonstrated in BlueSpace, a next-generation workspace solution encompassing multiple software and hardware components that integrate sensors, actuators, displays and wireless networks into architectural elements. The goal of the space is to increase knowledge workers’ productivity by deterring unwanted interruptions, improving awareness and fluid communication among team members, and providing greater individual comfort through personalized environmental settings.
Combining many aspects of the UIT in one device, the unique project Cross-Industry Dashboard for Retail and Healthcare should be named. This project represents a combination of groundbreaking research in new form-factor multi-modal devices (MetaPad), digital ink using new InkXML standards, speech, and analytic methods. We are developing a mobile wireless device and middleware to enable retail store employees to access store-specific information anywhere utilizing various input modalities such as traditional keyboard, bar code scanners, magnetic stripe readers, and handwritten information in the form of digital ink. This technology has applications to various domains including healthcare.
As part of the IBM Corporate Community Relations (CCR) program, the Web Adaptation project has received a lot of attention this past year in the press and currently is in use by several organizations serving populations of elderly users and users with disabilities. For 2003, CCR plans to internationalize this project to deploy the software to CCR partners worldwide.
IBM researchers have been, and continue to be, among the worldwide leaders in developing User-Interface Technologies. Being part of IBM gives us rare opportunities to have our research affect both the state-of-the-art and the state-of-the-practice.
Related Publications
Lisa Brown and Yingli Tian “Comparative Study of Coarse Head Pose Estimation," IEEE Workshop on Motion and Video Computing, Dec. 5-6, 2002. (Orlando FL)
Fairweather, P. G., Richards, J. T., & Hanson, V. L. (2002). Distributed accessibility control points to help deliver a directly accessible Web. Universal Access and Inclusion in Design: A Special Issue of Universal Access in the Information Society. DOI 10.1007/s10209-002-0037-3.
N. K. Ratha, J. H. Connell and R. M. Bolle, "Secure Fingerprint Authentication". Chapter 11, Automated Biometrics: Technologies and Systems, Kluwer 2002 (David Zhang Editor)
S. Maes, J. Navratil, U. Chaudhari, "Conversational Speech Biometrics," Chapter in "E-Commerce Agents Marketplace Solutions, Security Issues, and Supply and Demand," J. Liu and Y. Ye (Eds.): Springer Verlag, 2001, Pages 166-179.
G. Potamianos, C. Neti, J. Luettin, and I. Matthews, ``Audio-visual automatic speech recognition: An overview,'' To appear in: Audio-Visual Speech Processing, E. Vartikiotis-Bateson, G. Bailly, and P. Perrier (Eds.), MIT Press, pp. 121-148, 2003.MIT press book chapter on "audio-visual speech recognition".
A.W. Senior, Tracking with Probabilistic Appearance Model, in proceedings ECCV workshop on Performance Evaluation of Tracking and Surveillance Systems 1 June 2002 pp 48--55.
Malcolm Slaney,"Image-based Facial Synthesis", To appear in: Audio-Visual Speech Processing, E. Vartikiotis-Bateson, G. Bailly, and P. Perrier (Eds.), MIT Press, pp. 149-161, 2003.
C.Neti & G. Potamianos (et al.) wrote the Editorial to the Special Issue "Joint audio-visual speech processing" in Eurasip Journal of Applied signal processing, in Press, November 2003.
