Speech Recognition (SR)

Speech recognition is the process by which a computer (or other type of machine) identifies spoken words. Basically, it means talking to your computer, AND having it correctly recognize what you are saying. Speech recognition systems allow people to control a computer by speaking to it through a micro-
phone, either entering text, or issuing commands to the computer, e.g. to load a particular program, or to
print a document.

Speech RecognitionSpeech recognition involves the ability to match a voice pattern against a provided or acquired vocabulary. Usually, a limited vocabulary is provided with a product and the user can record additional words. More sophisticated software has the ability to accept natural speech.

The commercial SR systems were generally designed for lawyers, medical staff and other professionals who wanted to be able to enter text into a computer at speed without having to learn to type. As the systems have become cheaper and more reliable, they have increasingly been useful for many, but by no means all, people with disabilities.

Some people can use a SR system and get good results more or less straight away, others need to complete the training procedure and spend many hours using the system, painstakingly correcting misrecognised words, before a satisfactory level of recognition can be achieved. Even where the best technology is in use by a person who has a clear voice, is well motivated and has all the skills apparently necessary, speech recognition can be unsuccessful. A number of other factors can influence the likelihood of success:

connet Speech Consistency. Consistency of speech is much more important than voice quality. Many people with quite dysarthric speech are able to use discrete speech recognition systems provided that they are consistent in their speech.

connet Literacy Skills. Given the frequent need to choose the desired word from a list of choices, speech recognition will be most useful for users with reasonably reliable word recognition skills.

connet Cognitive Skills. The cognitive load involved in using speech recognition systems can be quite high. The person using it must not only think about what they want to say, but also how to say it; they must monitor whether the words they used have been recognised accurately, if not, they must decide on an appropriate strategy to correct them.

connet Visual Skills. A wide range of information is presented visually from the screen: the text that has been entered; choices for an unrecognised word or phrase; information on how the program is running; even basic information as to whether or not the microphone has been switched on.

connet Motivation. This is probably the most important factor for most people who try to use a speech recognition system. Initial results are often disappointing, particularly in comparison with the manufacturers claims. There will also be occasions when levels of recognition will seem to drop for no apparent reason

Medical transcription is vital to the healthcare industry for two main reasons: it charts patients' problems and medical procedures through their lives, and it is continually used to compile statistics to analyze the health of the nation. A medical transcriptionist listens to dictated information concerning patient care and translates it into printable documentation. They typically use a headset and foot pedal to play the dictation through a transcriber or receive digital voice files, and key the appropriate text into a computer. This information is used to produce patient medical records including history and physicals, clinic notes, medical reports, and physician correspondence.

Multimodal speech synthesis, or audio-visual speech synthesis, deals with automatic generation of voice and facial animation from arbitrary text. Applications span from research on human communication and perception, via tools for the hearing impaired, to spoken and multimodal agent-based user interfaces. A view of the face can improve intelligibility of both natural and synthetic speech significantly, especially under degraded acoustic conditions. Moreover, facial expressions can signal emotion, add emphasis to the speech and support the interaction in a dialogue situation. Text processing is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion . Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end.

VoiceXML is a programming language for building interactive voice applications. The VoiceXML language provides a clean and simple means for playing audio, recognizing speech and touch-tone (DTMF) input, controlling a call flow. VoiceXML is a derivative of the Extensible Markup Language (XML). XML is the standard format for defining structured documents and data on the Web. XML enables programmers to define an arbitrary vocabulary, formally known as a schema, using a standard, well-defined, easily-parsed syntax. One XML schema might describe customer information, another might describe a mathematical equation, and yet another might describe a recipe for chocolate chip cookies. XML is easily transported across the World Wide Web (WWW) using existing Internet protocols such as HTTP. Special tools aren't required to author XML documents, but it is tremendously easy to create tools or modify existing ones that both emit and read XML. This makes XML an ideal language for passing data back and forth between applications.

Automatic Speech Recognition (ASR)

Computer-based pronunciation training has emerged thanks to developments in ASR technology. However, even as foreign language teachers become increasingly aware of the advantages of using ASR software, they have become concerned with the reliability of machine-scored pronunciation.. This concern stems from their belief that a high degree of agreement should be obtained between automatic and human scores. Finding a high degree of correlation between the two would increase the use of ASR software for
pronunciation training.

ASR is a cutting edge technology that allows a computer or even a hand-held PDA to identify words that are read aloud or spoken into any sound-recording device. The ultimate purpose of ASR technology is to allow 100% accuracy with all words that are intelligibly spoken by any person regardless of vocabulary size, background noise, or speaker variables