Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to be machine readable. Speech recognition applications include voice dialling e.g. “Call home”, call routing e.g. "I would like to make a collect call", domestic appliance control and content-based spoken audio search, simple data entry e.g., entering a credit card number, preparation of structured documents speech-to-text processing e.g. word processors or emails, and in aircraft cockpits. The performance of speech recognition systems is usually specified in terms of accuracy and speed. Dictation machines can be used to achieve very high performance in controlled conditions. Commercially available speaker-dependent dictation systems usually require only a short period of training and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions. Optimal conditions usually assume that users:
- have speech characteristics which match the training data,
- can achieve proper speaker adaptation, and
- Work in a clean noise environment (e.g. quiet office or laboratory space).
This explains why users with accents might have lower recognition rates. Speech recognition in video has become a popular search technology used by several video search companies. Limited vocabulary systems, requiring no training, can recognize a small number of words as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organisations. Both acoustic modelling and language modelling are important parts of modern statistically based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in many systems. Language modeling has many other applications such as smart keyboard and document classification

How it works
To understand how speech recognition works it is desirable to have knowledge of speech and what features of it is used in the recognition process. In a human brain, thoughts are constructed into sentences and the nerves control the shape of the vocal tract which includes the jaws, tongue, mouth, vocal cords etc, to produce the desired sound. The sound comes out in phonemes which are the building blocks of speech. Each phoneme resonates at a fundamental frequency and harmonics of it and thus have high energy at those frequencies. The first three harmonics have significantly high energy levels and are known as formant frequencies. Each phoneme have a unique fundamental frequency and hence unique formant frequencies and it is this feature that enables the identification of each phoneme at the recognition stage. In general , speech recognition systems have stored reference templates of phonemes or words with which input speech is compared and the closest word or phoneme is given out. Since it is the frequencies that are to be compared , the spectra of the input and reference template are compared rather than the actual waveform.

Speech is preferred as an input because it does not require training and it is much faster than any other input. Also information can be input while the person in engaged in other activities and information can be fed via telephone or microphone which are relatively cheaper compared to current input systems. But there are several disadvantages in the recognition process. The same difficulty occurs when words are stored in reference template in continuous speech recognition. As already mentioned, in speaker-independent systems only isolated word recognition is commercially available. Most buyers would like the system to be speaker independent, and uttering words in isolation can be quiet irritating especially when the input is bulk and the processing speed may not be very fast. Even in speaker-dependent connected word recognition system (limited vocabulary) the speed of input is only up to 50 words per minute which is not very fast.
Other Disadvantages include;
- TIME; Typing is much faster than voice recognition.
- MONEY; In addition to the cost of the software and the microphone, there has been very little success using voice recognition on a machine with less than 512 MB of RAM.
- ACCURACY. This is related to the time issue--part of what makes voice recognition slower than typing is the need to correct misrecognition errors. In addition, any errors that are not caught by the author will not be caught by a spell checker since they will consist of the wrong word, spelled correctly
Design Issues
Abstracted view of reality - separate processing for speech - we hear what we expect to hear background noise, directional or broadcast modality
Human speech recognition tolerates mispronunciations, non-grammatical sentences, dialects
Sound/speech discrimination varies with age, depends on frequency (pitch), amplitude (loudness) and contrast (foreground/background - dB ratio) The future As with any automation systems, Automatic Speech Recognition (ASR) systems will be employed when their speed and efficiency is higher than the current input method so that savings can be made. But as mentioned above ASR systems have not quite reached that competitive position. On the other hand ASR systems are now more affordable than ever before. And when speaker-independent continuous speech recognition systems are developed speech recognition will be one of the popular methods of data input and will lead to the development of vocally interactive computers.
Conclusions
Initially looking at Speech recognition, it seemed simple and straight forward. It has become apparent that it would be a very difficult task to accomplish, and would require much more time, effort, and background on the subject than first thought.
Being able to determine what is spoken or who a speaker is with near perfect accuracy is an extremely difficult task. Preventing another individual from breaking into the system can be just as difficult, as it requires a system dependent on text and a system that will not accept anything other than what it specifies. The initial idea of being able to determine what word was being spoken is, at best, naïve, and at worst not at all feasible.
Links
http://www.youtube.com/watch?v=kX8oYoYy2Gc&feature=related
No comments:
Post a Comment