MATCH Technology Tutorial - How Does Speech Recognition Work?

How Does Speech Recognition Work?

A speech recognizer consists of a number of components. These are learned from data, using a Speech Corpus consisting of recordings of speech and their textual transcriptions. The Speech Recognizer learns to make correspondences between sounds and words.

Signal Processing

This processes the signals recorded by the microphone into Feature Vectors that provide a snapshot of what is going on in the speech signal, emphasising those features that are relevant to speech recognition. Typically, 100 feature vectors per second are produced.

Acoustic Model

This takes the stream of Feature Vectors and turns it into a stream of Phonemes (or Phoneme Hypotheses). A Phoneme is the unit that is used to construct words, and corresponds to a particular speech sound. An important aspect of the Acoustic Model is that it does not make definite decisions about what the stream of Phonemes is, but tells us how likely any particular Phoneme is at a point in the speech signal.

Lexicon

This tells us how words are constructed as a string of Phonemes. Alternative pronunciations are also possible.

Language Model

This states what sequences of words are likely and what are not. Just using a grammar is not possible, since people may say anything!