How Does Speech Recognition Work?
A speech recognizer consists of a number of components. These are learned from data, using a Speech Corpus consisting of recordings of speech and their textual transcriptions. The Speech Recognizer learns to make correspondences between sounds and words.
Signal Processing
This processes the signals recorded by the microphone into Feature Vectors that provide a snapshot of what is going on in the speech signal, emphasising those features that are relevant to speech recognition. Typically, 100 feature vectors per second are produced.
Acoustic Model
This takes the stream of Feature Vectors and turns it into a stream of Phonemes (or Phoneme Hypotheses). A Phoneme is the unit that is used to construct words, and corresponds to a particular speech sound. An important aspect of the Acoustic Model is that it does not make definite decisions about what the stream of Phonemes is, but tells us how likely any particular Phoneme is at a point in the speech signal.
Lexicon
This tells us how words are constructed as a string of Phonemes. Alternative pronunciations are also possible.
Language Model
This states what sequences of words are likely and what are not. Just using a grammar is not possible, since people may say anything!