How Does Speech Synthesis Work?
Typically, speech synthesis systems receive texts from other applications to read. This could be anything ranging from system messages to parts of books. The synthesis process consists of two broad steps:
Step 1. Determine Pronunciation
At the most basic level, this involves determining the sequence of speech sounds that is needed to say the words in the input (grapheme to phoneme conversion). Processing can be more elaborate, though. If the text contains abbreviations and numbers, for example, the system needs to determine how these should be read out (text normalisation). The system may also need to split the input into smaller chunks of output text (prosodic phrasing) or determine which words need to be emphasised.
As an example, consider: the next available appointment with the physiotherapist is on Thursday at 2PM:
-
After text normalisation this becomes: the next available appointment with the physiotherapist is on Thursday at two P M.
-
After prosodic phrasing this becomes: the next available appointment with the physiotherapist / is on Thursday / at two p m.
-
After emphasis this becomes: the NEXT available appointment with the PHYSIOTHERAPIST / is on THURSDAY / at TWO P M
-
After grapheme-to-phoneme conversion this becomes: dh ax n eh k s t ax v ey l ax b ax l ax p oy1 n t m ax n t w ih dh dh ax f ih z iy ax dh eh r ax p ax s t ih z aa n th er z d ey ae t t uw p iy eh m.
Step 2. Generate Speech
In the first step, the system specified what speech sounds had to be produced and how they should be produced. In the second step, these specifications are converted into actual speech. Cerevoice, the speech synthesis system currently used in MATCH, generates speech by searching for appropriate combinations of speech sounds in a large database of human speech. The best combinations found are then concatenated and, if necessary, modified to achieve the desired effects.