Stress manipulation in text-to-speech synthesis using speaking rate categories
The challenge of controlling prosody in text-to-speech systems (TTS) is as old as TTS itself. The problem is not just to know what the desired stress or intonation patterns are, nor is it limited to knowing how to control specific speech parameters (e.g. durations, amplitude, fundamental frequency). We also need to know the precise speech parameter settings that correspond to a certain stress or intonation pattern – over entire utterances.
We propose that the powerful TTS models afforded by deep neural networks, combined with the fact that speech parameters often are correlated and vary in orchestration, allows us to solve at least some stress and intonation parts using simplified control over relatively easy-to-control parameters, rather than detailed control over many parameters.
The paper presents a straightforward method of guiding word durations without recording training material especially for this purpose. The resulting TTS engine is used to produce sentences containing Swedish words that are unstressed in their most common function, but stressed in another common function. The sentences are designed so that it is clear to a listener that the second function is the intended. In these cases, TTS engines often fail and produce an unstressed version.
A group of 20 listeners compared samples that the TTS produced without guidance with samples where it was instructed to slow down the test words. The listeners almost unanimously preferred the latter version. This supports the notion that due to the orchestrated variation of speech characteristics and the strength of modern DNN models, we can provide prosodic guidance to DNN-based TTS systems without having to control every characteristic in detail.