Autoregressive Text to Speech (TTS) models such as Tacotron 2 and Flowtron have shown great ability to synthesize natural speech, even with non-autoregressive options available. However, autoregressive models are delicate to train and can suffer from various problems such as skipping tokens, misspelling words, or even catastrophic collapse at inference time.
Various techniques have been already proposed in order to reduce these issues and training difficulty. However, many of these options are either inflexible, increase model size significantly, or depend on pre-alignment data.
In this paper, we propose Convolutional Attention Consistency, a method that helps alignment learning without any of the aforementioned disadvantages. We demonstrate how integrating another alignment learning mechanism and using it as a loss function to guide our model enhances it
Figure 1: A visualization of how we guide Tacotron 2 with the Alignment Encoder's outputs, enforcing consistency.
Showcase 1: LJ Speech
We trained two Tacotron 2 models on the LJSpeech dataset: our proposal, and regular diagonal guided attention loss as a baseline, with r=1, a learning rate of 0.001 and a batch size of 32, for 110k steps on a single RTX 4090 each. We asked ChatGPT-4 to generate challenging sentences
Further details can be found in the paper
"The quick brown fox jumps over the lazy dog, while the sun sets over the peaceful valley"
Tacotron 2 CAC (Ours)
Tacotron 2 DGA
"Even though John loves chocolate, strawberries, and ice cream, he decided to try the vanilla cake instead"
"She pondered the choices: red, green, blue, or yellow; finally, she picked the blue one, which turned out to be a wise decision"
"Peter Piper picked a peck of pickled peppers, how many pickled peppers did Peter Piper pick?"
"When I visited Rome, the capital of Italy, I saw the Colosseum, the Vatican, and St. Peter's Basilica"
"Across the moonlit lake, the fireflies danced, casting flickering lights in the summer night"
Showcase 2: Twilight Sparkle from MLP:FiM
In order to show the flexibility of our method, we take a 44.1KHz pretrained model with CAC, and fine-tune it on a dataset with a very emotional speaker for 75k steps. Like before, we trained two models: one with, and the other without.
We note that ours is better at capturing emotion and tone, even in the absence of a dedicated emotion embedding
"Oh my gosh, I can't believe we won the championship!"
Tacotron 2 CAC→FT→CAC (Ours)
Tacotron 2 CAC→FT→DGA
"I'm so sorry for your loss, words can't express how much I sympathize"
"Even though John loves chocolate, strawberries, and ice cream, he decided to try the vanilla cake instead"
"Peter Piper picked a peck of pickled peppers, how many pickled peppers did Peter Piper pick?"
"When I visited Rome, the capital of Italy, I saw the Colosseum, the Vatican, and St. Peter's Basilica"
"Across the moonlit lake, the fireflies danced, casting flickering lights in the summer night"