3.4 Experiments
3.4.3 Conclusion
3.4 Experiments 31 syllables.
0 2 4 6 8 10 12 14 16
Waveform length (seconds)
0 20 40 60 80 100
St an da rd d ev iat ion of F
0(H z)
Pipeline
Tacotron (Characters) Tacotron (Phones)
Figure 3.4: Standard deviation of𝐹0 against output waveform length for samples predicted from our Tacotron systems and pipeline system.
and7.
The reason our Tacotron systems were outperformed by the pipeline system is unnatural prosody. It indicates predicting pitch accent from surface form is quite difficult, and increasing parameter size can not improve the prosody up to the pipeline TTS level. Similar problems are reported from a few studies using Tacotron based systems. According to [22], unnatural prosody including unnatural pitch was the most common error in their manual analysis for 100 challenging English sentences. This result motivates us to introduce pitch accent modeling to end-to-end TTS model, and we cover this in Chapter5. Before we tackle pitch accent modeling, we the investigate learning of pitch accents in Chapter4.
3.4 Experiments 33 Table 3.2: Comparison of structure and configuration between Tacotron and Tacotron2 in literature and modified Tacotron models in this study. Baseline Tacotron and Self-attention Tacotron correspond to models A and B illustrated in figure3.1. Numbers in bracket tuple indicate size of hidden layer(s). For modified Tacotrons, number tuples before and after “/” denote configuration for small and large model parameter sizes, respectively. Note that CBH block in encoders of modified Tacotron can be replaced with CNN. Pho., and accen., and para. denote phoneme, accentual type, and parameter, respectively.
ModelTacotroninliteratureModifiedTacotroninthisstudyTacotron[1]Tacotron2[22]BaselineTacotronSelf-attentionTacotron
VocoderGriffin-Lim[73]WaveNet[12]WaveNet,open-sourceversion
Decoder Post-netCBHG(256)CNN(512)CNN(512)CNN(512)Self-attention---Self-attention(256)/(1024)DecoderRNNGRU(256,256)-LSTM(256,256)/(1024,1024) LSTM(256,256)/(1024,1024)AttentionAdditive(256)Location-aware(128)Forward(256)/(128)Additive&Forward(256)/(128)AttentionRNNGRU(256)LSTM(1024,1024)LSTM(256)/(128)LSTM(256)/(128)Pre-netFFN(256,128)FFN(256,256)FFN(256,128)/(256,256)FFN(256,128)/(256,256)
Encoder
Self-attention---Self-attention(32)/(64)
EncodercoreCBHG(256) Bi-LSTM(512)Bi-LSTM(256)/(512)Bi-LSTM(256)/(512)CNN(512)CBH(128)/(256)CBH(128)/(256) Pre-netFFN(256,128)- FFNPho.(224,112)/(448,224) FFNPho.(224,112)/(448,224)FFNAccen.(32,16)/(64,32)FFNAccen.(32,16)/(64,32) Embedding(256)(512) Pho.(224)/(448)Pho.(224)/(448)Accen.(32)/(64)Accen.(32)/(64)
#Para.w/ovocoder6.9×10 627.3×10 611.3×10 6/35.8×10 611.6×10 6/41.6×10 6
Table 3.3: Alignment error rate on English test set. Each system was evaluated for three times, and each time it was trained from scratch with a different initial random seed. Ave. indicates average error rate. Values in bold correspond to those evaluated in a listening test.
Para.
Encoder Self- # Para. Alignment error rate (%)
size attention (1×106) Ave. 1 2 3
Small
CBHL - 7.9 7.6 6.2 8.6 9.0
X 12.1 17.9 10.0 10.4 33.2
CNN - 9.2 9.3 5.0 5.2 17.8
X 9.9 15.6 13.0 16.6 17.2
Large
CBHL - 36.7 1.0 0.6 0.8 1.6
X 47.6 0.6 0.2 0.6 1.0
CNN - 27.2 1.1 0.2 1.4 1.6
X 38.1 1.0 0.8 1.0 1.2 (a) Systems using phone as input
Para.
Encoder Self- # Para. Alignment error rate (%)
size attention (1×106) Ave. 1 2 3
Small
CBHL - 11.3 14.9 7.4 8.6 28.8
X 12.1 23.4 18.8 21.8 29.6
CNN - 9.2 11.1 6.6 10.8 15.8
X 9.9 18.0 15.0 16.8 22.2
Large
CBHL - 36.7 1.1 0.8 1.0 1.6
X 47.6 0.8 0.4 1.0 1.0
CNN - 27.2 0.5 0.2 0.6 0.6
X 38.1 0.7 0.4 0.6 1.0 (b) Systems using character as input
3.4 Experiments 35
Table 3.4: Mann-Whitney rank test for English listening test. Cell inthis color denotes statistical significance (𝑝 ≤ 0.01)whilethis color denotes𝑝 > 0.01.
Natural
ABS
Pipeline Vocoder
Tacotron (Large parameter size)
Tacotron Mel. Vocoder Para.
Charater Phone Input
CBHL CNN CBHL CNN Encoder
X X X X Self-att.
Natural
ABS Tacotron Mel. spec.
Vocoder para.
Pipeline vocoder para.
Tacotron (Largeparametersize)
Character
CBHL - X
CNN -
X Phone
CBHL - X
CNN -
X
37
Learning pitch accent in end-to-end TTS 4
4.1 Pitch accent
Pitch accent is accent involving pitch change. Generating speech with correct pitch accents is challenging especially for end-to-end TTS because end-to-end TTS does not rely on elaborated linguistic features such as part-of-speech, which is helpful information to infer types of pitch accents.
Pitch accent is one of the most important features for prosody. Prosodic description is different among languages. Figure4.1b shows a classification of languages based on prosodic description. There are two major groups in prosodic description: tone and intonation languages. Prosodic description of tone languages such as Mandarin and Japanese are lexical, which means pitch accents (tones) are determined based on linguistic units such as syllables or words. Intonation languages are further classified into fixed or variable. Fixed intonation languages have fixed accent position within words. Variable intonation languages do not have fixed accent position and pitch accents can not be determined straightforwardly. English has variable intonation
Writing systems Logographic
Morphermic Polymorphermic Syllabic
Kana Greek Hangul Roman
Cyllic Semitic Sumerian, Mandarin
Featual Segmental Phonographic
(a) Writing systems
Prosodic descriptions
Tone language
Lexical Grammatical Fixed
French, Hungarian English
Mandarin Edo
Variable Intonation language
(b) Prosodic descriptions
Figure 4.1: Classification of languages based on writing systems and prosodic descrip- tions, based on [3].
of prosodic description, and we saw end-to-end TTS systems struggled to generate natural pitch change in English in Chapter3.
The relationship between prosodic description and writing system is what makes the learning of pitch accents difficult for end-to-end TTS. Figure4.1ashows classification of languages based on writing system. Logographic writing systems use logograph characters which represent meaning. Phonographic writing systems use characters which represent sound. Tone languages such as Mandarin and Japanese adopt logographic writing systems. Intonation languages such as English adopt phonographic writing systems. Surface form, and lexical information derived from it, is necessary to infer pitch accent, but using surface form as input for TTS is normally difficult in logographic languages because large the diversity of logographic characters results in data sparsity. Phonemes are normally used as input for TTS for logographic languages, and lack of surface form information makes predicting pitch accents difficult. Surface form can be used as input for TTS in phonographic languages, but intonation is not necessarily lexical, and is difficult to infer.