Conclusion - Lexical pitch accent and duration modeling for neural end-to-end

3.4 Experiments

3.4.3 Conclusion

3.4 Experiments 31 syllables.

0 2 4 6 8 10 12 14 16

Waveform length (seconds)

0 20 40 60 80 100

St an da rd d ev iat ion of F

(H z)

Pipeline

Tacotron (Characters) Tacotron (Phones)

Figure 3.4: Standard deviation of𝐹₀ against output waveform length for samples predicted from our Tacotron systems and pipeline system.

and7.

The reason our Tacotron systems were outperformed by the pipeline system is unnatural prosody. It indicates predicting pitch accent from surface form is quite difficult, and increasing parameter size can not improve the prosody up to the pipeline TTS level. Similar problems are reported from a few studies using Tacotron based systems. According to [22], unnatural prosody including unnatural pitch was the most common error in their manual analysis for 100 challenging English sentences. This result motivates us to introduce pitch accent modeling to end-to-end TTS model, and we cover this in Chapter5. Before we tackle pitch accent modeling, we the investigate learning of pitch accents in Chapter4.

3.4 Experiments 33 Table 3.2: Comparison of structure and configuration between Tacotron and Tacotron2 in literature and modified Tacotron models in this study. Baseline Tacotron and Self-attention Tacotron correspond to models A and B illustrated in figure3.1. Numbers in bracket tuple indicate size of hidden layer(s). For modified Tacotrons, number tuples before and after “/” denote configuration for small and large model parameter sizes, respectively. Note that CBH block in encoders of modified Tacotron can be replaced with CNN. Pho., and accen., and para. denote phoneme, accentual type, and parameter, respectively.

ModelTacotroninliteratureModifiedTacotroninthisstudyTacotron[1]Tacotron2[22]BaselineTacotronSelf-attentionTacotron

VocoderGriffin-Lim[73]WaveNet[12]WaveNet,open-sourceversion

Decoder Post-netCBHG(256)CNN(512)CNN(512)CNN(512)Self-attention---Self-attention(256)/(1024)DecoderRNNGRU(256,256)-LSTM(256,256)/(1024,1024) LSTM(256,256)/(1024,1024)AttentionAdditive(256)Location-aware(128)Forward(256)/(128)Additive&Forward(256)/(128)AttentionRNNGRU(256)LSTM(1024,1024)LSTM(256)/(128)LSTM(256)/(128)Pre-netFFN(256,128)FFN(256,256)FFN(256,128)/(256,256)FFN(256,128)/(256,256)

Encoder

Self-attention---Self-attention(32)/(64)

EncodercoreCBHG(256) Bi-LSTM(512)Bi-LSTM(256)/(512)Bi-LSTM(256)/(512)CNN(512)CBH(128)/(256)CBH(128)/(256) Pre-netFFN(256,128)- FFNPho.(224,112)/(448,224) FFNPho.(224,112)/(448,224)FFNAccen.(32,16)/(64,32)FFNAccen.(32,16)/(64,32) Embedding(256)(512) Pho.(224)/(448)Pho.(224)/(448)Accen.(32)/(64)Accen.(32)/(64)

#Para.w/ovocoder6.9×10 627.3×10 611.3×10 6/35.8×10 611.6×10 6/41.6×10 6

Table 3.3: Alignment error rate on English test set. Each system was evaluated for three times, and each time it was trained from scratch with a different initial random seed. Ave. indicates average error rate. Values in bold correspond to those evaluated in a listening test.

Para.

Encoder Self- # Para. Alignment error rate (%)

size attention (1×10⁶) Ave. 1 2 3

Small

CBHL - 7.9 7.6 6.2 8.6 9.0

X 12.1 17.9 10.0 10.4 33.2

CNN - 9.2 9.3 5.0 5.2 17.8

X 9.9 15.6 13.0 16.6 17.2

Large

CBHL - 36.7 1.0 0.6 0.8 1.6

X 47.6 0.6 0.2 0.6 1.0

CNN - 27.2 1.1 0.2 1.4 1.6

X 38.1 1.0 0.8 1.0 1.2 (a) Systems using phone as input

Para.

Encoder Self- # Para. Alignment error rate (%)

size attention (1×10⁶) Ave. 1 2 3

Small

CBHL - 11.3 14.9 7.4 8.6 28.8

X 12.1 23.4 18.8 21.8 29.6

CNN - 9.2 11.1 6.6 10.8 15.8

X 9.9 18.0 15.0 16.8 22.2

Large

CBHL - 36.7 1.1 0.8 1.0 1.6

X 47.6 0.8 0.4 1.0 1.0

CNN - 27.2 0.5 0.2 0.6 0.6

X 38.1 0.7 0.4 0.6 1.0 (b) Systems using character as input

3.4 Experiments 35

Table 3.4: Mann-Whitney rank test for English listening test. Cell inthis color denotes statistical significance (𝑝 ≤ 0.01)whilethis color denotes𝑝 > 0.01.

Natural

ABS

Pipeline Vocoder

Tacotron (Large parameter size)

Tacotron Mel. Vocoder Para.

Charater Phone Input

CBHL CNN CBHL CNN Encoder

X X X X Self-att.

Natural

ABS Tacotron Mel. spec.

Vocoder para.

Pipeline vocoder para.

Tacotron (Largeparametersize)

Character

CBHL - X

CNN -

X Phone

CBHL - X

CNN -

Learning pitch accent in end-to-end TTS 4

4.1 Pitch accent

Pitch accent is accent involving pitch change. Generating speech with correct pitch accents is challenging especially for end-to-end TTS because end-to-end TTS does not rely on elaborated linguistic features such as part-of-speech, which is helpful information to infer types of pitch accents.

Pitch accent is one of the most important features for prosody. Prosodic description is different among languages. Figure4.1b shows a classification of languages based on prosodic description. There are two major groups in prosodic description: tone and intonation languages. Prosodic description of tone languages such as Mandarin and Japanese are lexical, which means pitch accents (tones) are determined based on linguistic units such as syllables or words. Intonation languages are further classified into fixed or variable. Fixed intonation languages have fixed accent position within words. Variable intonation languages do not have fixed accent position and pitch accents can not be determined straightforwardly. English has variable intonation

Writing systems Logographic

Morphermic Polymorphermic Syllabic

Kana Greek Hangul Roman

Cyllic Semitic Sumerian, Mandarin

Featual Segmental Phonographic

(a) Writing systems

Prosodic descriptions

Tone language

Lexical Grammatical Fixed

French, Hungarian English

Mandarin Edo

Variable Intonation language

(b) Prosodic descriptions

Figure 4.1: Classification of languages based on writing systems and prosodic descriptions, based on [3].

of prosodic description, and we saw end-to-end TTS systems struggled to generate natural pitch change in English in Chapter3.

The relationship between prosodic description and writing system is what makes the learning of pitch accents difficult for end-to-end TTS. Figure4.1ashows classification of languages based on writing system. Logographic writing systems use logograph characters which represent meaning. Phonographic writing systems use characters which represent sound. Tone languages such as Mandarin and Japanese adopt logographic writing systems. Intonation languages such as English adopt phonographic writing systems. Surface form, and lexical information derived from it, is necessary to infer pitch accent, but using surface form as input for TTS is normally difficult in logographic languages because large the diversity of logographic characters results in data sparsity. Phonemes are normally used as input for TTS for logographic languages, and lack of surface form information makes predicting pitch accents difficult. Surface form can be used as input for TTS in phonographic languages, but intonation is not necessarily lexical, and is difficult to infer.

4.2 TTS for pitch accent languages: Japanese as an

Dalam dokumen Lexical pitch accent and duration modeling for neural end-to-end (Halaman 53-60)