An end-to-end, also known as sequence-to-sequence, TTS framework is the main focus of this thesis. As figure2.1shows, an end-to-end TTS framework has a different structure to the pipeline one: it unifies the front-end, duration model, and acoustic model as one model by utilizing the DNN-based sequence-to-sequence framework [18] that can map a source-to-target sequence with a length mismatch. As well as pipeline TTS, sequence-to-sequence TTS has variant systems using different types of acoustic features and waveform generation methods: vocoder parameter and conventional vocoder [19,20], vocoder parameter and neural vocoder [21], mel and linear spectrogram and invert STFT with phase recovery [1], and mel or linear spectrogram and neural vocoder [22].
In this thesis, we use “pipeline” to refer to all methods in a pipeline framework and
“DNN-based pipeline TTS” to refer to both DNN-based SPSS and vocoder-free TTS.
We clearly mention a specific TTS method if the distinction is important. We use
“conventional” or “traditional” for the pipeline TTS framework even though some methods in the pipeline framework are relatively new in TTS history and the first sequence-to-sequence TTS was proposed only a few years ago [19].
2.2 End-to-end TTS
2.2.1 Overview
End-to-end TTS is a new TTS framework using a sequence-to-sequence learning method [18]. The end-to-end TTS unifies the linguistic model, acoustic model, duration model, and ultimately the waveform model [36,37], from the conventional pipeline framework into a single model. It has provoked many studies because of its potential to unify all models into one and its handiness as it uses text and speech only for model training.
Many architectures have been proposed for sequence-to-sequence based TTS.
Table2.2shows major methods of end-to-end TTS. Notable architecture examples are Tacotron [1], Char2Wav [21], VoiceLoop [20], DeepVoice3 [34], DCTTS [38], Tacotron2 [22], and Transformer [39]. All of them have an encoder-decoder [40] structure, which consists of an encoder to encode linguistic input and a decoder to decode acoustic features autoregressively. From the perspective of neural network (NN) types to
Table 2.1: List of languages which end-to-end TTS are applied to, except for English.
Language Linguistic feature Corpus Reference
Chinese
Phoneme Internal [19,23,24]
phone & tone Internal [25]
pinyin CSS10 [26,27]
character/phone/byte Internal [28]
character China Daily [29]
full-context Internal [30]
pinyin Data Baker [31]
Korean phoneme KSS [32]
character Internal [27]
Japanese
phoneme & accentual type ATR Ximera [2]
full-context ATR Ximera [33]
roman CSS10 [26]
Dutch character CSS10 [26,27]
Finnish character CSS10 [26,27]
French character CSS10 [26,27]
German character CSS10 [26,27]
Greek character CSS10 [26,27]
Hungarian character CSS10 [26,27]
Russian character CSS10 [26,27]
Spanish character CSS10 [26,27]
character/phone/byte Internal [28]
implement the sequence-to-sequence framework, those TTS architectures can be roughly classified into recurrent neural network (RNN)-based [1,21,22], convolutional neural network (CNN)-based [34,38], self-attention based ones [39] or memory buffer [20].
Note that, although end-to-end TTS is a new framework, each key element such as phone and linguistic feature embedding [41,42], autoregressive decoder [17,43], and postnet [44,45] has already been investigated for DNN-based pipeline TTS methods separately. A notable difference between the sequence-to-sequence and pipeline TTS frameworks is not the elements themselves but joint training of the elements.
2.2 End-to-end TTS 11 Table 2.2: Summary of major end-to-end TTS methods. All the methods use soft attention mechanism for implicit alignment learning.
System Network Alignment Decoder Post-net
output output
Char2Wav [21] RNN GMM Vocoder -
Tacotron [1] RNN Additive Mel Linear
VoiceLoop [20] Memory buffer GMM Vocoder -
Deep Voice 3 [34] CNN Dot-product Mel Linear/Vocoder
Tacotron 2 [22] RNN Location-sensitive Mel Mel
Transformer [35] Self-attention Dot-product Mel Mel
2.2.2 Linguistic modeling
The pipeline TTS frameworks typically use grapheme-to-phoneme conversion. The sequence-to-sequence TTS instead uses an encodery1:𝑈 =Encoder(𝑦1:𝑈) to transform input characters𝑦1:𝑈 into a linguistic representationy1:𝑈 that is supposed to encode the pronunciation and possibly prosody information.
Table2.1shows languages which end-to-end TTS are applied to, except for English.
An initial proof-of-concept of sequence-to-sequence TTS was investigated in Chinese with phoneme input [19]. Then character level input was investigated in English [1,21] and was confirmed to be feasible in other alphabetical languages [26] and Chinese [29,28]. The naturalness of synthetic speech using character input can be quite high for English [22].
Although implicit linguistic modeling is possible when the encoder has a number of sufficient non-linear transformation layers [46] and when there is a sufficient amount of training data, even large-scale speech copora such as LibriTTS [47] never match a lexicon dictionary in terms of word coverage [48]. This suggests a potential limitation of end-to-end TTS for out-of-vocabulary words and out-of-domain text, especially in languages with highly nonphonemic orthography such as English and ideographic languages such as Chinese. It is reported that phoneme input gives better results than character input [49,28] for sequence-to-sequence based TTS and many recent sequence-to-sequence based TTS studies avoid character input and use phonemes instead. Some studies use both character and phoneme [34,50] or a random mix of them [51]. It is also reported that pre-training an encoder to learn
grapheme-to-phoneme mapping is also an effective approach [29]. Nevertheless, the sequence-to-sequence based TTS is convenient to construct a multi-lingual model by handling various inputs with a single model, and it has been investigated with character, phoneme, and even byte input [24,27,52,28].
There are also attempts to use richer linguistic information as additional inputs for sequence-to-sequence based TTS for tonal languages such as Chinese [53], and pitch accent languages such as Japanese [54], where tone or accent and prosody information are indispensable yet hard to learn from graphemes or phonemes implicitly. However, some studies report excessive information negatively affects their results [30,33].
Simpler linguistic features such as pinyin with tone [31], phones with tone [25,30], or phonemes with accentual type [2] seems to be good choices for these languages.
2.2.3 Alignment modeling
As another difference from DNN-based pipeline TTS, the sequence-to-sequence TTS does not use the external aligner or duration model. Instead, many sequence-to- sequence TTS methods use an attention mechanism [55] to implicitly align source and target sequences. An attention mechanism computes attention score or alignment probability𝑝(𝑧𝑡 =𝑢 |x𝑡−1,y𝑢)which indicates how likely it is that the alignment𝑧𝑡 at 𝑡-th time step corresponds to input position𝑢 given the previous acoustic featurex𝑡−1
and𝑢-th linguistic featurey𝑢. An acoustic decoder decodes outputx𝑡 from a context vectorÍ𝑈
𝑢=1𝑝(𝑧𝑡 =𝑢 | x𝑡−1,y𝑢)y𝑢, which is an expectation of the encoded linguistic features by the attention scores, as inx𝑡 =Decoder
Í𝑈
𝑢=1𝑝(𝑧𝑡 =𝑢 |x𝑡−1,y𝑢)y𝑢,x𝑡−1
. There are many types of attention mechanisms. Basic ones include Gaussian mixture model (GMM) [56], additive [55], dot-product [57], and location-sensitive [58]
attention mechanisms, which were designed for other tasks but applied to TTS as in [21,20,1,34,39,19,22]. Several attention mechanisms have also been proposed specifically for TTS, where the alignment between source and target is monotonic and the target sequence is much longer than the source. They include forward attention [25], stepwise monotonic attention [59], a mixture of logistic attention [60], and dynamic convolution attention [61], all of which use a time relative approach to enforce monotonicity.
In addition, attention regularization loss has also been proposed for robust and fast