Pipeline TTS framework - Lexical pitch accent and duration modeling for neural end-to-end

y1:𝑈 from the input text. These linguistic features are usually represented as full-context labels, which encode the segmental (e.g., phoneme identity) and suprasegmental (e.g., phrase-level intonation) information to utter the text. For example, a typical front-end for English may use a pronunciation dictionary and a decision-tree-based model to conduct the letter-to-sound (or grapheme-to-phoneme) conversion [66]. It may use other hand-crafted rules or statistical models to infer the phase breaks [67] and pitch accents [68]. These modules in the front-end are usually language dependent, and the statistical ones require supervised training and hand-annotated datasets.

Note that there are also methods to build a front-end with less language-specific knowledge or supervised training. For example, the vector space model may be used to derive linguistic features from input text in an unsupervised way [41,42]. However, such a method is not applicable to ideographic languages such as Japanese due to the sparse distribution of written characters. On the other hand, there are also methods to jointly optimize statistical models in the front-end with the acoustic model in the back-end [69]. However, this method was only proposed for HMM-based SPSS and cannot be applied to DNN-based SPSS directly.

2.3.2 Back-end

The back-end in pipeline based TTS has two models: acoustic and waveform. The acoustic model has been advanced by replacing the hidden Markov models (HMMs) with DNNs [10]. On the basis of the vanilla feed-forward and recurrent NNs, researchers have further proposed many NN-based acoustic models with improved probabilistic modeling capability.

in particular, the vanilla feed-forward and recurrent NNs assume statistical independence among the acoustic feature frames [43]. To amend the independence assumption, autoregressive NN models were proposed to model the acoustic feature distribution of one frame conditioned on the previous frames, which can be implemented by feeding back the previous frames linearly [70] or non-linearly [43,17]. A similar idea can be used to capture the causal dependence between different acoustic feature dimensions [71]. Meanwhile, unlike the autoregressive NNs, there are also other NN-based models using intractable or implicit density functions. Examples of the former case include the post-filtering approach using a restricted Boltzmann machine

2.3 Pipeline TTS framework 15

(RBM) [44] and the trajectory DNN [72]. The latter case is the generative adversarial network (GAN) [45]. All the recently proposed NN-based acoustic models alleviate the over-smoothing effect in the vanilla NNs.

There are two types of waveform model in the back-end: algorithmic method or neural method. The conventional vocoder is an algorithmic method that has long been a core component of SPSS. However, the artifacts in generated waveforms from the vocoder limit the quality of SPSS, which motivates the development of waveform generation methods such as Griffin–Lim [73] in vocoder-free TTS and the neural waveform models [8]. Compared with the Griffin-Lim and related algorithms, the neural waveform models show better performance [74]. They are also more flexible as they can work on various types of input acoustic features including mel-spectrogram and vocoder parameters, be implemented with different types of NNs such as GAN [75] and normalizing flow [76], and incorporate classical signal processing algorithms like linear prediction coefficients (LPC) [77] and the harmonic-plus-noise model [78].

2.3.3 Duration model

HSMM-based SPSS models state duration with Gaussian distribution to align source and target [5]. In this case, the duration model is inherent to HSMM, and there is no external duration model.

DNN-based SPSS uses an independent phoneme duration model for alignment. A phoneme duration model is trained with duration labels obtained with a force aligner.

The force aligner can be either HMM or neural recognition models. With phoneme duration, linguistic features𝑦_1:_𝑈 can be upsampled up to the length of acoustic feature as in ˆ𝑦_1:_𝑇.

Phoneme duration models can be used to replace soft-attention in end-to-end TTS systems. In this case, phoneme duration models are normally external models, so it is not regarded as an end-to-end TTS framework but similar to vocoder-free TTS in a pipeline TTS framework. The force aligner used to train the duration model can be based on HMM [79,80] or DNN [81] based recognition model. Some systems use a generative model for force aligner instead of a recognition model, such as a TTS model using hard-attention [82] or soft-attention [63].

Text

Power spectrogram

Vocoder parameter Aligned

linguistic feature Linguistic

feature Front-end

Duration model

Acoustic model Sequence-to-sequence

model

HSMM based SPSS

DNN based SPSS

Pipeline

Sequence-to-sequence TTS

Sequence-to-sequence

Text Front-end ^Linguistic_feature Acoustic _parameter^Vocoder Waveform

model Vocoder

Text linguistic feature^Aligned _parameter^Vocoder

Linguistic

feature Waveform

Front-end Duration

model

Acoustic model

Vocoder

Vocoder Neural Vocoder

Neural Vocoder

Vocoder-free TTS

Text spectrogram^Power

Aligned linguistic feature Linguistic

feature Waveform

Front-end Duration

model

Acoustic model

iSTFT &

phase recovery

Neural Vocoder

Waveform iSTFT &

phase recovery

Figure 2.1: TTS frameworks appearing in this thesis. Pipeline TTS framework consists of independent front-end and back-end models. Pipeline TTS framework includes HSMM based SPSS, DNN based SPSS, and vocoder-free TTS. HSMM based SPSS and DNN based SPSS predict vocoder parameter with HSMM or DNN, respectively, and they use conventional vocoder for waveform generation. Vocoder-free TTS predicts power spectrogram with DNN, and it uses invert STFT and phase recovery to generate waveform. Neural vocoder can also be used for waveform generation.

Sequence-to-sequence TTS has a single model, and it may use either vocoder parameter or power spectrogram, and corresponding waveform generation method or neural vocoder.

Analysis and issues of end-to-end TTS 3

3.1 Targets of analysis

3.1.1 Pronunciation

Learning pronunciation directly from texts is one of the most important characteristics of end-to-end TTS framework. Resolving pronunciation is possible in alphabetical languages, but it is not easy, as we described in Section2.2.2in Chapter2. Phonemes are an alternative feature to characters for end-to-end TTS to avoid pronunciation ambiguity.

In this chapter we focus on factors affecting the ability of an end-to-end TTS model to disambiguate pronunciation of texts. The factor we cover here is neural network architecture. The encoder’s network structure is especially important for pronunciation learning. We compare two encoder structures to see how much pronunciation learned from texts can be improved by encoder structure: one structure is specialized for character inputs and the other is a simple structure.

3.1.2 Alignment

End-to-end TTS has an internal alignment module to align linguistic features and acoustic features which have different lengths. Soft-attention is the most popular method for the alignment module in end-to-end TTS. Soft-attention provides a simple and computationally efficient alignment mechanism which avoids marginalization.

Most end-to-end TTS methods adopt soft-attention.

It is known that soft-attention has a major issue in the case of TTS. In TTS, alignment between linguistic and acoustic features must be monotonic because transcription is read aloud from left to right. However, alignments predicted with soft attention may have fatal errors. Soft-attention can see the whole transcript at any time of decoding even though only small part of transcript should correspond to a frame of acoustic features. This feature of soft-attention is suitable to tasks such as machine translation in which source and target relation may require reordering.

This is however not ideal for TTS in which the relationship between source and target is always monotonic, and the target has much longer length than the source, which makes predicting alignments challenging. Fatal alignment errors caused by soft-attention may result in muffling, repetition, skip, or ill termination.

Several soft-attention methods are proposed to stabilize alignments. They include forward attention [25], stepwise monotonic attention [59], a mixture of logistic attention [60], and dynamic convolution attention [61], all of which use a time relative approach to enforce monotonicity.

We focus on factors affecting alignment error rates other than attention mechanism itself in this chapter. The factors include 1) model parameter size, and 2) neural network architecture. We choose these factors because 1) they affect the learning ability of the TTS model in general including alignment, and 2) modifying these factors is not complicated. We use forward attention [25] as the attention mechanism in our experiment, which provides robust alignment and fast learning of alignment. We treat alignment methods different from soft-attention to force monotonicity by design in Chapters6and7.

3.2 End-to-end TTS methods used for analysis 19

Dalam dokumen Lexical pitch accent and duration modeling for neural end-to-end (Halaman 35-41)