Lexical pitch accent and duration modeling for neural end-to-end

This thesis focuses on two lexical features of speech to improve end-to-end TTS: tonal stress and phoneme duration. We advance an end-to-end TTS method that involves modeling alignment by constructing discrete latent alignments based on phoneme duration.

Background

The impact these configurations had on end-to-end TTS performance was not investigated. Pronunciation strength was one of the top issues in TTS end-to-end.

Thesis overview

Motivation
Topic and scope
Issues to be addressed
Contribution

We propose a common framework to better handle the two features in end-to-end TTS in later chapters. We introduce latent variables into the end-to-end TTS to model two different factors.

Outline of thesis

Chapter 5 proposes a framework to model pitch accent as a latent variable in end-to-end TTS. We design this framework based on the method of modeling pitch accent as a latent variable in Chapter 5.

Notation

This thesis focuses on the end-to-end TTS framework, one of the frameworks for TTS. HSMM-based SPSS uses a conventional vocoder to produce the waveform [6,7], which suffers from the artifact caused by minimum phase and other assumptions in the speech production process [8].

End-to-end TTS

Overview
Linguistic modeling
Alignment modeling
Acoustic modeling

A first proof-of-concept of sequence-to-sequence TTS was investigated in Chinese with phoneme input [19]. Instead, many sequence-to-sequence TTS methods use an attention mechanism [55] to implicitly align source and target sequences.

Table 2.1: List of languages which end-to-end TTS are applied to, except for English.

Pipeline TTS framework

Front-end
Back-end
Duration model
Pronunciation
Alignment

Learning pronunciation directly from texts is one of the key features of the end-to-end TTS framework. In this chapter, we focus on factors that influence the ability of an end-to-end TTS model to disambiguate the pronunciation of texts.

Figure 2.1: TTS frameworks appearing in this thesis. Pipeline TTS framework consists of independent front-end and back-end models

End-to-end TTS methods used for analysis

Tacotron and Tacotron2

Zoneout [87] regularization is applied to long short-term memory (LSTM) layers in the encoder and decoder. In Tacotron, on the other hand, the regularization is only performed in the encoder pre-net and the decoder pre-net via dropout.

Baseline Tacotron

Note that post-net on mel spectrogram is added for the experiments in this study (see section 4.4.3) and was not used in the original work [1,2]. The output of the decoder RNN is projected to the mel spectrogram with a linear layer as the final output of the network.

Self-attention Tacotron

Zoneout regularization along with LSTM cell is introduced in Tacotron2 [89,22] and we use LSTM with Zoneout regularization for all RNN layers including attention RNN and decoder RNN. For the attention mechanism, we use advance attention without a transitive agent [25] instead of additional attention [55].

Pipeline TTS framework

Comparison between end-to-end and pipeline TTS framework 23

Nevertheless, sequence-to-sequence-based TTS systems using a neural vocoder achieved relatively high scores, and the best score was achieved by a DNN-based SPSS system using a neural vocoder, a front-end based on BERT (Bidirectional Encoder Representations from Transformers), an autoregressive duration model and a non-autoregressive acoustic model with a GAN post-filter [98]. The first DNN-based model in SPSS is responsible for modeling the mel-generalized cepstral coefficient (MGC).

Experiments

Experimental conditions

We measured the error rate of 16 Tacotron models on the test set of 500 utterances. We checked the statistical significance of the results between systems with a Mann-Whitney rank test [104].

Experimental results

Among the systems with the small parameter size, systems that used self-attention had particularly high alignment error rates. Self-attention in models with large parameter sizes also had no impact on the fatal alignment error rates.

Figure 3.3: Box plot for the results of English listening test. Red dots indicate mean, and black bars indicate median.

Conclusion

This result motivates us to introduce pitch-accent modeling to the end-to-end TTS model, and we cover this in Chapter 5. The relationship between prosodic description and writing system is what makes the learning of pitch accents difficult for end- two-end TTS.

Table 3.3: Alignment error rate on English test set. Each system was evaluated for three times, and each time it was trained from scratch with a different initial random seed

TTS for pitch accent languages: Japanese as an example

The Japanese language is a language with a "mora" accent; this means that there is an accent nucleus position numbered in mora units within an accent phrase. Thus, we provide accent type labels as the minimal input for end-to-end TTS to produce pitch accents and investigate various factors affecting pitch accent prediction ability.

TTS systems used in this investigation

Baseline Tacotron using phoneme and accentual type
Self-attention Tacotron using phoneme and accentual type
Tacotron using vocoder parameters
Pipeline systems

Self-attention Tacotron introduces "self-attention" to LSTM layers at the encoder and decoder as illustrated in Figure 4.3-B. At the encoder, the output of CBH-LSTM layers is processed with the self-attention block.

Figure 4.3: Architectures of proposed systems with accentual-type embedding. A:

Experiments

Japanese speech corpus
Experimental conditions (1)
Experimental results (1)
Experimental results (2)
Conclusion

For both the baseline Tacotron and the self-attention Tacotron, the CBHL encoder performed better than the CNN encoder at small parameter sizes. Self-attention was expected to improve naturalness for models with the small parameter size i.

Figure 4.6: Alignment obtained by dual source attention in SATMAP. Left figure shows alignment between output of encoder’s LSTM layer and target mel-spectrogram (forward attention)

From learning to modeling pitch accents

To do so, the end-to-end TTS model must model pitch accents as a latent variable.

Pitch accent modeling

Design of latent space

For example, the tone of Mandarin can be represented by six pitch contour patterns: LL (low), HH (high), LH (rising), HL (falling), L (neutral low), and H (neutral high), and are compiled as C -ToBI format [117]. The Japanese pitch accent in the Tokyo dialect has a more limited pattern, which consists of three pitch contour patterns: HL (falling), L (neutral low) and H (neutral high), and composed as X-JToBI format [118].

Framework to model and optimize latent variable

Abstract tone contour is the linguistic unit level representation of pitch stress, which represents the simplified pattern of pitch variation for a linguistic unit. The abstract height contour model is the natural choice of targets to model because it is based on linguistic units.

How to enforce latent variable to represent pitch accent

Variational autoencoder

Motivation of speaking style modeling ranges from diverse speech generation to speaking style control and prosody transfer. In contrast, conditional UAE is used for local speaking style modeling by conditioning latent variable with linguistic features.

VQ-VAE

Pitch accent modeling with conditional VQ-VAE

Model definition and variational lower bound
Model parameterization
Attention regularization
Training criteria in summary

We use l1:𝑈 = (𝑙1,· · · , 𝑙𝑈) to denote the latent pitch accent for y1: To fix their pitch, the tone recognizer performs summation of the acoustic inputs as¯x1:𝑈. and the decoder performs upsampling of the linguistic inputs as ˆy1:𝑇. The collection is based on soft attention as follows.

Figure 5.2: Brief overview of idea to incorporate pitch accent model to end-to-end TTS based on VQ-VAE.

Experiments

Experimental results

A system given sentence information (PP) had a TER of 13.5%, which is low compared to the system without sentence information (systemS-), indicating the importance of sentence information in predicting tones. A system with phrase information (systemPP) had a MOS that was higher than the system without phrase information, indicating that phrase information was a good cue to predict pitch accents.

Table 5.1: Tone error rate (TER) from each TTS system using pitch accent modeling.

Conclusion

In this chapter, we formulate our end-to-end TTS method using latent alignment by marginalizing all possible alignments. In Chapter 7, we formulate our end-to-end TTS method using latent duration based on variational inference.

Figure 5.4: Results of a listening test.

Alignment in end-to-end TTS and its issues

Unlike predicting fixed-length output, the stop flag enables the avoidance of unnecessary computation. The stop flag is an added complexity to implement such a trivial function. d) Late termination Figure 6.1: Common fatal alignment errors from soft-attention.

SSNT-based TTS: TTS using hard latent alignment

Model definition and learning of SSNT-based TTS

Here𝛼(𝑇 , 𝑈) is a forward variable of the forward-backward algorithm at the final input position𝑈 and the final time step𝑇. Then the back-propagation can be combined via backward operation of the forward-back algorithms by introducing a backward variable𝛽(𝑡 , 𝑢) and using the relation 𝜕𝑝(𝒙1:𝑇 |𝑦1:𝑈;𝜽).

Figure 6.2: Trellis structure of our model. A path that connects 𝒙 and 𝒚 represents alignment.

Network structure of SSNT-based TTS

One is a fully connected layer with sigmoid activation to compute the alignment transition probability𝑝(𝑎𝑡 ,𝑢) of Eq.

Sampling methods of hard alignment

Alignment sampling from a discrete distribution

To randomly sample a discrete alignment transition variable at input and time step with the Gumbel-Max trick, Gumbel noise is first added to each logit of the alignment transition densities, which can be written as. Finally, the argmax operation is applied to obtain a discrete sample of the alignment. 1[144] uses the same criteria to stop the prediction from consuming the entire input. 6.10).

Stochastic alignment search

6.10) This is a typical approach to random sampling from the Bernoulli distribution, and we refer to this mode as the “Logistic mode” because it involves sampling from the logistic distribution in continuous space.

Continuous relaxation of discrete alignment variables

Figure 6.5 summarizes the sampling process of the Logistic and binary Concrete conditions implemented with neural networks. At inference time, using the logistic noise can be optional in both conditions.

Figure 6.4: Left: Sigmoid function. Right: Binary Concrete distribution. Points on sigmoid function indicate 𝛼 parameters for binary Concrete distribution.

Experiments

Subjective evaluations
Subjective evaluation
Experimental result (1)
Experimental result (2)
Conclusion

We conducted an experiment to compare logistic and binary concrete conditions for SSNT-TTS. We used the Adam optimizer [111] and trained models with a batch size of 100 and a reduction factor of 2 [1].

Figure 6.6: Network architecture of SSNT-TTS.

Related works using duration for alignments

Another topic that will be covered in this chapter is how to incorporate the duration model and forced aligner used in the DNN-based pipeline TTS framework into end-to-end TTS. We consider the mathematical role of a duration predictor and a forced aligner in end-to-end TTS to build an end-to-end TTS method using latent duration.

TTS using latent duration

Overview of our approach

Model definition and variational lower bound

Model parameterization

CTC-based alignment search

Training criterion in summary

Implementation

Experiments

Experimental condition

The reference systems are public models built by the ESPNet-TTS team [153], and we used these specific models: transformer.v3,fastspeech.v3,tacotron2.v2 andtacotron2.v3. FastSpeech uses an external duration model, Tacotron-based systems use soft attention, and Transformer-TTS uses soft attention with positional encoding.

Experimental results

Conclusion

For the second issue regarding pitch accent prediction, we have included pitch accent modeling in end-to-end TTS based on conditional VQ-VAE in Chapter 5. For the second issue, we proposed an end-to-end TTS method finally using the latent phoneme duration based on the conditional VQ-VAE in Chapter 7.

Table 7.1: Character error rates of synthetic speech detected with automatic speech recognition.

Remaining issues

In Chapter 5, we introduced pitch accent modeling for end-to-end TTS based on conditional VQ-VAE. In Chapter 7, we proposed an end-to-end TTS method that uses latent phoneme duration based on VQ-VAE.

Final remark

Investigating improved tacotron text-to-speech synthesis systems with self-awareness for pitch-accented languages. Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis.

Approximate posterior

Decoder

Prior

Vector quantization

The KL divergence between the two multivariate Gaussians can be calculated analytically as A.31) By substituting the KL divergence in Eq.

In summary

Box plots of MOS scores of each system regarding naturalness of
Box plot for a results of Japanese listening test. Red dots indicate mean,
Schematic comparison of learning and modeling pitch accents
Brief overview of idea to incorporate pitch accent model to end-to-end
Architecture of end-to-end TTS system using pitch accent modeling
Results of a listening test
Common fatal alignment errors from soft-attention
Trellis structure of our model. A path that connects 𝒙 and 𝒚 represents
Detailed network structure of the SSNT-based TTS system
Left: Sigmoid function. Right: Binary Concrete distribution. Points on
Network architecture of SSNT-TTS
MOS scores of subjective evaluation
A sample that shows overestimation of pause duration

However, we further consider the condition𝑦1:𝑈 and define a parametric form for the prior in Eq. A.21) rather than assuming it to be uniform. This parametric prior leads to the loss in Eq. A.34), which are not included in unconditional models.