This thesis focuses on two lexical features of speech to improve end-to-end TTS: tonal stress and phoneme duration. We advance an end-to-end TTS method that involves modeling alignment by constructing discrete latent alignments based on phoneme duration.
Background
The impact these configurations had on end-to-end TTS performance was not investigated. Pronunciation strength was one of the top issues in TTS end-to-end.
Thesis overview
- Motivation
- Topic and scope
- Issues to be addressed
- Contribution
We propose a common framework to better handle the two features in end-to-end TTS in later chapters. We introduce latent variables into the end-to-end TTS to model two different factors.
Outline of thesis
Chapter 5 proposes a framework to model pitch accent as a latent variable in end-to-end TTS. We design this framework based on the method of modeling pitch accent as a latent variable in Chapter 5.
Notation
This thesis focuses on the end-to-end TTS framework, one of the frameworks for TTS. HSMM-based SPSS uses a conventional vocoder to produce the waveform [6,7], which suffers from the artifact caused by minimum phase and other assumptions in the speech production process [8].
End-to-end TTS
- Overview
- Linguistic modeling
- Alignment modeling
- Acoustic modeling
A first proof-of-concept of sequence-to-sequence TTS was investigated in Chinese with phoneme input [19]. Instead, many sequence-to-sequence TTS methods use an attention mechanism [55] to implicitly align source and target sequences.
Pipeline TTS framework
- Front-end
- Back-end
- Duration model
- Pronunciation
- Alignment
Learning pronunciation directly from texts is one of the key features of the end-to-end TTS framework. In this chapter, we focus on factors that influence the ability of an end-to-end TTS model to disambiguate the pronunciation of texts.
End-to-end TTS methods used for analysis
Tacotron and Tacotron2
Zoneout [87] regularization is applied to long short-term memory (LSTM) layers in the encoder and decoder. In Tacotron, on the other hand, the regularization is only performed in the encoder pre-net and the decoder pre-net via dropout.
Baseline Tacotron
Note that post-net on mel spectrogram is added for the experiments in this study (see section 4.4.3) and was not used in the original work [1,2]. The output of the decoder RNN is projected to the mel spectrogram with a linear layer as the final output of the network.
Self-attention Tacotron
Zoneout regularization along with LSTM cell is introduced in Tacotron2 [89,22] and we use LSTM with Zoneout regularization for all RNN layers including attention RNN and decoder RNN. For the attention mechanism, we use advance attention without a transitive agent [25] instead of additional attention [55].
Pipeline TTS framework
Comparison between end-to-end and pipeline TTS framework 23
Nevertheless, sequence-to-sequence-based TTS systems using a neural vocoder achieved relatively high scores, and the best score was achieved by a DNN-based SPSS system using a neural vocoder, a front-end based on BERT (Bidirectional Encoder Representations from Transformers), an autoregressive duration model and a non-autoregressive acoustic model with a GAN post-filter [98]. The first DNN-based model in SPSS is responsible for modeling the mel-generalized cepstral coefficient (MGC).
Experiments
Experimental conditions
We measured the error rate of 16 Tacotron models on the test set of 500 utterances. We checked the statistical significance of the results between systems with a Mann-Whitney rank test [104].
Experimental results
Among the systems with the small parameter size, systems that used self-attention had particularly high alignment error rates. Self-attention in models with large parameter sizes also had no impact on the fatal alignment error rates.
Conclusion
This result motivates us to introduce pitch-accent modeling to the end-to-end TTS model, and we cover this in Chapter 5. The relationship between prosodic description and writing system is what makes the learning of pitch accents difficult for end- two-end TTS.
TTS for pitch accent languages: Japanese as an example
The Japanese language is a language with a "mora" accent; this means that there is an accent nucleus position numbered in mora units within an accent phrase. Thus, we provide accent type labels as the minimal input for end-to-end TTS to produce pitch accents and investigate various factors affecting pitch accent prediction ability.
TTS systems used in this investigation
- Baseline Tacotron using phoneme and accentual type
- Self-attention Tacotron using phoneme and accentual type
- Tacotron using vocoder parameters
- Pipeline systems
Self-attention Tacotron introduces "self-attention" to LSTM layers at the encoder and decoder as illustrated in Figure 4.3-B. At the encoder, the output of CBH-LSTM layers is processed with the self-attention block.
Experiments
- Japanese speech corpus
- Experimental conditions (1)
- Experimental conditions (2)
- Experimental results (1)
- Experimental results (2)
- Conclusion
For both the baseline Tacotron and the self-attention Tacotron, the CBHL encoder performed better than the CNN encoder at small parameter sizes. Self-attention was expected to improve naturalness for models with the small parameter size i.
From learning to modeling pitch accents
To do so, the end-to-end TTS model must model pitch accents as a latent variable.
Pitch accent modeling
Design of latent space
For example, the tone of Mandarin can be represented by six pitch contour patterns: LL (low), HH (high), LH (rising), HL (falling), L (neutral low), and H (neutral high), and are compiled as C -ToBI format [117]. The Japanese pitch accent in the Tokyo dialect has a more limited pattern, which consists of three pitch contour patterns: HL (falling), L (neutral low) and H (neutral high), and composed as X-JToBI format [118].
Framework to model and optimize latent variable
Abstract tone contour is the linguistic unit level representation of pitch stress, which represents the simplified pattern of pitch variation for a linguistic unit. The abstract height contour model is the natural choice of targets to model because it is based on linguistic units.
How to enforce latent variable to represent pitch accent
Variational autoencoder
Motivation of speaking style modeling ranges from diverse speech generation to speaking style control and prosody transfer. In contrast, conditional UAE is used for local speaking style modeling by conditioning latent variable with linguistic features.
VQ-VAE
Pitch accent modeling with conditional VQ-VAE
- Model definition and variational lower bound
- Model parameterization
- Attention regularization
- Training criteria in summary
We use l1:𝑈 = (𝑙1,· · · , 𝑙𝑈) to denote the latent pitch accent for y1: To fix their pitch, the tone recognizer performs summation of the acoustic inputs as¯x1:𝑈. and the decoder performs upsampling of the linguistic inputs as ˆy1:𝑇. The collection is based on soft attention as follows.
Experiments
Experimental results
A system given sentence information (PP) had a TER of 13.5%, which is low compared to the system without sentence information (systemS-), indicating the importance of sentence information in predicting tones. A system with phrase information (systemPP) had a MOS that was higher than the system without phrase information, indicating that phrase information was a good cue to predict pitch accents.
Conclusion
In this chapter, we formulate our end-to-end TTS method using latent alignment by marginalizing all possible alignments. In Chapter 7, we formulate our end-to-end TTS method using latent duration based on variational inference.
Alignment in end-to-end TTS and its issues
Unlike predicting fixed-length output, the stop flag enables the avoidance of unnecessary computation. The stop flag is an added complexity to implement such a trivial function. d) Late termination Figure 6.1: Common fatal alignment errors from soft-attention.
SSNT-based TTS: TTS using hard latent alignment
Model definition and learning of SSNT-based TTS
Here𝛼(𝑇 , 𝑈) is a forward variable of the forward-backward algorithm at the final input position𝑈 and the final time step𝑇. Then the back-propagation can be combined via backward operation of the forward-back algorithms by introducing a backward variable𝛽(𝑡 , 𝑢) and using the relation 𝜕𝑝(𝒙1:𝑇 |𝑦1:𝑈;𝜽).
Network structure of SSNT-based TTS
One is a fully connected layer with sigmoid activation to compute the alignment transition probability𝑝(𝑎𝑡 ,𝑢) of Eq.
Sampling methods of hard alignment
Alignment sampling from a discrete distribution
To randomly sample a discrete alignment transition variable at input and time step with the Gumbel-Max trick, Gumbel noise is first added to each logit of the alignment transition densities, which can be written as. Finally, the argmax operation is applied to obtain a discrete sample of the alignment. 1[144] uses the same criteria to stop the prediction from consuming the entire input. 6.10).
Stochastic alignment search
6.10) This is a typical approach to random sampling from the Bernoulli distribution, and we refer to this mode as the “Logistic mode” because it involves sampling from the logistic distribution in continuous space.
Continuous relaxation of discrete alignment variables
Figure 6.5 summarizes the sampling process of the Logistic and binary Concrete conditions implemented with neural networks. At inference time, using the logistic noise can be optional in both conditions.
Experiments
- Experimental conditions (1)
- Subjective evaluations
- Experimental conditions (2)
- Subjective evaluation
- Experimental result (1)
- Experimental result (2)
- Conclusion
We conducted an experiment to compare logistic and binary concrete conditions for SSNT-TTS. We used the Adam optimizer [111] and trained models with a batch size of 100 and a reduction factor of 2 [1].
Related works using duration for alignments
Another topic that will be covered in this chapter is how to incorporate the duration model and forced aligner used in the DNN-based pipeline TTS framework into end-to-end TTS. We consider the mathematical role of a duration predictor and a forced aligner in end-to-end TTS to build an end-to-end TTS method using latent duration.
TTS using latent duration
Overview of our approach
Model definition and variational lower bound
Model parameterization
CTC-based alignment search
Training criterion in summary
Implementation
Experiments
Experimental condition
The reference systems are public models built by the ESPNet-TTS team [153], and we used these specific models: transformer.v3,fastspeech.v3,tacotron2.v2 andtacotron2.v3. FastSpeech uses an external duration model, Tacotron-based systems use soft attention, and Transformer-TTS uses soft attention with positional encoding.
Experimental results
Conclusion
For the second issue regarding pitch accent prediction, we have included pitch accent modeling in end-to-end TTS based on conditional VQ-VAE in Chapter 5. For the second issue, we proposed an end-to-end TTS method finally using the latent phoneme duration based on the conditional VQ-VAE in Chapter 7.
Remaining issues
In Chapter 5, we introduced pitch accent modeling for end-to-end TTS based on conditional VQ-VAE. In Chapter 7, we proposed an end-to-end TTS method that uses latent phoneme duration based on VQ-VAE.
Final remark
Investigating improved tacotron text-to-speech synthesis systems with self-awareness for pitch-accented languages. Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis.
Approximate posterior
Decoder
Prior
Vector quantization
The KL divergence between the two multivariate Gaussians can be calculated analytically as A.31) By substituting the KL divergence in Eq.
In summary
- Box plots of MOS scores of each system regarding naturalness of
- Box plot for a results of Japanese listening test. Red dots indicate mean,
- Schematic comparison of learning and modeling pitch accents
- Brief overview of idea to incorporate pitch accent model to end-to-end
- Architecture of end-to-end TTS system using pitch accent modeling
- Results of a listening test
- Common fatal alignment errors from soft-attention
- Trellis structure of our model. A path that connects 𝒙 and 𝒚 represents
- Detailed network structure of the SSNT-based TTS system
- Left: Sigmoid function. Right: Binary Concrete distribution. Points on
- Network architecture of SSNT-TTS
- MOS scores of subjective evaluation
- A sample that shows overestimation of pause duration
However, we further consider the condition𝑦1:𝑈 and define a parametric form for the prior in Eq. A.21) rather than assuming it to be uniform. This parametric prior leads to the loss in Eq. A.34), which are not included in unconditional models.