Sampling methods of hard alignment - Lexical pitch accent and duration modeling for neural end-

Encoder

Pre-net alignment

probability prediction probability

FFN

FFN concat

tanh sigmoid

sampling

phoneme

Decoder

LSTM joint probability

x2 x3 x2

Bidirectional LSTM CNN

Figure 6.3: Detailed network structure of the SSNT-based TTS system.

then transformed by a fully connected layer with tanh activation. The output of tanh nonlinearity is passed to two networks. One is a fully connected layer with sigmoid activation to compute the alignment transition probability𝑝(𝑎_{𝑡 ,𝑢})of Eq. (6.3). The other is a linear layer to compute the emission probability of Eq. (6.4).

6.4 Sampling methods of hard alignment 91

distribution. In Bernoulli sampling, the alignment variable𝑧_𝑡 is incremented by sampling from the Bernoulli distribution with a parameter obtained by normalizing the two nonzero cases of Eq. (6.3), namely,𝑧_𝑡 =𝑧_𝑡₋₁+𝑘, where𝑘 ∼Bernoulli(𝑎_{𝑡 ,𝑢} =Emit).

Note that this is different from a transition matrix in HMM synthesis which models duration with exponential distribution. Our system models transition of alignment as a latent variable so our system does not define distribution of duration like HMM synthesis.

In the original SSNT, its prediction stops when the EOS token is the output. In our case, its prediction stops when the alignment variable reaches the final input position¹ causing all of the input sequence to be used.

In the following sections, we describe various alignment sampling methods for SSNT-TTS. We investigate conditions for the alignment sampling methods in experiments.

6.4.1 Alignment sampling from a discrete distribution

A well-known method to sample a random variable from a discrete distribution is the Gumbel-Max trick [148]. The algorithm of the Gumbel-Max trick consists of two steps:

one, sampling a continuous random variable from a learned distribution, and two, performing the argmax operation to output a final symbol. Let us denote the density of the binary transition variables as𝑝(𝑎_{𝑡 ,𝑢} =Emit) =𝛼₁ and𝑝(𝑎_{𝑡 ,𝑢} =Shift) =𝛼₂, where𝛼₁+𝛼₂ =1. To randomly sample a discrete alignment transition variable at input𝑢and time step𝑡 with the Gumbel-Max trick, Gumbel noise is first added to each logit of the alignment transition densities, which can be written as

P(𝑎_{𝑡 ,𝑢} =Emit) =P(𝐺₁+log𝛼₁ >𝐺₂+log𝛼₂) (6.8)

=P(𝐿+log𝛼₁ > log𝛼₂), (6.9) where 𝐺₁ and𝐺₂ are the Gumbel noises and 𝐿 = 𝐺₁ −𝐺₂. Since the difference 𝐿 is known to follow a Logistic distribution, a Logistic noise can be sampled by 𝐿 =log(𝑈) −log(1−𝑈), where𝑈 is a sample from the standard Uniform distribution.

Finally, the argmax operation is applied to obtain a discrete sample of the alignment

1[144] uses the same criteria to stop prediction to consume full input.

transition variable

𝑎_{𝑡 ,𝑢} =argmax(𝐿+log𝛼₁,log𝛼₂). (6.10) This is a typical approach to drawing random samples from the Bernoulli distribution, and we refer to this condition as “Logistic condition” because it involves sampling from the Logistic distribution in continuous space. We will introduce another continuous distribution for the alignment transition distribution in Section6.4.3.

In implementation, because𝛼₁+𝛼₂ =1, we parameterize them as𝛼₁ =sigmoid(log𝛼 , 𝜆) and𝛼₂=1−sigmoid(log𝛼 , 𝜆), where sigmoid(𝑥 , 𝜆) = _1+exp(−¹ _𝑥_/_𝜆₎ and𝜆is a temperature hyper-parameter to control the shape of the sigmoid function (see Figure6.4).

Accordingly, SSNT only needs to predict a single value of𝛼 at each time step. It can be shown that𝛼 =𝛼₁/𝛼₂.

6.4.2 Stochastic alignment search

Since SSNT-TTS treats the alignment as a latent variable, it must decode a single alignment path from all possible alignments during inference. From here on we refer to the process of selecting the best alignment assearch.

Greedy search is a search method that chooses a single path with a higher alignment transition probability at each time step, as

𝑎_{𝑖, 𝑗} =argmax(log𝛼₁,log𝛼₂) =











Emit if𝛼₁ >𝛼₂ Shift if𝛼₁ <𝛼₂

. (6.11)

From equations (6.10) and (6.11), we can see that the difference between the greedy search and random sampling from a Bernoulli distribution is just the Logistic noise𝐿. For this reason, we refer to random sampling from a Bernoulli distribution asstochastic greedy searchfor convenience.

A generalized version of greedy search is the beam search method. The beam search method considers several candidates of alignments at each time step of decoding and selects the overall best candidate alignment path. The number of candidates is called beam width. Greedy search is a special case of beam search with beam width equal to one. Furthermore, by combining the sampling from the Bernoulli distribution and beam search, or more concretely, by adding the Logistic noise to the argmax

6.4 Sampling methods of hard alignment 93

operations during beam search,stochastic beam searchcan be easily implemented.

6.4.3 Continuous relaxation of discrete alignment variables

As we see in equation (6.9), the conventional random sampling from the Bernoulli distribution uses the Logistic distribution. The use of the Logistic distribution is, however, an oversimplified assumption since it has a symmetric shape and a single mode.

More specifically, under this condition, similarity between the sample distribution and the actual discrete distribution mainly depends on the prediction accuracy of the𝛼₁ parameter.

To model the hard alignment more appropriately, we propose using thebinary Concrete distribution[129] for the alignment transition probability. The Concrete distribution, which is also known as the Gumbel-Softmax distribution [149], is a continuous distribution that relaxes the discrete argmax operation. Here, we focus on the binary case of the Concrete distribution because our alignment transition variable 𝑎_{𝑡 ,𝑢} has two states only.

Using a concept similar to the Gumbel-Max trick, we can achieve random sampling from the binary Concrete distribution. Like the procedure described in Section6.4.1, random sampling from the binary Concrete distribution starts with adding the Logistic noise𝐿to a logit of the density of the alignment transition variable. Here, we introduce parameter𝛼 =𝛼₁/𝛼₂, and consider the logit of the𝛼 instead. First, we add the Logistic noise𝐿to a sigmoid function:

P(𝑎_{𝑡 ,𝑢} =Emit) = 1

1+exp(−(log𝛼 +𝐿)/𝜆) (6.12) where𝜆is a temperature parameter. This sigmoid function is used as a continuous version of the argmax function and a discrete sample can be obtained by applying the argmax operator to the output of the sigmoid function. The resulting distribution has the following form:

BinConcrete(𝑥|𝛼 , 𝜆) = 𝜆𝛼 𝑥^−𝜆−1(1−𝑥)^−𝜆−1

𝛼 𝑥^−𝜆+ (1−𝑥)^−𝜆2. (6.13) Figure6.4shows the binary Concrete distribution with𝛼 and𝜆of various values. The

Sigmoid

λ=1.0

λ=0.2

λ=0.05

Binary Concrete

Figure 6.4: Left: Sigmoid function. Right: Binary Concrete distribution. Points on sigmoid function indicate𝛼 parameters for binary Concrete distribution.

parameter𝛼 can be learned by using the Logit noise during training. We refer to the condition using the binary Concrete distribution as the“binary Concrete condition".

Figure6.5summarizes the sampling process of the Logistic and binary Concrete conditions implemented with neural networks. We can see that both conditions have a similar network structure, and the difference is the location where Logistic noise is added: the Logistic condition has the Logistic noise after the sigmoid function, while the binary Concrete condition has the Logistic noise before the sigmoid function. Note that the use of Logistic noise is mandatory during training for the binary Concrete condition, as the randomness makes a model learn the parameter𝛼 of the binary Concrete condition. In contrast, the Logistic condition does not use Logistic noise during training, since during training a model computes likelihood, not samples. At inference time, the use of the Logistic noise can be optional in both conditions. If the Logistic noise is omitted during inference, it is equivalent to a deterministic search

6.5 Experiments 95

Dalam dokumen Lexical pitch accent and duration modeling for neural end-to-end (Halaman 112-117)