Experimental results (2) - Lexical pitch accent and duration modeling for neural end-to-end

4.4 Experiments

4.4.5 Experimental results (2)

4.4 Experiments 57

generated alignment errors due to the prediction of longer sequences as we described in the previous section. Among the baseline systems, the systems using MGC and𝐹₀ had higher scores than the systems using mel-spectrogram under both the forced and predicted alignment conditions.

Interestingly, JA-Tacotronusing forced alignment got lower scores than that using predicted alignment under both conditions with and without accentual-type labels. This result is surprising because, in traditional pipelines, forced alignment is used as an oracle alignment and normally leads to better perceptual quality than that of the predicted case. Since Tacotron learns both spectrograms and alignments simultaneously, it seems to produce the best spectrograms when it infers both of them. Among the baseline pipeline systems, as expected, a forced alignment gave higher scores than predicted alignment for both systems using vocoder parameters and mel-spectrogram. In the case of predicted alignment, the score has a long tail variance towards the low score region.

The best proposed system still does not match the quality of the best pipeline system.

SA-Tacotronwith accentual-type labels and the pipeline system using mel-spectrogram and predicted alignment had 3.60±0.03 and 3.90±0.03, respectively. These are not the same results as for the English experiments reported in [22]. One major difference of our proposed systems from pipeline systems other than architecture is input linguistic features; our proposed systems use phoneme and accentual-type labels only, but the baseline pipeline systems use various linguistic labels including word-level information such as inflected forms, conjugation types, and part-of-speech tags. In particular, an investigation on the same Japanese corpus found that the conjugation type of the next word is quite useful for𝐹₀prediction [112].

Table 4.3: Alignment error rate on Japanese test set. Each system was evaluated three times, and each time it was trained from scratch with a different initial random seed.

Ave. indicates average error rate. Values in bold correspond to those evaluated in a listening test.

Para.

Encoder Self- # Para. Alignment error rate (%)

size attention (1×10⁶) Ave. 1 2 3

Small

CBHL - 11.3 2.4 0.4 2.7 4.2

X 11.6 11.5 6.0 7.9 20.6

CNN - 9.2 2.4 0.6 1.7 5.0

X 9.6 18.2 4.8 14.2 35.6 Large

CBHL - 35.8 0.2 0.2 0.2 0.2

X 41.6 0.3 0.2 0.2 0.4

CNN - 27.2 0.2 0.2 0.2 0.2

X 32.8 0.2 0.2 0.2 0.2

runs. Through further investigation, we found that a combination of post-net and self-attention made the alignments unstable in the small-parameter-size configurations.

In general, when systems have small parameter sizes, their learning of the alignment is sensitive to the initial parameters and network structures.

Subjective evaluation

Figure4.10shows the results of the listening test in Japanese, and Table4.4lists the outcome from the statistical significance test. We can see a significant gap between Tacotron systems with small and large parameter sizes. All systems with the small parameter size had low scores: baseline Tacotron had scores of 3.04±0.04 and 2.55±0.04 for the CBHL and CNN encoders, respectively, and self-attention Tacotron systems had 2.85±0.04 and 2.23±0.04 for the CBHL and CNN encoders, respectively. For both baseline Tacotron and self-attention Tacotron, the CBHL encoder performed better than the CNN encoder in the small parameter size conditions. On the other hand, Tacotron systems with the large parameter size had high MOSs of about 4.0. We listened to the samples that had average MOSs of less than 2.5 from the small-parameter-size systems and found incorrect pitch accents in them. Furthermore, samples from systems using the CNN encoder with the small parameter size generally sounded flatter in pitch than

4.4 Experiments 59 Analysis by synthesis Pipeline Tacotron

Small Large

CNN

CBHL CBHL CNN

Pipeline (mel spectrogram) Pipeline (vocoder parameter)

ABS (Pipeline w/ vocoder parameter)

ABS (Tacotron) ABS (Pipeline w/ mel spectrogram)

Natural

Parameter Size Encoder

Self-attention

SA SA SA SA

Figure 4.10: Box plot for a results of Japanese listening test. Red dots indicate mean, and black bars indicate median.

corresponding systems using the CBHL encoder, which is probably why listeners rated them negatively. Some lowly rated samples from the small self-attention Tacotron also contained fatal alignment errors, which were identified by the alignment error detector.

The difference among Tacotron systems with the large parameter size is not statistically significant as Table 4.4 shows, so the presence of self-attention and difference in encoder network structure did not significantly affect the naturalness of synthetic speech. Furthermore, these Tacotron systems had slightly higher scores than the pipeline system using the mel-spectrogram, which had an MOS of 3.97±0.04. The Tacotron systems using the CBHL encoder had statistically significant differences to the pipeline system using mel-spectrogram, but the Tacotron systems using the CNN based encoder did not. The pipeline system using the vocoder parameter had a lower MOS (3.87±0.04) than the pipeline system using the mel-spectrogram and the Tacotron system with the large parameter size.

Table 4.4: Mann-Whitney rank test for Japanese listening test. Cell inthis color denotes statistical significance (𝑝 ≤ 0.01)whilethis color denotes𝑝 > 0.01.

Natural

ABS Pipeline Tacotron

Tacotron Mel. Pipeline Mel. Vocoder Para. Pipeline Mel. Vocoder Para.

Small Large Para. size

CBHL CNN CBHL CNN Encoder

X X X X Self-att.

Natural ABS

Tacotron Mel. spec.

Pipeline Mel. spec.

Vocoder para.

Pipline Pipeline Mel. spec.

Vocoder para.

Tacotron

Small

CBHL -

CNN -

X Large

CBHL -

CNN -

Dalam dokumen Lexical pitch accent and duration modeling for neural end-to-end (Halaman 79-82)