Baseline diarisation system - Domain adaptation for speaker diarisation in low-resource environ

In this section we establish a baseline system to be used in all experimentation on the SAIGEN corpus. The baseline system, as depicted in Figure 5.1, uses the E-TDNN x- vector system, which had been chosen based on its performance on the EMRAI corpus in Sections 4.2.2 through 4.3.5.

To establish a baseline system, different SAD models are evaluated on the SAIGEN corpus and the best performing model is used when reporting results on system SAD (Section 5.3.1). Thereafter, the effects of the default system settings on diarisation performance is investigated to ensure that our baseline system is optimally configured (Sections 5.3.2 and 5.3.3). Finally, the performance of the unadapted baseline system is established on the SAIGEN corpus (section 5.3.4).

Figure 5.1: Our baseline system used for experimentation on the SAIGEN corpus.

5.3.1 Evaluation of SAD systems

Before establishing baseline results on the SAIGEN corpus, the pre-trained SAD models, as described in Section 4.2.1, are evaluated on the SAIGEN corpus and the best performing model is used when reporting results on system SAD.

Method

The SAD performance is compared in terms of F1-score, as described in Section 2.7.1.

All SAD models are used ‘as-is’. The F1-score is computed in the same manner as in Section 4.2.1: by converting the oracle and system SAD marks to binary arrays where a 0 represents non-speech and a 1 represents speech. Each digit in the binary array corresponds to 10ms of audio, which is the resolution of all three pre-trained models. The oracle SAD marks are used as ground truth, and the system SAD marks as predictions;

the F1-score is calculated accordingly.

Results

Table 5.1 shows the mean F1-score for each SAD baseline over the entire SAIGEN corpus and for each call centre respectively. The TDNN-SAD performed significantly better than both the Webrtcvad and LSTM SAD systems (Wilcoxon, p_value < 0.05)². Recall

2We use a Wilcoxon signed-rank test instead of the corrected resampled Student’s t-test as the datasets (call-centres) are independent. Additionally, due to the small size of the dataset it cannot be reliably determined if the differences in performance of the 3 models are normally distributed; we therefore follow

from Section 4.2.1 that the WebrtcVAD showed comparable performance to the TDNN- SAD on the EMRAI corpus. This is, however, not true for the SAIGEN corpus where the TDNN-SAD consistently outperforms the other systems by a considerable margin. The TDNN-SAD will therefore be used for all experiments in this chapter when reporting on system SAD.

Table 5.1: Mean F1-score taken over each call-centre of the SAIGEN corpus using the baseline SAD models. The error margin is the standard deviation within each subset.

Subset Webrtcvad LSTM TDNN

Call Centre 1 0.78 ± 0.56 0.76 ± 0.53 0.86 ± 0.31 Call Centre 2 0.81 ± 0.57 0.76 ± 0.77 0.86 ± 0.38 Call Centre 3 0.85 ± 0.43 0.80 ± 0.56 0.89 ± 0.33 Call Centre 4 0.82 ± 0.52 0.76 ± 0.74 0.89 ± 0.26 Call Centre 5 0.76 ± 0.89 0.69 ± 0.94 0.82 ± 0.71 Call Centre 6 0.75 ± 0.39 0.73 ± 0.47 0.83 ± 0.39 Call Centre 7 0.73 ± 0.11 0.69 ± 0.59 0.80 ± 0.11 Call Centre 8 0.67 ± 0.12 0.60 ± 0.88 0.81 ± 0.71 Call Centre 9 0.86 ± 0.40 0.85 ± 0.47 0.90 ± 0.29 Entire Corpus 0.76 ± 0.10 0.73 ± 0.89 0.83 ± 0.79

5.3.2 X-vector segmentation

In Chapter 4.3.1 the E-TDNN x-vector system is implemented ‘as-is’, without changing any of the default settings. In this subsection we investigate the effects of the segmentation settings on diarisation performance – all other default settings, such as the dimensionality of the LDA transform are not investigated in this study.

Demvsar’s advice [78] and use the Wilcoxon test which is a non-parametric alternative to the Student’s t-test.

Method

The E-TDNN system segments speech using a default 2-second sliding window without overlap (2000ms step between segments). Hence, baseline results are established by com- paring diarisation performance on the SAIGEN corpus while varying the default step size between segments³. The segment size itself remains unchanged.

Results

Table 5.2 shows the mean DER taken over all three development and test splits of the SAIGEN corpus using the E-TDNN and TDNN-SAD with various segmentation step sizes. We used a constant sliding window length of 2 seconds and, for each segmentation setting, we used oracle SAD and TDNN-SAD (system SAD). Notice that the DER remains relatively constant across step sizes. Although no significant difference is observed between step sizes (resampled Student’s t-test, pvalue > 0.05), we chose to perform all further experiments using a 250ms step size, as it resulted in the lowest mean DER across the developments splits. Additionally, we see a higher mean DER on the test splits compared to development splits when using TDNN-SAD but not when using oracle SAD, this indicates that the TDNN-SAD performs worse on the test splits than the development splits and not the E-TDNN x-vector system itself. However, as SAD systems are not the focus of this research we do not attempt to adapt the TDNN-SAD to have similar performance on the development and test splits.

3As mentioned in Section 4.2.2, when overlapping segments are labelled as belonging to different speakers, the speaker change point is chosen as the midpoint between the two overlapping segments.

Table 5.2: Mean DER measured on the SAIGEN corpus using the TDNN-SAD and E- TDNN baseline systems over various segmentation settings. The DER is reported over all three development and test splits with the standard error.

Step Size (ms) Oracle TDNN-SAD

Development Test Development Test

250 17.00 ±0.28 17.20 ± 1.16 23.99 ± 0.35 28.23± 0.15 500 17.12 ±0.34 17.55 ± 1.33 24.10 ± 0.19 28.58± 0.20 750 17.14 ±0.56 17.15 ± 1.10 24.10 ± 0.20 27.75± 0.40 1000 17.07 ±0.31 17.12 ± 1.39 24.02 ± 0.13 27.60± 0.34 2000 17.90 ±0.44 17.31 ± 0.76 24.07 ± 0.18 28.42± 0.23

5.3.3 SAD segment padding

The TDNN-SAD is chosen for the baseline system due to its performance on the SAIGEN corpus. However, the TDNN-SAD applies segment padding as a post-processing step which is an algorithm that slightly pads segments with silence, to compensate for over aggressive boundary placement. While details of the segment padding algorithm are confidential, its effect on diarisation performance is unknown and should be verified.

Method

To investigate how segment padding affects diarisation performance we measure the mean DER over the SAIGEN corpus with and without segment padding. Additionally, we de- compose the DER into its core components: false alarm, missed detection and speaker confusion. This decomposition allows us to determine how segment padding effects diarisation.

Results

Table 5.3 shows the decomposed DER taken over the SAIGEN corpus using TDNN-SAD with and without segment padding. The missed detection component of the DER is lower when using segment padding as opposed to no segment padding (corrected resampled Student’s t-test, pvalue < 0.05), which indicates the padded segments help to include speech which the TDNN-SAD missed. The opposite is true for the false alarm rate as it is higher when using segment padding (corrected resampled Student’s t-test, pvalue <0.05), indicating that, although segment padding can include missed speech, it also inevitably includes non-speech. Lastly, although the mean speaker confusion rate is lower when using segment padding, no significant difference is observed (corrected resampled Student’s t- test,pvalue>0.05). Therefore, seeing that segment padding improves the missed detection rate we will use TDNN-SAD with segment padding for all future experiments and will simply refer to it as ‘system-SAD’, as it is better to include more silence (as given by the false-alarm rate) than to exclude speech altogether.

Table 5.3: Mean DER measured on the SAIGEN corpus using TDNN-SAD with and without segment padding and the E-TDNN baseline (using a 2 second sliding window with a 250ms step). The DER and all its components are reported over all three development and test splits. The error margin is the standard error.

Error (%)

Development Test

TDNN-SAD with padding

TDNN-SAD without padding

TDNN-SAD with padding

TDNN-SAD without padding False alarm 4.67 ± 0.18 2.4 ± 0.21 7.59 ± 0.34 3.82 ± 0.26 Missed detection 2.51 ± 0.42 5.14 ± 0.31 3.77 ± 0.51 8.41 ± 0.29 Speaker confusion 16.82 ± 0.31 17.81 ± 0.40 16.81 ± 0.24 17.64± 0.33

DER 23.99 ± 0.35 25.35 ± 0.30 28.23 ± 0.15 29.87± 0.26

5.3.4 Baseline performance

This section showed that diarisation performance is relatively unaffected by the step size between segments and that the TDNN-SAD benefits from using segment padding. The baseline system will therefore use the E-TDNN model with a 2 second window and 250ms step between windows and the TDNN-SAD will be used with segment padding. The performance of the baseline system on the SAIGEN corpus is shown in Table 5.4.

Table 5.4: Mean DER and JER error achieved by the unadapted baseline over all 3 development and test splits. Results are reported using oracle and system SAD and the error margin is the standard error.

Subset SAD DER (%) JER(%)

Development Oracle 17.00 ± 0.28 32.79 ± 0.46 System 23.99 ± 0.35 45.84 ± 0.57 Test

Oracle 17.20 ± 1.16 33.66 ± 1.77 System 28.23 ± 0.15 49.80 ± 1.23

Dalam dokumen Domain adaptation for speaker diarisation in low-resource environments (Halaman 90-96)