Sub-fundamental frequency (SFF) filtering

5.2 Detection of dominant aperiodic component regions (DARs) in speech

5.2.1 Sub-fundamental frequency (SFF) filtering

Both aperiodic and periodic sources contain impulse excitations. In periodic sources, the impulse excitations are due to glottal closure and opening instants and they occur at regular time intervals.

In aperiodic sources, the impulse excitations are due to transient bursts and frication noises, and they occur at every instant of time with arbitrary amplitude. These time instants of occurrence of the excitation impulses are reflected as discontinuities in the signal. Discontinuities are also observed in the transitions between obstruents and sonorants, and sometimes in the end points of sonorants and

were detected using ZFF method proposed by Murthy et. al. [146]. Here, we attempt to detect some of the discontinuities due to aperiodic sources using SFF filtering.

The SFF filtering method is motivated from the ZFF method. In ZFF, the signal is passed through a cascade of two 0 Hz resonators. The output of the resonators grows / decays as a polynomial function of time. The effect of discontinuities due to impulse sequences are overridden by the large values of the filtered output. To extract the characteristics of the discontinuities due to impulse excitation, the deviation of the filtered output was computed from the local mean.

z(n) =y(n)− 1 2N + 1

m=−N

y(n+m) (5.1)

where, y(n) is the output of the 0 Hz resonator and 2N + 1 is the length of the window over which the local mean was computed. The trend removed signal z(n) is the ZFF signal (ZFFS). An FIR implementation of these sequence of operations was proposed in [162]. The output of the filter in both the designs is a function of the trend removal window length. It was shown that epochal information is extracted well when the trend removal window length is between one and two pitch periods [146], [162]. If the window length is reduced to half pitch period, then the discontinuities due to glottal opening instances may be captured. In a similar way, if a large window length (more than 3 pitch periods) is chosen, then discontinuities present beyond the pitch period can be captured. Choosing the trend removal window length more than a pitch period makes it a band pass filter having center frequency below the fundamental or pitch frequency. Therefore, the method is called SFF filtering.

The fundamental frequency is not calculated from the speech signal, instead, 125 Hz is considered as the average fundamental frequency and frequency components below this frequency are considered as sub fundamental frequency components. Since, the length of the trend removal window is more than 3 pitch periods, the center frequency of the band pass filter is below 42 Hz. The value is not very critical. Any value between 20-45 Hz can be used as the center frequency.

Discontinuities present beyond the pitch periods are due to aperiodic sources. In DARs, the aperiodic sources are strong and hence, some of these discontinuities are detected by using SFF filtering. The detection method is described for synthetic signal and acoustic speech signal in the following subsections.

Analysis in synthetic signal

To illustrate the method, synthetic periodic and aperiodic sources are generated and are shown

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

−0.5 0 0.5

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

−1 0 1

Normalized amplitude

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0 0.5 1

Time in seconds ^0.4 ^0.6 ^0.8 ¹ ^1.2 ^1.4 ^1.6

0 0.2 0.4 0.6 0.8

Time in seconds

0.4 0.6 0.8 1 1.2 1.4 1.6

−1 0 1

0.4 0.6 0.8 1 1.2 1.4 1.6

−0.5 0 0.5 (a)

(b)

(c)

(e) (d)

(f)

/sh/ /iy/ /hh/ /ae/ /d/ /er/ /d/ /aa/ /r/ /k/ /s/ /uh/ /t/ /ih/ /n/

Figure 5.1: (a) Synthetic signal consisting of periodic and aperiodic components. (b) Sub-fundamental frequency (SFF) filtered output of the synthetic signal. (c) Energy signal computed over 5 ms window. (d) Speech signal for the utterance “She had your dark suit in”. (e) SFF filtered output of the speech signal. (f) Energy of the filtered signal shown in (e). The energy of SFF filtered signal is higher in the aperiodic region than in the periodic region.

in Figure 5.1 (a). The periodic source is generated using impulse train (with 5 ms time period) and aperiodic source is generated using single impulse or random noise. There are four parts in the signal.

The first part (between 0 and 0.6 sec) consists of an impulse followed by a periodic impulse train.

The second part (between 0.6 and 1.2 sec) consists of non-overlapping random noise (5 db) followed by periodic impulse train. In the third (between 1.2 and 1.6 sec) and the fourth part (between 1.6 sec and end), periodic impulse train is added to 5 db and 30 db random noise, respectively. The synthesized signal is passed through the SFF filter and the output signal is shown in Figure 5.1 (b).

Figure 5.1 (c) shows the energy of the output signal computed over a 5 ms window. It is seen that energy of the signal in the aperiodic regions are high and the energy in the periodic regions are close to zero. For example, energy in the region around the first impulse and around the periodic impulse to silence transitions are high. These impulse like discontinuities have flat spectrum and contain frequency components in the sub fundamental frequency region. Due to this reason, the SFF filtered output shows high energy in the vicinity of these discontinuities. The impulse train also contain very low frequency components. But, due to periodic nature of the impulse train, most of the energy in the low frequency is concentrated around the fundamental frequency. Sub-fundamental frequency region doesn’t contain sufficient energy which results in very low energy at the output of the SFF filter.

Similarly, the white random noise in the second part and the additive white random noise in the

0.5 1 1.5 2 2.5 3 3.5

−0.5 0 0.5 1

0.5 1 1.5 2 2.5 3 3.5

0 0.2 0.4 0.6 0.8

Time in seconds

Normalized amplitude

(a)

(b)

Figure 5.2: (a) Synthetic signal containing aperiodic white random noise with various duration. (b) Energy of the SFF filtered output of the synthetic signal.

third part show higher energy at the filter output. This is due to the presence of all frequency components including the sub fundamental frequency components in the random noise signal. However, the additive noise in the last part is not detected by the method. The noise added in this region is very low (30 db signal-to-noise ratio) and therefore, amount of aperiodic component in this region is much lower than the periodic component. Due to absence of sufficient aperiodic component energy, the last region is not captured at the filtered output.

The SFF method sometimes fails to detect the random noise as one unit. Figure 5.2 shows synthetic signal containing random noise with various time durations. The first unit just after the single impulse contains a random noise of 100 ms duration. Duration of all other units are integer multiples of duration of the first unit. From Fig 5.2 (b) it can be seen that the first unit can be detected as one unit, as the energy of the SFF filtered output is almost uniform. But, for all other units, there are energy fluctuations and it is not possible to detect them as the single unit. In other words, some regions of the random noise sequence are missed by the SFF method. In such cases, knowledge of VLRs can be used to merge some hypothesized DARs belonging to the same unit. Moreover, vocal tract information can also be explored to detect the missed regions. Use of VLR knowledge and vocal tract information are described in the subsequent subsections.

Analysis of natural speech signal

Similar to the synthetic signal, a speech signal for the utterance “She had your dark suit in”

obtained from the TIMIT database is plotted in Figure 5.1 (d) along with the SFF filtered signal (in Figure 5.1 (e)) and its energy (in Figure 5.1 (f)). The speech signal contains different types of sonorant and obstruent sounds containing different proportion of periodic and aperiodic components.

The energy of the filtered signal is high for the obstruent regions and the energy is approximately zero in the sonorant regions for most of the time. In some cases, due to adjacent obstruent region, some portion of the sonorant region have high energy at the output signal. Aperiodic components present in the burst region of stop consonants (/d/, /k/ and /t/) are strong and hence, the filtered output have high energy in those regions. There are two unvoiced fricatives, namely, /sh/ and /s/, and one voiced fricative /hh/ in the signal. It can be seen that the energy of the filtered signal for fricatives /s/

and /hh/ are high, and it should be possible to detect those regions. However, for the fricative /sh/, the energy is gradually tapering off towards left which may lead to some regions going undetected.

To detect such missed regions, specially in case of fricatives, some vocal tract system information is explored. The features related to vocal tract information are described in the next subsection.

Moreover, similar to some random noise region in the synthetic signal, energy of the filtered output is fluctuating in case of the fricative /s/. The energy fluctuation will lead to detection of two DARs for the same sound. In this case, use of VLR information may help merging the two DARs.

5.2.2 Dominant resonant frequency (DRF) and high to low frequency components

Dalam dokumen Biswajit Dev Sarma (Halaman 129-133)