Pitch detection Methods - Developing A Word Separation Algorithm for Recognition of Continuous

A person’s pitch originates in the vocal cords/folds, and the rate at which the vocal folds vibrate is the frequency of the pitch. So, when the vocal folds oscillate at 300 times per second, they are said to be producing a pitch of 300 Hz. A pitch detection algorithm[23]

(PDA) is an algorithm designed to estimate the pitch or fundamental frequency of a quasiperiodic or virtually periodic signal, usually a digital recording of speech or a musical note or tone. This can be done in the time domain or the frequency domain.

PDAs are used in various contexts (e.g. phonetics, music information retrieval, speech coding, musical performance systems) and so there may be different demands placed upon the algorithm. Pitch determination in speech signals is an extensively studied topic for more than four decades. Current methods are still not to a desired level of accuracy.

When it presented with a single clean pitched signal, most techniques do well, but when the signal is noisy, many current PDA still fail to perform. So a variety of algorithms exist, most falling broadly into the classes given below:

2.2.1 Zero Crossings

In a time domain feature detection method, the signal is usually preprocessed to accentuate some time domain feature, then the time between occurrences of that feature is calculated as the period of the signal [23]. A typical time domain feature detector is implemented by low pass filtering the signal, then detecting peaks or zero crossings.

Since the time between occurrences of a particular feature is used as the period estimate, feature detection schemes usually do not use all of the data available. However, this does not work well with complex waveforms which are composed of multiple sine waves with differing periods. Nevertheless, there are cases in which zero-crossing can be a

useful measure, for example in some speech applications where a single source is assumed. The algorithm's simplicity makes it cheap to implement.

2.2.2 Cepstrum

A reliable way of obtaining an estimate of the dominant fundamental frequency for long, clean, stationary speech signals is to use the cepstrum [24]. The cepstrum is a Fourier analysis of the logarithmic amplitude spectrum of the signal. If the log amplitude spectrum contains many regularly spaced harmonics, then the Fourier analysis of the spectrum will show a peak corresponding to the spacing between the harmonics: i.e. the fundamental frequency. Effectively we are treating the signal spectrum as another signal, then looking for periodicity in the spectrum itself. The cepstrum is so-called because it turns the spectrum inside-out. The x-axis of the cepstrum has units of quefrency, and peaks in the cepstrum (which relate to periodicities in the spectrum) are called rahmonics. To obtain an estimate of the fundamental frequency from the cepstrum we look for a peak in the quefrency region (shown in Figure 2.5) corresponding to typical speech fundamental frequencies.

Figure 2.5: Frequency and quefrency of speech waveform.

The cepstrum works best when the fundamental frequency is not changing too rapidly, when the fundamental frequency is not too high and when the signal is relatively noise- free. A disadvantage of the cepstrum is that it involves computationally-expensive frequency domain processing.

2.2.3 LPC based

A very powerful method for speech analysis is based on linear predictive coding (LPC) [27], also known as auto-regressive (AR) modeling. This method is widely used because it is fast and simple, yet an effective way of estimating the main parameters of speech

signals. An all-pole filter with a sufficient number of poles is a good approximation for speech signals. Thus, we could model the filter H(z) as

}} }}

}}

z a z A

}}

z z

z X

H _p

k k

1 1

} } }

}



}

} .……….(19)

where p is the order of the LPC analysis. Taking inverse z-transform in above Equation results in

}}  } } }}

}

} } } ^p

kxn k en

a n

1 ..……….(20)

Linear predictive coding gets its name from the fact that it predicts the current sample as a linear combination of its past p samples. The LPC method would be used to detect not only pitch but also formants. Waveform and frequency Analysis of LPC filter for a frame is shown in Figure 2.6.

Figure 2.6: Waveform and Frequency Analysis of LPC filter for a frame [24].

2.2.4 Autocorrelation

The correlation between two waveforms is a measure of their similarity. The waveforms are compared at different time intervals, and their “sameness” is calculated at each interval. The result of a correlation is a measure of similarity as a function of time lag between the beginnings of the two waveforms. The autocorrelation function is the correlation of a waveform with itself. One would expect exact similarity at a time lag of zero, with increasing dissimilarity as the time lag increases. The mathematical definition of the autocorrelation function is shown in first equation, for an infinite discrete function x[n], and second equation shows the mathematical definition of the autocorrelation of a finite discrete function x0[n] of size N.

}} 

^

}}} }

}

}

} }

x v xnxn v

}} 

^}^}

}}} }

}

 }^N ^v   }

x v x n x n v

R ¹

The cross-correlation between two functions x[n] and y[n] is calculated using Equation below

}} 

^

}}} }

}

}

} }

xy v xn yn v

R ……….(21)

Periodic waveforms exhibit an interesting autocorrelation characteristic: the autocorrelation function itself is periodic. As the time lag increases to half of the period of the waveform, the correlation decreases to a minimum. This is because the waveform is out of phase with its time-delayed copy. As the time lag increases again to the length of one period, the autocorrelation again increases back to a maximum, because the waveform and its time-delayed copy are in phase. The first peak in the autocorrelation indicates the period of the waveform.

The cepstrum looks for periodicity in the log spectrum of the signal, whereas autocorrelation is more strongly related to periodicity in the waveform itself. The

autocorrelation function for a section of signal shows how well the waveform shape correlates with itself at a range of different delays. We expect a periodic signal to correlate well with itself at very short delays and at delays corresponding to multiples of pitch periods. In Figure 2.7 waveform of a speech signal and autocorrelation function for a single frame of the speech signal is shown.

Figure 2.7: (a) Waveform of a speech signal.

(b) Autocorrelation function for a frame of the above speech signal.

The autocorrelation approach works best when the signal is of low, regular pitch and when the spectral content of the signal is not changing too rapidly.

2.2.5 Harmonic Product Spectrum(HPS)

HPS is based on the assumption that voice is essentially composed of a fundamental frequency (f

0), i.e. the pitch, and harmonics that occur at integer multiples of f

0. Thus, if it is downsampled the DFT of a voice signal by 2, 3, 4 etc. and then multiply the original with its (zero-extended) downsampled versions, the product should be a single peak located at the pitch frequency [25]. This idea is displayed in Figure 2.8.

Figure 2.8: HPS Algorithm.

Dalam dokumen Developing A Word Separation Algorithm for Recognition of Continuous Bangla Speech (Halaman 30-36)