Efficient Noise Suppression for Robust Speech Recognition

The thesis deals with the issues of noise estimation technique with a single microphone for speech recognition in a noisy environment. I encouraged QBNE by adjusting the quantile level (QL) based on the relative amount of noise added to the target speech. Basically we assign two different QLs viz. binary levels, according to the measured statistical moment of the power spectrum on a logarithmic scale at each frequency.

On the assumption that speech is generally uncorrelated to the ambient noises, the noise power spectrum can be estimated using each mixture model parameter for speech absence class. I compared the proposed methods with the conventional QBNE and minimum statistics-based method on a simple speech recognition task at different signal-to-noise ratio (SNR) levels. Based on the experimental results, the proposed methods are shown to be superior to the conventional methods.

ASR Automatic Speech Recognition DFT Discrete Fourier Transform EM Expectation Maximization GMM Gaussian Mixture Model HMM Hidden Markov Model HTK Hidden Markov Model Toolkit MFCC Mel Frequency Cepstral Coefficients MLE Maximum Likelihood Estimator MMSE Minimum Mean Square Error MS Minimum Statistics.

Introduction

Based on observed log PSD of stationary noise and speech, I can be aware that the distribution of a stationary noise is close to or peakier than Gaussian (super-Gaussian), while a speech signal is more spread out than Gaussian (sub-Gaussian). A contrast function is used to determine the super-Gaussian (positive) and the sub-Gaussian (negative) by distance of the given distribution [4], and I adjusted higher quantile level when the distribution is Gaussian or super-Gaussian. Double GMM and RMM are used for probabilities and low mean and sigma parameter are used for noise power estimation respectively.

The distribution of this thesis is as follows; a brief explanation of single-microphone-based noise reduction techniques in Chapter 2 and Chapter 3 is for conventional noise estimation methods, which are independent of VAD. Our proposed noise estimation methods are described in Chapter 4, and Chapter 5 summarizes experimental results of speech recognition.

Single Microphone based Noise Suppression

Spectral subtraction

Py , Ps( )w and Pv( )w are the power spectrum of the input signal, pure speech and noise, respectively. And zero is needed if the estimated speech signal is less than 0 for half-wave rectification. Finally, we can obtain noise suppressed speech sˆ(n) by taking inverse DFT after the square root of the RHS of equation (6).

As this, the principle of spectral subtraction is very simple, but it can only apply to stationary noises, as mentioned before. But most sounds are generally non-stationary in the real world, musical sounds often remain after filtering.

Figure 2.2 Half wave rectification for non-negative value

Wiener filter

As we assumed before, speech and noise are uncorrelated, and speech and noisy speech are WSS, the autocorrelation function of y(n) can be rewritten as.

Noise Estimation

Minimum statistics based noise estimation

Principle of the minimum statistics method
Deriving optimal time-frequency dependent smoothing factor
Bias factor

As mentioned before, it is assumed that the power spectrum of noisy speech is the summation of the power spectrum of speech and noise. So the noise variance was estimated by tracking the minimum of the speech noise power spectrum over a fixed buffer length. Deriving a bias factor for the noise estimate since the minimum tracking was biased towards lower values.

The smoothing parameter used in equation (23) had to be low value to follow the noise faster. On the other hand, it had to be close to one to keep the power of the minimum shell as small as possible. Therefore, the smoothing parameter is derived by the conditional mean squared error between V(w,t)2 and . ), (t.

Note that in equation (25), time-frequency dependent smoothing factor b( )w,t was used instead of fixed factor as defined in (23). However, in real-time implementation, the value of the estimated noise variance Vˆ(w,t)2 lags behind the true noise variance. Therefore, a certain correction factor bc( )t was calculated using the ratio between the average smoothed periodogram and estimated noise power.

Since the minimum is biased at low values, the bias factor for compensating the minimum of the speech noise power spectrum is derived using the statistics of the minimum of the correlated PSD estimates of the noisy speech. Thus the bias term is derived by finding the minimum PSD average for some V(w,t)2 =1 which after simplification gave Qeq, called "equivalent degrees of freedom", is a function of the smoothed periodogram and the prior noise variance.

Quantile based noise estimation

The QBNE method has a very simple concept and can track the noise power spectrum without reference to speech presence as an MS-based method. However, the estimated noise power will be close to the speech power component if little noise is added. As a result, the output signal will be distorted due to overestimation of the noise power spectrum, leading to low recognition speed.

Histogram based noise estimation

If we compare three different noisy speeches, we can easily know that the spectrum of the size of noiseless speech is distributed around zero and the average of the spectrum is smaller than that of others. Basically, the less SNR of noisy speech, the more spectrum gets a higher value. And we can also observe that the variance increases as the SNR decreases.

Figure 3.2 Speech signal, magnitude spectrum and histogram at 2 kHz according to change of SNR

Proposed Method

Binary quantile level based noise estimation

Kurtosis based gaussianity estimation
Negentropy based gaussianity estimation
Extended infomax algorithm based gaussianity estimation

As forms of distributions, noise histograms are close to the Gaussian or super-Gaussian distribution. On the other hand, a distribution becoming blunt means that the SNR of the input signal is increasing, so it should choose low QL. The input buffer b( )w,t must force zero mean with unit variance before the function f( )× which evaluates the Gaussianity of the distribution is performed.

It is important to measure the Gaussianity of the distribution because it is highly correlated with the amount of additive noise. It can measure not only similarity to Gaussian, but also distinguish sub-Gaussian or super-Gaussian. Kurtosis is equal to 3 for Gaussian random variable and can also be less than or greater than 3 for sub-Gaussian or super-Gaussian.

Since the distribution of stationary noise follows super-Gaussian or Gaussian, I experimentally obtain the binary quantile level as follows. Since the entropy of the Gaussian is the largest over the entire random variable, the negentropy is always greater than zero and is zero if and only if x follows the Gaussian random variable. It means that the SNR of a frequency band is decreasing, high QLs are more suitable.

In other words, if the SNR of the input signal is increasing, we should choose low QLs. Estimation of the type of Gaussian is important because the distribution approaches the super-Gaussian means very noisy condition. The decision factor, k, determines the shape of the distribution of u: 1, 0 and -1 for super-Gaussian, Gaussian and Sub-Gaussian, respectively.

Figure 4.1 Histogram of log-scale PSD for various noises and clean speech, measured at 2 kHz

Dual mixture model based noise estimation

Dual Gaussian mixture model based noise estimation
Dual Rayleigh mixture model based noise estimation

Therefore, we assumed that the probability of Y(w,t)2 given Cj( )w follows univariate Gaussian density function with different mean and variance axis. From this feature, we estimated the power spectrum of noise by taking a long-term average of it. It is common practice to apply Gaussian probability density function for probability, and GMM can represent any form of distribution with less distortion.

For estimating noise power spectra with small error, the Rayleigh probability density function is more suitable than the Gaussian distribution. Because the power spectrum is a non-negative value and the Rayleigh distribution is also defined only to positive values. And the maximum value of the density function is equal to 1/s e and is reached when x=s.

From this property, the noise power spectrum is estimated by taking the argument that maximizes.

Figure 4.6 GMM based likelihoods for speech presence and absence

Experimental Results

In accordance with type of proposed methods, we summarized the results of speech recognition experiments separately in Tables 5.2 and 5.3, where "None" column lists the performance results without noise suppression. I found that QBNE and MS have discriminating advantages and disadvantages as shown in Figure 5.1 and 5.2. QBNE performed quite well under severe noisy conditions; on the contrary, the performance of MS gets better as SNR increases, and best in clean condition.

Among the approaches, extended infomax-based noise estimation leads the best performance and kurtosis-based method outperforms negentropy-based methods. The main reason for better performance can be deduced that infomax algorithm and kurtosis can identify sub-Gaussian distribution compared to negentropy. For another type of approaches, which are based on a mixture model, the recognition rates are increased by an average of about 5.7% and 7.4% compared to no processing.

RMM based method shows a better result than GMM approach, thus RMM is more suitable probability for current spectrum. In summary, the speech recognition results show that the proposed methods are quite stable and overcome the limit of conventional methods in different noise types and noise levels regardless of types of proposed noise assessment.

Table 5.2 Results of speech recognition experiment-1 (binary QL based methods)

Conclusion

Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 3] Volker Stahl, Alexander Fischer, and Rolf Bippus, "Quantile-based noise estimation for spectral subtraction and Wiener filtering," in Proceedings of ICASSP, vol Sejnowski, "Independent component analysis using an extended Infomax algorithm for mixed sub-Gaussian and super-Gaussian sources", Neural Comput.

6] Gil-Jin Jang and Hoon-Young Cho, "Efficient spectrum estimation of noise using line spectral pairs for robust speech recognition," Electronics Letters, vol. Hirsch, "AURORA experimental framework for performance evaluations of speech recognition systems under noisy conditions," in Proceedings of INTERSPEECH, p. 10] Intae Lee and Gil-Jin Jang, "Independent vector analysis based on overlapped variable-width cliques for frequency-domain blind signal separation," EURASIP Journal on Advances in Signal Processing, vol.

Above all, I would like to express my deepest gratitude to my advisor Prof. I would like to thank Machine Intelligence Labbers: Ara, Junyoung, Kibeom, Insik, Hyungju, Jiu, Doyeon, Chungho and Sungyong. He cannot share this joy with me, but I am sure he is very proud of me.