of Hard and Soft Thresholding with Bias-compensated Noise Level

Typically, setting threshold criteria requires an accurate estimate/knowledge of the noise level in the noisy speech. After hard thresholding is reached, soft thresholding is applied to the rest of the areas to further reduce the noise level.

Introd uction

Speech Enhancement: Background

Speech enhancement based on spectral decomposition and filtering [14J-[22] remains a common and effective approach for enhancing speech degraded by acoustic additive noise when only the noisy speech is available. This general class is based on variations of optimal filters and includes such methods as Wiener filtering, spectral subtraction, and various maximum likelihood (ML) estimation schemes. Speech enhancement methods have also been proposed based on the periodicity due to pitch [29].

Objective of This Research

6 combined application of hard and soft thresholding of the transform coefficients of the noisy signal with bias-compensated noise level as the threshold parameter. We investigate performance of the proposed method in both wavelet and DCT (discrete cosine transform) domain using corrected noise level as the threshold parameter.

Organization of the Thesis

The transformation coefficients are first divided into a number of blocks consisting of the appropriate number of consecutive coefficients for the transformed signal. The results will be compared with one of the latest methods proposed by Bahoura and Rouat [37J.

Review of Speech Enhancement Techniques

Introduction
Speech Enhancement Techniques Based on Short-Time Spectral Amplitude Estimation

Speech enhancement based on direct estimation of short-time spectral amplitude
Speech enhancement techniques based on Wiener filtering

Speech Enhancement Techniques Based on Speech Model
Wavelet Speech Enhancement Based on the Teager Energy Operator

Wavelet packet analysis
Teager energy approximation
Masks construction
Threshold modulation criterion
Mask processing for the time-adapted threshold
Time-adapted threshold
Thresholding W P coefficients
Inverse transformation

Conclusion

First, the short-time spectral amplitude is estimated in the frequency domain, using the degraded speech spectrum. The power spectral density estimate of the noise process v(n) is denoted by Sv(w).

Fig. 2.1: The spectral subtraction approach

Bias-compensated Noise Level for Wavelet and DCT Speech

Enhancement

Introduction
Problem Formulation
Estimation of Noise Level: Conventional Ap- proach
Wavelet Transform Based Proposed Enhance- ment Algorithm

Calculation of corrected value of noise level

In speech, most signal energy is usually held in the lower frequency range. The higher frequency range of the transform coefficients usually contains noise, the noise level of which is usually estimated [48J. The effect of the signal components present in this region may be insignificant at low SNRs, but is not negligible especially at high SNRs.

Unlike other conventional techniques, a more accurate estimate of the threshold parameter, the noise level, is obtained by compensating the effect of the signal trace remaining in the high-frequency region of the degraded speech transform coefficients. In the OCT-based analysis, the MAD of the OCT coefficients is calculated at the best level (N/2 +1 to N) to estimate the threshold parameter for the denoising process. Applying CTv+ as a threshold parameter removes some of the signal components which have a significant negative effect at high SNRs.

But in noise level estimation, the W P transform of degraded speech to the first degree is used. Therefore, kurtosis can be used to estimate the correction factor proportional to the signal present in the MAD of the wavelet coefficients at the best level.

Fig. 3.1: Variation of U v+ with SNR: (a)wavelet; (b) DCT.

3.6)is(pwith (0 forp=0 being the SNR of the original given noisy speech.Then,

Because the entire curve like that of the template function can be generated by adding auxiliary noise to the given noisy speech, except for possible shift in SNR due to difference in the speech signals. On the other hand, for a very low SNR the degree of mismatch is very high as 'Y((p)' resembles only a small part of the lower part of the template function. For an intermediate SNR of the given noisy signal The degree of similarity between the curves 'Y((p) and r((p)) varies between the above two cases.

In this research, the maximum correlation value between the curves 'Y((p) and r((p) is used as a measure of the degree of similarity. Because, regardless of the shape, 'Y((p ) is generally shifted from r( (p) in SNR and thus the maximum correlation value is reached when they overlap. The maximum value of the function Ryr(d), denoted by Rmax, indicates the degree of similarity between the template and the test functions.

On the other hand, when the SNR of the given noisy speech is very low, Rmax :S is found to be 0.255. However, below such a low SNR value, the effect of signal bias is insignificant on the estimated value of the noise level.

3.9)Rmaxlies between Wand1/!, a linear interpolation for the proposed correction

Thresholding W P coefficients

Application of soft thresholding alone

Combined application of hard and soft thresholding

Reconstruction of the original signal
DCT Based Proposed Enhancement Algo- rithm

The rest of the regions where the signal strength is higher than the noise, soft thresholding is applied to further improve the noisy signal. As the noise power uniformly penetrates the actual signal in the wavelet domain, subtracting the noise power from the signal power is expected to improve the SNR of the enhanced signal. The expanded signal is synthesized by the inverse transformation Wp-1 of the modified coefficients W P (Wf,m)' i.e.

The topic of speech enhancement is widely researched, and many speech enhancement algorithms make use of the Discrete Fourier Transform (DFT) to facilitate the removal of noise embedded in the noisy speech signal. This is often done as it is easier to separate the speech energy and the noise energy in the transformation domain. For example, the energy of white noise is uniformly spread across the entire spectrum, but the energy of speech, especially voiced speech, is concentrated.

Most of the algorithm only tries to modify the spectral amplitudes of the noise-disturbed speech signal to reduce the effect of the noise component while keeping the noise-disturbed phase information intact. It is interesting to note that in [15] it was found that the best estimate of the phase of the speech component was the phase of the corrupted signal itself.

Calculation of corrected value of noise level

3.18) A similar procedure described in section 3.4.1 is adopted to estimate the correction

Thresholding DCT coefficients

Conclusion

Results

Data Used

Estimation of Corrected Noise Level

SNR given estimated bias corrected actual method noise correction speech noise level noise level noise level. TIMIT database and then augmented with computer generated white noise sequences. The noise power is first estimated by Eq. 3.2), then the signal power is obtained by subtracting it from the noisy speech power. The thus obtained noise and signal strength values give an underestimate of the SNR due to the upward bias of the estimated noise level.

4.1, the final SNR estimate using the corrected noise level is much more accurate than that using the biased noise level. As described in Section 3.4.1, we add the auxiliary computer-generated white Gaussian noise sequences of the power boost to the given noise signal of 22220 (s1) and 15616 (s2) samples. For convenience, we choose the noise power to result in a reduced SNR of 1 dB.

Note that samples of the template function must be taken at the same interval. Estimated results of fJ (Eq. 3.9)) at the high frequency region for different SNRs are presented in Tables 4.1 and 4.2 for two different utterances s1 and s2 together with the corrected noise level (O'v) using Eq.

Performance Test

Objective test
Subjective test

SNR given estimated bias corrected actual method noise correction speech noise level noise level noise level. It is obvious that the proposed method in the wavelet and DCT domains shows a better improvement performance than the previous one for almost all SNRs. The results for conventional soft thresholding in both the wavelet and DCT domains also show improved performance due to the compensation introduced in the biased noise level.

37]' (iv) soft threshold in the wavelet domain, (v) hard and soft threshold in the wavelet domain, (vi) soft threshold in the DCT domain, (vii) hard and soft threshold in the DCT domain. 37]' (iv) soft threshold in the wavelet domain, (v) hard and soft threshold in the wavelet domain, (vi) soft threshold in the DCT domain, (vii) hard and soft threshold in the DCT domain. The SNR of the enhanced speech is shown in Table 4.4 for the proposed method along with the recent one [37].

It is also clear that the proposed method in both wavelet domain and DCT domain shows better enhancement performance for this real recorded noise. Also in this case, the proposed method removes noise comparatively by introducing less distortion in the enhanced speech. In the first session, for each type of noise, listeners compare the outputs of the proposed system in the wavelet domain with that reported in [37J.

In the second session, the comparison is between the outputs of the proposed system and the noisy input signal.

Table 4.2: Comparison of actual and corrected noise levels along with the cor- cor-rection factor, fJ, for the speech, "Should we chase those cowboys?", at different SNRs.

Conclusion

Summary
Suggestions for future work

The noise and signal power estimation schemes proposed here can be used for estimating SNR with very good accuracy for further processing of the noisy speech signal. The main goal of speech enhancement is to improve the perceptual aspects, that is, intelligibility and quality of speech. The intelligibility of speech can be ensured by applying the bias-compensated noise level as the threshold parameter.

The proposed hard and soft thresholds have shown better performance for objective tests in terms of improved SNR. But it may not always yield better performance in subjective evaluations, due to the introduced distortions and artifacts known as the musical noise. Therefore, a change in thresholding technique can be explored to further improve the quality of enhanced speech.

Therefore, a nonlinear approach incorporating speech-dependent parameters can be proposed to obtain an accurate correction factor for the one-way noise level. Obviously, this will estimate the noise level more accurately and the SNR estimation scheme will be more appropriate.

Bibliography

Pollack, “Speech communication at high noise levels: the role of a sound-operated automatic gain control system and hearing protection,” J. Cappe, “Estimation of the musical sound phenomena with the Ephraim and Malah noise suppressor,” IEEE Trans. Malah, “Speech Enhancement Using a Short-Term Spectral Amplitude Estimator with Minimum Mean Square Error,” IEEE Trans.

Mahmoud, “Speech enhancement using fourth-order cumulants and optimum filters in the subband domain,” Speech Communication , vol. Zhao, “An energy-constrained signal subspace method for speech enhancement and recognition in white and colored noises,” Speech. Drygajlo, “Combined Wiener and coherence filtering in wavelet domain for microphone array speech enhancement,” in ICASSP, Seattle, WA, pp.

Rouat, “Wavelet speech enhancement based on the teager energy operator,” IEEE Signal Processing Letters, vol. Le Bouquin, “Enhancement of noisy speech signals: Application to mobile radio communications,” Speech Communications, vol.