Prosody based Word Separation Algorithm - Developing A Word Separation Algorithm for Recogniti

be average of pitch during the frame and sometimes it could be totally irrelevant to pitch.

On the other hand, the frame length should be taken long enough to contain at least one period of voiced signal, so the short-term autocorrelation has a peak. The other effect of short length frame is that the peak amplitude in autocorrelation function decreases as we decreases frame length and makes peak detection process more difficult. So length of frame is important. Autocorrelation function is defined as

}} 

^

}}} }

}

}

} }

v n x n x v

It should make sense that as the shift value q begins to reach the fundamental period of the signal, the autocorrelation between the original signal and the shifted signal will begin to increase and rapidly approaches to peak value at the fundamental period.

Inverse of fundamental period is fundamental frequency.

 Intensity

Intensity means power. Power can be derived by the following formula

……….(26)

 Duration

Duration is simply time duration.

input speech signal as shown in Figure 3.5. It plays an important role in determining where unvoiced sections of speech exist.

0 1 2 3 4 5 6 7 8

x 10⁴ -5

0 5 10x 10^-3

Time

Amplitude

Input speech

0 1 2 3 4 5 6 7 8

x 10⁴ -5

0 5x 10^-3

Time

Amplitude

After removing dc offset.

Figure 3.5: Shows input speech after removing dc offset.

Due to human speech production, speech signal is known to exhibits quasi-stationary behavior over a short period of time (20-40) ms [7]. This is typical trade off between frequency resolution and time resolution that is required for the analysis of non stationary signals. We select frame length 40msec with 50% overlapping. Framing and overlapping of the input speech is shown in Figure 3.6.

Each individual frames are windowed in order to minimize signal discontinuities and spectral distortion. Hamming window in Figure 3.7 is used to decrease the signal to zero at the beginning and end of each frame. After applying hamming window speech signal is shown in Figure 3.8.

The Hamming window used in hamming form [6]:

}}

n }0.54}0.46cos

}

2 n/N }1

}

,0nN }1

h q ……….(27)

0 50 100 150 200 250 300 350 400 450 500

-5 -4 -3 -2 -1 0 1 2 3 4 5x 10^-3

Frame no.

Amplitude

Figure 3.6: Speech signal after framing and overlapping of input speech.

0 50 100 150 200 250 300 350 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sample no.

Amplitude

Hamming window

Figure 3.7: Hamming window.

0 50 100 150 200 250 300 350 400 450 500 -5

-4 -3 -2 -1 0 1 2 3 4x 10^-3

Frame no.

Amplitude

Figure 3.8: Speech signal after applying Hamming window.

Then threshold value for noise energy is calculated. In continuous speech, absence of speech means background silence. But clean silence environment in impossible due to noise. So we can say that background silence means background noise. Noise energy threshold is calculated from the first 100ms data. If noise energy threshold is low from noise environment, then noise can falsely considered as speech. If it is high then noise environment then some speech with low energy can treat as noise. To capture the behavior of noisy environment we calculate noise energy threshold from first 100 ms data. Detail of noise energy threshold selection is given in Chapter 5.

51 ₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈

x 10⁴ -5

0 5x 10^-3

Time

Amplitude

Sunte pelum posta gie

0.5 1 1.5x 10^-3

Candidate words (Sun,te,pelum,pos,ta,gie)

Sun te

pelum pos

ta gie

Speech pattern is detected from noise if speech energy is greater than noise energy.

These speech patterns can be words, subwords or noise. So we called them as candidate words. Candidate word of input speech is shown in Figure 3.9, 3.10:

0 1 2 3 4 5 6 7 8 9

x 10⁴ -0.02

-0.01 0 0.01 0.02

Time

Amplitude

Artho Anorther Mul

0 100 200 300 400 500 600

0 2 4 6x 10^-3

No. of Frame

Amplitude by Energy

Candidate words (ar,tho,anor,ther,mul)

ar tho

anor

ther mul

Figure 3.9: Candidate words of ‘Artho anorther mul’ by energy is shown.

100 200 300 400

Pitch value in Hz

0 50 100 150 200 250 300 350 400 450

0 1 2 3 4x 10^-3

No. of Frame Autocorrelation Value

Koto bastota amader

Pitch

0 100 200 300 400 500 600

0 100 200 300 400

No. of Frame

Pitch value in Hz

0 100 200 300 400 500 600

0 1 2 3 4x 10^-3

No. of Frame Autocorrelation Value

Artho anorther mul

Pitch

Figure 3.10: Candidate words of ‘Sunte pelum posta gie’ by energy is shown.

Fundamental frequency (f₀) of the candidate word is calculated by autocorrelation method. Resultant value of autocorrelation and frequency that derived by autocorrelation is shown in the Figure 3.11 and 3.12. Fundamental frequency of human lies in 85- 300Hz. So we estimate fundamental frequency in between 60-300Hz.

Figure 3.11: Fundamental frequency of speech ‘Artho Anorther Mul’.

Figure 3.12: Fundamental frequency of speech ‘Koto bastota amader’.

In continuous Bangla speech fundamental frequency’s pattern is high to low from the beginning to the end of sentences. For example a speaker utters ‘Ami vat khai’ than f0 on

‘Ami’ is higher than ‘vat’ and also ‘khai’. But whenever stress is occurred within a word than f0 behavior is varied. This variation is also depends on word position in a speech.

For example f0 behavior of candidate word in last word is different with the candidate word of other positions when stress is given. If a speaker say ‘Artho noi somman chai’

than we get five candidate words ‘ar’, ‘tho’, ’noi’, ’somman’, ’chai’. Among them f0 pattern of ‘ar’ and ‘tho’ is low to high. Again if the speaker say ‘Somman noi chai artho’, then again we get five candidate words but now f0 pattern of ‘ar’ and ‘tho’ is high to low. So we see that last candidate word behave different when it receives stress.

To detect stressed candidate word we minimum f₀ as 180 Hz and minimum f₀ difference with previous word is 10Hz and maximum f₀ for last candidate word is 140 Hz. Because for 180 Hz the candidate word received enough stress. Minimum f₀ is not set in 150-170 range because this is normal speech range and fail to differentiate between normal word or stressed candidate word. If f₀ is set above 180 Hz than some candidate word is also omit. We also take the fact that f₀ difference of candidate word with previous candidate word is higher whenever it receives stress. When minimum f0 difference with previous

word is 2-5Hz then the candidate word do not received stress. For 10Hz it receives proper stress and again for 15Hz, 20Hz and more it treat all candidate words whose f₀is 10Hz range as separated unique word which leads false separation. f₀is low for last word in Bangla speech. 150 Hz range is normal pitch range and so speak 130 Hz, 120 Hz, 100Hz range is impossible. So we take maximum f₀range is 140Hz for last candidate word in speech and for candidate word of other positions we take minimum f₀as 180Hz and f₀ difference is 10 Hz or more to merge with previous word. Details experiment on these parameter selections is given in chapter 5. Pseudocode of our Prosody based word separation algorithm is given in Figure 3.13.

1. read input speech.

2. remove dc offset from input speech signal.

3. frame speech signal with 40 ms length with 20 ms overlapping.

4. apply hamming window.

5. calculate noise energy threshold from first 100ms speech.

6. find candidate words if speech energy is greater than noise energy threshold.

7. for each candidate words

calculate fundamental frequency( f₀).

end for

8. for each candidate words after processing

if candidate words f₀is greater than or equal to 180Hz or their frequency difference is greater than or equal to 10Hz and their frame gap is less than or equal to 20 then merge the candidate word with previous word.

end if

8.1. for last candidate word

if f₀is less than or equal to 140Hz and frame gap with previous word is less than or equal to 20 then merge the candidate word with its previous word.

0 1 2 3 4 5 6 7 8

x 10⁴ -0.015

-0.01 -0.005 0 0.005 0.01

Time

Amplitude

0 1 2 3 4 5 6 7 8

x 10⁴ -0.02

-0.01 0 0.01 0.02

Time

Amplitude

Input speech: Bastota chkari nie

After removing dc offset.

-0.01 -0.005 0 0.005 0.01

Amplitude

Input speech: Bastota chkari nie

end if end for end for

9. return separated words.

Figure 3.13: Pseudocode of PWSA based word separation algorithm.

Result of the Prosody based word separation algorithm is given in Chapter 5.

Dalam dokumen Developing A Word Separation Algorithm for Recognition of Continuous Bangla Speech (Halaman 57-66)