be average of pitch during the frame and sometimes it could be totally irrelevant to pitch.
On the other hand, the frame length should be taken long enough to contain at least one period of voiced signal, so the short-term autocorrelation has a peak. The other effect of short length frame is that the peak amplitude in autocorrelation function decreases as we decreases frame length and makes peak detection process more difficult. So length of frame is important. Autocorrelation function is defined as
}}
}}} }
}
}
} }
n
v n x n x v
R
It should make sense that as the shift value q begins to reach the fundamental period of the signal, the autocorrelation between the original signal and the shifted signal will begin to increase and rapidly approaches to peak value at the fundamental period.
Inverse of fundamental period is fundamental frequency.
Intensity
Intensity means power. Power can be derived by the following formula
……….(26)
Duration
Duration is simply time duration.
input speech signal as shown in Figure 3.5. It plays an important role in determining where unvoiced sections of speech exist.
0 1 2 3 4 5 6 7 8
x 104 -5
0 5 10x 10-3
Time
Amplitude
Input speech
0 1 2 3 4 5 6 7 8
x 104 -5
0 5x 10-3
Time
Amplitude
After removing dc offset.
Figure 3.5: Shows input speech after removing dc offset.
Due to human speech production, speech signal is known to exhibits quasi-stationary behavior over a short period of time (20-40) ms [7]. This is typical trade off between frequency resolution and time resolution that is required for the analysis of non stationary signals. We select frame length 40msec with 50% overlapping. Framing and overlapping of the input speech is shown in Figure 3.6.
Each individual frames are windowed in order to minimize signal discontinuities and spectral distortion. Hamming window in Figure 3.7 is used to decrease the signal to zero at the beginning and end of each frame. After applying hamming window speech signal is shown in Figure 3.8.
The Hamming window used in hamming form [6]:
}}
n }0.54}0.46cos}
2 n/N }1}
,0nN }1h q ……….(27)
0 50 100 150 200 250 300 350 400 450 500
-5 -4 -3 -2 -1 0 1 2 3 4 5x 10-3
Frame no.
Amplitude
Figure 3.6: Speech signal after framing and overlapping of input speech.
0 50 100 150 200 250 300 350 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Sample no.
Amplitude
Hamming window
Figure 3.7: Hamming window.
0 50 100 150 200 250 300 350 400 450 500 -5
-4 -3 -2 -1 0 1 2 3 4x 10-3
Frame no.
Amplitude
Figure 3.8: Speech signal after applying Hamming window.
Then threshold value for noise energy is calculated. In continuous speech, absence of speech means background silence. But clean silence environment in impossible due to noise. So we can say that background silence means background noise. Noise energy threshold is calculated from the first 100ms data. If noise energy threshold is low from noise environment, then noise can falsely considered as speech. If it is high then noise environment then some speech with low energy can treat as noise. To capture the behavior of noisy environment we calculate noise energy threshold from first 100 ms data. Detail of noise energy threshold selection is given in Chapter 5.
51 0 1 2 3 4 5 6 7 8
x 104 -5
0 5x 10-3
Time
Amplitude
Sunte pelum posta gie
0.5 1 1.5x 10-3
Candidate words (Sun,te,pelum,pos,ta,gie)
Sun te
pelum pos
ta gie
Speech pattern is detected from noise if speech energy is greater than noise energy.
These speech patterns can be words, subwords or noise. So we called them as candidate words. Candidate word of input speech is shown in Figure 3.9, 3.10:
0 1 2 3 4 5 6 7 8 9
x 104 -0.02
-0.01 0 0.01 0.02
Time
Amplitude
Artho Anorther Mul
0 100 200 300 400 500 600
0 2 4 6x 10-3
No. of Frame
Amplitude by Energy
Candidate words (ar,tho,anor,ther,mul)
ar tho
anor
ther mul
Figure 3.9: Candidate words of ‘Artho anorther mul’ by energy is shown.
52
100 200 300 400
Pitch value in Hz
0 50 100 150 200 250 300 350 400 450
0 1 2 3 4x 10-3
No. of Frame Autocorrelation Value
Koto bastota amader
Pitch
0 100 200 300 400 500 600
0 100 200 300 400
No. of Frame
Pitch value in Hz
0 100 200 300 400 500 600
0 1 2 3 4x 10-3
No. of Frame Autocorrelation Value
Artho anorther mul
Pitch
Figure 3.10: Candidate words of ‘Sunte pelum posta gie’ by energy is shown.
Fundamental frequency (f0) of the candidate word is calculated by autocorrelation method. Resultant value of autocorrelation and frequency that derived by autocorrelation is shown in the Figure 3.11 and 3.12. Fundamental frequency of human lies in 85- 300Hz. So we estimate fundamental frequency in between 60-300Hz.
Figure 3.11: Fundamental frequency of speech ‘Artho Anorther Mul’.
Figure 3.12: Fundamental frequency of speech ‘Koto bastota amader’.
In continuous Bangla speech fundamental frequency’s pattern is high to low from the beginning to the end of sentences. For example a speaker utters ‘Ami vat khai’ than f0 on
‘Ami’ is higher than ‘vat’ and also ‘khai’. But whenever stress is occurred within a word than f0 behavior is varied. This variation is also depends on word position in a speech.
For example f0 behavior of candidate word in last word is different with the candidate word of other positions when stress is given. If a speaker say ‘Artho noi somman chai’
than we get five candidate words ‘ar’, ‘tho’, ’noi’, ’somman’, ’chai’. Among them f0 pattern of ‘ar’ and ‘tho’ is low to high. Again if the speaker say ‘Somman noi chai artho’, then again we get five candidate words but now f0 pattern of ‘ar’ and ‘tho’ is high to low. So we see that last candidate word behave different when it receives stress.
To detect stressed candidate word we minimum f0 as 180 Hz and minimum f0 difference with previous word is 10Hz and maximum f0 for last candidate word is 140 Hz. Because for 180 Hz the candidate word received enough stress. Minimum f0 is not set in 150-170 range because this is normal speech range and fail to differentiate between normal word or stressed candidate word. If f0 is set above 180 Hz than some candidate word is also omit. We also take the fact that f0 difference of candidate word with previous candidate word is higher whenever it receives stress. When minimum f0 difference with previous
word is 2-5Hz then the candidate word do not received stress. For 10Hz it receives proper stress and again for 15Hz, 20Hz and more it treat all candidate words whose f0 is 10Hz range as separated unique word which leads false separation. f0 is low for last word in Bangla speech. 150 Hz range is normal pitch range and so speak 130 Hz, 120 Hz, 100Hz range is impossible. So we take maximum f0 range is 140Hz for last candidate word in speech and for candidate word of other positions we take minimum f0 as 180Hz and f0 difference is 10 Hz or more to merge with previous word. Details experiment on these parameter selections is given in chapter 5. Pseudocode of our Prosody based word separation algorithm is given in Figure 3.13.
1. read input speech.
2. remove dc offset from input speech signal.
3. frame speech signal with 40 ms length with 20 ms overlapping.
4. apply hamming window.
5. calculate noise energy threshold from first 100ms speech.
6. find candidate words if speech energy is greater than noise energy threshold.
7. for each candidate words
calculate fundamental frequency( f0).
end for
8. for each candidate words after processing
if candidate words f0is greater than or equal to 180Hz or their frequency difference is greater than or equal to 10Hz and their frame gap is less than or equal to 20 then merge the candidate word with previous word.
end if
8.1. for last candidate word
if f0is less than or equal to 140Hz and frame gap with previous word is less than or equal to 20 then merge the candidate word with its previous word.
55
0 1 2 3 4 5 6 7 8
x 104 -0.015
-0.01 -0.005 0 0.005 0.01
Time
Amplitude
0 1 2 3 4 5 6 7 8
x 104 -0.02
-0.01 0 0.01 0.02
Time
Amplitude
Input speech: Bastota chkari nie
After removing dc offset.
-0.01 -0.005 0 0.005 0.01
Amplitude
Input speech: Bastota chkari nie
end if end for end for
9. return separated words.
Figure 3.13: Pseudocode of PWSA based word separation algorithm.
Result of the Prosody based word separation algorithm is given in Chapter 5.