Speeches were directly recorded in wave files using standard multimedia cards available in PC. The signals are digitized at the sampling rate of 8000 Hz, 16 bits mono channel.
The sentences consist of different categories like simple, complex, compound and passive sentences. Experiment was done in two steps, at first efficiency of Prosody based word separation algorithm was evaluated and then recognition performance. Total 270 sentences are uttered by speaker. Input is taken step by step. For example at first 10 sentences were recorded in room environment. Second 10 sentences were taken after a week. This is due to assess the efficiency of our algorithm with respect to noise and also with change of speaker mode. Last 135 sentences of total 270 sentences are the repetition of first 135 sentences but word position is varied. For example, in first 135, if
61
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Magnitude Spectrum
a sentence form is like this ‘Tomar amar bangladesh’ then in second 135(136-270), it forms like ‘Bangladesh tomar amar’. This is done to verify that whether our Prosody based word separation algorithm is error prone to specific position of word or not.
Our word separation algorithm uses the fact of stress that is high on initial word and become low at the end of sentences. But what happens when a sentence has only one word. To see the performance of word separation algorithm on isolated speech we take 7808 isolated speech. In recognition phase, 270 sentences with 1755 words and 7808 words from isolated speech are used. A detail of our experiment is given below.
At first dc offset is removed from input speech signal. It plays an important role in determining where unvoiced sections of speech exist. Framing is performed with 40ms length with 20ms overlapping. Then Hamming window is applied. After preprocessing Prosody based word separation algorithm is applied. To detect noise energy threshold we consider first 100ms data for calculation of noise energy. For each separated word, feature is extracted by Mel Frequency Cepstral Coefficient. The computation of MFCC is based on the short-term analysis introduced in Section 2.1.5. For the purpose of MFCC processor, filter banks are implemented in frequency domain which is shown in Figure 4.3.
Figure 4.3: Mel-spaced filer bank from our experiment.
MFCC extraction process of a Bangla word ‘Vat’ is shown in Figure 4.4-4.9:
0 5 10 15 20 25
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
2x 10-3
Frame no.
Amplitude
Input speech:Vat
Figure 4.4: After framing and overlapping the word signal ‘Vat’.
0 5 10 15 20 25
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5x 10-3
Frame no.
Amplitude
Input speech:Vat
Figure 4.5: After applying hamming window on the word signal ‘Vat’.
0 50 100 150 200 250 300 350 400 450 500 0
1 2 3 4 5 6 7 8x 10-5
Frequency (Hz)
Spectrum
After performing fft
Figure 4.6: After performing Fast Fourier transform on the word signal ‘Vat’.
0 5 10 15 20 25 30 0
0.05 0.1 0.15 0.2 0.25
No. of mel filter
Mel spectrum
After applying mel filter bank
Figure 4.7: After multiplying mel scale with the word signal ‘Vat’.
0 5 10 15 20 25 30 -7
-6 -5 -4 -3 -2 -1
No. of mel filter
log mel spectrum
After performing log
Figure 4.8: After performing log on the word signal ‘Vat’.
67
-4 -2 0 2 4 6
8 MFCC coefficients of word vat
MFCC coefficients value
0 5 10 15 20 25 30
-3 -2 -1 0 1 2 3 4 5 6 7
No. of mel filter
MFCC coefficients value
MFCC coefficients of word vat
Figure 4.9: After performing Inverse Fast Fourier transform on the word signal ‘Vat’.
For training we take 16 utterances of a word. For example we take 16 utterances for a word ‘vat’. So MFCC feature of 16 words contains huge data among them, though all are not important to recognize that particular word. Processing time is also higher if all MFCC features are used. So we use Vector Quantization with K means algorithm to compress data. MFCC extraction of all 16 utterance of word and its VQ codeword is shown in Figures 4.10-4.17.
Figure 4.10: Mel Frequency Cepstral Coefficients of word ‘vat’.
0 5 10 15 20 25 30
-3 -2 -1 0 1 2 3 4 5 6 7
VQ codebook of word vat
No. of Mel Filter
VQ value
Figure 4.11: VQ codebook of ‘vat’.
0 5 10 15 20 25 30 -5
0 5
10 MFCC coefficients of word ami
No. of Mel Filter
MFCC coefficients value
Figure 4.12: Mel Frequency Cepstral Coefficients of word ‘ami’.
0 5 10 15 20 25 30
-4 -2 0 2 4 6
8 VQ codebook of word ami
No. of Mel Filter
VQ value
Figure 4.13: VQ codebook of ‘ami’.
0 5 10 15 20 25 30 -6
-4 -2 0 2 4 6 8
10 MFCC coefficients of word binamulle
No. of Mel Filter
MFCC coefficients value
Figure 4.14: Mel Frequency Cepstral Coefficients of word ‘binamulle’.
0 5 10 15 20 25 30
-6 -4 -2 0 2 4 6
8 VQ codebook of word binamulle
No. of Mel Filter
VQ value
Figure 4.15: VQ codebook of ‘binamulle’.
0 5 10 15 20 25 30 -5
0 5
10 MFCC coefficients of word orthonoitik
No. of Mel Filter
MFCC coefficients value
Figure 4.16: Mel Frequency Cepstral Coefficients of word ‘orthonoitik’.
0 5 10 15 20 25 30
-4 -2 0 2 4 6 8
VQ codebook of word orthonoitik
No. of Mel Filter
VQ value
Figure 4.17: VQ codebook of ‘orthonoitik’.
After feature extraction and VQ compression database was built. For recognition, at first test speech were separated according to our word separation algorithm and then features were extracted. Distance between extracted feature of separated test words and database are compared by Euclidean distance. If smallest distance word is less or equal to 2, the word is recognized else it is unknown word. Whole system is implemented using Matlab 7. MFCC extraction of some test sample is shown in Figure 4.18-4.19.
0 5 10 15 20 25 30
-4 -2 0 2 4 6 8
MFCC coefficients of test word binamulle
No. of Mel Filter
MFCC coefficients value
Figure 4.18: MFCC extraction of test word ‘binamulle’.
0 5 10 15 20 25 30
-6 -4 -2 0 2 4 6
8 MFCC coefficients of test word sukro
No. of Mel Filter
MFCC coefficients value
Figure 4.19: MFCC extraction of test word ‘sukro’.
Chapter 5
Results and Discussions
Our proposed Prosody based word separation algorithm (PWSA) is based on the fact that relative pitch information is high for subwords of a word. We carry experiment for examine the efficiency of our algorithm and results also say the same. We detect candidate words to calculate prosodic features. Noise energy threshold has significant value to detect candidate words. So at first we carry experiment on 100 sentences of 666 words by taking threshold value from 50ms, 80ms, 100ms, 120ms, 150ms data and select one which gives best word separation accuracy. Result of the above experiment is given in Table 1.
TABLE 1: NOISE ENERGY THRESHOLD FROM DIFFERENT TIME VS. WORD SEPARATION ACCURACY.
Noise energy threshold from
Wrong Separation
Right Separation
Accuracy (%)
50ms 72 594 89
80ms 34 632 94
100ms 0 666 100
120ms 0 666 100
150ms 0 666 100
Here experiment was carried out for 100 sentences of 666 words.
From the above Table we see that accuracy of word separation is best when we calculate noise energy threshold from 100ms data. In continuous speech, absence of speech means background silence. But clean silence environment in impossible due to noise. So we can say that background silence means background noise. Whenever we take noise energy threshold from 50ms data then it is very little amount so that it wrongly considers noise as speech. In this case, candidate word forms by taking two or three words which leads wrong separation.
For 80ms, the situation is improved by a little amount but not satisfactory. For 100 ms and above, noise energy threshold is calculated from enough sample value so that it matches noisy environment which results word separation accuracy is high. So for the rest of our experiment we calculate noise energy threshold value from 100ms data.
Fundamental frequency’s pattern is high to low from the beginning to the end of sentences in continuous Bangla speech. But whenever stress is occurred within a word than f0 behavior is varied. This variation is also depends on word position in a speech. f0
behavior of candidate word in last word is different with the candidate word of other positions when stress is given. To detect stressed candidate word we carry several experiments to select minimum f0 and minimum f0 difference with previous word and maximum f0 for last candidate word. Result is shown is Table 2- 4.
TABLE 2: MINIMUM F0 VS. WORD SEPARATION ACCURACY.
Minimum f0 to detect stressed candidate words
Wrong Separation
Right Separation
Accuracy (%)
150Hz 41 625 93
160Hz 21 645 96
170Hz 8 658 98
180Hz 0 666 100
200Hz 16 650 97
220Hz 26 640 96
230Hz 26 640 96
Here experiment was carried out for 100 sentences of 666 words.
TABLE 3: MINIMUM F0 DIFFERENCE VS. WORD SEPARATION ACCURACY.
Minimum f0
difference to detect stressed candidate words
Wrong
Separation Right
Separation Accuracy (%)
2Hz 42 624 93
5Hz 12 654 98
10Hz 0 666 100
15Hz 8 658 98
20Hz 28 638 95
25Hz 45 621 93
Here experiment was carried out for 100 sentences of 666 words.
TABLE 4: MAXIMUM F0 DIFFERENCE VS. WORD SEPARATION ACCURACY.
Maximum f0
to detect stressed candidate words
Wrong Separation
Right Separation
Accuracy (%)
150Hz 12 654 98
140Hz 0 666 100
130Hz 8 658 98
120Hz 16 650 97
100Hz 16 650 97
Here experiment was carried out for 100 sentences of 666 words.
When a candidate word is received stress its f0 higher than 180 Hz. In Table 2 whenever we set minimum f0 condition is 150Hz-170Hz than it merge candidate word with previous word and also wrongly merge with other unique word and performance becomes very poor. For 180 Hz the candidate word received enough stress and accuracy is high. Again for 200Hz-230Hz the candidate word also receive enough stress and it omits those candidate words whose has f0 less than 200 Hz and greater than 180 Hz. So performance is remains same as 150Hz-170Hz. Again in Table 3 we take the fact that f0
difference of candidate word with previous candidate word is higher whenever it receives stress. When minimum f0 difference with previous word is 2-5Hz then the candidate word do not received stress. For 10Hz it receives proper stress and accuracy is high. Again for 15Hz, 20Hz and more it treat all candidate words whose f0 is 10Hz range as separated unique word so performance is poor.
f0 is low for last word in Bangla speech. In Table 4 we see that for 140 Hz we get maximum accuracy. 130 Hz, 120 Hz, 100Hz is very low f0 for female voice and hence no subword is merged. So we take maximum f0 range is 140Hz for last candidate word in
speech and for candidate word of other positions we take minimum f0 as 180Hz and f0 difference is 10 Hz or more to merge with previous word.
Word Separation of several Bangla speeches and pitch information and merging criteria is shown in Table 5.
TABLE 5: WORD SEPARATION OF SOME BANGLA SPEECH USING EXISTING [55]
AND PROPOSED PROSODY BASED WORD SEPARATION ALGORITHM (PWSA).
Sentences Candidate
Word Separation Result Using Existing
Algorithm
Pitch Need to Change Boundary According to
PWSA
Separation Result
Using PWSA
Artho noi somman chai
Ar W 180 Yes C
Tho W 226 Yes C
Noi C 155 No C
Somman C 173 No C
Chai C 149 No C
Japta batase patagulo jhore pore niche
Jap W 171 Yes C
Ta W 195 Yes C
Ba W 153 Yes C
Ta W 177 Yes C
Se W 204 Yes C
Patagulo C 169 No C
Jhore C 164 No C
Pore C 166 No C
Ni W 178 Yes C
Che W 136 Yes C
Kosto korle Kesto mile
Kos W 176 Yes C
To W 207 Yes C
Korle C 187 No C
Kes W 187 Yes C
To W 191 Yes C
Mile C 164 No C
Here W stands for wrong and C for correct separation.
Now we carry experiment for examine the efficiency of our Prosody based word separation algorithm. Experimental result shows that our proposed algorithm that uses stress information with energy performs excellent compare to existing [55] algorithm.
Other stress information like intensity and duration does not contain useful information for word separation of Bangla speech. We implement existing algorithm [55] and result is given in Table 6. Result shows that 598 words are not separated by existing algorithm and the number is 28 by our new algorithm. Performance is improved by 8% when database in [55] is used, while it increased by 32% for new database that consists of 1755 words. Success rate is also higher than the previous rate of 81% reported in [56].
Moreover when only isolated speech is used our algorithm separates all 7808 words correctly.
TABLE 6: COMPARISON OF PROPOSED ALGORITHM VS.EXISTING ALGORITHM [55].
Database Algorithm Correct
Separation
Wrong Separation
Performance
Existing Existing 12 1 92%
Existing PWSA 13 0 100%
New Existing 1157 598 66%
New PWSA 1727 28 98%
Here existing database consist of 13 words and new database consist of 1755 words.
Result in Table 6 shows that pitch or stress has great impact on Bangla language. It has high rising pattern of frequency within subword of a word and low rising pattern for last word in sentences. We try to correlate this stress information with energy and develop a word separation algorithm. Result shows that PWSA outperforms existing algorithm.
Threshold value for word recognition is set 2. To select threshold value we carry experiment which is given in Table 7. Experimental result on Bangla continuous speech and isolated speech is given in Table 8.
TABLE 7: THRESHOLD VALUE VS. WORD SEPARATION ACCURACY Threshold value Correctly
Recognize (%)
Not
Recognize (%)
Falsely
Recognize (%)
1 4 96 0
1.5 75 25 0
2 97 0 3
2.5 91 0 9
It is very unlikely than Euclidean distance between a test word’s MFCC feature with known words VQ compressed MFCC feature are within 0-0.9 range. Hence word separation accuracy for threshold value 1 in Table 7 is low. For threshold value 1.5 accuracy is improved but not satisfactory. For threshold value 2.5, boundary limit is high enough so false recognition is high. For threshold value 2 we get maximum accuracy and hence we select threshold value 2 to take decision of recognition, not recognition.
TABLE 8: RECOGNITION RESULT OF BANGLA SPEECH Speech
Recognition Types
Total Words Wrong Recognition
Right Recognition
Success Rate (%)
Continuous 1755 312 1443 82
Isolated 7808 459 7349 94
Some word exists in Bangla speeches that are not separated correctly using our algorithm. For example ‘ponchash’, ‘metate’, ’achen’, ‘mochre’, ‘buske’, ‘choto’,
‘uttom’ etc. Frequency is also high among subwords but due to our frame gap condition they are not separated correctly. This shows the efficiency of our algorithm, we use frame gap to separate words not for merging. If speaker utters ‘ami’ as ‘a..mi’ then we treat is as separate words. In isolated speech there is only high to low stress pattern.
Special condition for the last word in our PWSA, satisfy the behavior of isolated speech.
Again in recognition phase some words like ‘vat’ is falsely identified as ‘vabi’, ‘class’ as
‘glass’, ‘sat’ as ‘at’. Overall 82% recognition accuracy has achieved for continuous Bangla speech. For isolated Bangla speech 94% recognition accuracy has obtained.
Coarticulation effect exists between words in continuous speech which made recognition performance is poor than isolated Bangla speech. In isolated speech coarticulation effect is negligible hence recognition performance is better. Prosody has great impact on Bangla language which is the motivation behind our word separation algorithm and practical experiment on Bangla continuous speech also say same. Continuous speech recognition system in [12] requires huge training sample and exhibits performance of
80% whereas our speech recognizer’s performance is 82% with less training sample. We also clarify that our Prosody based word separation algorithm can separate English speech or not. Result of English speech separation is given in Table 8.
TABLE 9: WORD SEPARATION OF CONTINUOUS ENGLISH SPEECH BY PWSA
Total Words Wrong
Recognition
Right Recognition
Success Rate (%)
283 78 205 72
English is a stress timed language whereas Bangla is syllable timed Language. So, prosodic information is also different in Bangla than English. In an isolated Bangla word prosodic information is ‘high to low’. For example stress pattern is decreasing from start to end in the word ‘jana’. But in an Isolated English language prosodic information is varied from word to word. For example stress pattern is ‘low to high to low’ on the word
‘generic’ and ‘high to low to high’ on the word ‘generate’. In continuous speech, prosodic difference on Bangla to English is more remarkable. When subword exists is continuous Bangla speech then stress information is higher than previous word and our PWSA merge the subword with previous word. But in continuous English speech stress information is followed different patterns when subword exists and our PWSA performs accuracy of only 72%. For large speech corpora and native English speaker performance is worse.
Noise always degrades the performance of speech recognition system. Because it can not be completely eliminated and its behavior varies with time and environment. For these reason, we take 100ms data from speech to calculate noise energy threshold value.
Without using any fix value for noise energy threshold our system can work efficiently in noisy environment.
Chapter 6 Conclusions
In the age of information technology Human-Computer interaction has gained importance. Speech is the primary mode of communication among human being and people expect to exchange natural dialect with computer due to recent development of speech technology. Hence research in continuous speech recognition is crucial.
Although Bangla is the fifth highest spoken language [17] in the world research on Bangla speech recognition is not satisfactory. In this context our research provides a strong base for further research on speech recognition.
In this thesis, work has been focused on establishing an efficient word separation algorithm for speech recognition system. Efforts have been distributed into 2 main parts:
Prosody based Word separation Algorithm
Speech Recognition.