Developing A Word Separation Algorithm for Recognition of Continuous Bangla Speech

Abdus Sattar, Assistant Professor, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka. His great interest and vast experience in the field of Bangla speech recognition has influenced the author to realize the thesis.

The Process of Speech Production and Perception in Human Beings

This tube serves as an acoustic transmission system for sounds created within the vocal tract. Voiceless sounds are produced by creating a constriction somewhere in the vocal tract tube and forcing air through that constriction, thereby creating a turbulent flow of air, which acts as a random excitation of the vocal tract tube noise.

Figure 1.1: Cross Sectional view of Human speech organ [2].

Speech Recognition

Classification of Speech Recognition

Isolated speech recognition is relatively easy because word boundaries are detectable and words tend to be pronounced cleanly. The most complex of these systems is the continuous speech, speaker-independent system with a large vocabulary, and the simplest is the isolated word, speaker-dependent system with a small vocabulary.

Variability’s of Speech

The speech produced in noise is different from the speech produced in a quiet environment due to the change in speech production in an attempt to communicate more effectively across a noisy environment. A lot of effort has been put into telephone speech recognition because of its wide range of applications.

Figure 1.2: Distortion in adverse environments.

Algorithms in Speech Recognition

Zero-Crossing and Energy-based Speech recognition
Feature-Dependent Speech Recognition
Template-Based Speech Recognition
Knowledge-Based Speech Recognition
Stochastic Speech Recognition Systems
Connectionist Speech Recognition Systems

Template-based speech recognition systems have a database of prototype speech patterns (templates) that define the vocabulary. Knowledge-based speech recognition systems [26] incorporate expert knowledge, for example derived from spectrograms, linguistics or phonetics.

Bangla Speech Recognition

Connectionist speech recognition is based on artificial neural networks that use learning strategies to organize and optimize a network of processing elements (neurons). Thus, the speech knowledge or constraints used for speech recognition are distributed among many but simple processing elements [14].

Objectives of this Thesis

Outline of this Thesis

An efficient word segmentation algorithm is needed to improve the performance of continuous Bangla speech recognition system. After word segmentation by continuous Bangla speech feature extraction, a speech recognizer is also needed to recognize the unknown word.

Feature Extraction Methods

Power Spectral
Short term Cepstrum
Linear Prediction
Linear Prediction Cepstral Coefficients
MFCC

The power spectrum of a speech signal describes the frequency content of the signal over time. The first step towards calculating the power spectrum of the speech signal is to perform a Discrete Fourier Transform (DFT).

Figure 2.1: Speech waveform of a sentence.

Pitch detection Methods

Zero Crossings
Cepstrum
LPC based
Autocorrelation
Harmonic Product Spectrum(HPS)

To obtain an estimate of the fundamental frequency from the cepstrum, we look for a peak in the quefrence region (shown in Figure 2.5) that corresponds to the characteristic fundamental frequencies of speech. Figure 2.7 shows the waveform of the speech signal and the autocorrelation functions for each frame of the speech signal.

Figure 2.5: Frequency and quefrency of speech waveform.

Classification Methods

Dynamic Time Warping
Hidden Markov Models
Neural Networks
Vector Quantization

K-means Algorithm

Dynamic Time Warping algorithm is one of the oldest and most important algorithms in speech recognition [26]. Its time and space complexity is simply linear in the duration of the speech sample and the size of the vocabulary. Local backpointers must specify the reference word model (shown in Figure 2.9) of the preceding point so that the optimal word order can be recovered by tracing backwards from the last point of the word W with the best final score.

In static classification, the neural network sees the entire input speech at once and makes a single decision. A speech recognition system must be able to estimate probability distributions of the calculated feature vectors.

Figure 2.9: Dynamic Time Warping. (a) alignment path (b) local path constraints.

Distance Measure

The centroids are then recomputed for the new clusters and the algorithm is repeated until the vectors no longer switch clusters or else the centroids no longer change.

Related works on English Speech Recognition

These elements have the following patterns: weak-strong-weak, strong-weak-strong, strong-weak-strong-weak and strong-weak-weak respectively. For example, pitch movements increase on strong than on weak syllables, but these were not significantly greater for strong vs. In the duration case, the result showed that the speaker considers boundaries before weak syllables more in need of marking than boundaries before strong syllables.

They examined how word boundaries are produced in deliberately clear speech, and the result showed that more lengthening and greater increases in intensity were applied in clear speech for weak syllables than for strong ones. The mean fundamental frequency also increased to a greater extent in weak syllables than in strong movements, but with pitch, however, it increased to a greater extent in strong syllables than in weak ones.

Related works on Bangla Speech Recognition

The digital signal processing (DSP) techniques were used to extract the features of the speech signal. Initially, a word separation algorithm was used to separate words from continuous Bangla speech by comparing the speeds of Noise Energy and Zero Crossing to speech. Thus, an efficient word separation algorithm is needed to correctly separate words from continuous speech.

In this context a new word segmentation algorithm called prosody based word segmentation algorithm is developed in this research which performs better than the algorithm in [55] and improved the performance of the speech recognition system. MFCC is used to extract features from separate words as it performs better than other spectrum analysis techniques [18].

Why we need a word separation algorithm

Bangla consists of a large vocabulary of more than 100,000 words [17], so its search range is high. Our ultimate goal is to increase the robustness of speech recognition by shrinking the search space during the decoding process. And if words can be separated from continuous speech, techniques that can be applied to isolated speech recognition can also be applied to continuous speech recognition.

Challenges of word separation from continuous speech

Problems of existing word separation algorithm in Bangla

In this research, experiments were conducted for 13 different Bangla words and among them, 12 words were separated correctly and 1 word could not be separated correctly, which is 'Pakhi'. In Figure 3.1, in the first section, input speech 'Artho noi somman chai' is shown and in the second section, speech is segmented using algorithm [55]. From the above result, it shows that the algorithm [55] suffers from separated single words in subword which made recognition performance poor.

But a sub-word (ni, che) from a real speech 'niche' does not show the same phenomena as in the database. The length of the subwords is also very short, so they do not contain important features that can be recognized, making overall recognition poor.

Figure 3.1: Word separation of ‘Artho noi somman chai ’ using algorithm [55].

Proposed method for word separation

Prosody and prosodic features

It is not only important to know what was said, but also how it was said. A typical feature of prosody is that it is associated with stretches of speech larger than one segment (phoneme), i.e.

Classification of language using prosodic features

In a language with stress time, syllables can last different times, but there is considered to be a fairly constant amount of time (on average) between successive stressed syllables.

Prosodic parameters

On the other hand, the frame length should be taken long enough to contain at least one period of voice signal, so that the short-term autocorrelation has a peak. The other effect of short length frame is that the peak amplitude in autocorrelation function decreases as we reduce frame length and makes peak detection process more difficult. It should make sense that as the shift value q begins to reach the fundamental period of the signal, the autocorrelation between the original signal and the shifted signal will begin to increase and rapidly approach the peak value at the fundamental period.

Prosody based Word Separation Algorithm

For example, the f0 behavior of the candidate word in the final word is different from the candidate word of other positions when stress is applied. To detect the stressed candidate word, f0 is a minimum of 180 Hz and the minimum f0 difference from the previous word is 10 Hz and the maximum f0 for the last candidate word is 140 Hz. Minimum f0 is not set in the range 150-170 because this is a normal speech range and cannot distinguish between a normal word or a stressed candidate word.

We also take the fact that f0 difference of candidate word with previous candidate word is higher whenever it is stressed. So we take maximum f0 range is 140Hz for last candidate word in speech and for candidate word of other positions we take minimum f0 as 180Hz and f0 difference is 10 Hz or more to merge with previous word.

Figure 3.5: Shows input speech after removing dc offset.

Word Separation of Bangla speech by using PWSA

Then noise energy threshold is calculated and candidate word is selected by comparing speech energy with noise energy. So according to our Prosody-based word separation algorithm, the candidate word 'to' is stressed because it receives f0 of 182 Hz and thus merged with the previous word 'Bas'.

Figure 3.15: Speech signal after applying framing and overlapping.

Pseudocode for Training Phase

Pseudocode for Testing Phase

Experimental Details

This is done to verify whether our Prosody-based word separation algorithm is error-prone to specific position of word or not. To see the performance of word separation algorithm on isolated speech, we take 7808 isolated speech. Our proposed Prosody-Based Word Separation Algorithm (PWSA) is based on the fact that relative pitch information is high for subwords of a word.

From the table above, we can see that the word separation accuracy is best if we calculate the noise energy threshold from the 100ms data. The multi-Bangla speech word separation and pitch information and clustering criteria are shown in Table 5. We now conduct an experiment to verify the effectiveness of our prosody-based word separation algorithm.

We also clarify that our prosody-based word segmentation algorithm may or may not segment English speech.

Figure 4.3: Mel-spaced filer bank from our experiment.

Prosody based Word Separation Algorithm

Although Bangla is the fifth highest spoken language [17] in the world, research on speech recognition in Bangla is not satisfactory. In this thesis, the work focuses on establishing an efficient word separation algorithm for speech recognition systems. Crucially, continuous speech does not provide a clear signature, such as pause or silence between words.

But simply comparing speech energy and zero-crossing rates to noise will not yield a correct result. Because due to our human speech production mechanism, a certain gap may exist within a word, but the existing algorithm [55] incorrectly detects this gap as a gap between words.

Speech Recognition

Prosodic feature has great importance on Language and its behavior varies from language to language (English [60], Hungarian, Finnish [58], Bengali [17] etc. Using our Prosody-based word separation algorithm for word separation performance is also dramatically increased.

Future Research

6] Yuk D., Robust Speech Recognition Using Neurale Networks and Hidden Markov Models-Adaptations Using Non-linear Transformations, Ph. Khan M., Isolated and Continuous Bangla Speech Recognition: Implementation, Performance and application perspective, Proceedings of the seventh International Symposium om naturlig sprogbehandling, SNLP december, Pattaya, Thailand. 35] Wang, C., Seneff, S., Lexical Stress Modeling for Improved Speech Recognition of Spontaneous Telephone Speech in the JUPITER Domain, In Proceedings of the 7th European Conference on Speech Communication and Technology, pp Aalborg, Danmark, september 2001.

40] Werner, S., Keller, E., Fundamentals of Speech Synthesis and Speech Recognition: Basic Concepts, State of the Art, and Future Challenges, Wiley Blackwell, 1994. 45] Rabiner L.R., A Tutorial on Hidden Markov Models and Selected Applications in Speech recognition, In Proceedings of the IEEE, pp. 257-286, vol.