Related works on English Speech Recognition

Figure 2.13: Clusters in the k-mean algorithm [30].

The process of k-means algorithm used least-squares partitioning method to divide the input vectors into k initial sets. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated until when the vectors no longer switch clusters or alternatively centroids are no longer changed.

Writers put white spaces between words, but speakers normally do not separate words by silences. The speech signal is a continuous stream of words, with no reliable cues to word boundaries. This absence of pauses or any other deterministic cue to word boundaries makes it hard to segment fluent speech into discrete words. However, in order to understand speech, humans must be able to extract individual words effortlessly from the speech stream to use them to access lexical information. Prosody is an important feature to detect word boundary and we used this feature to separate words from continuous Bangla speech. Now we give a short review of word segmentation strategy in continuous English speech by using prosodic feature

English listeners segment speech at the onset of strong (but not at the onset of weak) syllables [32]. The Metrical Segmentation Strategy (MSS) in [32] holds that listeners tend to segment input before strong syllables in English. More precisely, when a full vowel is heard, a boundary is hypothesized at the beginning of the syllable of which the vowel is nucleus, and it is that syllable that the listener uses in an attempt at lexical access. Thus the listener segments speech by making reference to an acoustic/phonetic cue. But strong and weak syllables are not solely specified by lexical stress marking.

Because strong and weak syllables positions are not fixed in a word. They follow different type of patterns i.e. strong-weak-weak, weak-strong-weak, strong-weak-strong and strong-weak-strong-weak etc. For example consider the following words: generic, generate, generation, generous. These items have the following patterns: weak-strong- weak, strong-weak-strong, strong-weak-strong-weak and strong-weak-weak respectively.

A set of labeling criteria had been developed in [33] to label prosodic events in clear or continuous speech and proposed a scheme whereby that information can be transcribed in a machine readable format. They had chosen to annotate prosody in a syllabic domain which is synchronized with a phonemic segmentation.

In English, strong syllables tend to be higher in pitch than their weak neighbors [34].

They had shown that such lexical boundary information is coincidental with syllabic boundary information for spoken English. They were able to detect one-third of word boundaries of a speech corpora using these two combined approach. The phonotactics of a language is the sequential constraints that operate on contiguous items at the segmental level. This sequential information can be described as a set of co-occurrence restrictions that hold within syllables. Thus, while the sequence /#pr/ is permissible in English, /#mp/ is not, where the symbol # represents a syllable boundary. Likewise, in syllable- final position, /mp#/ is valid, but not /pr#/.

In [35], one research issue on English stress was addressed. The issue is how well the stress property of a vowel can be determined from the acoustics in spontaneous speech.

To answer these questions, they study the correlation of various pitch, energy, and duration measurements with lexical stress on a large corpus of spontaneous utterances, and identify the most informative features of stress using classification experiments.

The best set of prosodic features for stress classification consists of the integral of energy, raw duration, pitch slope, and the average probability of voicing. The highest accuracy was achieved by combining MFCC features with the best prosodic feature set.

An automatic detection of prosodic prominence in continuous English speech was studied in [36, 37]. By using syllable nuclei duration, energy, overall nucleus energy and movement of pitch, a prominence function was proposed and it correctly classified 80.61% of the syllables as either prominent or non-prominent.

In research [38], a distinction between boundaries proceeding strong vs. weak syllable in clear speech was investigated and they argued that there is a trade off between different sources of information. For example, pitch movement increases on strong than on weak syllables but these were not significantly greater for strong vs. weak word initial syllables. In the durational case, result showed that speaker consider boundary before weak syllables more in need of marking than boundaries before strong syllables.

Prosodic analyses of the syllable were presented in [39]. They examined how word boundaries are produced in deliberately clear speech and result showed that more lengthening and greater increases in intensity were applied in clear speech to weak syllables than to strong. Mean fundamental frequency was also increased to a greater extent on weak syllables than on strong but pitch movement, however, increased to a greater extent on strong syllables than on weak.

So in most research in English prosodic cue is applied to separate syllable. At first labeling the syllable as strong and weak by hands were performed. Then prosodic parameter on strong and weak syllable were applied and examined there effect.

Individual syllables are often clearly discernible from the acoustic signal but word grouping is nearly indistinguishable in English [40]. Hidden markov model [42, 44, 45], dynamic time warping [41, 43] artificial neural networks [43] are used for continuous speech recognition. Among then Hidden Markov model is most popular techniques for continuous speech recognition as well as word recognition in English.

Now a common question in automatic speech processing concerns the universality of prosodic phenomena: Can a prosodic speech component for language X handle prosodic phenomena in language Y ? Indeed, many prosodic phenomena operate in a nearly indistinguishable manner in the various languages. Other phenomena, however, pattern quite different from language to language. For example French suggest that listeners in that language segment speech at syllable boundaries [46]. Japanese listeners segment speech at mora boundaries [47]. In Norwegian [38], for instance, stressed syllables tend to have a low pitch. Different languages involve different combinations of acoustic parameters or different weightings among them. By using prosodic parameter language are divided into three types: Stress timing, mora timing and syllable timing.

English language is a stress timing language whereas Bangla is syllable timing language.

Prosodic characteristic in Bangla are completely different from English [39, 48] and so we need a complete separate model to analyze prosodic parameter in Bangla to perform word separation from Bangla real speech by prosodic features.

Dalam dokumen Developing A Word Separation Algorithm for Recognition of Continuous Bangla Speech (Halaman 43-47)