Abstract
Sudipta Acharya December 2018
1 Abstract
I
t has been observed that for most of the languages, there exist two forms such as writing style form or written language and the other one is an oral (colloquial) form or spoken language. This thesis is all about spoken language prosody modeling for Bangla language. Speech is an essential medium for a human communication. Speech can also be used as a medium of interaction between human and machine.Being spoken by nearly 260 million people in the world, Bangla language is one of the most widely spoken language based on the number of speakers.
Bangla is the national language of Bangladesh and is also the official language of West Bengal, a state in India. Considering different dialects, there are lots of variations in the Bangla language. Our work has been focused on the dialect known as Standard Colloquial Bengali (SCB) which is spoken in and around Kolkata.
Language transferring can take place both at segmental (more commonly known as phonetic) and at the supra-segmental (i.e. prosodic) space. Segmental features are necessary for correct pronunciation of a word. Supra-segmental features help the listener to locate stressed words, phrase boundaries, speaker’s attitudes and emotions. This thesis concentrates on different ways to bring naturalness in synthesized speech such as in a TTS (Text-To-Speech) system.
There are three main chapters in this thesis: duration modeling, sentence medial pause modeling, and intonation modeling.
Bangla is a bound stress language, i.e., the first syllable of every word is stressed. The manifestation of stress is observed on the supra-segmental at- tributes. The analysis result shows that the first syllable of all syntactic word is not lengthened, but the lengthening occurs in place of every prosodic word.
Similarly, phrase-final lengthening does not happen for all syntactic phrases; it is happening in the case of prosodic phrase only. Duration modeling is done in HTS (HMM-based Speech Synthesis System) based Bangla speech synthe- sis system using prosodic structure and an improvement is there in duration modeling . The average deviation in duration between the original speech and in case of syntactic structure based way is 41 ms with a standard deviation of 25 ms whereas for prosodic structure-based method average deviation is 22 ms with a standard deviation of 12 ms.
1
The naturalness in TTS can significantly be improved by inserting a pause at the right place for required duration. Pause and duration of pause between words are vital elements in the utterance of a sentence. A linear model is developed to determine the probability of occurring pauses and pause duration, within a sentence (called Sentence Medial Pauses) in Bangla sentences with different speech rates such as slow, medium and fast paces of utterance. The sentence-medial pauses duration and the probability of pause occurrence for Bangla readout speech at different speech rate are linearly dependent on phrase type , phrase length (l) and distance (d) between the current phrase and its dependent counterpart.A detail analysis has been done and has developed a linear regression model for individual phrase type to predict pause occurrence and its duration in sentence medial position. The results manifest that the average accuracy of predicting pause occurrences for all speech rates (fast, slow and medium) of the model is 82% and pause duration is within 100 ms in 79%
of cases.
Proper modeling of fundamental frequency (F0) has primary importance for the generation of highly natural synthetic speech. A study has been done on an approach of fundamental frequency (F0) contour modeling for Bangla language using Fujisaki (superposition) model parameters. A method has been proposed to identify the prosodic word boundaries, prosodic phrase boundaries, sentence types from Bangla continuous speech using the fundamental frequency contour.
Analysis and synthesis ofF0 contour has been using command-response model and a set of rules are also generated to predict theF0 contour from a written text.The results of quantitative and qualitative assessments are used to validate our method. The average means square error between originalF0 contour and generatedF0contour is 12Hz. In the case of perceptual test average MOS score is 3.2 in 5 point MOS scale.
Keywords: Sentence-medial pause; Occurrence probability of pause; Prosodic features; Spoken language; Written language; Prosodic word; Prosodic Phrase;
Fundamental frequency; Prosodic word boundary; Prosodic Phrase Boundary;
Bangla language, Segmental Feature, Supra-segmental feature, Text-to-Speech (TTS).
2