• Tidak ada hasil yang ditemukan

Challenges in Children’s Speech Recognition

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 31-34)

All ASR systems comprise of three components: an acoustic model that captures the acoustical properties of the speech sound units (phonemes), a pronunciation dictionary that maps the words into its constituent phonemes and a language model that captures the grammar or syntax of the language of the speech signal. Thus, the properties of a speech signal can be majorally categorized into two types:

the acoustic properties and the linguistic properties. Both the acoustic and the linguistic properties of a speech signal are governed by the physiology of the speaker. The control of articulators affects the linguistics in speech while the physical dimensions of vocal tracts and vocal folds have direct influence on the acoustics of speech signal.

Children have large differences in both the acoustic and the linguistic correlates of speech from the adults that make it particularly difficult for ASR systems to deal with children’s speech [6, 27, 28].

The various acoustic and linguistic differences include differences in the pitch, the formant frequencies, the average phone duration, the speaking rate, the glottal flow parameters, pronunciation and gram- mar [29–32]. The children are reported to have different values of mean and variance of the acoustic correlates of speech than those of the adults [27, 33, 34]. For example, the area of the F1-F2 formant ellipses is larger for children than for adults for most vowel phonemes [35]. Another issue in dealing with children’s speech is the difficulty to model their constantly changing speech characteristics as all

children undergo rapid development and with varying rates. As children grow, their speech produc- tion organs also develop and so their anatomy and physiology keep changing quite significantly. As a result, children’s speech has higher inter- and intra-speaker acoustic variabilities than those of adults’

speech [28, 30].

1.2.1 Acoustic Correlates of Children’s Speech

Several researchers have examined what makes children’s speech different from the adults’ speech.

In a key study [35] by Eguchi and Hirsh, and later summarized by Kent in [36], the age-dependent changes in the formant frequencies and the fundamental frequency measurements of children speakers aged three to thirteen were reported. Children have non-linearly increasing formants located at high values [27, 29, 30, 37, 38]. Also, they have high pitch frequency values causing large spacing between the pitch harmonics [27, 29, 30, 37]. These high formant frequencies and pitch frequency values are attributed to their inherent shorter vocal tract and vocal folds lengths, respectively. For instance, 5-year old children have been reported to have 50% higher value of formant frequencies than of adult males [27]. In comparison to the presence of 3-4 formants for adults’ speech, only 2-3 formants are present for children’s speech within 0.3-3.2 kHz frequency range [28]. The higher formants of children’s speech fall outside the narrow transmission bandwidth (3.4-4.0 kHz) of telephone channels resulting in the loss of spectral information in case of ASR of narrowband children’s speech. The phoneme durations and the average sentence durations have also been observed to be longer than that for adults which in turn reduces their speaking rate also [27, 29, 30, 39]. The average vowel duration of 5- year old children is reported by Leeet. al. in [29] to increase by 36% compared to those of 12-year old children. On analyzing the consonant-vowel transitions in case of adults’ and children’s speech in [40], it is noted that the children’s speech have shorter transition duration and larger spectral difference between consonant and vowel in the consonant-vowel pair than those of the adults’ speech. Studies have found systematic decrease in the values of the mean and variance of the acoustic correlates such as formants, pitch and duration with age, with their values reaching adult ranges around 13 or 14 years [27, 30]. A specific result that is especially relevant for speech modeling is the scaling behavior of formant frequencies with respect to age.

The formant frequencies, the pitch and the speaking rate are perceptually relevant sources of acous- tic mismatch. On the other hand, the differences in the voice source parameters due to physiological differences between the speakers affect the source spectrum [41]. The glottal flow of a speech signal

is characterized mainly by the three parameters viz., open quotient (OQ), return quotient (RQ) and speed quotient (SQ). The glottal flow parameters, controlling the shape of the glottal pulse, affect the long-term overall shape (spectral tilt) of the speech power spectrum [41, 42]. For instance, the OQ mainly affects the levels of the lower part of the source spectrum so that a large OQ typically means a higher level of the lowest few harmonics. The RQ affects the steepness of the source spectrum and a large RQ corresponds to greater attenuation of the higher frequencies. These glottal flow parameters i.e., the OQ, RQ and SQ have also been observed to be different for speech corresponding to children and adult speakers [33, 34]. Therefore, the differences in the glottal flow parameters are apparent as differences in the breathiness in speech [43–45]. In [28], young children’s speech has been reported to have 60% more breathiness than adults’ speech.

Contributing to all this is their increased intra-speaker spectral and temporal variabilities [6,30,40].

Spectral variation, between repetitions of the same vowel by similar speakers, was found to increase by 30% for 5-year old children compared to 12-year old children [27]. The mean Euclidean distance between the cepstral coefficients of the first and second half of vowels for 5-year old children is 20%

greater than that for those of 12-year old children. Increase in the intra-speaker spectral and temporal variabilities gives rise to greater overlapping of the phonemic classes making the pattern classification problem even more difficult. It has been reported that children of age of 5 years have about 60% of vowel classification accuracy against that of about 90% of the adults [27].

1.2.2 Linguistic Correlates of Children’s Speech

Children exhibit less precise control of the articulators especially at the age of 5-6 years. Sometimes they have not yet learnt how to articulate specific phonemes [32]. As a result, children’s speech have many problems like disfluencies, false-starts and extraneous speech [6, 7, 46]. The frequency of occurrence of mispronunciations for the 8-10 years old children was noted in [46] to be almost twice as high as that for the 11-14 years old children. Children have smaller vocabulary than adults and so, they use less words per utterance to convey the same message. The correct inflectional forms of certain words may not have been acquired fully by children, especially for those words that are exceptions to common rules. So, sometimes their sentences contain some spurious words which are not found in adults’ case.

On exploring children’s read speech and spontaneous speech, similar trend was noted in their linguistic variabilities with age in both cases. Children’s spontaneous speech was also found to be

less grammatical than adults’ speech. However, the adult-level values were found to reach 1-2 years earlier for read speech [47]. Linguistic variability in children’s speech reduces with age. Older children use simpler linguistic constructs and shorter utterances to convey the intended message. Disfluencies decrease with age and children reach adult-skill level at around 12-13 years of age (somewhat earlier for boys than girls) [47]. So, the ability of the children to use language efficiently to convey the message improves with their age.

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 31-34)