• Tidak ada hasil yang ditemukan

Basic Acoustic Features Analysis of Vowels and C - IRD India

N/A
N/A
Protected

Academic year: 2025

Membagikan "Basic Acoustic Features Analysis of Vowels and C - IRD India"

Copied!
4
0
0

Teks penuh

(1)

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________________________________

________________________________________________________________________________________________

ISSN (PRINT) : 2320 – 8945, Volume -3, Issue -1, 2015 20

“Basic Acoustic Features Analysis of Vowels and C- V-C of Indian English Language”

1Anil Kumar.C, 2Shiva Prasad.K.M, 3M.B.Manjunatha, 4KodandaRamaiah.G.N

1ResearchScholar, Jain University, B’lore

2Asst professor, R L J I T., INDIA.

3PRINCIPAL, AIT Tumkur.

4Professor & HOD Dept of ECE, Kuppam-Engineering College, Kuppam., Email:[email protected].

Abstract- A single speech can convey the information more than hundreds of signals, hence Speech is one of the most information-laid signals, speech sounds have a rich and multi-layered temporal variation that convey emotions, words, intention, expressions, intonation, accent, speaker identity, gender, age, style of speaking, state of health of speaker. We use language every day without devoting much thought to the process. But articulator movementsi.ethe movement of lips, tongue and other organs- is among the subtlest and most adept of actions performed by human beings. The major concern in language will be the identification, separation and processing of vowels and consonants of a particular language. This system we are storing different voice test samples of various speakers of different age with different conditions, from each speaker with different vowels, consonants and different combinations. From the spectrogram we are extracting the formant frequency &

pitch for the different vowels and consonants using PRAAT software.

Key words: Formant frequency, Spectrogram, vowels, PRAAT.

I. INTRODUCTION.

Speech is the most convenient way of human communication. Speech is not only sequence of steady state of some sounds abruptly changing from one to other or just some signals can be ignored after hearing, but Speech is a unique signal that conveys linguistic and non-linguistic information and also Speech conveys message of multiple levels of knowledge sources.

Speech signals are examples of information bearing signals that evolve as functions of a single independent variable like time. Speech is not only informative signal but also an complex wave as acoustic output of speaker’s effort. Speech serves to communicate from speaker to one or more listeners. The typical sound produced when a phoneme is articulated and is called phone.

Most of the linguistic languages have 20-40 phonemes which provides an alphabet of sounds to describe the different words in the language. Words are composed of phoneme sequences called syllables. Speech analysis also referred as feature extraction of speech. [1]

Speech sounds are sensations of air pressure vibrations produced by air exhaled from the lungs and modulated

and shaped by the vibrations of the glottal cords and the resonance of the vocal tract as the air is pushed out through the lips and nose.

Feature extraction refers to, conversion of the speech wave form into some type of parametric representation for further speech processing for various applications.

The parameters obtained by analysis tools are important acoustic cues. Speech analysis helps in reducing the raw speech data base (speech corpus) to manageable quantity and to extract information from the raw data, which are crucial in interpreting the speech signal. The detailed analysis of speech features and their relationship to human perception is a challenging task in speech processing.[3]

Speech is an immensely information-rich signal exploiting frequency-modulated,amplitude- Modulated and time-modulated carriers (e.g. resonance movements, harmonics and noise, pitch intonation, power, duration) to convey information about words, speaker identity, accent, expression, style of speech and emotion. All this information is conveyed primarily within the traditional telephone bandwidth of 4 kHz. The speech energy above 4 kHz mostly conveys audio quality and sensation.[2]

II. SCOPE OF THE WORK.

Globally there exits different languages, the language can be identified through the speech. Hence we can achieve speaker recognition but the identification of speech is complex and is performed through various digital speech processing techniques. The speech is in random nature generated due to excitation of so many organs in the human body which consists of complex and simple resonant frequencies called Formants. The essentiality of measuring formant frequencies gives an idea of finding out utterances in speech and also the voice quality. The speech consists of words comprising vowels, semivowels and consonants which may be either voiced or unvoiced sounds. The major concern in any language will be the processing of vowels, consonants and there combinations like VCV,CVC&

CCVC words. We are implementing the spectrogram technique & we are extracting the formant frequencies

& pitch from the speech samples recorded using PRAAT software and analysed spectrogram technique.[3]

(2)

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________________________________

________________________________________________________________________________________________

ISSN (PRINT) : 2320 – 8945, Volume -3, Issue -1, 2015 21

2.2.OBJECTIVE OF THE STUDY.

In order to generate the spectrogram for different vowels, consonants and Their combinations, of various age group of same gender(female). We are using the PRAATsoftware for analyzing the formant frequencies and pitch for eachsample.In the spectrogram, we are extracting the formant frequencies&pitch. In the signal the audio frequency and other necessary information will be examined with utterance from speaker. In our system we are storing different samples of each speaker with different vowels, consonants& V-C-V wordsasspeech database or speech corpus [4]

2.3 DATABASE CREATION

Test samples considered here are listed below for the female subjects like Fs1,Fs2,Fs3, Fs4 and Fs5(child around 10yrs, Fs1,Fs2 and Fs3 are adult speakers of age 22years. Fs4 is of age 60years) uttered speech samples of Vowels: a,e,i,o,u,Words with combination of vowels:

bat,bird,bootare recorded using table top mic make of i- ball Model No M27 with sensitivity -58dB ±3dB, frequency response of 100Hz to 16kHz, with sampling frequency of 22,100Hz.

2.4.HUMAN-SPEECH PRODUCTION MECHANISM.

An outline of the anatomy of the human speech production system is shown in fig.1. It is divided into three major regions namely: larynx, vocal tract and respiratory system. The combined voice production mechanism produces thevariety of vibrations and spectral-temporal compositions that form different speech sounds. The act of production of speech begins with exhaling (inhaled) air from the lung. Without the subsequent modulations, this air will sound like a random noise with no information. The information is first modulated onto the passing air by the manner and the frequency of closing and opening of the glottal folds.

Speech is a convolved signal linear filterification is not suited for the analysis of convolved signals like speech.[5]

Signal processing front end for extracting the feature set is an important stage in any speech processing applications like speech analysis, speech synthesis

Fig 1.Anatomy of speech production.

2.5. PROBLEM STATEMENT.

The essentiality of measuring acoustic features gives an idea of finding out formants (resonant peaks), pitch

(fundamental frequency F0) in speech and also the voice quality.

The speech consists of words comprising vowels, semivowels and consonants which may be either voiced or unvoiced sounds. The problem considered here is bytaking differentwords to be analyzed with respect to formant frequencies and pitch [1]

2.6LIST OF VOWELS & CONSONANTS.

The phonemes in American English are classified into four categories. Thy are vowels , diphthongs, semivowels & consonants and these can b further divided into so many types. Vowels are divided into front, mid, back vowels. Diphthongs are not divided further. Consonants are divided into noel, stop, fricatives, whispering, affricates. Semivowels are divided into liquids and glides. This can be shown in the fig 2

III. BASIC ACOUSTIC FEATURES:

In human speech , Pitch, formant frequency and intensity forms the basic features. The brief description is as follows:

Pitch: It is also known as repetition rate of opening and closing of vocal folds, it is the fundamental frequency of vibration of the vocal folds known as fundamental frequency(f0). Pitch is used for classification of Voiced/unvoiced speech. Pitch for voiced component is low and for unvoiced is absent. Commonly used Pitch measuring method is auto correlation.[2]

Fig 2.Overview of American English

Intensity: It is an acoustic feature which models the energy(loudness) of sound simulating the way it is perceived by the human ear by calculating the sound amplitudes in different intervals usually expressed in decibels (dB).[2]

Spectrogram: A spectrogram converts a two dimensional waveform (amp v/s time) into a three dimensional pattern (amp v/s frequency v/s time).the spectrogram are broadly classified into two major types namely wideband and narrow band spectrogram. Wideband spectrogram is the basic tool for spectral analysis. Peaks in spectrum appear as dark horizontal bands called formant resonances. Voiced sounds cause vertical marks in spectrogram due to increase in speech amplitude each time the vocal folds close. Spectrograms

are used primarily to examine formant canter frequencies. Wide band spectrograms employs 300 Hz

(3)

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________________________________

________________________________________________________________________________________________

ISSN (PRINT) : 2320 – 8945, Volume -3, Issue -1, 2015 22

band pass filters with response time of a msec, which yield good time resolution (accurate durational measurement).[2] [3]

3.2 ABOUT PRATT:

We have used pratt as our analysis platform because

 it is a public domain, it supports on variety of platforms namely window, macintosh, linux and Solaris.

 it provides variety of value of datastructures like text grid,pitch tier and table to represent the various types of information used to extracting the prosodic features.

 we can use pratt commands as programming language.

 it can be used as a suite of high quality speech analysis routines like pitch tracking.

3.3 APPLICATIONS.

 Language processing: Analyzing the gender voice.

 Speech Recognition & understanding: speech refers to the recognition of each word in speech, whereas speech understanding means extracting the meaning of what is said from the words recognized.

 Home automation: automatic conversion of electronic devices can be controlled through speech.

 Robotics: Human machine interface will become very easy in robotic application.

 Human speech recognition: It indicates that human do not appear to use spectral templates but they do partial recognition of phonetic units independently in different frequency bands.

IV. TEST-SAMPLE ANALYSIS.

TABLE 1: Tabulated results of the various samples with formant frequencies and pitch

Test samples considered here are listed below for the users like speaker1, speaker2 & speaker3, speaker 4, speaker 5 (around 10yrs).

Samples F1 F2 F3 Pitch

Speakar1.a 519.16 543.04 1963.58 259.09

e 315.00 566.76 1979.13 275.26

i 518.48 2196.82 2817.14 266.75

o 533.13 868.48 1743.76 266.43

u 390.32 812.66 1666.16 295.36

Bat 919.60 1964.54 2609.15 235.63 Bird 477.38 1916.64 2385.90 237.86 Boot 422.42 725.92 1669.00 265.19 Speakar2.a 442.56 925.42 2447.94 219.60

e 237.40 847.62 2225.66 231.87

i 427.70 996.00 2584.16 225.04

o 488.59 875.02 902.01 243.87

u 370.45 995.66 1880.66 246.43

Bat 968.08 1608.08 1990.68 240.76

Bird 461.69 1163.97 1175.57 229.77 Boot 382.13 667.35 1003.05 239.03 Speakar3.a 487.42 1099.06 2201.66 242.92

e 294.48 725.04 1927.54 278.48

i 524.04 1031.82 1948.55 265.26

o 580.52 1335.07 1788.62 289.44

u 282.65 601.38 2169.93 262.20

Bat 690.21 981.49 2027.06 246.34 Bird 731.99 134.29 1706.08 248.42 Boot 468.58 839.48 1286.53 277.38 Speaker 4.a 633.18 1959.88 2096.65 211.10

e 343.28 459.58 2064.38 255.80

i 430.77 592.73 2256.42 217.46

o 545.96 921.53 2058.29 234.78

u 479.49 1309.02 2009.33 242.63

Bat 714.10 849.0 1965.83 196.14

Bird 139.47 620.54 1454.80 209.92 Boot 489.75 1015.49 1731.17 251.30 Speaker 5.a 540.05 606.67 1468.77 303.83

e 390.54 808.72 1848.28 190.62

i 677.96 1275.90 1932.15 322.47

o 639.27 678.14 1338.80 338.15

u 362.22 722.97 1457.97 365.49

Bat 681.38 1079.17 2035.49 355.29 Bird 552.23 967.91 1905.82 321.99 Boot 405.57 711.68 1436.69 354.05

 Vowels: a,e,i,o,u,

 Words with combination of vowels: bat,bird,boot

 Considered consonents: k,l,m,n,

 Words with combination of consonants:

why,fly,crypt

Fig 3 Spectrogram of letter “a” for the speaker 4.

Fig4 Spectrogram of letter “e” for the.speaker4

(4)

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________________________________

________________________________________________________________________________________________

ISSN (PRINT) : 2320 – 8945, Volume -3, Issue -1, 2015 23

Fig 5 sample format frequency peaks obtained using PRAAT.

Fig 5 gives an idea about the formant peaks for an given input signal. Fig 3 and fig 4 provides the sample spectrogram of the speech input signal in PRAAT, similarly the spectrograms are plotted for all the samples and also various acoustic parameters like formant frequencies, pitch and intensity are tabulated as shown in table 1. The compartitive bar graph representation is also shown in fig 6, for better comparative analysis.

From the it is evident that the SPEECH is not only a signal but it is the combination of several parameters such as gender, age, content of speech, context and combination of vowels and consonants in particular language. But the Indian English accent is influenced by other native languages hence that prevails the interest in the speech signal processing domain.

V. CONCLUSION.

The processing of vowels and consonants has been successfully implemented and tested using a PRAAT software in which three different formant frequencies, spectrogram and pitch of the speaker are tabulated and the comparative bar graph of 40 plus samples are plotted and corresponding pitch of the speaker has been plotted.

From the results we can conclude that the pitch of the child test sample considered is comparatively higher than aged and young speakers and also the formant frequencies and spectrogram of young speakers are easily compared with the child utterance.

The pitch of the consonants pronounced by the young speakers are found to be identical with interestingly noticeable variations in the formant frequencies, spectrogram and hence the human machine interface is more likely towards the young users and this made possible for the various innovative applications for young generation.The major separation line to bifurcate the vowels, consonants and V-C-V words in the English language under our test conditions is the 3 formant frequencies of vowels of any speaker is related to

utterance in the speech whereas the formant frequency- 3(F3). Apart from the are found to be identical or linearly varying irrespective of the speaker. Hence with the only parameter of F3 we can easily distinguish the vowels and non-vowels with its pitch.

Fig 6.Summary of formant Frequencies and pitch for vowels

Uttered by different users.

REFERENCES:

[1] L. Rabiner and B. Juang, Fundamentals of speech recognition.Prentice Hall, 1993.

[2] Fundamentals of speech recognition by L.

Rabiner, B juang, Yegnanarayana/ I Edition.

[3] L. Welling and H. Ney, "A Model for Efficient Formant Estimation", Proc. IEEE ICASSP, pp. 797-800,Atlanta, 1996.

[4] Shaila D Apte – “Speech & Audio Processing”.;

First Edition, Wiley India Pvt Ltd., 2013.

[5] Thomas F Quatieri – “Discrete Time Speech Processing Principles & Practices”., First Edition, Pearson Education Ltd., 2011.

[6] Douglas O Shaughnessy – “Speech Communication- Human &Machine ”., Second Edition, University Press (INDIA) Ltd., 2009.



Referensi

Dokumen terkait