Speech Based Emotional Feature Extraction

(1)

ISSN (PRINT) :2320 – 8945, Volume -1, Issue -6, 2013

11

Speech Based Emotional Feature Extraction

Varun Kumar Tiwari & Pradnya P. Zode

Department of Electronics Engineering, Yeshwantrao Chavan College of Engineering Nagpur, India

E-mail : [email protected], [email protected]

Abstract – With the increase in the computing capabilities of the microprocessors it has now become possible to do real time computations on speech signals and images, so automatic emotion recognition through speech signals has become an important area of research. In this paper we have discussed the methods of extracting emotional features from a regular speech signals.

Index Terms —Emotional feature; feature extraction;

Support Vector Machine; Mel Frequency Cepstrum Cooefficient

I. INTRODUCTION

The human emotions are expressed through speeches and one can judge the emotion of the speech if he is a human, but if we want to make a machine that understands the emotion, we need to design an algorithmic approach as a solution to the problem. The algorithm is a two stage approach the first is to extract features from the speech signal that corresponds to the emotional state of the speaker, and the second is to classify the extracted features into proper emotional state. The accuracy of this type of system is depended upon the accuracy of both the stages, if the features extracted are not proper emotional features, then the system fails, and if the classifier is unable to classify properly, then also the system fails. A lot of research has been done in the field of pattern recognition which has resulted in development of some good classifiers like Neural Networks, Support Vector Machine, Hidden Markov Model, Gaussian Mixture Model, etc. so we have a lot of options for the latter part of the algorithm therefore our main interest is to find the proper technique for emotional features in the speech signals. A speech signal, once recorded is simply a digital signal so when it comes to feature extraction we have different methods that are discussed in this paper and latter the emotional features.

II. FEATURES A. Statistical Features:

A digital system can be considered as a collection of samples of data, so we can find some statistical features of the signal.

If the signal has n samples as x₁, x₂, x₃, …, xn then, 1. Mean: Mean is defined as :





ⁿ

i

n

x

i

x

1 1

2. Median: If the sample data are arranged in increasing order, the median is the middle value if n is an odd number, or midway between the two middle values if n is an even number.

3. Mode: The mode is the most commonly occurring value.

4. Range: Range is the difference between the largest and smallest observations in the sample.

5. Deviation: Deviation is the departure of a given reading from the arithmetic mean of the group of readings.

6. Average Deviation: Average deviation is the sum of the absolute values of the deviation divided by the number of readings.

(2)

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

12 7. Variance: Variance (s²) is the arithmetic mean

of the squared deviations from the sample mean.

8. Standard Deviation: The sample standard deviation (s) is the square-root of the variance.

These features can be calculated and considered as one of the important parameter of emotional speech.

A. Acoustic Feature:

1. Pitch: It represents the perceived fundamental frequency of a sound. It is one of the major auditory attributes of sounds along with loudness.

2. Formant: It is a peak in the frequency spectrum of a sound caused by acoustic resonance. The Formant frequencies are distinguishing or meaningful frequency components of human speech.

3. Intensity: It is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). The perception of loudness is related to both the sound pressure level and duration of a sound.

4. Magnitude Spectrum: This feature extracts the FFT (Fast Fourier Transform) magnitude spectrum from a set of audio samples. It gives a good idea about the magnitude of different frequency components within a window.

5. Power Spectrum: This feature extracts the FFT power from a set of audio samples. It gives a good idea about the power of different frequency components within a window.

6. Linear Predictive Coding (LPC) coefficients: This feature helps in representing the spectral envelope of an audio or speech signal.

7. Mel Cepstral coefficients (MFCC): This feature constitutes the coefficients derived from the Cepstral representation of audio signal such that the frequency bands are equally spaced on the Mel scale approximating the human auditory system's response more closely. MFCCs are more commonly and widely used as features in speech recognition

systems. Davis and Mermelstein [3] demonstrated that the MFCC representation approximates the structure of the human auditory system better than traditional linear predictive features. It was extracted as follows:

Figure1.MFCC Extraction

The Mel-scale is a logarithmic mapping from physical frequency to perceived frequency [8]. The Cepstral coefficients extracted using this frequency scale is called MFCCs.

III. EMOTIONAL FEATURES A. Emotion:

An Emotion is a term for a mental and physiological state associated with a wide variety of feelings, thoughts, and behaviour. Emotions are subjective experiences, or experienced from individual point of view. Emotion is often associated with mood, temperament and personality. But in general emotions are short-term whereas moods are long-term and temperaments or personalities are very long-term. A particular mood may sustain for several days and temperament sustain for month or years. Human emotion can be of different types such as angry, happiness, sadness, neutral, fear, disgust, surprise, shy, bored etc.

Primary Emotions: The emotions that occur as a direct result of encountering some kind of cue.

Secondary Emotions:- The emotional reactions we have to other emotions. For example, a person may feel ashamed as a result of becoming anxious or sad.

B. Emotion vs. Features

The various emotional states can be and their relationships with the speech features are shown in the table I.

 

1

2 2



 



n x x s

n

i i

 

1

2









n x x s

n

i i

(3)

13 TABLE I

Emotion vs. Feature Emotion Pitch Intensity Speaking

Rate

Voice Quality Anger High mean

wide range

Increased Fast Breathy, Blaring timbre Happiness Increased mean

and range

Increased Fast Sometimes breathy, moderately

blaring timbre Sadness Normal or

lower than normal mean,

narrow range

Decreased Slow Resonant timbre

IV. RESULT AND ANALYSIS

In order to test the we designed a emotion recognition system using SVM as a classifier in MATLAB. SVM is a binary classifier but by using kernel tricks it can be used as multilevel classifier here we have used Radial Basis Function, Poly, linear and sigmoid to test the features.

A general model of the recognition system is as shown in figure2.

Figure2.Speech Emotion Recognition System In the training phase, a few emotional speech signals are taken from the standard speech emotion database and fed to the classifier for its training this is supervised training of the classifier. Once the classifier is trained we can move to the working phase, where we

can take any random unknown sample and give it to the classifier for emotional state recognition. Now the accuracy of the system can be calculated by comparing the classified samples with their known emotion.

Here we have used four emotional features anger, happy, normal and sad. The standard database [6] has 13-anger samples, 15- happy samples, 30- normal samples and 15- sad samples.

The accuracy of the system will vary, depending upon the selected speech feature. The main features that showed good results are shown in the tables.

TABLE II

Statistical Features mean median, standard deviation as speech feature

SVM (Kernel)

Parameter Accuracy (2- emotional

samples each)

Accuracy (3- emotional

samples each) Poly Wav sample:

80000, FFT size- 512, Hop size- 410

30.3 34.2

Linear Wav sample:

29.7 34.2

RBF Wav sample:

28.7 32.1

Sigmoid Wav sample:

29.7 29.4

TABLE III Energy as speech feature SVM

(Kernel)

samples each )

(4)

14

Poly Wav sample:

67.5 70.45

Linear Wav sample:

68.2 70.45

RBF Wav sample:

70.1 71.59

Sigmoid Wav sample:

63.2 68.18

TABLE IV

MFCC and energy as speech feature SVM

(Kernel)

samples each ) Poly Wav sample:

Melcof:3

71.3 72.4

Linear Wav sample:

Melcof:3

71.3 70.8

RBF Wav sample:

Melcof:3

72.6 71.8

Sigmo id

Wav sample:

Melcof:3

65.2 62.18

V. CONCLUSION

The statistical features of a speech signal solely does not work as emotional features, whereas the acoustic features play an important role in emotion recognition, energy is one major feature of emotion and the combination of energy and MFCC provides a good emotion recognition rate, so are important emotional features.

VI. REFERENCES

[1] Won-Jung Yoon, and Kyu-Sik Park: ” Building Robust Emotion Recognition System on Heterogeneous Speech Databases” IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, May 2011.

[2] Stavros Ntalampiras and Nikos Fakotakis ,“Modeling the Temporal Evolution of Acoustic Parameters for Speech Emotion Recognition”

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 3, NO. 1, JANUARY- MARCH 2012 .

[3] Steven B. Davis, and Paul Mermelstein, Comparison of parametricrepresentations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP, 1980.

ASSP-28(4): p. 357-366.

[4] Eun Ho Kim, Kyung Hak Hyun, Soo Hyun Kim, and Yoon Keun Kwak “Improved Emotion Recognition With a Novel Speaker-Independent Feature” IEEE/ASME TRANSACTIONS ON MECHATRONICS, VOL. 14, NO. 3, JUNE 2009 317.

[5] Chul Min Lee, Student Member, IEEE, and Shrikanth S. Narayanan, Senior Member, IEEE

“Toward Detecting Emotions in Spoken Dialogs”

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 2, MARCH 2005 293.

[6] http://emotion-research.net/wiki/Databases

