Copyright is owned by the Author of the thesis. Permission is given for
a copy to be downloaded by an individual for the purpose of research and
private study only. The thesis may not be reproduced elsewhere without
the permission of the Author.
Voice Recognition System for Massey University Smarthouse
A thesis presented in partial fulfilment of the requirements for the degree of
Master of Engineering
Ill .
Information Engineering
at
Massey University, Auckland, New Zealand
Rafik Gadalla
2006
Acknowledgment
T would like to ex press my sin
cere gratitude to myproject
supervisor,Dr Tom M
oir, for hi
s guidance and support throughoutthe duration of th
is project.I would al
so like tothank my colleagues Leon Kok
, Grettlc Lomiwes andYaitheki Yoganathan
and other members of the smarthouse projects for their effort anddedica tion t
owards the formation of the smanhouse.
I'm gratefu
lL
o my famil
y for all their support and encouragemen t. In particul
ar I would like tothank my parents for th
eir love, patience and encouragement, myfiance M
ana!, mybrothers and
friends w hom
with
out their assistancethis proj
ect would not have been made poss ible.
A
nd above
all, I would liketo thank God for making i t all happen.
Abstract
The co ncep t o f a smarthouse aims to integrate techn ol ogy into hou ses to a level w here most daily task s are automated and to provide co mfort, safety and entert ainment to the house res iden ts. The concept is mainly aimed at the elderly populati on to improve their qual ity of life.
In order to maintain a natural medium of communi cation, the house employs a speech recognition system capable of anal ysing spoken language. and extracting commands from it. Thi s projec t focuses on the devel opment and evaluation of a windows appli ca tion devel oped with a high l evel programming language which inc orporates speec h recognition tec hnology by utili sing a commercial speech recognition engine.
The speech recognition system acts as a hu b within the Smarthousc to recei ve and delegate use r command s to different switching and control sys tems.
Initial trail
s were built using Dragon aturall y Speaking as the recogni tion engine.
However that proved inappropriate for use in the Smanhous e project as it is s peaker dependent and requires each us er to train it w ith hi ·/her own vo ice.
The appli cati on now utilizes the Microso ft Speech Applicati on P rogram ming
Interface (SAPl ), a . oftware l ayer which sits between applications and speech engines and the Microso ft Speech Recognition Engine, which i s freely distributed with some Microsoft products. A
lthough Dragon Naturally Speaking offers better recognition for dictation, MS engine can be optimized using Contex t Free Grammar (CFG) to gi ve enhanced recogniti on in the intended application. The application i s designed to be speaker independent and can handle continuou s speec h. It connects to a database
III
oriented expert system to carry out full conversations with the users. Audible prompts and confirmations are achieved through speech synthesis using any SAPI compliant text to speech engine.
Other developments focused on designing a telephony system using Microsoft Telephony Application Programming Interface (T API). This allows the house to be remotely controlled from anywhere in the world. House residents will be able to call their house from any part of the world and regardless of their location, the house will be able to respond to and fulfil their commands.
IV
Ta ble of Contents
Ackn owledg,nent .. ... .. ....... ........... ... ............. .... .. ...... ........ ............. ................
11A bstract .. ................................ ................... ......................................................... ..... III List of Figures ................................................ ..... .............................................. VII List of A bbreviations .... ............... .................. ................ ............................ ... ........ . VIII Chapter
I:Introduction ...... ............ ............................ ............ ..................................
I1.1 Massey Sn1 a rt house ... I
1.1.1 Location and Positioning System ... 31.1.2 Voice eparation system ... 3
1.1.3 House Management System ... 4
1.1.4 Remote Switching System ... 4
C hapter 2: Background .. ...... ........ ... .................. ...... .... ..... ...... .............. ... .... 5
2.1
Ho,v S peech Recogniti on Works ... ....................... .................... ............................. 5
2.1.1 Extracting Discriminating Sp<.:cch Features ... 5
2.1.2 Extracting Phonemes ... 7
2.1.3 Applying Grammar and Languagc Models ... 8
2.2
SAPI ..................................... .. ................................................................. ...........
112.3
T API ...................................... ........................................ ................. .......................
13 2.3.1 API ... 13 2.3.2 T /\PI server ... 13 2.3.3 Service provider intcrfacc ... 1-12.4 HMM ..... ................. ........ ...................... ................................................................ .. 1
52.5 XML .............. ....................... .......................................... .......................................... 1
72.6 O the r S mart Ho use Proj ects .......... .................................... ................................ l 8 2.7 S peec h Inte rface S tandards ..... ............. .................................................................. ..
19 2.7.1 YoiceXML ... 202.7.2 SALT ... 21 2.8
S peec h Recognition in Commercia l En vironments ......................................... ...
23C hapter 3: Problem Fo rmulation an d System R equirements .. .... .......... .. ......... ... ... 25
3.1 Problem Forn1ulation
... 253.2
S pecifications ...
... 27C hapter 4: System Implem entation .. ..... ... ....... ........ ............. .... .. ........... ....... ......... ... 29
4.1 S peech Interface Design concepts ..................... .................................................. 29
4.2 S peech Recognition Impleme ntation ...
314.2.1 Sy lem Initiali,ation ... 31 4.2.2 Command Execution ... 31 4.2.3 Performing other Functions ... 32
4.2.4 Commands Databa e ... 33
4.2.5 Speech Recognition Oow chart ... 36
4.3 Telephony lmplemen tation ......................................................................... ....... 37
4.4 Graphical User Interface .... ................................................................... .......... ...... 40
4.4.l GUI Design Con iderations ... 40
4.4.2 Application Interface ... 42
V
Chapter 5: Syste,n Testing ... 47
5.1 Preliminary Tests ... 47
5.2 Final Evaluation Methodology ... 48
5.2.1 Room Setup and Recording Equipment... ... 49
5.2.2 Selection and training of Subject Speakers ... 49
5.2.3 Design of command phrases ... 51
5.2.4 Introduction of Noise ... 52
5.2.5 Telephony Testing ... 54
5.2.6 Analysis of Sound files ... 55
Chapter 6: Results and Discussion ... 56
6.1 ASR Engines Feature Comparison ... 56
6.1. I Dragon Naturally Speaking ... 56
6. I .2 Microsoft SAPI Kit ... 56
6.1.3 Vocon 3200 ... 57
6.1.4 IBM via Voice ... 57
6.2 System Evaluation Results ... 58
6.3 Improving Recognition Accuracy ... 64
Chapter 7: Conclusion and Future Work ... 67
7.1 Conclusion ... 67
7 .2 Future Work ... 69
References ... 70
Bibliography ...
73Appendices ...
75Appendix A: Matlab Script Used During Testing Process ... 75
Appendix B: Grammar File Used During Testing Process ... 77
VI
List of Figures
Figure I. I The different components of the Massey U ni versity Smarthouse and how
they integrate together. ... .. ... .. ... .. ... .... ... ... 2
Figure 1. 2: Bluetooth watch worn by Massy Smarthouse occupants ...
...... 3
Figure 2.1 : Amp
litude vs. time graph for the phrase " Massey Uni vers ity" ... 6
Figure 2.2: Spec trograph o f the phrase " Massey U niver sity" ... ... .. ... ... ... 7
Fi gure 2.3 : Structure of a continuous ASR engi ne ... . 10
Fi gure 2.4: SA P! archi tec ture .. ... ...
......
...... ... ... 11
Figure 4. 1: Comm unicati on bet ween the speec h system and expe11 system ... .. .. 30
Fi gure 4.2 : Appli cation's Splash Screen ... ..
...... ... ... 31
Figure 4.3: Smarth ou e command database schema ... ... ... .. 35
Figure 4.4: Simplified fl ow chart of the speech recogn ition applicati o ...
... 36Figure4.5: Fl ow chart of tel ephony hand ling application ... .. ... ... 39
Figure 4.6: Standard naming co nventi on used in the app lication's to menu ... .4 I Figure 4.7: L ogical grouping of compo n ent s ....
...... ... ... .41
Figure 4.8: The application's main fo rm. The diff erent buttons and di al ogs are numbered one to ei ght. ... ... ... .. ...
...... ...
...... .42
Fi gure 4.9: Microphone Training Wizard .. ... .. ... .... ... ... .... 44
Fi gure 4. 1 0: Add New Words Wizard .. ... ... .. 44
Fi gure 4. 11 : User Tra ining Wi zard ... ... ... ... . .45
Fi gure 4. 1 2: H el p-About W i ndow ...
......
..... .46
Fi gure 5.1
: Program used assi
st speakers to read commands ......
.5 1 Fi gure 5.2: Program used for tes tin g and ASR engine accuracy ... ... ... ... 55
Fi gure 6. 1: Recogn it ion accuracy grap h for Speaker
I(female w ith America n accent) ... ... ... .. ... ...
...... ... ... .... ..
....... .... ...
...... ... ... 59
Fi gure 6.2: R ecog nition accuracy graph for Speaker 2 (female w ith cw Zealand acc ent) ....
....... ...
..... ... ....
.....
......
...... 59
Figure 6.3: Recog nition accuracy graph for Speaker 3 (male w ith fo rei gn accent) .... 60
Figure 6.4: Recogn ition accuracy graph for Speaker 4 (male w ith Scollish accent ) ... 60
Figure 6.5: Recog nition accuracy graph for oise I (Traffic oise) ... ...
... 61 Figu re 6.6: R ecogn ition accuracy graph fo r Noi ·e 2 (Crowd o i
se) .. ... ....
..... 6 1 Figure 6.7: Recogn ition accuracy graph fo r Noise 3 (Children T a lkin g Noi se) ... 62
Figure 6.8: Recog nition accuracy graph fo r oise 4 (Ambient Musi c) ...
..... ... 62
Figure 6.9: R ecognition accuracy graph for Speaker I using bandpass filtered Speech to simulate telephony quality speec h ... ... ... ..
...... ... 63
Fig 6. 10: Beamformer microphone array ... 65
VII
List of Abbreviations
SAPI
CFG
TAPI TCP IP
PCM
LPC
FFT MFCC HMM
ASR TTS API DOI
XML
COM
SPI
D TMF
SALT IVR
RMS SNR
Speech Application Programming Interface Context Free Grammar
Telephony Application Programming Interface Transfer Contro l Protocol
Internet Protocol Pulse Code Modulation Linear Predictive Coding Fast Fourier Transform
Mel Frequency Cepstral Coefficient Hidden Markov Model
Automatic Speech Recognition Text To Speech
Application Programming Interface Device Driver Interface
eXtensib le Mark-up Language
Component Object Model Service Provider Interface Dual Tone Multiple Frequency Speech Application Language Tags Integrated Voice Response
Root Mean Square Signal to Noise Ratio
VIII
Chapter 1: Introduction
The purpose o f this proj ec t was to deve lo p a vo ice recog niti o n syste m , tha t ca n be used in M assey Uni vers ity Sm artho use to res pond to the occupant s needs a nd des ires s imply by takin g the ir vo ice requ es ts and tra nsforming them into ac ti o ns. The syste m acts as a hub that services and delegates a ll vo ice requ ests to oth er co nt rol sys te ms w ithin the house.
1. 1 Massey Smart house
M assey Uni ve rs it y S marth ouse is a co ll a bo ra ti ve resea rc h and develo pme nt proj ec t amo ng th e In stitut e of In fo rmati o n and Math emati ca l Sc ie nces a nd the In stitu te of Tec hn ology and Eng in ee rin g and o ther indu str y p artn ers. The goal of the p roj ec t is to c rea te a ho use w he re tec hno logy a nd appli ances in th e ho use he lp make li fe eas ie r, safer a nd mo re e njoyable fo r its occ upant s. It res pond s to the needs a nd des ires of occupant s by, fo r exampl e, mo nit orin g the ir hea lth , adj ustin g li ghtin g, te mperatu re, or even am bie nt mu sic to th eir perso nal prefere nces, a nd whe reve r possib le ass is ts, th e m in all the ir dail y tas ks. The Sm artho use m ain aims are:
• M onit or the hea lth and sa fe ty of its occupa nts, b y us in g the la tes t in in fo rm ati o n sys te ms and bi otec hno logy.
• Auto mate commo n house managem ent tasks, thu s a ll owin g inhabit ants to have a m ore e nj oyabl e and com fo rtabl e li fe.
• Pro vide info rm ati on and e nte rtainme nt to th e occ upant s up on the ir de m a nd.
It sho uld hide the tec hnique and details of how it works a nd be comple tel y intuiti ve to
use (Human Ce ntred Des ign).
The main beneficiaries of this project will be the elderly population who want to retain their independence, and their families and fri ends who can be secure in the knowledge that they are safe, well and comfortable. The health sector will benefit by being able to more effectively help and monitor people in their care. There will also be a number of other benefits for the construction industry, appliance indus try, and for other people who wish to improve their quality of life.
The Massey University Smarthouse [ 1] will be a world-class showcase for the integration of hou se automation, health care a nd s mart appliance technology. Figure 1.1 provides an overview of the different components of the Smarthouse that are discussed below
Voice Separation Component
The Internet (Weather Information, TV listinqs, etc)
The Voice Recognition Component
...
Artificial Intelligence Component
Location System Component
Remote Switching Component
Figure 1.1 The different component" of the Ma-.sey University Smarthou"c and how they integrate together
2
1.1.1 Location and Positioning System
T rackin g the pos iti on of occ upant
s and devices w ithin the ho use is essentia l to att a in
smart co ntrol a nd mo nitorin g. The s marth o use the refore w ill be equipped with a Blu etooth ubiquit ous netwo rk that co nsists of tra nsceive r nodes that span across the roof of the entire house.
The occupants of the house will
be wearing a Bluetoot h trans mitting watch
thatco nta ins th
eir uniquely id
entifiable codethat lets th
ehouse know who they are, and
exactl y where they are within the house.
Figure 1.2: Bluct()oth \\illch \\\lrn h) :\la"">' Smartlmu"c \lccupanh
1.1 .2 Voice separation system
To enable the sm artho use to be
controlled by voice, tw o
approaches can be taken. The
firstis for all occup ants to wear a vo ice capturin g device, in the fo rm of headse t o r
watch or othe r. The second is to u e wa ll
or roof mounte d microp ho nes to all ow for di s tant s peec h recog niti o n. Beca use the first approach is restricti ve to the occ upants, M assey
University'sSmarthouse will be usin g beamforme r mic ro pho ne array s. The development of the bea m fo rmer arrays utili se some
well known beamfo rmin g al gorithms to minimize no ise,
and provide clean, hig h qualit y speech for the peech
3
recognition system. The main algorithm used will he a modified version of the Griffiths-Jim beamformcr.
1 .1.3 House Management System
The house management system is a PC based software that contains all the rules that govern the operation of the house.
Itwill act as the central control unit that will be communicating all the necessary inrormation to and from other components within the house. The application will be equipped with an expert system implemented in the form of a datahase. The system will collect information from the location system, the speech recognition system and the different sensors within the house to manage the daily operations of the house in an intelligent manner.
1.1.4 Remote Switching System
Switching and control of appliances is made possible by a TCP/IP switching system built using an embedded system that, although capable of being used as a single and stand-alone device
toaid in home-automation, also integrates into the smarthouse environment, allowing a number of smart appliances
tohe networked and controlled by the house management system. The device offers a simple web browser interface to show the status of connected devices at any given moment.
4