• Tidak ada hasil yang ditemukan

automatic speech recognition

N/A
N/A
Protected

Academic year: 2025

Membagikan "automatic speech recognition"

Copied!
8
0
0

Teks penuh

(1)

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

A UTOMATIC S PEECH R ECOGNITION Lecture 1

Overview of Speech Technology and ASR

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

General Information

• Lecturer: Atiwong Suchato

• email: [email protected]

• TA: Sirinart Tangruamsap (Nart)

• Course Webpage:

http://www.cp.eng.chula.ac.th/~atiwong/2110432

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Speech Communication

• Human use speech as the major mean to

communication to other people.

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Speech Communication

• The brain controls various articulators in the vocal tract to create the sound wave in some systematic forms (language).

• Auditory system sends nervous signal according to the received sound wave to the brain.

Speech Signal in Different Representations

(2)

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Speech and Language-related Technologies

• Human-machine interaction

– Automatic Speech Recognition – Speech Synthesis

– Text-to-Speech (TTS)

– Natural Language Understanding – Natural Language Generation

• Telecommunication

– Speech Coding – Speech Enhancement

• Human speech communication

– Speech-to-speech Language Translator – Text-to-text Language Translator

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Speech and Language-related Technologies

• Security

– Speaker ID

– Speaker Verification – Speaker Authentication

• Speech and Hearing Disorder

– Speech Synthesis – Hearing Aid

• Speech Manipulation

– Speaking Rate Adjusting – Special Effect

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Human-machine comm. via spoken language

input output

human machine

Speech

Speech Speech Speech

Recognition

Text Text

Synthesis

Text Text

Meaning Meaning

Understanding Generation

MIT Open Course Ware

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Speech-based Interface Applications

• Mostly input (recognition only)

– Simple Command and Control – Simple Data Entry (Over the phone) – Dictation

• Interactive Conversation (understanding needed)

– Information kiosks – Transactional processing – Intelligent agents

MIT Open Course Ware

(3)

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Benefits of Speech Interface

• requires no special training

• leaves hands free

• leaves eyes free

• spoken language has high data rate

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Limitation of Speech Interface

• cannot be used in noisy environment

• unnatural for some tasks (coding?)

• create distraction or annoyance in some situations

• confidential information

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

ASR vs. Touchtone

• Benefits of using an automated system, such as touchtone telephone and ASR, in stead of human customer representatives?

• Problems of touchtone system? Can ASR overcome each of the problem?

• Problem caused by ASR that will not happen if the touchtone system is used?

• Ideal ways for a call center to handle incoming calls?

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Automatic Speech Recognition

• Uncover words from acoustic signal

ASR System ASR System Speech

signal Words

(4)

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Major Components of an ASR System

Speech

Signal Representation Search Recognized

words Acoustic

Models Lexical

Models Language Models Training Data

Applying Constraints

MIT Open Course Ware

Features

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Representation

• must select the right features for the representation used for the desired task

• tasks:

– recognition of fruits in a basket containing mangos, guavas, durians, and coconuts

• weight?

• skin color?

• meat color?

• shape?

– recognition of different sound classes?

Speech

Signal Representation Features

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Constraints

• Acoustic

– signal generated by human vocal

apparatus

• Lexical

– Words in a language are limited.

• Language

– A sentence must be syntactically and semantically well formed.

Search

Recognized words Acoustic

Models Lexical

Models Language Models

Features

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Domain Constraint

• Domain example

– Movie listing – Phone directory – Weather information

• Specifying domain makes lexical and language

constraint more specific.

(5)

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Characterizing ASR System Capability

Noise-canceling mic. – cell phone Transducer

High (>30dB) – Low (<10dB) SNR

Small (<10) – Large (>200) Perplexity

Finite-state – Context-sensitive Language Model

Small (<20words) – Large (>50,000 words) Vocabulary

Speaker Dependent – Speaker Independent Enrollment

Read Speech – Spontaneous Speech Speaking Style

Isolated Word – Continuous Speech Speaking Mode

Range Parameters

MIT Open Course Ware

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

ASR Trends

automatic learning simple structure

intense knowledge engineering Knowledge

Acquisition

homogeneous and heterogeneous homogeneous

and simple heterogeneous

and complex Knowledge

Representation

probabilistic and data-driven template

matching, deterministic and data-driven heuristic and

ad hoc (rule-based) Modeling

Approaches

sub-word units sub-word units

whole-word and sub-word units Recognition

Units

mid 80’s - now mid 70’s–mid

80’s before mid 70’s

MIT Open Course Ware

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

ASR performance (English)

• High performance, speaker-independent speech recognition is now possible

– Large vocabulary for cooperative speakers in benign environments

– Moderate vocabulary for spontaneous speech over the phone

• Commercial recognition systems are now available

– Dictation

– Telephone transaction

• When well-matched to applications,

technology is able to help perform real work.

MIT Open Course Ware

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Well-matched Application?

• Realistically, in the foreseeable future, do you expect an ASR system to give a 100%

accuracy for an open domain application?

• To which application that ASR system is the least suitable?

a) home appliances control using voice command b) putting a patient medical history into a database

using conversational speech

c) an automatic dictation system

d) an ASR-automated call center

(6)

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

ASR performance (English)

• Speaker-independent, continuous speech ASR now possible

• Digit recognition over the telephone with word error rate of 0.3%

• Error rate cut in half every two years for moderate vocabulary tasks

• Error for spontaneous speech more than twice that of read speech

• Conversational speech, involving multiple speakers and poor acoustic environment, remains a challenge

• Tens of hours of training data to port to a different domain

• Statistical modeling using automatic training achieves significant advances

MIT Open Course Ware

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

ASR performance (English)

MIT Open Course Ware

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Challenges

• Co-articulation

• Speaker Independent

– Gender / Children – Dialect

– Non-native

• Spontaneous Speech

– Disfluencies

• Out-of-Vocabulary Word

– Fill words – Proper names

• Noise / Channel Robustness

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

A human-machine conversational system

• System: Welcome to the “CU airways” flight information system. Do you know the flight number?

• Caller: Yes.

• System: What is it?

• Caller: Three forty-one.

• System: Thanks. Do you want arrival or departure information?

• Caller: Departure.

• System: Is it Departing Saturday, Sunday, or Monday?

• Caller: Sunday.

• System: Great. Hold on while I check on that flight.

– <system checks for the information from the database and report the info.>

• System: Is there anything else?

• Caller: No. Goodbye.

• System: Thanks for calling “CU Airways”. Goodbye.

(7)

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

A human-machine conversational system

Speech Synthesis

Speech Recognition

Language Understanding Discourse

Context

Dialog

Management Database

Language Generation

meaning speech

speech

words sentence

visual display key tone

meaning

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Voice Prompt Design

• a challenge by itself

• how to ask? how to convey complex info.?

• Social-Psychological problem

– male or female voice – humorous

– naturalness

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Selected Demo: SPEECHBUCKS

• Speech-controlled Computer Game

• 2110432 Class Project (1 st semester / 2006)

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1

Department of Computer Engineering, Chulalongkorn University Spoken Language Systems Research Group

Selected Demo: SPEECHBUCKS

(8)

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Selected Prototypes: Faculty of Engineering, Call Center

• x8-6400

• Provides Speech

Recognition capability as an alternative mean for reaching individual personnel by phone without knowing their extensions.

2110432 Auto. Speech. Recog. | First Semester 2007 | Lecture 1 ATIWONG SUCHATO

Selected Prototypes: Telephone Call Rate Inquiry

• Under development.

• Callers used their natural speech to inquire

about international telephone call rate and

other related information.

Referensi

Dokumen terkait

For this particular project, the Mel Frequency Cepstrum Coefficient (MFCC) feature is used to extract the information or the characterization of the speech signal for

Hidden Markov Model for Continous Speech Recognition, IEEE Transactions.. on Signal

Pada pengujian dilakukan dapat disimpulkan bahwa intensitas suara, jarak, logat dan kecepatan user mengatakan perintah suara setelah suara beep serta metode

This paper involves design of a light-weight speech recognition system, which directly operates on acoustic features and perform Long Short-Term Memory LSTM based speech to text

Experimental evaluation The proposed unsupervised acoustic modelling can be used to obtain either a continuous representation of the speech sig- nal in terms of virtual phoneme state

The aim of this study was to develop an SER system that can classify and recognise six basic human emotions i.e., sadness, fear, anger, disgust, happiness, and neutral from speech

CONCLUSION The statistical features of a speech signal solely does not work as emotional features, whereas the acoustic features play an important role in emotion recognition, energy

Abstract— Since the parameterization in the perceptually relevant aspects of short-term speech spectra in ASR front-end is advantageous for speech recognition, such as Mel-LPC,