13 1 ADPCM in Speech Coding

(1)

Chapter 13 Basic

Audio

Compression

Techniques

13 1 ADPCM in Speech Coding

13.1 ADPCM in Speech Coding

13.2 G.726 ADPCM

13.3 Vocoders

(2)

13 1 ADPCM in Speech Coding

13.1 ADPCM in Speech Coding

y

ADPCM

forms the heart of the ITU's

y

ADPCM

forms the heart of the ITU s

speech compression standards

G.721,

G.723, G.726, and G.727.

y

The di erence between these standards

involves the

bit-rate

(from 3 to 5 bits per

sample) and

some algorithm details

sample) and

some algorithm details.

(3)

(4)

13 2 G 726 ADPCM

13.2 G.726 ADPCM

y

ITU G 726 supersedes ITU standards

y

ITU G.726 supersedes ITU standards

G.721 and G.723.

y

Rationale: works by

adapting a

ﬁ

xed

quantizer

in a simple way. The di erent

q

p

y

(5)

y

The standard de

ﬁ

nes a

multiplier constant

y

The standard de

ﬁ

nes a

multiplier constant

(6)

y

The input value is a

ratio

of

a di erence

ith th

f t

α

with the

factor

α

.

(7)

Backward Adaptive Quantizer

y Backward adaptive works in principle by noticing Backward adaptive works in principle by noticing

either of the cases:

◦ too many values are quantized to values far from ld h if i i i f

zero – would happen if quantizer step size in f were

too small.

◦ too many values fall close to zero too much of the y time

x would happen if the quantizer step size were too large.

y Jayant quantizer allows one to adapt a backward y Jayant quantizer allows one to adapt a backward

quantizer step size after receiving just one single output.

◦ Jayant quantizer simply expands the step size if the quantized input is in the outer levels of the quantizer, and reduces the step size if the input is near zero.

(8)

The Step Size of Jayant Quantizer

y Jayant quantizer assigns Jayant quantizer assigns multiplier values Mmultiplier values M_k_k to to

each level, with values smaller than unity for levels

near zero, and values larger than 1 for the outer le els

levels.

y For signal fn, the quantizer step size ∆ is changed

according to the quantized value k, for the according to the quantized value k, for the

previous signal value fn−1, by the simple formula

y Since it is the quantized version of the signal that

(9)

G.726 G.726 —

— Backward Adaptive Jayant

Backward Adaptive Jayant

Quantizer

y _G.726_G.726 _uses_uses_ﬁ_ﬁ_{xed quantizer steps based on the}_{xed quantizer steps based on the}

logarithm of the input diﬀerence signal, e_n divided by

(10)

y The "unlocked" part adapts via the equationp p q

y The locked part is slightly modip g y ﬁed from the unlocked partp :

y The G.726 predictor is quite complicated: it uses a linear p q p

combination of 6 quantized diﬀerences and two

(11)

13 3 Vocoders

13.3 Vocoders

y Vocoders – voice coders which cannot be y Vocoders voice coders, which cannot be

usefully applied when other analog signals, such as modem signals, are in use. g

◦ concerned with modeling speech so that the salient features are captured in as few bits as possible.

◦ use either a model of the speech waveform in time (LPC (Linear Predictive Coding) vocoding), or ... → ◦ break down the signal into frequency components

◦ break down the signal into frequency components

and model these (channel vocoders and formant vocoders).

y Vocoder simulation of the voice is not very good

(12)

Phase Insensitivity

y A complete reconstituting of speech waveform is A complete reconstituting of speech waveform is

really unnecessary, perceptually: all that is needed is for the amount of energy at any time to be

ab t ri ht and the si nal ill s nd ab t ri ht about right, and the signal will sound about right.

y Phase is a shift in the time argument inside a

function of time. function of time.

◦ Suppose we strike a piano key, and generate a roughly sinusoidal sound cos(ωt), with ω =2πf.

◦ Now if we wait suﬃcient time to generate a phase shift π/2and then strike another key, with sound

cos(2( ωt + π/2), we generate a waveform like the solid ) g line in Fig. 13.3.

(13)

y

If we did not wait before striking the

y

– If we did not wait before striking the

second note, then our waveform would

be cos(

ω

t)+cos(2

ω

t). But perceptually,

the two notes would sound the same

sound, even though in actuality they

would be shifted in phase.

(14)

Channel Vocoder

y _{Vocoders can operate at}_{Vocoders can operate at}_{low bit-rates}_{low bit rates}_,_,_{1–2 kbps}_{1 2 kbps}_{. To}_{. To}

(15)

y _{Due to}_{Due to}_{Phase Insensitivity}_{Phase Insensitivity}_(i.e._(i.e._{only the energy is}_{only the energy is} important):

◦ The waveform is "recti_ﬁed" to its absolute value.

Th _ﬁl b k d i l i l l f h

◦ The _ﬁlter bank derives relative power levels for each frequency range.

◦ A subband coder would not rectify the signal, and would

id f b d

use wider frequency bands.

y A channel vocoder also analyzes the signal to

determine the general pitch g p of the speech (p (low — bass, or high — tenor), and also the excitation of the speech.

y A channel vocoder applies a vocal tract transfer y A channel vocoder applies a vocal tract transfer

(16)

Formant

Formant Vocoder

Vocoder

Formant

Formant Vocoder

Vocoder

y _Formants_Formants_{: the}_{: the}_{salient frequency}_{salient frequency}_{components that are}_{components that are}

present in a sample of speech, as shown in Fig 13.5.

y _Rationale:_{encoding only the most important} frequencies

(17)

Linear Predictive Coding (LPC)

y LPC vocodersLPC vocoders extract salient features extract salient features of speech of speech

directly from the waveform, rather than

transforming the signal to the frequency domain

LPC F

y LPC Features:

◦ uses a time-varying model of vocal tract sound generated from a given excitation

generated from a given excitation

◦ transmits only a set of parameters modeling the

shape and excitation of the vocal tract, not actual

i l diﬀ ll bit t

signals or diﬀerences small bit-rate

y About "Linear": The speech signal generated by

the output vocal tract model is calculated as a the output vocal tract model is calculated as a function of the current speech output plus a

(18)

LPC Coding Process

y

LPC

starts

by deciding whether the current

y

LPC

starts

by deciding whether the current

segment is

voiced or unvoiced:

◦ For unvoiced: a wide-band noise generator g is

used to create sample values f(n) that act as input to the vocal tract simulator

◦ For voiced: a pulse train generator creates values

f(n)

◦ Model parameters ap _i_i: calculated by using a y g

least-squares set of equations that minimize the

di erence between the actual speech and the speech generated by the vocal tract model speech generated by the vocal tract model, excited by the noise or pulse train generators

(19)

LPC Coding Process (cont'd)

LPC Coding Process (cont d)

y

If the output values generate s(n) for

y

If the output values generate s(n), for

input values f(n), the output depends on

p

previous output sample values

:

LP

b

l l

d b

(20)

LPC Coding Process (cont'd)

LPC Coding Process (cont d)

y

By taking the

derivative of a

and

setting

y

By taking the

derivative of a

_i

and

setting

it to zero

, we get a set of J equations:

(21)

LPC Coding Process (cont'd)

LPC Coding Process (cont d)

y

An often used method to

calculate LP

y

An often-used method to

calculate LP

(22)

LPC Coding Process (cont'd)

LPC Coding Process (cont d)

y

Since

φ

(i j) can be de

ﬁ

ned as

φ

(i j)= R(|i

−

y

Since

φ

(i, j) can be de

ﬁ

ned as

φ

(i, j) R(|i

j|), and when R(0)

≥

0, the matrix {

φ

(i, j)} is

positive symmetric, there exists a fast

p

y

(23)

LPC Coding Process (cont'd)

LPC Coding Process (cont d)

y

Gain G

can be calculated:

y

Gain G

can be calculated:

(24)

Code Excited Linear Prediction

(CELP)

y CELP is a more complex family of coders that CELP is a more complex family of coders that

attempts to mitigate the lack of quality of the simple LPC model

CELP l d f h

y CELP uses a more complex description of the

excitation:

◦ An entire set (a codebook) of excitation vectors is

◦ An entire set (a codebook) of excitation vectors is matched to the actual speech, and the index of the best match is sent to the receiver

Th l i i h bi 4 800 9 600

◦ The complexity increases the bit-rate to 4,800-9,600 bps

◦ The resulting speech is perceived as being g p p g more similar and continuous

(25)

The Predictors for CELP

y

CELP coders contain two kinds of prediction:

y

CELP coders contain two kinds of prediction:

◦ LTP (Long time prediction): try to reduce

d d i h i l b _ﬁ di th b i

redundancy in speech signals by _ﬁnding the basic periodicity or pitch that causes a waveform that more or less repeats

more or less repeats

◦ STP (Short Time Prediction): try to eliminate the

(26)

Relationship between STP and LTP

y

STP

captures the

formant structure

of the

y

STP

captures the

formant structure

of the

short-term speech spectrum based on

only a

few samples

p

y

LTP,

following STP, recovers the long-term

correlation in the speech signal that

h

d

h

represents the periodicity in speech using

whole frames or subframes (1/4 of a frame)

LTP i ft

i

l

t d " d ti

– LTP is often implemented as "adaptive

codebook searching"

y

Fig 13 6 shows the relationship between

STP

y

Fig. 13.6 shows the relationship between

STP

(27)

(28)

Adaptive Codebook Searching

y

Rationale:

y

Rationale:

◦

Look in a codebook of waveforms to

_ﬁ

nd one

that matches the current subframe

◦

Codeword: a shifted speech residue segment

indexed by the by the lag

τ

corresponding to

the current speech frame or subframe in the

adaptive codebook

(29)

Open

Open Loop Codeword Searching

Loop Codeword Searching

Open

Open--Loop Codeword Searching

Loop Codeword Searching

y

Tries to minimize the

long term

y

Tries to minimize the

long-term

(30)

LZW Close

LZW Close--Loop Codeword

Loop Codeword

Searching

y Closed-loop search is more often used in CELP y Closed-loop search is more often used in CELP

coders — also called Analysis-By-Synthesis (A-B-S)

y speech is reconstructed and perceptual error for speech is reconstructed and perceptual error for

that is minimized via an adaptive codebook search, rather than simply considering sum-of-squares

y The best candidate in the adaptive codebook is

selected to minimize the distortion of locally reconstructed speech

y Parameters are found by minimizing a measure of

h di b h i i l d h

(31)

Hybrid Excitation Vocoder

y Hybrid Excitation Vocoders are di erent from y Hybrid Excitation Vocoders are di erent from

CELP in that they use model-based methods to introduce multi-model excitation

y includes two major types:

◦ MBE (Multi-Band Excitation): ( ) a blockwise codec, in which a speech analysis is carried out in a speech frame unit of about 20 msec to 30 msec

MELP (M ltib d E it ti Li P di ti ) h

◦ MELP (Multiband Excitation Linear Predictive) speech codec is a new US Federal standard to replace the old LPC-10 (FS1015) standard with the application focus ( ) pp on very low bit rate safety communications

◦ a DoD speech coding standard used mainly in military applications and satellite communications

(32)

MELP Vocoder

y

MBE utilizes the A-B-S scheme in parameter

y

MBE utilizes the A B S scheme in parameter

estimation:

◦ The parameters such as basic frequency, p q y,

spectrum envelope, and sub-band U/V decisions are all done via closed-loop searching

◦ The criterion of the closed loop optimization is

◦ The criterion of the closed-loop optimization is based on minimizing the perceptually weighted reconstructed speech error, which can be

(33)

MBE Vocoder

y MELP: also based on LPC analysis uses a y MELP: also based on LPC analysis, uses a

multiband soft-decision model for the excitation signal g

y The LP residue is bandpassed and a voicing

strength parameter is estimated for each band

y Speech can be then reconstructed by passing the

excitation through the LPC synthesis _ﬁlter

y Di erently from MBE, MELP divides the

excitation into _ﬁve _ﬁxed bands of 0-500,

(34)

MELP

MELP Vocoder

Vocoder (Cont'd)

(Cont'd)

MELP

MELP Vocoder

Vocoder (Cont d)

(Cont d)

y A voice degree parameter is estimated in each y A voice degree parameter is estimated in each

band based on the normalized correlation

function of the speech signal and the smoothed p g recti_ﬁed signal in the non-DC band

y Let s_k(n) denote the speech signal in band k, u_k(n)

(35)

MELP Vocoder (Cont'd)

MELP Vocoder (Cont d)

y

MELP adopts a jittery voiced state to

y

MELP adopts a jittery voiced state to

simulate the marginal voiced speech

segments – indicated by an aperiodic

g

y

p

_ﬂ

ag

g

y

The jittery state is determined by the

peakiness of the full-wave recti

_ﬁ

ed LP

d

( )

residue e(n):

y

If peakiness is greater than some threshold,

(36)

13 4 Further Exploration

13.4 Further Exploration

→ Link to Further Exploration for Chapter 13. Link to Further Exploration for Chapter 13.

y _{A comprehensive introduction to speech coding may}

f S '

be found in Spanias's excellent article in the textbook references.

y _{Sun Microsystems Inc has made available the code} y _{Sun Microsystems, Inc. has made available the code}

for its implementation on standards G.711, G.721, and G.723, in C. The code can be found from the link in the Chapter 13 _ﬁle in the Further Exploration

in the Chapter 13 _ﬁle in the Further Exploration section of the text website

y _{The ITU sells standards; it is linked to from the text}

website for this Chapter