Chapter 13 Basic
Chapter 13 Basic
Audio
Audio
Compression
Compression
Techniques
Techniques
13 1 ADPCM in Speech Coding
13.1 ADPCM in Speech Coding
13.2 G.726 ADPCM
13.3 Vocoders
13 1 ADPCM in Speech Coding
13 1 ADPCM in Speech Coding
13.1 ADPCM in Speech Coding
13.1 ADPCM in Speech Coding
y
ADPCM
forms the heart of the ITU's
yADPCM
forms the heart of the ITU s
speech compression standards
G.721,
G.723, G.726, and G.727.
y
The di erence between these standards
involves the
bit-rate
(from 3 to 5 bits per
sample) and
some algorithm details
sample) and
some algorithm details.
13 2 G 726 ADPCM
13 2 G 726 ADPCM
13.2 G.726 ADPCM
13.2 G.726 ADPCM
y
ITU G 726 supersedes ITU standards
yITU G.726 supersedes ITU standards
G.721 and G.723.
y
Rationale: works by
adapting a
fi
xed
quantizer
in a simple way. The di erent
q
p
y
y
The standard de
fi
nes a
multiplier constant
yThe standard de
fi
nes a
multiplier constant
y
The input value is a
ratio
of
a di erence
ith th
f t
α
with the
factor
α
.
Backward Adaptive Quantizer
Backward Adaptive Quantizer
Backward Adaptive Quantizer
Backward Adaptive Quantizer
y Backward adaptive works in principle by noticing Backward adaptive works in principle by noticing
either of the cases:
◦ too many values are quantized to values far from ld h if i i i f
zero – would happen if quantizer step size in f were
too small.
◦ too many values fall close to zero too much of the y time
x would happen if the quantizer step size were too large.
y Jayant quantizer allows one to adapt a backward y Jayant quantizer allows one to adapt a backward
quantizer step size after receiving just one single output.
◦ Jayant quantizer simply expands the step size if the quantized input is in the outer levels of the quantizer, and reduces the step size if the input is near zero.
The Step Size of Jayant Quantizer
The Step Size of Jayant Quantizer
The Step Size of Jayant Quantizer
The Step Size of Jayant Quantizer
y Jayant quantizer assigns Jayant quantizer assigns multiplier values Mmultiplier values Mkk to to
each level, with values smaller than unity for levels
near zero, and values larger than 1 for the outer le els
levels.
y For signal fn, the quantizer step size ∆ is changed
according to the quantized value k, for the according to the quantized value k, for the
previous signal value fn−1, by the simple formula
y Since it is the quantized version of the signal that
G.726
G.726 —
— Backward Adaptive Jayant
Backward Adaptive Jayant
Quantizer
Quantizer
y G.726G.726 uses uses fifixed quantizer steps based on the xed quantizer steps based on the
logarithm of the input difference signal, en divided by
y The "unlocked" part adapts via the equationp p q
y The locked part is slightly modip g y fied from the unlocked partp :
y The G.726 predictor is quite complicated: it uses a linear p q p
combination of 6 quantized differences and two
13 3 Vocoders
13 3 Vocoders
13.3 Vocoders
13.3 Vocoders
y Vocoders – voice coders which cannot be y Vocoders voice coders, which cannot be
usefully applied when other analog signals, such as modem signals, are in use. g
◦ concerned with modeling speech so that the salient features are captured in as few bits as possible.
◦ use either a model of the speech waveform in time (LPC (Linear Predictive Coding) vocoding), or ... → ◦ break down the signal into frequency components
◦ break down the signal into frequency components
and model these (channel vocoders and formant vocoders).
y Vocoder simulation of the voice is not very good
Phase Insensitivity
Phase Insensitivity
Phase Insensitivity
Phase Insensitivity
y A complete reconstituting of speech waveform is A complete reconstituting of speech waveform is
really unnecessary, perceptually: all that is needed is for the amount of energy at any time to be
ab t ri ht and the si nal ill s nd ab t ri ht about right, and the signal will sound about right.
y Phase is a shift in the time argument inside a
function of time. function of time.
◦ Suppose we strike a piano key, and generate a roughly sinusoidal sound cos(ωt), with ω =2πf.
◦ Now if we wait sufficient time to generate a phase shift π/2and then strike another key, with sound
cos(2( ωt + π/2), we generate a waveform like the solid ) g line in Fig. 13.3.
y
If we did not wait before striking the
y– If we did not wait before striking the
second note, then our waveform would
be cos(
ω
t)+cos(2
ω
t). But perceptually,
the two notes would sound the same
sound, even though in actuality they
would be shifted in phase.
Channel Vocoder
Channel Vocoder
Channel Vocoder
Channel Vocoder
y Vocoders can operate at Vocoders can operate at low bit-rateslow bit rates, , 1–2 kbps1 2 kbps. To . To
y Due to Due to Phase Insensitivity Phase Insensitivity (i.e. (i.e. only the energy is only the energy is important):
◦ The waveform is "rectified" to its absolute value.
Th fil b k d i l i l l f h
◦ The filter bank derives relative power levels for each frequency range.
◦ A subband coder would not rectify the signal, and would
id f b d
use wider frequency bands.
y A channel vocoder also analyzes the signal to
determine the general pitch g p of the speech (p (low — bass, or high — tenor), and also the excitation of the speech.
y A channel vocoder applies a vocal tract transfer y A channel vocoder applies a vocal tract transfer
Formant
Formant Vocoder
Vocoder
Formant
Formant Vocoder
Vocoder
y FormantsFormants: the : the salient frequency salient frequency components that are components that are
present in a sample of speech, as shown in Fig 13.5.
y Rationale: encoding only the most important frequencies
Linear Predictive Coding (LPC)
Linear Predictive Coding (LPC)
Linear Predictive Coding (LPC)
Linear Predictive Coding (LPC)
y LPC vocodersLPC vocoders extract salient features extract salient features of speech of speech
directly from the waveform, rather than
transforming the signal to the frequency domain
LPC F
y LPC Features:
◦ uses a time-varying model of vocal tract sound generated from a given excitation
generated from a given excitation
◦ transmits only a set of parameters modeling the
shape and excitation of the vocal tract, not actual
i l diff ll bit t
signals or differences small bit-rate
y About "Linear": The speech signal generated by
the output vocal tract model is calculated as a the output vocal tract model is calculated as a function of the current speech output plus a
LPC Coding Process
LPC Coding Process
LPC Coding Process
LPC Coding Process
y
LPC
starts
by deciding whether the current
yLPC
starts
by deciding whether the current
segment is
voiced or unvoiced:
◦ For unvoiced: a wide-band noise generator g is
used to create sample values f(n) that act as input to the vocal tract simulator
◦ For voiced: a pulse train generator creates values
◦ For voiced: a pulse train generator creates values
f(n)
◦ Model parameters ap ii: calculated by using a y g
least-squares set of equations that minimize the
di erence between the actual speech and the speech generated by the vocal tract model speech generated by the vocal tract model, excited by the noise or pulse train generators
LPC Coding Process (cont'd)
LPC Coding Process (cont'd)
LPC Coding Process (cont d)
LPC Coding Process (cont d)
y
If the output values generate s(n) for
yIf the output values generate s(n), for
input values f(n), the output depends on
p
previous output sample values
:
LP
b
l l
d b
LPC Coding Process (cont'd)
LPC Coding Process (cont'd)
LPC Coding Process (cont d)
LPC Coding Process (cont d)
y
By taking the
derivative of a
and
setting
yBy taking the
derivative of a
iand
setting
it to zero
, we get a set of J equations:
LPC Coding Process (cont'd)
LPC Coding Process (cont'd)
LPC Coding Process (cont d)
LPC Coding Process (cont d)
y
An often used method to
calculate LP
yAn often-used method to
calculate LP
LPC Coding Process (cont'd)
LPC Coding Process (cont'd)
LPC Coding Process (cont d)
LPC Coding Process (cont d)
y
Since
φ
(i j) can be de
fi
ned as
φ
(i j)= R(|i
−
ySince
φ
(i, j) can be de
fi
ned as
φ
(i, j) R(|i
j|), and when R(0)
≥
0, the matrix {
φ
(i, j)} is
positive symmetric, there exists a fast
p
y
LPC Coding Process (cont'd)
LPC Coding Process (cont'd)
LPC Coding Process (cont d)
LPC Coding Process (cont d)
y
Gain G
can be calculated:
yGain G
can be calculated:
Code Excited Linear Prediction
Code Excited Linear Prediction
(CELP)
(CELP)
y CELP is a more complex family of coders that CELP is a more complex family of coders that
attempts to mitigate the lack of quality of the simple LPC model
CELP l d f h
y CELP uses a more complex description of the
excitation:
◦ An entire set (a codebook) of excitation vectors is
◦ An entire set (a codebook) of excitation vectors is matched to the actual speech, and the index of the best match is sent to the receiver
Th l i i h bi 4 800 9 600
◦ The complexity increases the bit-rate to 4,800-9,600 bps
◦ The resulting speech is perceived as being g p p g more similar and continuous
The Predictors for CELP
The Predictors for CELP
The Predictors for CELP
The Predictors for CELP
y
CELP coders contain two kinds of prediction:
yCELP coders contain two kinds of prediction:
◦ LTP (Long time prediction): try to reduce
d d i h i l b fi di th b i
redundancy in speech signals by finding the basic periodicity or pitch that causes a waveform that more or less repeats
more or less repeats
◦ STP (Short Time Prediction): try to eliminate the
Relationship between STP and LTP
Relationship between STP and LTP
Relationship between STP and LTP
Relationship between STP and LTP
y
STP
captures the
formant structure
of the
ySTP
captures the
formant structure
of the
short-term speech spectrum based on
only a
few samples
p
y
LTP,
following STP, recovers the long-term
correlation in the speech signal that
h
d
h
represents the periodicity in speech using
whole frames or subframes (1/4 of a frame)
LTP i ft
i
l
t d " d ti
– LTP is often implemented as "adaptive
codebook searching"
y
Fig 13 6 shows the relationship between
STP
yFig. 13.6 shows the relationship between
STP
Adaptive Codebook Searching
Adaptive Codebook Searching
Adaptive Codebook Searching
Adaptive Codebook Searching
y
Rationale:
yRationale:
◦
Look in a codebook of waveforms to
fi
nd one
that matches the current subframe
◦
Codeword: a shifted speech residue segment
indexed by the by the lag
τ
corresponding to
the current speech frame or subframe in the
adaptive codebook
Open
Open Loop Codeword Searching
Loop Codeword Searching
Open
Open--Loop Codeword Searching
Loop Codeword Searching
y
Tries to minimize the
long term
yTries to minimize the
long-term
LZW Close
LZW Close--Loop Codeword
Loop Codeword
Searching
Searching
y Closed-loop search is more often used in CELP y Closed-loop search is more often used in CELP
coders — also called Analysis-By-Synthesis (A-B-S)
y speech is reconstructed and perceptual error for speech is reconstructed and perceptual error for
that is minimized via an adaptive codebook search, rather than simply considering sum-of-squares
y The best candidate in the adaptive codebook is
selected to minimize the distortion of locally reconstructed speech
y Parameters are found by minimizing a measure of
h di b h i i l d h
Hybrid Excitation Vocoder
Hybrid Excitation Vocoder
Hybrid Excitation Vocoder
Hybrid Excitation Vocoder
y Hybrid Excitation Vocoders are di erent from y Hybrid Excitation Vocoders are di erent from
CELP in that they use model-based methods to introduce multi-model excitation
y includes two major types:
◦ MBE (Multi-Band Excitation): ( ) a blockwise codec, in which a speech analysis is carried out in a speech frame unit of about 20 msec to 30 msec
MELP (M ltib d E it ti Li P di ti ) h
◦ MELP (Multiband Excitation Linear Predictive) speech codec is a new US Federal standard to replace the old LPC-10 (FS1015) standard with the application focus ( ) pp on very low bit rate safety communications
◦ a DoD speech coding standard used mainly in military applications and satellite communications
MELP Vocoder
MELP Vocoder
MELP Vocoder
MELP Vocoder
y
MBE utilizes the A-B-S scheme in parameter
yMBE utilizes the A B S scheme in parameter
estimation:
◦ The parameters such as basic frequency, p q y,
spectrum envelope, and sub-band U/V decisions are all done via closed-loop searching
◦ The criterion of the closed loop optimization is
◦ The criterion of the closed-loop optimization is based on minimizing the perceptually weighted reconstructed speech error, which can be
MBE Vocoder
MBE Vocoder
MBE Vocoder
MBE Vocoder
y MELP: also based on LPC analysis uses a y MELP: also based on LPC analysis, uses a
multiband soft-decision model for the excitation signal g
y The LP residue is bandpassed and a voicing
strength parameter is estimated for each band
y Speech can be then reconstructed by passing the
excitation through the LPC synthesis filter
y Di erently from MBE, MELP divides the
excitation into five fixed bands of 0-500,
MELP
MELP Vocoder
Vocoder (Cont'd)
(Cont'd)
MELP
MELP Vocoder
Vocoder (Cont d)
(Cont d)
y A voice degree parameter is estimated in each y A voice degree parameter is estimated in each
band based on the normalized correlation
function of the speech signal and the smoothed p g rectified signal in the non-DC band
y Let sk(n) denote the speech signal in band k, uk(n)
MELP Vocoder (Cont'd)
MELP Vocoder (Cont'd)
MELP Vocoder (Cont d)
MELP Vocoder (Cont d)
y
MELP adopts a jittery voiced state to
yMELP adopts a jittery voiced state to
simulate the marginal voiced speech
segments – indicated by an aperiodic
g
y
p
fl
ag
g
y
The jittery state is determined by the
peakiness of the full-wave recti
fi
ed LP
d
( )
residue e(n):
y
If peakiness is greater than some threshold,
13 4 Further Exploration
13 4 Further Exploration
13.4 Further Exploration
13.4 Further Exploration
→ Link to Further Exploration for Chapter 13. Link to Further Exploration for Chapter 13.
y A comprehensive introduction to speech coding may
f S '
be found in Spanias's excellent article in the textbook references.
y Sun Microsystems Inc has made available the code y Sun Microsystems, Inc. has made available the code
for its implementation on standards G.711, G.721, and G.723, in C. The code can be found from the link in the Chapter 13 file in the Further Exploration
in the Chapter 13 file in the Further Exploration section of the text website
y The ITU sells standards; it is linked to from the text
website for this Chapter