• Tidak ada hasil yang ditemukan

Complete Discrete-Time Model

Dalam dokumen Discrete-Time Speech Signal Processing (Halaman 169-174)

EXERCISES

4.4 A Discrete-Time Model Based on Tube Concatenation

4.4.3 Complete Discrete-Time Model

EXAMPLE4.5 Let the vocal tract lengthl=17.5 cm and the speed of soundc=350 m/s.

We want to find the number of tube sectionsN required to cover a bandwidth of 5000 Hz, i.e., the excitation bandwidth and the vocal tract bandwidth are 5000 Hz. Recall thatτ = cNl and that 2π4τ is the cutoff bandwidth. Therefore, we want 4τ1 =5000 Hz. Solving forτ, the delay across a single tube,τ =200001 . Thus, from above we haveN= l =10. SinceNis also the order of the all-pole denominator, we can model up to N2 = 5 complex conjugate poles. We can also think of this as

modeling one resonance per 1000 Hz.

We see that the all-pole transfer function is a function of only the reflection coefficients of the original concatenated tube model, and that the reflection coefficients are a function of the cross- sectional area functions of each tube, i.e.,rk = AAk+1k+1+AAkk. Therefore, if we could estimate the area functions, we could then obtain the all-pole discrete-time transfer function. An example of this transition from the cross-sectional areasAktoV (z)is given in the following example:

EXAMPLE4.6 This example compares the concatenated tube method with the Portnoff numer- ical solution using coupled partial differential equations [26]. Because an infinite glottal impedance is assumed, the only loss in the system is at the lips via the radiation impedance. This can be introduced, as we saw above, with an infinitely-long(N+1)th tube, depicted in Figure 4.19 with a terminating cross-sectional area selected to match the radiation impedance, according to Equation (4.35), so that rN =rL. By altering this last reflection coefficient, we can change the energy loss in the system and thus control the bandwidths. For example, we see in Figure 4.19 the two different cases ofrN =0.714 (non-zero bandwidths) andrL =1.0 (zero bandwidths). This example summarizes in effect all we have seen up to now by comparing two discrete-time realizations of the vocal tract transfer function that have similar frequency responses: (1) A numerical simulation, derived with central difference approximations to partial derivatives in time and space, and (2) A (spatially) discretized concatenated

tube model that maps to discretized time.

4.4 A Discrete-Time Model Based on Tube Concatenation 149

32 24 16

8 0

0 1 2 3 4 5 0 1000 2000 3000 4000 5000

1 2 3 4 5 6

Distance (Δx = 1.75 cm)

Frequency (kHz) (a)

(c)

Frequency (Hz) (d)

7 8 9 10 11 0

–1.0

36 32 28 24 20 16 12 8 4 0 –4 –8 –0.5 0 0.5 1.0

1 2 3 4 5 6

Distance (Δx = 1.75 cm) (b)

7 8 9 10 11

Area (cm2) Amplitude

20 log Va(Ω)

rN = .071 rN = 1.0

1st 2nd 3rd 4th 5th

650.3 1075.7 2463.1 3558.3 4631.3

94.1 91.4 107.4 198.7 89.8 Frequency

/a/

Bandwidth Formant

20 log Va(Ω)

Figure 4.19 Comparison of the concatenated tube approximation with the “exact” solution for area function (estimated by Faut [7]) of the Russian vowel /a/ [26],[28]: (a) cross-sectionA(x)for a vocal tract model with 10 lossless sections and terminated with a 30 cm2 section that does not reflect; (b) reflection coefficients rk for 10 sections; (c) frequency response of the concatenated tube model—the solid curve corresponds to the lossless termination (zero bandwidths) and the dashed curve corresponds to the condition with loss (finite bandwidths); (d) frequency response derived from numerical simulation of Portnoff.

SOURCE: M.R. Portnoff,A Quasi-One-Dimensional Digital Simulation for the Time-Varying Vocal Tract[26].

©1973, M.R. Portnoff and the Massachusetts Institute of Technology. Used by permission.

R(z) denotes the discrete-time radiation impedance and V (z) is the discrete-time all-pole vocal tract transfer function from the volume velocity at the glottis to volume velocity at the lips.

The radiation impedanceR(z) =Zr(z)is a discrete-time counterpart to the analog radiation impedanceZr(s). You will show in Exercise 4.20 thatR(z)1z1 and thus acts as approximately a differentiation of volume velocity to obtain pressure, introducing about a 6 dB/octave highpass effect. AlthoughR(z)is derived as a single zero on the unit circle, it is

more realistically modeled as a zero slightly inside the unit circle, i.e., R(z)1αz1

with α < 1, because near-field measurements at the lips do not give quite the 6 dB/octave rolloff predicted by a zero on the unit circle [8]. The analogZr(s)was derived by Flanagan under the assumption of pressure measurements in the far field, i.e., “sufficiently” far from the source [8]. Considering the pressure/volume velocity relation at the lips as a differentiator, the speech pressure waveform in continuous time,x(t ), measured in front of the lips can be expressed as

x(t )Ad

dt[ug(t )v(t )] = A

d

dtug(t )

v(t ),

where the gainAcontrols loudness. (The reader should prove this equality.) The effect of radi- ation is therefore typically included in the source function; the source to the vocal tract becomes the derivative of the glottal flow volume velocity often referred as theglottal flow derivative, i.e., the source is thought of as18 dtdug(t )rather thanug(t ).

The discrete-time speech production model for periodic, noise, and impulsive sound sources is illustrated in Figure 4.20. Consider first the periodic (voiced) speech case. For an input consisting of glottal airflow over a single glottal cycle, thez-transform of the speech output is expressed as

X(z) = AvG(z)H (z)

= AvG(z)V (z)R(z)

whereAv is again controlling the loudness of the sound and is determined by the subglottal pressure which increases as we speak louder, thus increasing the volume velocity at the glottis.

G(z)is thez-transform of the glottal flow input,g[n], over one cycle and which may differ

Linear/

Nonlinear Combiner

Speech Av

An

Ai

×

×

× Zero Poles

G(z) R(z)

H(z)

V(z) R(z)

Poles & Zeros Zero

Figure 4.20 Overview of the complete discrete-time speech production model.

18Moving the differentiation to the source holds strictly only when the speech production components are linear. We saw in Chapter 2 that components of a nonlinear system are not necessarily commutative.

4.4 A Discrete-Time Model Based on Tube Concatenation 151

with the particular speech sound (i.e., phone), speaker, and speaking style.R(z)is the radiation impedance that we model as a single zeroR(z) = 1−αz1, andV (z)is a stable all-pole vocal tract transfer function from the volume velocity at the glottis to the volume velocity at the lips, and which is also a function of the particular speech sound, speaker, and speaking style.

We have seen in Chapter 3 that an approximation of a typical glottal flow waveform over one cycle is of the form

g[n] = nu[−n])nu[−n]),

i.e., two time-reversed exponentially decaying sequences, that hasz-transform

G(z) = 1

(1− βz)2

which for realβ <1 represents two identical poles outside the unit circle. This model assumes infinite glottal impedance, i.e., no loss at the glottis [no loss at the glottis allowed us in the previous section to obtain an all-pole model forV (z)]. All loss in the system is assumed to occur by radiation at the lips. For the voiced case, thez-transform at the output of the lips can then be written over one glottal cycle as a rational function with poles inside and outside the unit circle and a single zero inside the unit circle, i.e.,

X(z) = AvG(z)V (z)R(z)

= Av (1−αz1) (1βz)2Ci

k=1(1ckz1)(1ckz1) (4.39) where we assumeCipole pairs ofV (z)inside the unit circle. This rational function, along with a uniformly-spaced impulse train to impart periodicity to the input, is illustrated in the upper branch of Figure 4.20. Observe in this rational function thatV (z)andR(z) are minimum- phase, whileG(z), as we have modeled it, is maximum-phase, having two poles outside the unit circle. Referring to our discussion in Chapter 2 of frequency-domain phase properties of a sequence, we can deduce that the glottal flow input is responsible for a gradual “attack” to the speech waveform within a glottal cycle during voicing. We return to this important characteristic in later chapters when we develop methods of speech analysis and synthesis.

If we apply the approximate differentiation of the radiation load to the glottal input during voicing, we obtain a “source” function illustrated in Figure 4.21 for a typical glottal airflow over one glottal cycle. In the glottal flow derivative, a rapid closing of the vocal folds results in a large negative impulse-like response, called theglottal pulse, which occurs at the end of the open phase and during the return phase of the glottal cycle, as shown in Figure 4.21. The glottal pulse is sometimes considered the primary excitation for voiced speech, and has a wide bandwidth due to its impulse-like nature [1]. (Note that we are using the term “glottal pulse”

more strictly than in Chapter 3.) We have also illustrated this alternative source perspective in the upper branch of Figure 4.20 by applyingR(z)just afterG(z).

Two other inputs to the vocal tract are noise and impulsive sources. When the source is noise, as, for example, with fricative consonants, then the source is no longer a periodic glottal airflow sequence, but rather a random sequence with typically a flat spectrum, i.e., white noise,

Open Phase Closed

Phase

Return Phase

Glottal Pulse

Time (a)

(b) ug(t) ug(t)

Figure 4.21 Schematic of relation between (a) the glottal airflow and (b) the glottal flow derivative over a glottal cycle.

although this noise may be colored by the particular constriction and shape of the oral tract. The outputz-transform at the lips is then expressed by

X(z) = AnU (z)H (z)

= AnU (z)V (z)R(z)

whereU (z)denotes thez-transform of a noise sequence,u[n]. The third source is the “burst”

which occurs during plosive consonants and which for simplicity we have modeled as an impulse.

The outputz-transform at the lips for the impulsive input is given by X(z) = AiH (z) = AiV (z)R(z).

The noise and impulse source are shown in the lower two branches of Figure 4.20. Observe that all sources can occur simultaneously as with, for example, voiced fricatives, voiced plosives, and aspirated voicing. Furthermore, these sources may not be simply linearly combined, as we saw in the voiced fricative model of Example 3.4 of Chapter 3, a possibility that we have represented by the linear/nonlinear combiner element in Figure 4.20.

In the noise and impulse source state, oral tract constrictions may give zeros (absorption of energy by back-cavity anti-resonances) as well as poles. Zeros in the transfer function also occur for nasal consonants, as well as for nasalized vowels. Methods have been developed to compute the transfer function of these configurations [16],[17]. These techniques are based on concatenated tube models with continuity constraints at tube junctions and boundary conditions similar to those used in this chapter for obtaining the transfer function of the oral tract. (A number of simplifying cases are studied in Exercises 4.7 and 4.12.) In these cases, the vocal tract transfer functionV (z)has poles inside the unit circle, but may have zeros inside and outside

Dalam dokumen Discrete-Time Speech Signal Processing (Halaman 169-174)