4.6 Summary
5.1.2 Proposed Learning Mechanism of Effective Dictionary
The effectiveness of sparse representation depends on the promising creation of dictionary D as illustrated in Eq. (5.1). In this work, two different learning mechanisms are proposed for an estimation of effective dictionary. The first learning mechanisms exploits the well known K-SVD algorithm for the creation of invariable size global dictionary. In addition to this, we have exploited the information about the duration parameter of speech utterance to construct the exemplar dictionary. The incorporation of information about the duration parameter can be interpreted as the sparse representation over the utterance-specific adaptive dictionary is also described in the following.
Creation of Invariable Size Global Dictionary: The learning mechanism of invariable size global dictionary incorporates the K-SVD algorithm. The K-SVD, a generalization of K-means clustering method is most widely algorithm used to learn the redundant dictionary for sparse representation [76].
As discussed earlier, the sparse representation of LPCs of neutral and stressed speech utterances over the dictionary, which comprises the LPCs of neutral speech will help in reducing the acoustic mismatch between them. Therefore, the set of LP-vectors {af}Nf=1 of neutral speech utterances arranged inYare used to estimate the dictionary as depicted in block diagram shown in Figure 5.3.
The K-SVD algorithm constructs the dictionary for best sparse representation of the training vectors (LP-vectors inY) with a minimum sparsity constraint using following objective function,
D,XminY{kY−DXY k22} subject tokxf k0≤T0 ∀f (5.2)
Framing LP analysis Dictionary initialization
Sparse coding, OMP
Update dictionary,
K-SVD Neutral
speech Estimated
dictionary vectors,
Y
Figure 5.3: Estimation of invariable size global dictionary using the K-SVD algorithm.
where, the dictionaryDconstitutesDatoms. The dimension of this dictionary isP×D. The matrix,XY
having dimensionD×N comprises the set of sparse coded vectors corresponding toYwith constraint on sparsityT0. The learning process of dictionary involves jointly update of dictionary atoms along with update of sparse coding, thus resulting in accelerated convergence. The sparse coding or atom decomposition can be done using any of the pursuit algorithm. In this work, the orthogonal matching pursuit (OMP) [76, 78, 171–173] is used for sparse coding as depicted in Figure 5.3. The sparse representation for all the observed neutral and stressed speech utterances are derived over these invariable number of atoms of the same estimated dictionary D. In this work, this estimated dictionary D is referred to as the invariable size global dictionary due to it global exploration for sparse representation for all the observed neutral and stressed speech utterances.
Creation of Utterance-Specific Adaptive dictionary: The speaker changes the speaking rate to manifest the information about the stress condition as well to retain the acoustic information [5, 11].
Consequently, the duration of speech signal is influenced and it leads to the amended prosodic char- acteristics for the stressed speech in comparison to the neutral speech. The literature reviews sum- marized in Chapter 1 have demonstrated that, the same speech utterance produced under different stress conditions exhibit the different time duration and is found to be very informative for character- izing the stress classes. The investigation on the changes in the duration parameter is increasingly attractive courtesy from a broader range of researchers involved in the area of stressed speech due to it potentiality of classifying the different stress classes [6, 60]. In the work reported in [70], the significant degree of changes have been observed for fricatives, diphthongs, nasals, semi-vowels, affricates, stops, silence and vowels to emphasize the lombard effect. These studies show that, the duration of speech signal carries the significant information of the stressed speech. The incorporation of information about the duration parameter of speech utterance (neutral or stressed speech) in the
5.1 Sparse Representation of Vocal-Tract System Parameters for Speech Synthesis
b b b
b
bb b
Y1 Y2 YF
a5 a4 a3 a2 a1
aF aF−1
LP-vectors of observed utterance
a1
a2 a3
a4 a5
aF−1 aF
Y1 ω1
Y2 ω2
Y3
ω3 Y4
ω4 Y5
ω5
YF−1 ωF−1
YF ωF
Set ofKnFnearest-neighbor LP-vectors
Utterance-specific adaptive dictionary
Figure 5.4: Creation of utterance-specific adaptive dictionary using the K-NN algorithm.
estimation of dictionary can help in improving the sparse representation of set of LP-vectors for that neutral/ stressed speech utterance to reduce the aforementioned acoustic mismatch. To model the in- formation about the duration parameter of speech, we have explored the K-nearest-neighbour (K-NN) algorithm-based non-parametric probability density estimation method as shown in Figure 5.4. The K-NN algorithm is based on the conception that the probability density at a point can be estimated by growing the radius of the sphere around that point until it contains the precisely fixed number of sample observations [143, 147]. The integration of information about the duration parameter of each observed neutral or stressed speech utterance in the estimation of dictionary leads to the estimation of an adaptive dictionary according to the observed utterance. In this work, this dictionary is called as the utterance-specific adaptive dictionary. The number of atoms in this dictionary depends on the duration of observed speech utterance as depicted in Figure 5.4. The following steps describe the learning mechanism of proposed utterance-specific adaptive dictionary.
Step I: At first, LP-vectors {af}Ff=1 for the observed speech utterance (neutral or stressed speech) are determined using theP-order LP analysis. The subscript f represents the frame index and varies from1≤f ≤F, whereF is the total frames of observed speech utterance.
Step II:In the next step, for theF number of LP-vectors{af}Ff=1 of observed speech utterance, the feature-space with respect to the LP-vectors of neutral speech utterances arranged inYis partitioned into theF number of distinct cells{ωf}Ff=1 using the Bayes decision rule by explor- ing the K-NN algorithm as shown in Figure 5.4. This feature-space comprises the vocal-tract
of nearest-neighbor LP-vector of neutral speech utterances corresponding to distinct LP-vector of observed utterance and generally referred to as the Voronoi tesselation of space [143, 147].
Step III: The Kn nearest-neighbor LP-vectors corresponding to each cells are arranged in Yf =
af1 af2 · · · afKn
. In this way, the set of nearest-neighbor LP-vectors {Yf}Ff=1 are determined from all the partitioned cells. These nearest-neighbor LP-vectors in {Yf}Ff=1 sig- nify the estimate of the probability density corresponding to the set of LP-vectors {af}Ff=1 of observed speech utterance, respectively. Each column of these matrices represent one of the LP-vectors in theY, which comprises the vocal-tract system parameters of neutral speech signals. The dictionary is created by the concatenation of the individual matrices{Yf}Ff=1 in D=
Y1 Y2 · · · YF
. The number of atoms,KnF in this exemplar dictionary are associated with the duration of observed speech utterance in terms of number of framesF, which makes it adaptive to the observed utterance and referred to as the utterance-specific adaptive dictionary.
The estimated utterance-specific adaptive dictionary constitutes the nearest-neighbor LP-vectors of neutral speech utterances. Therefore, the sparse representation of LP-vectors of the given ob- served neutral and stressed speech utterances using their corresponding estimated utterance-specific adaptive dictionaries in Eq. (5.1) will help in reducing the acoustic mismatch between them.