The greedy algorithms attempt to construct the support one atom at a time. In other words, it provides an iterative solution to following mathematical problem,
minkDw−xk2 s. t. kwk0 ≤ T (5.2)
where T is the sparsity constraint. We have used the orthogonal matching pursuit (OMP) algorithm for sparse coding in this work.
5.1.1 Creation of dictionary for sparse representation
There are two kinds of dictionaries that are commonly used for the sparse representation purpose. The first one is referred to as the exemplar dictionary. It is a redundant dictionary derived by concatenating a large number of examples to produce a sparse presentation for the target. Such dictionaries are hand-created and are difficult to optimize. In contrast to that the learned dictionary is derived by processing the data to produce a sparse representation.
In this work, we have used KSVD [132] algorithm for creating a learned redundant dictionary for sparse representation purpose. The KSVD is a generalization of the well known K-means clustering algorithm. It constructs a dictionary of K atoms that leads to the best possible representation for each of the training examples along with the specified sparsity constraint.
The dictionary learning problem is represented as, minD,W
kX−DWk22 s. t. kwik0 ≤T ∀i (5.3)
where, X is the set of dictionary training vectors, D is the learned dictionary, W is the set of corresponding sparse vectors, T is the sparsity constraint and training example vector index i.
The dictionary learning is an iterative process and each iteration alternates between two stages:
sparse coding and dictionary update.
5.2 ABWE using sparse representation
these dictionaries happen to produce very accurate representation of speech signals but their representations are neither sparse nor have much similarity for WB and NB speech. In contrast, if we create a dictionary that produces a sparse representation for speech signals, then it is more likely to produce similar representations for WB and NB speech. Motivated by this hypothesis, we have explored a very simple approach to ABWE in which the sparse representation of the NB speech obtained with respect to a NB dictionary are applied to a corresponding WB dictionary to achieve the reconstruction of HB information for the given NB signal.
5.2.1 Proposed SR-ABWE approach
For learning the dictionary for sparse modelling purpose, a development data is derived from the training set of WSJCAM0 corpus [152] which includes both male and female speakers.
The selected speech signals are segmented into frames of size 20 ms with an overlap of 5 ms between frames. A total of 0.12 million frames are collected and the resulting data matrix is then used in dictionary learning using KSVD algorithm for the sparse representation of speech frames. In dictionary learning, 50iterations of KSVD are used.
For sparse modelling, first we have created a single dictionary having 4000 atoms on WB speech. The atoms of this WB dictionary are first decimated by a factor of 2 and then interpo- lation by a factor of 2 to derive a dictionary for the sparse representation of the corresponding upsampled narrowband (NBI) signal frames. The atoms of so derived NBI dictionary have a one-to-one mapping to those of WB dictionary. The sparse representation of NBI target frames obtained with NBI dictionary are then associated with WB dictionary to synthesize a signal having enhanced HB information. For single dictionary case, the plot of the bandwidth enhanced magnitude spectra for NBI frame obtained using the proposed ABWE approach for a voiced (/aa/) and an unvoiced (/s/) frames are shown in Figure 5.1(a) and Figure 5.1(c), respectively. For contrast purpose, the magnitude spectra for the sparse representation of WB frame with WB dictionary and the original WB signal are also shown in figure. The corre- sponding sparse codes for NBI/WB data are shown in Figure 5.1(b) and Figure 5.1(d). On comparing with the original signal spectra, the proposed ABWE modelling approach appears
Figure 5.1: Panels (a) and (c) show the reconstructed spectra using the proposed ABWE approach for a voiced (/aa/) and a unvoiced (/s/) frames of speech, respectively. For contrast purpose, the spectra for the sparse representation of WB frame and the original frame are also shown. Panels (b) and (d) show the corresponding sparse codes obtained with WB and NBI dictionaries. A single dictionary is used for sparse coding of both voiced and unvoiced segments.
working somewhat for voiced case, but for unvoiced case it does not appear to be effective enough.
On noting the fact that the characteristics of voiced and unvoiced speech differ significantly, so a single dictionary may not be very effective. To improve the modelling, it was decided to learn separate dictionaries for voiced and unvoiced speech cases. For judging the improvement in the proposed ABWE modelling with separate voiced and unvoiced KSVD learned dictionaries, the magnitude spectra for the same voiced and unvoiced frames are shown in Figure 5.2. On comparing Figure 5.1 and Figure 5.2, a considerable improvement in SR-ABWE modelling can be noted with the use of separate dictionaries, in particular for the unvoiced case. Though for unvoiced case, the indices in NBI and WB sparse codes still differ but the enhanced spectra
TH-1705_08610211
5.2 ABWE using sparse representation
Figure 5.2: Improvement in the modelling of the proposed ABWE approach with the use of separate voiced and unvoiced dictionaries in sparse coding. The example frames and the layout of panels are identical to that of Figure 5.1. Separate dictionaries are used for voiced and unvoiced cases.
show better match with the original WB spectra.
Like other existing ABWE approaches, in the proposed approach too the given NB speech is retained without any modification while the estimated HB speech is added to the given NB speech with appropriate amplitude scaling. The detailed block diagram of the proposed sparse representation based ABWE approach in shown in Figure 5.3. Further to highlight that the creation of coupled NBI and WB dictionaries is critical for the proposed ABWE approach.
Simply the sparse coding of NBI speech data with WB dictionary would lead to poor recreation of HB information especially for unvoiced case. To verify this fact, the enhanced spectra for NBI frame obtained by sparse coding with WB dictionary are also given in Figure 5.1(a) and Figure 5.1(c) for voiced and unvoiced cases, respectively. Note that very poor enhancement is obtained without any coupling for unvoiced case.
Figure 5.3: The complete block diagram of the proposed sparse representation based ABWE ap- proach.