End to End ASR Free Keyword Spotting with Transfer Learning from Speech Synthesis

Although they provide very good accuracy, the former suffers from being an offline system and the latter suffers from less accuracy. Here we proposed an improvement to an approach called End-to-End ASR free Keyword Spotting. This system is inspired by traditional keyword spotting architectures and consists of three modules namely acoustic coder and phonetic coder and keyword neural network. And in addition to the fact that the system is end-to-end, it offers an advantage that it is easy to implement.

Keyword recognition (KWS) is the task of identifying the putative hits of a text query in a reference audio file. In the latter case, these keyword detection systems work on the hypothesis of low-quality automatic speech recognition (ASR)[1]. With the emergence of end-to-end systems, each system finds a suitable version of traditional architectures, especially in speech recognition and keyword discovery systems.

Especially for streaming services [6][7] such as voice assistants End-to-End systems provide low latency. The performance of End-to-End systems is significantly low when operating in low-resource scenarios [9].

Motivation

Pre-Emphasis
Windowing
Discrete Fourier Transform
Mel ﬁlter bank and log
The Ceptrum
Deltas and Energy
Monophone Alignment
Tri-phone Alignment

And another disadvantage is that we try to project the acoustic and phonetic features into the same common space only by classifying whether the keyword exists or not, ie. cross entropy loss. This is a simplification of the task of projecting the acoustic and phonetic encoder into the same common space. Since our problem depends on the content of the text, Mel Frequency Ceptral Coecients will be most suitable.

The next step in MFCC feature extraction is to analyze the spectral content of the speech signal. Since each phonetic content is characterized by different spectral content, let us take x[1],x[2].x[m] as the input signal. Phone production is a random process. To model the phones, we first know where each phone starts and ends.

But marking those boundaries will be very difficult, while marking information at the word and sentence level will be relatively easy. So phones are the universal sounds that can generalize the entire language and there will only be a limited number of them. We will know how each word is pronounced. Given this information and having labels marked at the word or sentence level, we can still model the phonetic distribution.

The time-varying distribution of phone sounds will be characterized by Hidden Markov Models (HMM). The observation probabilities are modeled with Gaussian mixed models (GMM), but as mentioned earlier, we do not have bounds for each phone. Inspired by this, we use left-right HMMs with 3 to 5 states to model phones.

This means that each phone will be modeled with an HMM, so each country will have different distributions. To choose this we form a decision tree, at each end a question will be asked to decide which phone to keep and which one not to. For example, p-a-t and b-a-t have similar termination type, so the corresponding HMMs will be added to form a single state.

Figure 2.1: Pre-emphasis of Vowel [aa][12]

Deep Neural Network based Acoustic modeling

Time-Delay Neural Networks

Language Modeling

Decoding Graph Construction

The traditional viterbi decoding performs an exact search, but it will not be sufficient as the order of the language model increases. The entire output acoustic model can be graphed, and the rest of the information up to the word assignment is displayed in several stages. For this we store other information sufficient to assign them to CD phones.

We can combine all of these in a single phase and can be optimized to form a single graph through composition. After transcribing the audio at the phoneme or word level, we then perform a keyword search. At first glance, one might intuitively think of performing keyword dictation by transcribing audio to text.

When transcribing at the word level, the immediate problem will be lack of vocabulary (oov). It is very difficult to recognize several words and especially proper names such as company names and unique names of people. Because it is very important to have a good language model, otherwise the performance will be poor.

Discriminative Keyword Spotting

Small Foot-Print Keyword Spotting

Lattice indexing

Lattice generation

Keyword Search

Keyword-Filler based Methods

Here we will briefly discuss about recurrent neural networks (RNN) and long-term short-term networks (LSTM) and bidirectional architectures, and finally we will review encoder-decoder architectures. RNNs do exactly the same, they have feed back from previously hidden units as input.

Long Short Term Memory

Encoder-Decoder Architecture

Attention Mechanism

Baseline system

The baseline system considered here is [11]. The baseline system we considered here is ASR-free keyword detection from speech proposed in [18]. It consists of three modules an Acoustic Encoder and a Phonetic Encoder and a Keyword Neural Network. It is inspired by traditional ASR keyword detection where it contains an acoustic model and a language model and keyword search algorithm.

As we can see, you can find a one-to-one correspondence between the acoustic encoder and the acoustic model. Acoustic Encoder is an automatic encoder, it takes the speech features as input and projects to the given length representation and then tries to reconstruct the signal again. Since speech is a time-dependent signal, the best way to understand this autoencoder would be a sequential architecture.

In a similar way, Phonetic encoder or Char RNNLM will take the keyword phone sequence and get fixed-length embedding. The acoustic auto-encoder and phonetic auto-encoder are trained mean quadratic and cross entropy loss respectively. Once these auto-encoders have been trained, we remove the decoder parts to be used in keyword spotting.

Now it is considered as a conjoint network and trained for keyword display with cross-entropy loss. The drawback of this approach is that it will not take into account the temporal information present in the speech signal as it is lost when taking fixed length representation. The other problem is that it tries to project phonetic features into the acoustic space at the loss of keyword display.

Proposed Architecture

Acoustic Encoder
Phonetic Encoder
Attention
KWS Neural Network and Cost Function

In summary, the acoustic features are fed into the acoustic encoder, the phonetic encoder of the phonetic pronunciation of the keywords, and the attention between these two representations will be computed, to obtain a context vector. The context vector and keyword input will be merged together and fed to the KWS neural network for classification. But again the acoustic representations and keyword input are very far from each other, this is evident from the training loss in Figure 2 shown.

Transfer Learning from Speech Synthesizer

Duration Modeling
Acoustic Modeling
Database Description
End-to-End ASR free Keyword Spotting
Results and Discussion

For acoustic modeling, along with the phone, the duration is aggregated at the input level to learn the Mel Frequency Cepstral Coeﬃcients (MFCC). In addition, to maintain the loudspeaker characteristics, band aperiodicity and fundamental frequency will be used in the network. Each phone with left-right context and duration is given as input and the output is taken MFCC, bandaperiodicity and fundamental frequency.

Although we have to create more negative examples due to the nature of the problem, we can show that the performance will not affect much even after increasing the number of examples. Here the experimented language is Telugu, but given both databases of the same size and parameters from the original paper, we have quoted best achieved results for the experimented database. For that, we have used the open source CMU arctic data[26], which has about 1.5 hours of English and consists of 1150 phonetically balanced utterances.

The context vector and query phonetic embedding vector will be projected to 200 dimensions each and concatenated to make a single vector. For the supervised query detection, we built a time delay neural network (TDNN)[13] based speech recognition and took the best path to detect the query. From this we considered 10% in size of test size and got 534 positive examples and 534 negative examples.

With a pre-trained model there is an absolute improvement of 7% over the base system in terms of accuracy. When calculating locations, instead of just looking at the maximum attention value, we looked at the top 3 maximum attention locations and checked whether they coincide with the actual location. In the second best calculation, we excluded 20 frames to the right and left of the peak due to the smooth nature of the attention function.

We can see that almost 56% of them match the location of the query with the best system.

Figure 6.2: Trend in training loss for end-to-end free asr KWS network

Conclusion

The proposed pre-trained network converges faster and to a better loss as shown in Figure 2. High attention is achieved where the query is actually located and the second best values have occurred after the query location. In terms of accuracy without pretraining with speech synthesizer is closer to the baseline and with pretraining brings it closer to the supervised system.

Saraclar, “Lattice indexing for spoken expression detection,” IEEE Transactions on Audio, Speech, and Language Processing, vol. Jones, “Unconstrained keyword observation using telephone grids with a spoken document search application,” Computer Speech and Language, vol. Paul, “A Keyword Recognition System Based on a Hidden Markov Model,” in International Conference on Acoustics, Speech and Signal Processing, April 1990, p.

Cernocky, “A Comparison of Keyword Discovery Approaches for Informal Continuous Speech,” in Ninth European Conference on Speech Communication and Technology, 2005. McGraw, “Streaming Keyword Discovery in Small Spaces Using Sequence-to-Sequence Models,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Xie, “End-to-End Attentional Models for Small Footprint Keyword Observation,” arXiv arXiv preprint.

Picheny, "End-to-end speech recognition and keyword search in low-resource languages," in 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2017, p. Kingsbury, "End-to-end asr- keyword-free search from speech," IEEE Journal of Selected Topics in Signal Processing, vol.