A Hybrid Spoken Language Processing System for Smart Device Troubleshooting

(1)

electronics

Article

A Hybrid Spoken Language Processing System for Smart Device Troubleshooting

Praveen Edward James¹, Hou Kit Mun^1,* and Chockalingam Aravind Vaithilingam²

1 School of Engineering, Taylor’s University, Taylor’s University Lakeside Campus, Subang Jaya, Selangor 47500, Malaysia; [email protected]

2 HiRes Laboratory, School of Engineering, Taylor’s University, Taylor’s University Lakeside Campus, No. 1, Jalan Taylor’s, Subang Jaya, Selangor 47500, Malaysia; [email protected]

* Correspondence: [email protected]

Received: 6 May 2019; Accepted: 11 June 2019; Published: 16 June 2019 Abstract:The purpose of this work is to develop a spoken language processing system for smart device troubleshooting using human-machine interaction. This system combines a software Bidirectional Long Short Term Memory Cell (BLSTM)-based speech recognizer and a hardware LSTM-based language processor for Natural Language Processing (NLP) using the serial RS232 interface. Mel Frequency Cepstral Coefficient (MFCC)-based feature vectors from the speech signal are directly input into a BLSTM network. A dropout layer is added to the BLSTM layer to reduce over-fitting and improve robustness. The speech recognition component is a combination of an acoustic modeler, pronunciation dictionary, and a BLSTM network for generating query text, and executes in real time with an 81.5% Word Error Rate (WER) and average training time of 45 s. The language processor comprises a vectorizer, lookup dictionary, key encoder, Long Short Term Memory Cell (LSTM)-based training and prediction network, and dialogue manager, and transforms query intent to generate response text with a processing time of 0.59 s, 5% hardware utilization, and an F1 score of 95.2%.

The proposed system has a 4.17% decrease in accuracy compared with existing systems. The existing systems use parallel processing and high-speed cache memories to perform additional training, which improves the accuracy. However, the performance of the language processor has a 36.7% decrease in processing time and 50% decrease in hardware utilization, making it suitable for troubleshooting smart devices.

Keywords: NLP; speech recognition; FPGA; LSTM; acoustic modeling; troubleshooting

1. Introduction

Manipulating speech signals to extract relevant information is known as speech processing [1].

This work integrates an optimized realization of speech recognition with Natural Language Processing (NLP) and a Text to Speech (TTS) system to perform Spoken Language Processing (SLP) using a hybrid software-hardware design approach. SLP involves three major tasks, namely translating speech to text (speech recognition), capturing the intent of the text, action determination using data processing techniques (NLP), and responding to users through voice (Speech Synthesis). Long Short Term Memory cell (LSTM), a class of Recurrent Neural Networks (RNN), is currently the state-of-the-art for continuous word speech recognition and NLP, due to its ability to process sequential data [2].

There are several LSTM-based speech recognition techniques available in the literature.

For end-to-end speech recognition, speech spectrograms are chosen directly as the pre-processing scheme and processed by a deep bidirectional LSTM network with a novel Connectionist Temporal Classification (CTC) output layer [3]. However, CTC fails to model the dependence of output frames on previous output labels. In LSTM networks, the training time increases with additional LSTM layers

Electronics2019,8, 681; doi:10.3390/electronics8060681 www.mdpi.com/journal/electronics

(2)

Electronics2019,8, 681 2 of 16

and a technique utilizing a single LSTM layer with a Weighted Finite State Transducer (WFST)-based decoding approach was designed to mitigate the problem [4]. The decoding approach is fast, since it uses WFST and enables effective utilization of lexicons, but the performance is limited. RNN-T, a transducer-based RNN, is used to perform automatic speech recognition by breaking down the overall model into three sub-models, namely acoustic models, pronunciation dictionary, and language models [5]. To enhance the performance, Higher Order Recurrent Neural Network (HORNN), a variant of RNN, was used. This technique reduces the complexity of LSTM but uses several connections from previous time steps to eliminate vanishing long-term gradients [6].

In this work, word-based speech recognition was used in a Bidirectional LSTM to create a simpler and a reliable speech recognition model. Most NLP applications include an LSTM-based model for off-line Natural Language Understanding (NLU) using datasets [7], LSTM networks for word labelling tasks in spoken language understanding [8], and personal information, combined with natural language understanding to target smart communication devices, such as Personal Digital Assistants (PDA) [9]. Additionally, LSTM-based NLP models were also utilized for POS tagging, semantic parsing on off-line datasets, and to calculate the performance of machine translation by Natural Language Generation (NLG) [10,11].

In such scenarios, large sale LSTM models are computation and memory intensive. To overcome this limitation, a load balance aware pruning method is used to compress the model, along with a scheduler and a hardware architecture called an Efficient Speech Recognition Engine (ESE) [12].

However, the random nature of the pruning technique leads to unbalanced computation and irregular memory access. Hence, a structured compression technique that eliminates these irregularities with a block-circulant matrix and a comprehensive framework called C-LSTM was designed [13].

To further improve performance and energy efficiency, Alternating Direction Method of Multipliers (ADMM) are used to reduce block size and training time [14]. These reductions are achieved by determining the exact LSTM model and its implementation-based on factors such as processing element design optimization, quantization, and activation limited by quantization errors. There are several parameters that assess the performance of the spoken language processing system. They include accuracy and training time for speech recognition, processing time, hardware utilization and F1 score for NLP, and F1 score for the entire system. These values are determined to understand the relevance of this system in the current scenario.

The speech recognition component of the proposed system is implemented as a five-layered DNN with a sequential input layer, a BLSTM-based classification layer, a fully connected layered modeler for word-level modeling, a dropout layer, and a SoftMax output layer. The NLP part of the proposed system is implemented on an FPGA, as a four-layered DNN uses the one-pass learning strategy, using an input layer, LSTM-based learning layer, prediction layer, and an output layer. The text-to-speech component is designed by invoking Microsoft’s Speech Application Programming Interface (SAPI) interface from MATLAB code.

The proposed spoken language processing system is constructed by integrating these three components using an RS232 serial interface. The rest of the paper is organized as follows. Section1 introduces the topic in view of current ongoing research. Section2presents the system design and implementation. Section3gives the results and discussion, and Section4concludes the topic with an insight into the future.

2. Design and Implementation

This section presents the design of the spoken language processing system by adopting a bottom-up design methodology. The individual components are the speech recognizer, natural language processor, speech synthesizer, the integration of the software-based speech recognition component, and the hardware language processor component using the serial RS232 interface. The concept visualization of the proposed application is shown in Figure1. An acoustic modeling server incorporates the proposed software speech recognition component to provide global access to any specific smart device connected

(3)

through the Internet of Things (IOT) cloud platform. The query text generated from this server is received by the hardware language processor through the IOT cloud. On the other hand, the hardware language processor is embedded within the smart device as a co-processor.

Electronics 2019, 8, x FOR PEER REVIEW 3 of 16

specific smart device connected through the Internet of Things (IOT) cloud platform. The query text generated from this server is received by the hardware language processor through the IOT cloud.

On the other hand, the hardware language processor is embedded within the smart device as a co-processor.

Figure 1. Smart device troubleshooting system.

The server communicates with the smart device through the IOT cloud and the smart device along with the language processor has an IOT device connected through an RS232 interface facilitating the communication process for purposes such as control, monitor, operate, and troubleshoot. The speech recognition component accepts speech in real-time and match features to words, which are combined as sentences. MATLAB R2018a is used to design this component as a software application.

The hardware-based language processor is coded using Verilog HDL, simulated using Modelsim 6.5 PE, synthesized, and verified experimentally. Finally, the entire system is tested with real-time data and its performance parameters are obtained. The design of the individual components is explained below.

2.1. BLSTM-Based Speech Recognition

Speech recognition is achieved by performing the following tasks. It involves capturing speech into the system using speech acquisition, pre-processing, feature extraction, BLSTM-based training and classification, word-level modeling, regularization with dropout, and likelihood estimation using SoftMax [15]. Speech acquisition is performed at the sentence-level using MATLAB-based recording software through a microphone. The speech acquisition process captures the speech signal from a natural environment, and hence it is susceptible to noise.

This degrades the signal strength and requires pre-processing the signal using several techniques. A band pass filter is initially used to filter signals between 50 Hz and 10,000 Hz.

Pre-emphasis filter is a high-pass filter [16]. It is used to increase the amount of energy in the high frequencies and enables easy access to information in the higher formants of the speech signals. It enhances phone detection accuracy. The pre-emphasis filter function is given by Equation (1).

Figure 1.Smart device troubleshooting system.

The server communicates with the smart device through the IOT cloud and the smart device along with the language processor has an IOT device connected through an RS232 interface facilitating the communication process for purposes such as control, monitor, operate, and troubleshoot. The speech recognition component accepts speech in real-time and match features to words, which are combined as sentences. MATLAB R2018a is used to design this component as a software application.

The hardware-based language processor is coded using Verilog HDL, simulated using Modelsim 6.5 PE, synthesized, and verified experimentally. Finally, the entire system is tested with real-time data and its performance parameters are obtained. The design of the individual components is explained below.

2.1. BLSTM-Based Speech Recognition

Speech recognition is achieved by performing the following tasks. It involves capturing speech into the system using speech acquisition, pre-processing, feature extraction, BLSTM-based training and classification, word-level modeling, regularization with dropout, and likelihood estimation using SoftMax [15]. Speech acquisition is performed at the sentence-level using MATLAB-based recording software through a microphone. The speech acquisition process captures the speech signal from a natural environment, and hence it is susceptible to noise.

This degrades the signal strength and requires pre-processing the signal using several techniques.

A band pass filter is initially used to filter signals between 50 Hz and 10,000 Hz. Pre-emphasis filter is a high-pass filter [16]. It is used to increase the amount of energy in the high frequencies and enables easy access to information in the higher formants of the speech signals. It enhances phone detection accuracy. The pre-emphasis filter function is given by Equation (1).

(4)

H(z) =1−az⁻¹ (1)

wherea=0.95.

The magnitude and phase response of the pre-emphasis filter is shown in Figure2.

1 1

)

(z = −az⁻

H ⁽¹⁾

where a = 0.95.

The magnitude and phase response of the pre-emphasis filter is shown in Figure 2.

Figure 2. Magnitude and phase response of the pre-emphasis filter.

The speech signal is a natural realization of a set of randomly changing processes, where the underlying processes undergo slow changes. Speech is assumed to be stationary within a small section, called a frame. Frames are short segments of the speech signal that are isolated and processed individually. A frame contains discontinuities at its beginning and end. This can be smoothened by window functions. In this design, the Hamming window function is used and is given by Equation (2).

0.54 0.46 cos (2 / ) for 0 ( ) 0 otherwise

n L n L

W n  − π ≤ ≤ 

= 

 

(2)

where L is the length of the window. A good window function has a narrow main lobe and a low side lobe. A smooth tapering at the edges is desired to minimize discontinuities. Hence, the most common window used in speech processing is the Hamming window, which has lower side lobes, as the length of the window is increased [17]. The human ear is less sensitive to high frequencies. To overcome this problem, frequencies above 1000 Hz are mapped to Mel scale. Conversion to the Mel scale involves creating a bank of triangular filters that collects energy from each frequency band, with 10 filters spaced linearly within 1000 Hz and the rest spread logarithmically above 1000 Hz. The Mel Frequency Cepstral Coefficients (MFCC) can be computed from the raw acoustic frequency (f) using Equation (3).

700) 1 ( log

* 2595 )

( ₁₀ f

f

mel = + (3)

Cepstrum is defined as the spectrum of the log of the spectrum of a time waveform [18]. The Cepstrum is used to improve phone recognition performance and a set of 12 coefficients is calculated. The Cepstrum of a signal is the Inverse DFT of the log magnitude of the Discrete Cosine Transform (DCT) of the signal. In addition to Mel Frequency Cepstral Coefficients (MFCC), the energy of the signal that correlates with the phone identity is computed. For a signal x in a window from time sample t1, to time sample t2, the energy of the n spectrum is calculated using Equation (4).

2 2

( )= ^t 1 ( ) E n t t x t



= ⁽⁴⁾

The energy forms the 13th coefficient and the set of 12 coefficients are combined to form 13 Cepstral coefficients. These Cepstral coefficients are used in a delta function (Δ) to identify changing features related to change in the spectrum [19]. For a short-time Cepstral sequence C[n], the delta-Cepstral features are typically defined as

Figure 2.Magnitude and phase response of the pre-emphasis filter.

The speech signal is a natural realization of a set of randomly changing processes, where the underlying processes undergo slow changes. Speech is assumed to be stationary within a small section, called a frame. Frames are short segments of the speech signal that are isolated and processed individually. A frame contains discontinuities at its beginning and end. This can be smoothened by window functions. In this design, the Hamming window function is used and is given by Equation (2).

W(n) =

( 0.54−0.46 cos(_2πn/L) for 0≤n≤L

0 otherwise

)

(2) whereLis the length of the window. A good window function has a narrow main lobe and a low side lobe. A smooth tapering at the edges is desired to minimize discontinuities. Hence, the most common window used in speech processing is the Hamming window, which has lower side lobes, as the length of the window is increased [17]. The human ear is less sensitive to high frequencies. To overcome this problem, frequencies above 1000 Hz are mapped to Mel scale. Conversion to the Mel scale involves creating a bank of triangular filters that collects energy from each frequency band, with 10 filters spaced linearly within 1000 Hz and the rest spread logarithmically above 1000 Hz. The Mel Frequency Cepstral Coefficients (MFCC) can be computed from the raw acoustic frequency (f) using Equation (3).

mel(f) =2595∗log₁₀(1+ ^f

700) (3)

Cepstrum is defined as the spectrum of the log of the spectrum of a time waveform [18].

The Cepstrum is used to improve phone recognition performance and a set of 12 coefficients is calculated. The Cepstrum of a signal is the Inverse DFT of the log magnitude of the Discrete Cosine Transform (DCT) of the signal. In addition to Mel Frequency Cepstral Coefficients (MFCC), the energy of the signal that correlates with the phone identity is computed. For a signalxin a window from time samplet1, to time samplet2, the energy of the n spectrum is calculated using Equation (4).

E(n) =^X^t2

t=t1x²(t) (4)

The energy forms the 13th coefficient and the set of 12 coefficients are combined to form 13 Cepstral coefficients. These Cepstral coefficients are used in a delta function (∆) to identify changing features related to change in the spectrum [19]. For a short-time Cepstral sequenceC[n], the delta-Cepstral features are typically defined as

D[n] =C[n+m]⁻C[n−m] (5)

(5)

whereCrefers to the existing coefficients,mis the number of coefficients used to compute∆;mis set to 2. A set of 13 delta coefficients and then double delta coefficients (delta of delta coefficients) are calculated. By combining the three sets of 13 coefficients, the feature vectors with 39 coefficients are obtained. The LSTM model learns directly from the extracted features. However, they are labelled at the word level first. Labelling enables the features to be identified uniquely and converted to categorical targets. The categorized labels are directly mapped to the sentences so that the process of classifying data according to the targets can be learned by the BLSTM model, as shown in Figure3.

] [ ] [ ]

[n C n m C n m

D = + − − ⁽⁵⁾

where C refers to the existing coefficients, m is the number of coefficients used to compute Δ; m is set to 2. A set of 13 delta coefficients and then double delta coefficients (delta of delta coefficients) are calculated. By combining the three sets of 13 coefficients, the feature vectors with 39 coefficients are obtained. The LSTM model learns directly from the extracted features. However, they are labelled at the word level first. Labelling enables the features to be identified uniquely and converted to categorical targets. The categorized labels are directly mapped to the sentences so that the process of classifying data according to the targets can be learned by the BLSTM model, as shown in Figure 3.

Figure 3. Bidirectional Long Short Term Memory Cell (BLSTM) structure.

The speech recognition component has multiple layers. The first layer is the input layer, followed by a BLSTM layer. This layer directly accepts two-dimensional feature vectors as inputs in a sequential manner, as the data varies with time. For training, the entire set of features representing a sentence is available. The BLSTM layer is trained to learn input patterns along with the categorized labels using the Back Propagation Through Time (BPTT) algorithm. The BPTT algorithm executes the following steps to train the BLSTM model.

Step 1: Feed-forward computation. Feed-forward computation is similar to a Deep Neural Network (DNN). Additionally, it has a memory cell in which past and future information can be stored. The cell is constrained by input, output, update, and forget gates to prevent random access.

Step 2: Gradient calculation. This is the mini-stochastic gradient calculation by taking the derivation of the cost function. The cost function measures the deviation of the actual output from the target function. In this study, cross-entropy is used to calculate the loss.

Step 3: Weight modification. The weights between the output layer and the input layer are updated with a combination of gradient values, learning rate, and a function of the output.

Step 4: Dropout. The process is repeated until the cost function is lesser than a threshold value.

The training of the BLSTM is followed by regularization using an existing technique called dropout by adding a dropout layer. Dropout erases information from units randomly. This forces the BLSTM to compute lost information from the remaining data. It has a very positive effect on BLSTM, which stores past and future information in memory cells. An illustration of possible areas for application of Dropout is indicated with dotted lines in Figure 4. Dropout is applied only to the non-recurrent connections. This is indicated by dotted lines. Using dropout on these connections increases the robustness of the model and also minimizes over-fitting.

Figure 3.Bidirectional Long Short Term Memory Cell (BLSTM) structure.

The speech recognition component has multiple layers. The first layer is the input layer, followed by a BLSTM layer. This layer directly accepts two-dimensional feature vectors as inputs in a sequential manner, as the data varies with time. For training, the entire set of features representing a sentence is available. The BLSTM layer is trained to learn input patterns along with the categorized labels using the Back Propagation Through Time (BPTT) algorithm. The BPTT algorithm executes the following steps to train the BLSTM model.

Step 1: Feed-forward computation. Feed-forward computation is similar to a Deep Neural Network (DNN). Additionally, it has a memory cell in which past and future information can be stored.

The cell is constrained by input, output, update, and forget gates to prevent random access.

Step 2: Gradient calculation. This is the mini-stochastic gradient calculation by taking the derivation of the cost function. The cost function measures the deviation of the actual output from the target function. In this study, cross-entropy is used to calculate the loss.

Step 3: Weight modification. The weights between the output layer and the input layer are updated with a combination of gradient values, learning rate, and a function of the output.

Step 4: Dropout. The process is repeated until the cost function is lesser than a threshold value.

The training of the BLSTM is followed by regularization using an existing technique called dropout by adding a dropout layer. Dropout erases information from units randomly. This forces the BLSTM to compute lost information from the remaining data. It has a very positive effect on BLSTM, which stores past and future information in memory cells. An illustration of possible areas for application of Dropout is indicated with dotted lines in Figure4. Dropout is applied only to the non-recurrent connections. This is indicated by dotted lines. Using dropout on these connections increases the robustness of the model and also minimizes over-fitting.

(6)

Figure 4. Dropout-based regularization.

where x represents the input, rectangular boxes indicate LSTM cells, and y represents the output.

Word-level modeling is done using the TIMIT which is an acoustic corpus for speech data [20].

2.2. LSTM-Based Language Processor

The LSTM-based NLP framework is implemented as a hardware language processor. The language processor is designed using a combination of two implementation styles, namely process-based and model-based, and hence is hybrid in nature. The language processor is designed as a combination of functional blocks using Verilog Hardware Description Language (VHDL).

Logical verification is done using Modelsim 6.5 PE. The design is prototyped on an Altera DE2-115 Board with CYCLONE IVE EP4CE115FC8 FPGA using Quartus II software. The various functional blocks are individually explained below.

2.2.1. Universal Asynchronous Receiver Transmitter (UART) Module

The UART module has a transmitter and a receiver to enable two-way communication with the software modules. It follows asynchronous communication controlled by the baud rate. The commencement and completion of transmission of a byte are indicated with start and stop bits. The receiver receives data from the speech recognition module one bit at a time. The receiver monitors the line for the start bit, which is a transmission from high to low. If the start bit is detected, the line is sampled for the 1st bit immediately after the start bit.

Once the 8 data bits are received, the data bits are processed into the system after the reception of a stop bit, which is a transition from low to high. The FPGA clock frequency is 50 MHz and for the proposed system, the transmission baud rate chosen is 9600 bps. At the receiver, when the data are sampled at the FPGA clock frequency, data cannot be retrieved. Hence, the data transmission rate of the transmitter and the reception rate at the receiver are kept constant and equal by dividing the clock frequency with the integer 5208 to generate the baud rate of 9600 kbps.

2.2.2. Vectorizer

The actual input to the language processor is text. However, the language processor handles only numeric data internally. Hence, the vectorizer converts the text to a byte value. The process of receiving query data from the input device is accomplished via a UART interface (receiver), shift register, and First-in-First-out (FIFO) register. The process flow diagram real-time processing of query text is shown in Figure 5. Each word of a query text is converted to a key in a sequence of steps:

1. The UART receiver receives a byte of data and transfers it to the shift register.

Figure 4.Dropout-based regularization.

wherexrepresents the input, rectangular boxes indicate LSTM cells, and y represents the output.

Word-level modeling is done using the TIMIT which is an acoustic corpus for speech data [20].

2.2. LSTM-Based Language Processor

The LSTM-based NLP framework is implemented as a hardware language processor. The language processor is designed using a combination of two implementation styles, namely process-based and model-based, and hence is hybrid in nature. The language processor is designed as a combination of functional blocks using Verilog Hardware Description Language (VHDL). Logical verification is done using Modelsim 6.5 PE. The design is prototyped on an Altera DE2-115 Board with CYCLONE IVE EP4CE115FC8 FPGA using Quartus II software. The various functional blocks are individually explained below.

2.2.1. Universal Asynchronous Receiver Transmitter (UART) Module

The UART module has a transmitter and a receiver to enable two-way communication with the software modules. It follows asynchronous communication controlled by the baud rate.

The commencement and completion of transmission of a byte are indicated with start and stop bits. The receiver receives data from the speech recognition module one bit at a time. The receiver monitors the line for the start bit, which is a transmission from high to low. If the start bit is detected, the line is sampled for the 1st bit immediately after the start bit.

Once the 8 data bits are received, the data bits are processed into the system after the reception of a stop bit, which is a transition from low to high. The FPGA clock frequency is 50 MHz and for the proposed system, the transmission baud rate chosen is 9600 bps. At the receiver, when the data are sampled at the FPGA clock frequency, data cannot be retrieved. Hence, the data transmission rate of the transmitter and the reception rate at the receiver are kept constant and equal by dividing the clock frequency with the integer 5208 to generate the baud rate of 9600 kbps.

2.2.2. Vectorizer

The actual input to the language processor is text. However, the language processor handles only numeric data internally. Hence, the vectorizer converts the text to a byte value. The process of receiving query data from the input device is accomplished via a UART interface (receiver), shift register, and First-in-First-out (FIFO) register. The process flow diagram real-time processing of query text is shown in Figure5. Each word of a query text is converted to a key in a sequence of steps:

1. The UART receiver receives a byte of data and transfers it to the shift register.

(7)

2. In the shift register, the maximum word length is fixed at 80 bits. It is monitored by a counter.

The shift register receives data until the maximum length is reached or the word end is detected.

3. The FIFO length is variable. However, for the current implementation, it is fixed as the first 8 words of a sentence. At the end of each word, the FIFO moves on to the next value, when indicated by a counter’s (~W/R) active-low signal. When the FIFO becomes full, the (~W/R) becomes active-high, indicating that the FIFO can be read sequentially.

4. The vectorizer module reads each word from the FIFO and converts it to an 8-bit value mapping the information dictionary, followed by a key encoder.

5. The Key encoder receives each byte value and encodes it into a unique 8-bit key.

Figure 5. Vectorizer module.

The generation of the query text is followed by the generation of the information text.

2.2.3. Information Text Generation

The logic verification of the Reverse Lookup Dictionary (RLD) is given in Figure 6. Here, the number and its corresponding key are highlighted.

Figure 6. Logic verification of reverse lookup dictionary (RLD) module.

A paragraph of information text is stored in a text file and downloaded into a custom RAM called Word Library during runtime. The location address of each word is 8-bit and corresponds to the location of the word library starting from 00 to FF. There is an address generator module for generating these addresses. Initially, the words are assigned byte values by the vectorizer. Then, unique keys are assigned for each byte value by the key generator and the combination of these

Figure 5.Vectorizer module.

The logic verification of the Reverse Lookup Dictionary (RLD) is given in Figure6. Here, the number and its corresponding key are highlighted.

Figure 5. Vectorizer module.

The logic verification of the Reverse Lookup Dictionary (RLD) is given in Figure 6. Here, the number and its corresponding key are highlighted.

Figure 6. Logic verification of reverse lookup dictionary (RLD) module.

A paragraph of information text is stored in a text file and downloaded into a custom RAM called Word Library during runtime. The location address of each word is 8-bit and corresponds to the location of the word library starting from 00 to FF. There is an address generator module for generating these addresses. Initially, the words are assigned byte values by the vectorizer. Then, unique keys are assigned for each byte value by the key generator and the combination of these

Figure 6.Logic verification of reverse lookup dictionary (RLD) module.

A paragraph of information text is stored in a text file and downloaded into a custom RAM called Word Library during runtime. The location address of each word is 8-bit and corresponds to the location of the word library starting from 00 to FF. There is an address generator module for generating these addresses. Initially, the words are assigned byte values by the vectorizer. Then, unique keys are assigned for each byte value by the key generator and the combination of these blocks constitutes

(8)

a Reverse Lookup Dictionary (RLD). RLD performs two different functions. It is used to generate a unique key from the word generation. It is also used to recover a word given the key.

2.2.4. Key Encoder

The information text keys and query text keys generated by the previous modules are compared by a comparator and assigned four 8-bit values based on four different observed criteria, as given in Table1. From the table, it can be seen that if the start or end of the query text or information text is detected, unique values are assigned. The same process is performed when the match or no match of the two text sequences are detected. This process primarily standardizes the input sequence for one-pass learning and additionally secures the data from the external environment. The block diagram of the key encoder is shown in Figure7.

Table 1.Key encoder.

No. Comparison Parameters Encoder Output

1 Qtext start Output Value: 8⁰d127

2 Qtext end Output Value: 8⁰d63

3 Itext==Qtext Output Value: 8⁰d191

4 Itext<>Qtext Output Value: 8⁰d0

blocks constitutes a Reverse Lookup Dictionary (RLD). RLD performs two different functions. It is used to generate a unique key from the word generation. It is also used to recover a word given the key.

2.2.4. Key Encoder

The information text keys and query text keys generated by the previous modules are compared by a comparator and assigned four 8-bit values based on four different observed criteria, as given in Table 1. From the table, it can be seen that if the start or end of the query text or information text is detected, unique values are assigned. The same process is performed when the match or no match of the two text sequences are detected. This process primarily standardizes the input sequence for one-pass learning and additionally secures the data from the external environment. The block diagram of the key encoder is shown in Figure 7.

Table 1. Key encoder.

No. Comparison Parameters Encoder Output 1 Qtext start Output Value: 8′d127 2 Qtext end Output Value: 8′d63 3 Itext == Qtext Output Value: 8′d191 4 Itext <> Qtext Output Value: 8′d0

Figure 7. Process flow diagram of the key encoder.

2.2.5. LSTM Decoder

The values generated by the key encoder are allowed through the LSTM decoder. The LSTM decoder is a Deep Learning Network (DNN) consisting of an input layer, two LSTM hidden layers—namely the training and prediction layer—and an output layer. The LSTM layers train on the vales from the key encoder and predict the response by implementing self-learning within their structure using individual LSTM modules.

A LSTM layer has a memory cell of an LSTM module controlled by four gates, namely the input gate, output gate, update gate, and forget gate. The input gate allows new information to flow into the memory cell. The update gate allows information changes to be updated in the memory cell. The forget gate allows information in the memory cell to be erased. The output gate allows stored information to be transferred to subsequent modules or layers in the LSTM module.

The LSTM module is combined with weight and bias storage memory units to form the LSTM decoder. The key encoder values are sequentially transferred into the LSTM training layer by the input layer. In this layer, the four values generated by the encoder are random and are mapped to four ranges of values by the LSTM training layer. At the same time, the prediction layer keeps track

Figure 7.Process flow diagram of the key encoder.

2.2.5. LSTM Decoder

The values generated by the key encoder are allowed through the LSTM decoder. The LSTM decoder is a Deep Learning Network (DNN) consisting of an input layer, two LSTM hidden layers—namely the training and prediction layer—and an output layer. The LSTM layers train on the vales from the key encoder and predict the response by implementing self-learning within their structure using individual LSTM modules.

A LSTM layer has a memory cell of an LSTM module controlled by four gates, namely the input gate, output gate, update gate, and forget gate. The input gate allows new information to flow into the memory cell. The update gate allows information changes to be updated in the memory cell. The forget gate allows information in the memory cell to be erased. The output gate allows stored information to be transferred to subsequent modules or layers in the LSTM module.

The LSTM module is combined with weight and bias storage memory units to form the LSTM decoder. The key encoder values are sequentially transferred into the LSTM training layer by the input layer. In this layer, the four values generated by the encoder are random and are mapped to four

(9)

ranges of values by the LSTM training layer. At the same time, the prediction layer keeps track of the order of the sequence of values and predicts the index in the word library, where most of the values are located. The RLD is used as reference for this task. If multiple indices are obtained in this process, identification of the most probable information text that constitutes the response is achieved by the dialogue modeler. The block diagram of the LSTM decoder is shown in Figure8.

of the order of the sequence of values and predicts the index in the word library, where most of the values are located. The RLD is used as reference for this task. If multiple indices are obtained in this process, identification of the most probable information text that constitutes the response is achieved by the dialogue modeler. The block diagram of the LSTM decoder is shown in Figure 8.

Figure 8. Long Short Term Memory Cell (LSTM) decoder.

2.2.6. Dialogue Modeler

The dialogue modeler consists of index detect, output hold, and output generation modules.

The index detect compares the indices of information text generated in comparison with query text and calculates a score based on the matching keywords and the sequence in which these words occur. The language processor is a synchronous module and generates the output in a sequence.

Hence, the output is maintained at the previous value until the similarity score is calculated by the index detection module. This operation is performed by the output hold module. Finally, the byte values of the actual words in a sentence are generated by the output generation module. The entire processor operates using a system clock and two control signals generated by the main control unit.

2.2.7. Main Control Unit

This is used to synchronize all the individual hardware units. Most of the hardware blocks execute sequentially and are controlled using enable signals. The state diagram is given in Figure 9.

The UART receiver operates at the baud rate, hence the shift register needs to operate with a clock slower than the on-board clock. Hence, a clock divider circuit is used to reduce the clock speed. This circuit is under the control of the main control unit. A shift signal is generated to shift each byte of input data into the shift register. The main control unit is implemented as a state machine. The FIFO is operated in write or read mode using a (write/read) signal.

Figure 8.Long Short Term Memory Cell (LSTM) decoder.

2.2.6. Dialogue Modeler

The dialogue modeler consists of index detect, output hold, and output generation modules.

The index detect compares the indices of information text generated in comparison with query text and calculates a score based on the matching keywords and the sequence in which these words occur.

The language processor is a synchronous module and generates the output in a sequence. Hence, the output is maintained at the previous value until the similarity score is calculated by the index detection module. This operation is performed by the output hold module. Finally, the byte values of the actual words in a sentence are generated by the output generation module. The entire processor operates using a system clock and two control signals generated by the main control unit.

2.2.7. Main Control Unit

This is used to synchronize all the individual hardware units. Most of the hardware blocks execute sequentially and are controlled using enable signals. The state diagram is given in Figure9. The UART receiver operates at the baud rate, hence the shift register needs to operate with a clock slower than the on-board clock. Hence, a clock divider circuit is used to reduce the clock speed. This circuit is under the control of the main control unit. A shift signal is generated to shift each byte of input data into the shift register. The main control unit is implemented as a state machine. The FIFO is operated in write or read mode using a (write/read) signal.

The control block enables the query data captured by the FIFO from the shift register. This data is transferred to the subsequent modules by generating a query enable signal (QEN) after the start signal (START) is received. The query text is simultaneously transferred to the subsequent modules by transferring the query text until all input words are captured in the FIFO. Then, the simultaneous transfer of query text and information text is achieved by enabling an information-text-enable signal (IEN). If a done signal (DONE) is received the state machine returns to the initial state.

(10)

Figure 9. Main control unit of the language processor.

The control block enables the query data captured by the FIFO from the shift register. This data is transferred to the subsequent modules by generating a query enable signal (QEN) after the start signal (START) is received. The query text is simultaneously transferred to the subsequent modules by transferring the query text until all input words are captured in the FIFO. Then, the simultaneous transfer of query text and information text is achieved by enabling an information-text-enable signal (IEN). If a done signal (DONE) is received the state machine returns to the initial state.

The weights are predetermined through a series of simulations such that the values of the cell lie in the range of 25% and 75% of the maximum possible value. The learning algorithm is implemented in the structure of the LSTM network and is explained below:

Step 1: In the first step, the query text is vectorized and assigned unique keys.

Step 2: The information text is also vectorized, assigned unique keys, and populated in a look-up dictionary.

Step 3: The key encoder sequentially compares every query key with the information key, encodes them into 4 possible values, and transfers them sequentially to the LSTM network.

Step 4: The encoded keys are mapped to four ranges of values and the index of the response text is identified.

Step 5: The integer values of the response text following the index are extracted from the reverse lookup dictionary to obtain response information sentences.

Step 6: Finally, a proximity score is calculated based on the similarity between information and query sentence by the dialogue modeler.

Step 7: The sentence with the maximum score is selected as the most probable response sentence.

The experimental setup of the spoken language processing system is shown in Figure 10.

Figure 9.Main control unit of the language processor.