Improving Speech-to-Text Recognition for Malaysian English Accents Using Accent Identification

It is hereby certified that Len Shu Yuan (ID No: 18ACB07073 ) has completed this final year project entitled “Improving speech-to-text recognition for Malaysian English accents with accent identification” under the supervision of Ts Dr Cheng Wai Khuen (supervisor) of the Department of Computer Science and Communication, Ja Min. Supervisor) of the Department of Computer Science, Faculty of Information and Communication Technology. I declare that this report entitled "Improving speech-to-text recognition for Malaysian English accents using accent identification" is my own work, except as cited in the references. In this final year project, I would like to express my sincere thanks to all the parties who support me and help me to complete my project.

However, the performance and accuracy of the speech recognition system are greatly affected by the non-native accents, for example, Malaysian English. In this project, the Accent Identification (AID) techniques will be implemented to improve the performance of the ASR systems in recognizing Malay-English accents. The datasets used in this project come from Mini Librispeech, Speech Accent Achieve and other Malaysian English speakers.

LIST OF ABBREVIATIONS

Introduction

Problem Statement and Motivation
Project Scope
Project Objectives
Impact, Significance and Contribution
Background Information
Report Organization

The motivation for this project is to solve the accuracy problems of advanced ASR systems dealing with non-native accents such as Malaysian English, which is a mixture of different accents from several. The project will focus on using the Kaldi toolkit, which is a state-of-the-art speech recognition tool. The accented ASR system built for Malaysian English can help in localization of the application or product.

Today, there are many advanced ASR systems that allow users to control their devices with voice commands in different languages. The accents of the non-native speaker can degrade the performance and robustness of ASR systems developed with standard English models. Chapter three describes the methodology used in this project, including the speech corpus used, data preparation and an overview of the proposed ASR models.

Literature Review

Overview
Automatic Speech Recognition .1 Kaldi

Gaussian Mixture Models-Hidden Markov Model Hybrid System
Deep Neural Network-Hidden Markov Model Hybrid Systems

Existing Accent Identification Techniques

Support Vector Machines (SVM) Based Accent Identification
Convolutional Neural Networks (CNN) based Accent Identification
Comparison of 3 AID Techniques

Kaldi is an open source state-of-the-art automatic speech recognition (ASR) tool written in C++ and licensed under the Apache v2.0 license. The top level of the diagram shows external libraries dependent on a set of tools that are also freely available: one is OpenFst for the finite state framework, and the other is the BLAS/LAPACK numerical algebra libraries. The second level of the diagram is a C++ library that will be called to and from a shell script to build and run the speech recognizer.

The least level of the Kaldi toolkit are the Shell scripts that implement the various stops of speech recognition such as data preparation, feature extraction, training models, decoding map structuring, decoding, etc. The GMM-HMM model is one of the most widely used methods in acoustic model training of conventional speech recognition systems. According to Najafian and Russell (2020), the DNN-HMM ASR model achieved a 47% WER reduction over the GMM-HMM model, proving that the DNN-HMM model is inherently capable of accommodating accented speech compared to the GMM-HMM model (Najafian & Russell achieved significant performance improvements in state-of-the-art ASR systems by replacing GMMs with DNNs (Hinton, et al. al., 2012).

According to the architecture of the hybrid DNN-HMM system shown in Figure 2.2.3.1, in which the HMM component is used to model the sequential property of the speech signal and the perception probabilities are estimated through DNN in the acoustic observations (Yu & Deng, 2015). The hidden layers store the important information, such as the importance of the input, and break down the function into specific transformations of the data (Gad, 2018). Many solutions have been proposed by other researchers and some solutions make sense for this project.

Najafian and Russell (2020) presented a study of the relationship between AID and ASR accuracy using i-vector based AID. According to the confusion matrix in Figure 2.3.1.1, the i-vector accent identification accuracy range is from 52% to 100%. The authors argued that the overlap of the i-vector accent spaces of the accents shown in Figure 2.3.1.2 causes a problem of low accuracy.

The authors compared the performances of three core designs (linear, polynomial, RBF) of SVMs and the best results shown in Figure 2.3.3.1 were obtained using linear SVM (Pedersen & Diederich, 2007).

Methodology

Overview
Data collection and preprocessing .1 Speech corpus

Data preparation and preprocessing

Kaldi models

GMM-HMM model
DNN model

CNN based accent identification model
Tools to use
Implementation issues and challenges
Evaluation metric

In this project, speech files from Malay speakers will be used and I have collected 20 more speech recordings from Malaysian Chinese and Indian speakers to ensure that accents of different ethnicities are included. In addition, speech data from the American English speakers will be used to show the comparison between native and non-native speakers. After the required data is prepared, the GMM-HMM model is trained and continues to perform DNN training only.

In this project, the Monophone model will be the base model for evaluating the performance and improving the other models. It is used as a building block for triphonic models that use contextual information (Chodroff, 2015). Again, a decoding graph will be generated and the test set will be decoded using a feature space maximum likelihood linear regression (fMLLR) decoding graph.

For alignment, the model estimates the speaker identity (using the inverse fMLLR matrix) and then removes the speaker identity from the model by multiplying the inverse matrix by the feature vector (Chodroff, 2015). In this project, a Time-Delay Neural Network (TDNN) architecture expects to receive input of 40-dimensional high-resolution MFCC features with 25ms frames and 10ms drift. The MFCC and Cepstral Mean and Variance Normalization (CMVN) of the velocity perturbed data are then calculated.

Once the data is ready, the model will convert the audio file to .wav format and perform MFCC feature extraction. The first challenge is the limited computing power, as my laptop specification is not enough to support the training of large speech corpora. The performance of the Kaldi models is evaluated using Word Error Rate (WER), which is the main metric used to measure ASR performance.

The performance of the CNN-based model will be calculated using the confusion matrix generated by the model.

Experimental Setup and Result

ASR training and decoding process

Data file preparation

Model training .1 Feature extraction

Monophone model (mono)
Triphone models (tri1, tri2b, tri3b)

CNN AID model
Result analysis

Once the MFCC features are extracted, the next step is to use the Kaldi compute_cmvn_stats.sh script to perform CMVN normalization. The monophonic model (base model) can be started to be trained using the Kaldi train_mono.sh script. After the train_mono.sh script finishes running, the FST and the training graph are generated by the mkgraph.sh script.

Then, the decoding of the test set (dev_clean_2) is performed based on the information generated in the training process. The first triphone model, tri1, is trained on the alignments of monophone models and delta functions using the Kaldi script train_deltas.sh and mono training graph generated in the base model. The other triphone model, tri2b, is trained by learning MLLT on top of LDA features, tran_lda_mllt.sh is used in this case.

Each model is used to decode the test set (dev_clena_2) after the training and stretching process. Then the TDNN model builds a new decision tree using the MFCC features, chain topology, and velocity perturbed data alignment using build-_tree.sh. The test set (dev_clean_2), American English (test_usa) and Malaysian English (test_my) are decoded using the trained ASR models.

For dev_clean_2, WER gives the best performance and has the biggest improvement from tri3b to TDNN, which can say that the DNN-HMM model has a significant effect on the improvement of GMM-HMM models. Looking at the 4 GMM models, the largest drop in WER occurs from monophone to triphone, which can say that triphone with delta functions can significantly reduce WER. However, the WER results of tri2b and tri3b show no improvement, which may suggest that SAT speaker adaptive training does not improve the recognition accuracy of the Mini Librispeech corpus, or the corpus size is just not enough to see an improvement.

The accuracy results look good because the CNN model can correctly identify most of the speech of native speakers.

Conclusion

Project Review, discussions and conclusion
Novelties and contribution
Future work

Much work can still be done to improve the performance of the proposed ASR models and the AID model. Due to the limited computing power and time, the performance of the ASR models and the CNN AID model is not good enough. To get better performance, more methods of handling functions can be tried to find the best combination.

Furthermore, the influences of the ethnicity and native language of the speaker can be further studied to create a language model adapted to Malaysian English.

Bibliography

Appendices

FINAL YEAR PROJECT WEEKLY REPORT

WORK DONE
WORK TO BE DONE
PROBLEMS ENCOUNTERED
SELF EVALUATION OF THE PROGRESS Need to speed up on the progress
SELF EVALUATION OF THE PROGRESS Need to figure how to improve the performance

PROGRESS SELF-ASSESSMENT Need to understand how to improve performance Need to understand how to improve performance. Form title: Supervisor's Comments on Originality Report Generated by Turnitin for submission of Final Year Project Report (for Undergraduate Programs). Final Year Project Title Improving Word-in-Text Recognition for Malaysian English Accents Using Accent Identification.

The required originality parameters and limitations approved by UTAR are as follows:. i) the total similarity index is 20% or less, and. ii) Matching of individual cited sources must be less than 3% each and (iii) Matching of texts in a continuous block must not exceed 8 words. Note: Parameters (i) – (ii) exclude citations, bibliography and text matches of less than 8 words. Note The supervisor/candidate(s) must provide the Faculty/Institute with an electronic copy of the complete originality report set.

Based on the above results, I hereby declare that I am satisfied with the originality of the end-of-year project report submitted by my student(s) as stated above.

UNIVERSITI TUNKU ABDUL RAHMAN FACULTY OF INFORMATION & COMMUNICATION

TECHNOLOGY (KAMPAR CAMPUS)