Daffodil International University

Asif Shahariar, ID no. Iftekher Aziz, ID no. and Shovan Banik, ID no. to the Department of Computer Science and Engineering, Daffodil International University has been accepted as satisfactory in partial fulfillment of the requirements for the degree of B.Sc. Department of Computer Science and Engineering Faculty of Natural Sciences and Information Technology Påskelilje International University. Deep knowledge and great interest from our supervisor in "Machine Learning" to carry out this project.

Syed Akhter Hossain Professor and Head, Department of CSE, for his kind help to complete our project and also to other faculty member and the staff of CSE department of Daffodil International University. We would like to thank our entire coursemate in Daffodil International University who participated in this discussion while completing the coursework. Spoken language identification is the means of detecting the specific language spoken by an anonymous speaker.

Our main task is to identify parameters and features from spoken language that can be used to separate languages. The main goal of this project is to find the best algorithm for detecting specific language. This technology has been used in spoken language translation [1] and spoken document retrieval [2] as a key technology.

This problem comes to our attention when we see that the spoken language can be translated very well in Google Translator, but before we translate, we need to select the languages.

Figure 1.2.1: Language auto detection from speech isn’t available in Google Translator [4]

Research Question

If we can properly detect the language through speech recognition, we will be able to make the work of many people much easier by ensuring its proper use in many more applications, including Google Translator. From this thought, we will try to find the most accurate algorithm for determining English, Spanish, German language in our project.

Expected Output

BACKGROUND

Related Works

Language can also be identified by dividing speech into different phonetic categories based on the acoustic structure of the language. In 1996, Zissman compares each of the four types of language perception processes [9] and tries to find the best method among them. These four models are: monolingual phone recognition, Gaussian mixture model (GMM), n-gram language modeling (PRML), and language-dependent parallel phone recognition (PPR).

Haizhou Li, Bin Ma and Chin-Hui Lee presented a vector space modeling to identify language. Biadsy and Hirschberg works to identify the four dialects of Arabic in the above method [11]. These models are: Music Genre Motivated Approach, Feeder Neural Network, Recurrent Neural Network, Convolutional Neural Network and Gaussian Mixture Model.

We learned the names of various methods as researchers worked to identify languages by gathering information from speech. Currently, the use of neural network is enriched to extract the frequency from the spoken word.

Terminologies

They proposed and implemented a linguistic identification (LID) system to make SVM with shifted delta-cepstral (SDC) more efficient. There, a working bit SVM classifier with backpropagation premises was made to separate the target dialects depending on the probability distribution in this DLCSV space. DLCSVs were generated and the SVM classifier ultimately estimated the posterior probability of each target language, which is used to correlate the final results.

Another method for spoken language identification was proposed by Bin MA & Haizhou LI [22] to identify spoken language using a sound recognizer and a bag-of-sounds (BOS) classifier. Here, a high-dimensional vector classifier was used with the SVM classifier on the original high-dimensional space. The data was separated into two separate databases for training and testing the BOS classifier.

Identifying spoken language is a unique concept, so many people try to solve this using machine learning software. In a paper we saw that Julien Boussard, Andrew Deveau and Justin Pyron used many methods and algorithms to solve this concept. They used the approach motivated by music genre, Feed-Forward Neural Network, Recurrent Neural Network, Convolutional Neural Network and Gaussian Mixture Model in their project.

We will discuss here which one produces the best results and how these methods work. In a recurrent neural model, they used the Keras function called Long Short-Term Memory LSTM to construct a model. Convolutional neural network and feed-forward neural network are similar, but the output of a convolutional layer is different.

Long short term memory (LSTM) used in deep learning which is an artificial recurrent neural network (RNN) architecture. In a paper we found the uses of LSTM for language identification written by RubenZazo, AliciaLozano-Diez, Javier Gonzalez-Dominguez, Doroteo T. Here they discussed about Baum-Welch statistics, Universal Background Model (UBM) and ' use some comparisons like this.

Figure 2.2.2: Proposed LID System Overview [26]

Comparative Analysis & Summary

From their results, we can conclude that longer audio files give better results than shorter ones in the case of large vocabulary recognition system identification. From the above information and data, we can see that the highest accuracy was found using machine learning. On the other hand, using speech recognition with a large vocabulary, the test result of 10 seconds of data is 92.02%.

The classification accuracy of the four languages in the identification method of hierarchical temporal memories is 92%. Since the best accuracy is available from machine learning, we use the help of machine learning to rank our selected languages.

Table 2.1.1: Lexicon sizes for each language Language Words Ave. Length

RESEARCH METHODOLOGY

Research Subject and Instrumentation
Statistical Analysis
Proposed Methodology

Feature extraction
Machine Learning

Implementation Requirements
Experimental results
Discussion

From raw data we created 8 different data with 8 different pitches and another 8 data with 8 different types of speeds. Because the machine cannot train itself with the given data. At this stage we have to go through many methods to bring the data into the same configuration.

Notable works among them are bringing all the data in the same format, keeping the length of each of them just right. Feature extraction is the process of breaking down input data and extracting information from it. We have extracted 20 features from a sound through mfcc and extracted 6 more features.

An audio file with a length of at least 25ms is the main component for mfcc feature extraction. Alphabets, like vowels, have more energy at a lower frequency when it comes to pronunciation. And when people are in the midst of high frequency, they also have hearing problems.

In this case we must take into account that the amplitude must decrease near the edge of the frame. The way an audio clip from the machine is listened to and the way people listen are completely different. In addition to the functions that came from mfcc, we extracted 6 more functions from an audio.

Of course, not all algorithms will work the same, so we worked with different algorithms.

CONCLUSION AND IMPLICATION FOR FUTURE RESEARCH

Conclusion

Future Research

ACRONYMS AND ABBREVIATION

Toledano, Joazuin Gonzalez-Rodriguez "Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks" January 2016. Rong Tong, Bin Ma, Donglai Zhu, Haizhou li and Eng Sking Chang "Integrating Acoustic, Prosodic and phonotactic features for spoken language identification" Proc. Patil, Akshay Vishwas Johi, Harsha.K.C, Pramod.N "Spoken language identification using machine learning", Visvesvaraya Technological University, Belgaum, pp.

Ming Li, Hongbin Suo, Xiao Wu, Ping Lu, Yonghong Yan, "Speech Language Identification Using Score Vector Modeling and Support" proc. Ming Li, Hongbin Suo, Xiao Wu, Ping Lu, Yonghong Yan "Speech Language Identification Using Score Vector Modeling and Support Vector Machine" proc. Toledano, Joazuin Gonzalez-Rodriguez "Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks" p.

PLAGIARISM REPORT