Identification of Regional Origin Based on Dialec Using the Perceptron Evolving Multilayer Method
Okvi Nugroho1, Opim Salim Sitompul1,*, Suherman2
1Faculty of Computer Science and Information Technology, Master of Informatics Engineering Study Program, University of North Sumatra, Medan, Indonesia
2 Faculty of Engineering, Electrical Engineering, University of North Sumatra, Medan, Indonesia Email: 1[email protected], 2,*[email protected], 3,*[email protected]
Correspondence Author Email: [email protected]
Abstract−Voice detection is very important for the world of information technology that can be used for voice processing, biometrics, human computer interfaces. Voice identification carried out in this study is based on speech or dialect using a prototype that has been designed using the Raspiberry Pi device and other supporting devices. In its application, the regional identification prototype uses sound feature extraction, namely Mel Frequency Cepstral Coefficients (MFCC) and uses an artificial neural network method with a multilayer perceptron (secos) developing algorithm. The purpose of this study is to identify regional origins based on dialect or speech using the Mel Frequency Cepstral Coefficients (MFCC) extraction technique and the Evolving Multilayer Perceptron method. The results of the regional recognition test produce a good level of accuracy, with testing as an example of the Aceh area with test data of 10 voice samples, the results obtained by the prototype can identify voices with a success rat of being able to recognize 7 voices out of 10 samples tested in the Aceh region. From all the tests on the areas of Aceh, Karo, Nias, Simalungun, the accuracy was 88%
Keywords: Identification; SECOS ;Prototype; MFCC
1. INTRODUCTION
Language has an important role in human life. Language allows humans to communicate and develop in carrying out their functions and duties [1][2]. The role of language as a communication tool in society is ery important so that the intended meaning reaches the listener [3][4]. While language is conveyed through voice media which is the basic medium owned by humans, although until now electronic media and voice applications are also widely used, their use is to help carry out text-to-voice and voice-to-text communication and then for security systems [5][6]. Indonesia consists of many regions[2][7].
These differences are united by using the national language, namely Indonesian and ethnic groups are united and regulated administratively by the Indonesian national system [8][9]. However, the speaking of the national language by speakers is still influenced by regional and cultural backgrounds, this is what is meant by dialect, the differences in dialects can be known from the narratives carried out so that regional origin can only be identified with the spoken dialect. by the speaker.
Voice detection is very important for the world of information technology that can be used for speech processing, biometrics, human computer interfaces [10][11] but research is often limited to gender and emotion identification. Several studies related to this topic include [12](conducted a study to identify the speaker's emotions from his voice using the Mel Frequency Cepstral Coefficient (MFCC) technique and the designed system was validated for happy, sad and angry emotions.
Then another study [13] conducted a study to identify gender through voice signals by extracting characteristics from voice signals and then classified using SVM classification. Research on area identification has also been carried out by [14] in this study using only the Mel Frequency Cepstral Coefficients (MFCC) technique.
so that in the research that the author is doing, he will build a prototype of regional recognition with dialect using raspberry pi by combining the Mel Frequency Cepstral Coefficients (MFCC) technique and the multilayer perceptron algorithm.
In this study, the voices of several tribes with dialects of Nias, Aceh, Simalungun, Karo were then recorded and processed using time signal analysis. and the frequency domain that can be used. observed from the spectrogram and then identified. In this research, in detecting regional origin based on dialect or speech, future development plans can be applied to an automatic answering machine or robot for processing at the biometric level. This study focuses on the differentiation and identification of areas of origin based on the pronunciation of Indonesian sounds.
This study will explore determining the dataset and implementing the necessary artificial intelligence technology, this research focuses on using the feature extraction technique Mel Frequency Cepstral Coefficients (MFCC) to extract features from sound based on the sound spectrum[15][16][17]. In sound processing and processing, after feature extraction with the Mel Frequency Cepstral Coefficients (MFCC) technique, an artificial neural network (ANN) is used [18][19][20]. Based on this, this research conducts research by identifying regional origins based on dialect in the process of recognizing regional origins using the Evolving Multilayer Perceptron method and utilizing the Mel Frequency Cepstral Coefficients (MFCC) feature extraction technique to obtain dialect characteristics of regional origins.
.
.
2. RESEARCH METHODOLOGY
2.1 Data used
The data used is voice data taken from regions in Indonesia, namely Aceh, Nias, Karo and Simalungun. In this study, there was 1 sentence that was used as a sample in the study, namely: "thank you". The data consists of 4 regional origins in Indonesia, namely Aceh, Nias, Simalungun, Karo, and each region of origin consists of 100 people so that the total is 400 people who have different voice and dialect characteristics. To obtain sound characteristics, this study will use the Mel Frequency Cepstral Coefficients (MFCC) feature extraction technique.
2.2 Development Engineering
SECoS is a minimalist implementation of the ECoS principle which is often also called simple evolving MLP (Evolving MultiLayer Perceptron)[21]. SECoS consists of three neural sections. The first part is a straight line or other structured feature. The other part, the hidden part, is growth[22]. And the third part of the neon light is the beginning part. When using the linear activation function, and the activation A that is present in the evolving node n section is displayed with
An = 1 - Dn (1)
In the Anala function, an activation value from nodes n and Dn is the value of the normalized distance function between the input vector and the incoming weight vector at the specified node. The distance Dn function has a value that can be calculated using the formula for the normalized Hamming distance:
(2)
function K is the input number of nodes on the SECoS network, I is the input of vector values, and W is the input of the weight matrix in the evolving layer section[23]. In this study an analysis of the activation function was carried out by calculating besides using the normalized hamming distance it was also calculated using the normalized Manhattan distance as in equation (3) below:
(3)
And equation (4) normalized euclidean distance as follows
(4) Furthermore, learning was carried out using the SECoS method on regional origin identification data based on dialect. The SECoS algorithm is as follows:
a)
Propagate the input vector I into the network.b)
If the maximum activation (Amax) of the node is smaller than the sensitivity threshold coefficient (Sthr), then:c)
Add new node If not:d)
Calculate the error value between the learning outcomes (output vector Oc) and the actual value (output vector Od).e)
If the error value is greater than the threshold error coefficient (Etrh) or the desired output node is not active, then:f)
Add new node Otherwise:g)
Make changes to the connection weight on the winning hidden node.h)
Repeat these steps for each input vector.When the function of the node is added to the layer, the value of the input weight is given an initialization process that is the same as the input vector I and the value of the weight or output weight is initialized, the process is the same as the output vector Od. The propagation process from the hidden layer to the output layer is only processed by two stages. The initial stage based on the function of the One-of-N propagation method is that the propagation can be processed with hidden nodes for the highest activation value. The second stage of the process by the Many-of-N propagation method is that propagation can be processed with hidden nodes that have the number or number of activations above the activation threshold.
2.3 Framework
In this study, a search for regional origin based on dialect or voice will be used using data that has been generated by recording predetermined sentences and will be processed by carrying out the pre-processing stage first. In the research framework, there are steps that will change the sound format to .WAV. This stage is needed for the next stage such as the training process. The sound file format is also changed for all data with the same format so that it can be done easily at the training stage. Then proceed with the MFCC (Mel Frequency Cepstral Coefficients) (MFCC) stages which consist of PreeEmphasize, Framee Blocking, Windowing, Fast Fourier Transform, Mel
Frequency Warping, and Cepstral Liftering. Data from MFCC (Mel Frequency Cepstrum Coefficients) will then enter the development stage of the multilayer Perceptron method. The next process is grouping the sound extracted data using the Mel Frequency Cepstral Coefficients (MFCC) method into 3 groups where 1 part of the group will be the processed data. testing and 2 groups will be the data at the training and validation stage. The following is an image of the research framework which will be displayed in Figure 1
Figure 1. Research Architecture The explanation of the stages of Figure 1 is as follows:
1. In the first stage, data was collected, namely by recording speech or dialects in 4 regions in Indonesia with a total of 10 people in each region, so in total there were 40 people and each person said the sentences used as samples in the study, namely: "thank you" then saved in one file with WAV format.
2. The second step is to extract the characteristics of the sound data stored in WAV file format. this function is digitally filtered based on speech or dialect using the Mel Frequency Cepstral Coefficients (MFCC) extraction technique. The following are the results of voice feature extraction with MFCC.
a) Pre-emphasis, carrying out the process of sending signals through filter stages that absorb frequency values.
This process is necessary so that the voice is heard clearly. This stage is carried out after collecting data from the sound. The purpose of this process is to produce a spectrum of frequencies with clear sound conditions. This process is carried out with input or output from the time domain. So that the default alpha value used in the pre-emphasis filtering process is 0.33.
y = [𝓃] = 𝑠 [𝑛] − 𝛼 𝑠[𝑛 − 1] (5)
y = [𝓃] = 𝑝𝑟𝑒 𝑒𝑚𝑝ℎ𝑎𝑠𝑖𝑧𝑒 𝑓𝑖𝑙𝑡𝑒𝑟 𝑟𝑒𝑠𝑢𝑙𝑡 𝑠𝑖𝑔𝑛𝑎𝑙 (6)
s = [𝓃] = 𝑠𝑖𝑔𝑛𝑎𝑙 𝑏𝑒𝑓𝑜𝑟𝑒 𝑡ℎ𝑒 𝑝𝑟𝑒 𝑒𝑚𝑝ℎ𝑎𝑠𝑖𝑧𝑒 𝑓𝑖𝑙𝑡𝑒𝑟 (7)
Berikut ini hasil dari proses tahap pre-emphasis pada gambar 2
Figure 2. Pre-Emphasis Process Sound Signals
b) Windowing and framing, the framing process aims to limit the frequency of speech signals. This function is automatically performed in Python with 1024 frames per speech segment. windowing is the process of projecting sound onto any part of the signal created by the framing process. After the sound signal is cut into frames, there is an discontinuity in the frame at the beginning and end so windowing is necessary. The goal of windowing is to minimize these discontinuities so that the speech signal is spiky at the start and end of the frame. These stages reduce the continuity at the start and end of the signal and combine the frequencies. The following shows the results of the framing and windowing stages in Figure 3
Figure 3. Signals from the Framing and Windowing Process
Figure 3.3. is the result obtained from the pre-emphasis stage with the process of windowing and framing, in Figure 3 a graph of the sound signal that has become low and high frequency on sound waves obtained based on
𝑤0= 0,54 − 0,46 cos(2 𝜋 𝑛 𝑁 − 1)
c) Fast Fourier Transform (FFT), this stage is used so that the voice signal changes from the time domain to the frequency domain. The FFT process is carried out on all frames that have been generated by the previous process. Each frame generates one frequency vector. In voice signal processing, the FFT will convert the voice signal in the time domain to the frequency domain. In the FFT there is a long FFT variable that is stored in the NFFT variable with a value of 1024. The output of the FFT process is a complex number that represents frequency. Following are the results of the Fast Fourier Transform stages in Figure 4 Venus
Figure 4. Process of Fast Fourier Transform (FFT)
d) Filterbank/Mel-frequency, is the step to get the characteristic coefficient value/distinctive feature of the voice signal. Mel Frequency Wrapping is generally done using a filter bank. Filterbank is one form of filter that is carried out with the aim of determining the size of the energy of a certain frequency band in the voice signal.
Figure 5. FilterBank Process Sound Signal
3. The third stage will be the stage of forming a neural network. The network model used is an evolving multilayer perceptron that uses a structured learning system. The workings of the regional origin identification system based on the dialect that will be implemented are 3 processes such as training, validation and testing. The following is the process of evolving the Perceptron multilayer during the training stage
1. Input sound files.
2. Perform feature extraction using MFCC (Mel Frequency Cepstral Coefficients (MFCC) which consists of PreEmphasize, Frame Blocking/windowing, FastFourier Transform, Mel Frequency Warping 3. Perform data collection on column vectors from feature extraction results from the MFCC matrix 4. Enter the value of the sensitivity threshold (Strh), learning rate 1, learning rate 2, learning rate 3, learning
rate 4
5. Set the first node function to 1 and set the weight 1 and weight 2 where weight 1 is the input final vector and weight 2 is the target of the final vector.
6. Calculating the activation value of the input vector using the function 𝐴𝑛 = 1 − 𝐷𝑛.
7. Find the node with the highest activation value.
8. When the maximum value is smaller than the predetermined value for the sensitivity threshold, the node is added by 1 and the weight vector is initialized according to the input from the vector and the output vector weight is initialized according to the desired output vector Od,
9. Propagate to the most highly activated node using the OneOfN method, which is the value used for forward propagation from the evolving layer to the output layer using the node with the highest activation value.
10. The error value calculated is the difference between the desired output function and the actual output.
11. In the process, if the error between the desired output and the actual output obtained by the active node function is greater than the error threshold (Ethr) then the node is added to one and the input vector weight is initialized according to the value of the input vector I and the output vector weight is initialized with the desired output vector Od ,
12. The weight number 1 is changed with the function from equation 3 and the weight number 2 is changed with the function from Equation 4.
13. Then the training process is carried out with further data then the stages in step (f) are carried out. If all the data has been trained, the number of nodes and the associated weight matrix of the nodes are then stored and the process will be carried out during the testing process. The workings of the analysis of regional origin at the time of testing include the following:
a) Enter the sound file and perform the process of feature extraction using MFCC (Mel Frequency Cepstral Coefficients (MFCC) which consists of PreEmphasize, Frame Blocking, Windowing, Fast Fourier Transform, Mel Frequency Cepstral Warping, and Cepstral lifting
b) The matrix form of the MFCC feature extraction results
c) The shape of the column vector from the process that gets the sound feature calculation results from the MFCC matrix that has been carried out in the normalization stage.
d) Initialize the total number of nodes equal to the total number of all nodes from the results of the training stages and initialize the numbers of the weights associated with these nodes.
e) Calculate the activation number for the input vector with Eq f) 𝐴𝑛 = 1 − 𝐷𝑛.
g) Search for the node with the highest activation number.
h) Propagating on the evolving layer using OneOfN.
i) The results to be obtained are the accuracy of the recognition of regional origin based on dialect
3. RESULT AND DISCUSSION
In this chapter, we will discuss the results that contain voice samples that are used as training, validation and testing to determine regional origin based on dialects that utilize the voice extraction feature, namely the mel-frequency cepstral coefficient (MFCC) and use the evolving multilayer perceptron method.
3.1 Data training, validation and testing
Training, validation and test data are pre-recorded sound samples. In this study there is a distribution of training data, distribution of test data and distribution of validation data contained in table 1 below:
Table 1. Distribution of training data, testing and validation
No Title amount of data
1 Training data 320
2 Testing data 40
3 validation data 40
Jumlah 400
Based on table 1 there is a distribution in each phase of the data training, testing and validation, in each sound sample coming from the 4 regions there is a sound signal as shown in Figures 5.6.7 and 8 below
Figure 5. Signal Frequency Sound pattern in Aceh's voice
Figure 6. Signal Frequency Sound pattern in Karo voice
Figure 7. Signal Frequency Sound pattern in Nias voice
3.2 Sampling results of 4 regional sound patterns
Different from other sounds. The sound pattern is the actual sound signal value relative to the frequency value.
Furthermore, the original sound signal is extracted using the mel-frequency cepstral coefficient (MFCC) method to recognize the characteristics or distinctive features of the voice itself, the overall data from sound sampling must be known for the next process stage, namely the training, validation and testing process. Table 2 shows the overall matrix value of the overall sound sampling in the entire area. After the matrix value of the overall sound sampling value is known, then processing each voice to find out the results of more specific information on the characteristics of the voice to be used as a research reference.
Table 2. Value Of The Whole Area Of Origin The Matrix
No Aceh Karo Nias Simalungun
1 0,233232 0,434212 0,434232 0,623232
2 0,324545 0,534234 0,523131 0,234245
3 0,524242 0,234245 0,623112 0,534222
4 0,748342 0,534222 0,726323 0,572424
5 0,232344 0,572424 0,526311 0,694854
6 0,563424 0,694854 0,324859 0,621404
7 0,634242 0,621404 0,625241 0,634242
8 0,742842 0,834924 0,762423 0,742842
9 0,623231 0,829319 0,625241 0,623231
10 0,768343 0,723424 0,153739 0,768343
Based on table 2, the matrix values for each dataset are displayed in the form of sounds originating from the Aceh, Karo, Nias and Simalungun regions. The following is the result of the matrix value which is illustrated by the plot contained in Figure 9
Figure 9. The matrix value of the entire area of origin
Figure 10 will explain the matrix values generated in the 4 regions namely aceh, karo, nias and simalungun, each value will be used for the dialect recognition process where the matrix value aims to see the proximity of sounds based on the value
3.3 Regional voice assistance test
The results of the process of testing the mel-frequency cepstral coefficient (MFCC) technique on the regional origin recognition system based on dialect on voice samples that only use 4 areas in Indonesia. the testing process is the sound that has been collected previously. The detailed presentation is shown in table 3 as follows:
Table 2. Test result
Origin Prediction results
Correct Wrong
Aceh 4 3
Karo 8 2
Nias 6 4
Simalungun 7 3
Number of data 28 12
Result 88% 12%
Table 2 description of the testing process will use 10 data that has been divided by the training process to be part of the testing process. Of the 10 voices that have been tested with voices from the Aceh region, the results obtained are very good with successfully identifying 7 Aceh voices and 3 voices failing to identify. The prediction results are described in the form of plots in the figure 10
Figure 10. Prediction Results
4. CONCLUSION
Based on the whole process, the results of the research that have been carried out, the results of the voice features, namely the mel-frequency cepstral coefficient (MFCC) and the evolving multilayer perceptron method in regional recognition systems based on dialect or speech in detail in recognizing voice patterns from the region well and the amount of training data or the training data carried out greatly affects the accuracy of the introduction of regional origins based on the dialect in the system. The more variations of the training data and testing data with the greater the tolerance rate, the higher the accuracy of the speech pattern recognition of regional origin in the regional origin recognition system based on speech or dialect. So the authors can conclude that the mel-frequency cepstral coefficient (MFCC) voice extraction feature and the developed Perceptron multilayer method show good results in recognizing regional origins by being applied to a raspiberry pi device that can perform speech recognition directly with microphone media. The results obtained When testing regional recognition based on dialect by testing Aceh voice data as many as 10 voices it can recognize or identify dialect as many as 7 voices, Simalungun as many as 10 voices and can correctly identify as many as 7 voices, Karo area as many as 8 correct voices and Nias as many as 6 correct voice with an overall accuracy rate of 88%. So the prototype designed to perform speech recognition can run well to identify voices. The results of the analysis of the problems encountered when the prototype wanted to do voice recognition from the region were interference with voice input in the form of noise so that in the process of recording voice researchers used a microphone that could eliminate noise.
REFERENCES
[1] B. D. Wijanarko, Y. Heryadi, D. F. Murad, C. Tho, and K. Hashimoto, “Recurrent Neural Network-based Models as Bahasa Indonesia-Sundanese Language Neural Machine Translator,” in 2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE), 2023, pp. 951–956.
[2] A. R. Lubis, S. Prayudani, M. Lubis, and O. Nugroho, “Sentiment Analysis on Online Learning During the Covid-19 Pandemic Based on Opinions on Twitter using KNN Method,” in 2022 1st International Conference on Information System & Information Technology (ICISIT), 2022, pp. 106–111.
[3] O. Mailani, I. Nuraeni, S. A. Syakila, and J. Lazuardi, “Bahasa Sebagai Alat Komunikasi Dalam Kehidupan Manusia,”
Kampret J., vol. 1, no. 2, pp. 1–10, 2022.
[4] A. R. Lubis, S. Prayudani, M. Lubis, and O. Nugroho, “Latent Semantic Indexing (LSI) and Hierarchical Dirichlet Process (HDP) Models on News Data,” in 2022 5th International Conference of Computer and Informatics Engineering (IC2IE), 2022, pp. 314–319.
[5] H. Yan, H. Bai, X. Zhan, Z. Wu, L. Wen, and X. Jia, “Combination of VMD Mapping MFCC and LSTM: A New Acoustic Fault Diagnosis Method of Diesel Engine,” Sensors, vol. 22, no. 21, p. 8325, 2022.
0 5 10
Aceh karo Nias Simalungun
Prediction Results
Correct Wrong
[6] Q. Li et al., “MSP-MFCC: Energy-Efficient MFCC Feature Extraction Method with Mixed-Signal Processing Architecture for Wearable Speech Recognition Applications,” IEEE Access, vol. 8, pp. 48720–48730, 2020, doi:
10.1109/ACCESS.2020.2979799.
[7] R. D. Alamsyah and S. Suyanto, “Speech gender classification using bidirectional long short term memory,” in 2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 2020, pp. 646–649.
[8] P. Mehra and S. K. Verma, “Comparing Classifiers for Recognizing the Emotions by extracting the Spectral Features of Speech Using Machine Learning,” in 2023 International Conference on Device Intelligence, Computing and Communication Technologies,(DICCT), 2023, pp. 387–391.
[9] M. A. A. Albadr, S. Tiun, M. Ayob, M. Mohammed, and F. T. AL-Dhief, “Mel-frequency cepstral coefficient features based on standard deviation and principal component analysis for language identification systems,” Cognit. Comput., vol. 13, pp. 1136–1153, 2021.
[10] K. W. Gunawan, A. A. Hidayat, T. W. Cenggoro, and B. Pardamean, “Repurposing transfer learning strategy of computer vision for owl sound classification,” Procedia Comput. Sci., vol. 216, pp. 424–430, 2023.
[11] F. Abakarim and A. Abenaou, “Voice gender recognition using acoustic features, mfccs and svm,” in International Conference on Computational Science and Its Applications, 2022, pp. 634–648.
[12] S. Mukherjee, S. Mundra, and A. Mundra, “Speech Emotion Recognition Using Convolutional Neural Networks on Spectrograms and Mel-frequency Cepstral Coefficients Images,” in Information and Communication Technology for Competitive Strategies (ICTCS 2022) Intelligent Strategies for ICT, Springer, 2023, pp. 33–41.
[13] K. Chachadi and S. R. Nirmala, “Voice-based gender recognition using neural network,” in Information and Communication Technology for Competitive Strategies (ICTCS 2020) ICT: Applications and Social Interfaces, 2022, pp.
741–749.
[14] U. E. Akpudo and J.-W. Hur, “Intelligent solenoid pump fault detection based on MFCC features, LLE and SVM,” in 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 2020, pp. 404–
408.
[15] Q. Li et al., “MSP-MFCC: Energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications,” IEEE Access, vol. 8, pp. 48720–48730, 2020.
[16] A. Ashar, M. S. Bhatti, and U. Mushtaq, “Speaker identification using a hybrid cnn-mfcc approach,” in 2020 International Conference on Emerging Trends in Smart Technologies (ICETST), 2020, pp. 1–4.
[17] M. B. Alsabek, I. Shahin, and A. Hassan, “Studying the Similarity of COVID-19 Sounds based on Correlation Analysis of MFCC,” in 2020 international conference on communications, computing, cybersecurity, and informatics (CCCI), 2020, pp. 1–5.
[18] T. S. Kanchana and B. S. E. Zoraida, “A Framework for Automated Personality Prediction from Social Media Tweets,”
in 2022 IEEE World Conference on Applied Intelligence and Computing (AIC), 2022, pp. 698–701.
[19] C. M. Suneera and J. Prakash, “Performance Analysis of Machine Learning and Deep Learning Models for Text Classification,” 2020 IEEE 17th India Counc. Int. Conf. INDICON 2020, 2020, doi:
10.1109/INDICON49873.2020.9342208.
[20] H. M. M. Hasan and M. A. Islam, “Emotion recognition from bengali speech using rnn modulation-based categorization,”
in 2020 third international conference on smart systems and inventive technology (ICSSIT), 2020, pp. 1131–1136.
[21] Y. Ye, R. Yi, Z. Cai, and K. Xu, “STEdge: Self-Training Edge Detection With Multilayer Teaching and Regularization,”
IEEE Trans. Neural Networks Learn. Syst., 2023.
[22] M. S. Christo, J. J. Menandas, M. George, and S. V. Nuna, “DDoS Detection using Multilayer Perceptron,” in 2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), 2023, pp. 688–693.
[23] S. A. Rather and P. S. Bala, “Lévy flight and chaos theory based gravitational search algorithm for multilayer perceptron training,” Evol. Syst., vol. 14, no. 3, pp. 365–392, 2023.