DAFFODIL INTERNATIONAL UNIVERSITY

This project titled “Bengali Functional Sentence Classification through Machine Learning Approach” submitted by Antara Biswas, ID No Musfiqur Rahman, ID No and Zahura Jebin Orin, ID No at the Department of Computer Science and Engineering, Daffodil International University, has been deemed satisfactory accepted for partial fulfillment of the requirements for the degree of B.Sc. Department of Computer Science and Engineering Faculty of Science and Information Technology Daffodil International University. We hereby declare that this project was executed by us under the supervision of Md.

We also declare that neither this project nor any part of this project has been submitted elsewhere for the award of any degree or diploma. Deep knowledge and great interest of our supervisor in the field of "Machine Learning" to realize this project. Touhid Bhuiyan, Professor and Head of CSE Department, for his kind help to complete our project and also to other faculty members and staff of CSE department of Daffodil International University.

We would like to thank all our classmates at Daffodil International University who participated in this discussion during the completion of the course. Inspired by these studies, the classification of Bengali functional sentences has been supplemented with machine learning methods for sentence classification.

Introduction

With continuous innovation and exploration in this field, it is predicted to grow in the coming day. The use of large amounts of smart technologies and devices, cloud-based solutions and the thirst for capable and understanding computers for AI-based text sensing is the main factor of the growth of NLP. In Figure 1.2, it is assumed that NLP can be applied to any type of AI domain.

However, the demand and the scope of the fields motivated us to work on it.

Figure 1.2: Fields of Artificial Intelligence

Research Objectives

Report Layout

Related Works

Related works

After applying a number of models, the authors agreed that the MNB algorithm is selected as the best perfection of 90.17%. For Chinese character and word level, LM gave about 86.7% and 89.2%, respectively, and for Japanese character level about 84%. In that case, three general text classification algorithms were introduced: K-Nearest Neighbor (KNN), Naive Bayes (NB), and Support Vector Machines (SVM), all of which have good classification effects.

After comparison, the experimental results showed that the SVM model and the NB model gave better results.

Scopes of the Work

On the other hand, it has been found that most of the previous works are Deep Learning based works. As a result, those who want to work on a pure machine learning based model on the Bengali language can also take ideas from this research paper.

Challenges

Research Methodology

Introduction

Class Selection

Best Model

Data Collection Process

Three common sources of data collection are open source, the Internet, and artificial data generation. All data was collected manually from various Bengali poems, short story books, plays, novels, newspapers, articles and online portals. Figure 3.3.2 presents a pie chart showing the percentage of each sentence type.

Data was organized into three categories when stored: context (sentence), context type (sentence name), and a class of context.

Figure 3.3.2: Pie Chart View of Collected Data

Data Preprocessing

Insertion of Dataset
Dataset Diagnosing
Dataset Cleaning

This is because the model does not work at all when the collected data is inserted in any file format. Models work by inserting certain types of file formats, such as CSV, HTML, or XLSX file formats, and moving on to the next process to process that file. After inserting the data, this step first verifies the data type of the dataset.

We took the context as the object type and the class as the full type of our dataset which was a favorable aspect for us. We got classes as context and integer as our data objects, which was a favorable aspect for us. To monitor and find these null values, the info() and isna().sum() functions are applied.

We also verified the mean value of the data, standard deviation, minimum and maximum values via the write() function. Finally, the dataset was made unique by checking and removing duplicate entries to improve accuracy. Punctuation marks and stopwords in particular do not contain much information, but sometimes give a wrong impression of the most important feature.

Then, punctuation marks and stop words are removed by reading each token from the corpus. Since Word-Tokenize is a very popular feature for tokenization and cleansing, we used it in our research to pass this stage. Since ML algorithms cannot understand the normal form of data, data must be encoded as integers, that is, the numerical form that generates the feature vectors.

Count Vectorizer or Bag of Words (BoW) is generally applicable for vectorization that reveals the existence of words in the data. When a word is present in the text data, it returns 1 for this word, otherwise it returns 0 and creates a vector-matrix statement in each single data. On the other hand, this vector conversion step is essential to fit the data in the algorithms.

Figure 3.4: Diagram of Data Preprocessing

Machine Learning Model Selection

Naive Bayes (NB)
Random Forest (RF)
Support Vector Machines (SVM)
Extreme Gradient Boosting (XGB)

15 In equation (1), X represents the class, which means that X tells us that the sentence is an interrogative, exclamatory or assertive sentence according to the given condition. This is the most commonly used method due to its flexibility and usually gives excellent results without the need to adjust the hyperparameters. 16 In the decision tree, the feature importance is calculated using the following equation (6), where the sum of all node importances is divided by the total number of nodes.

The sum of feature importance for each tree is divided by the total number of trees, which is represented by equation (8) [16]. Decision trees are used for classification and regression problems and are associated with supervised machine learning. The mathematical term can be represented by equation (12), where T is the current state and X are the selected attributes.

The problem with IG in taking into account the number of branches is scalped by the gain ratio that would result before the construction of the split. Support Vector Machine (SVM) is a supervised machine learning technique that belongs to the SVM family. It predicts results based on the data point or neighbor closest to the user [11].

One of the best editing algorithms used for supervised learning is XGB (Extreme Gradient Boosting). Most data scientists prefer XGB, as it has a high achievable speed from the core estimate. XGB is a machine learning technique based on decision trees that uses a decision tree design for gradient boosting.

Cross-validation is a method of data resampling to evaluate the generalizability of predictive models and to prevent overfitting. Among many of the resampling methods, cross-validation is usually exercised to adjust model parameters [13]. The first subset serves as a validation set Dval-1, while the next nine subset serves as a training set Dtrain-1 in the first fold.

Figure 3.6: K-Fold Cross Validation where k=10 and the data set are randomly split into 10 detach subsets

Performance Parameters .1 Confusion Matrix

Result Analysis

In the NB example, the recall and F1 score supported a score very close to each other for query precision or class 0, with 358 as support. However, for calls or class 1, precision was observed to be less than class 0 with recall and the f1 score was adaptive with 405 support. On the other hand, for assertion or class 2, in the case of NB, each parameter was very bad excluding support.

In TABLE 5, the precision, recall and f1 score for interrogation or class 0 was found to be 0.79 and the support was 358. 24 TABLE 5: MEASUREMENT OF PRECISION, RECALL AND F1 SCORE FOR THREE CLASSES (RANDOM . FOREST ALGORITHM) Test Result Cross-. In this plot, the Random Forest model scored highest for precision, recall, and f1 score.

In addition, Figure 4.4 was sketched for class 2, indicating that SVC is the top model with the highest standard of parameters.

TABLE 4: MEASURE PRECISION, RECALL, AND F1-SCORE FOR THREE CLASSES (NAÏVE BAYES ALGORITHM)

Conclusion and Future Work 5.1 Conclusion

Future Works

Pardalos, “Nearest Neighbor K-Classification,” in Data Mining in Agriculture, New York, NY: Springer New York, 2009, p. , Systems and Applications (IISA), 2019. Forman, “An extensive empirical study of feature selection metrics for text classification,” The Journal of Machine Learning Research, vol.

Chen, “An introduction to machine learning for panel data: Decision trees, random forests, and other dendrological methods,” SSRN Electron. Available: https://medium.com/@rdhawan201455/knn-k-nearest-neighbor-algorithm-maths-behind-it-and-how-to-find-the-best-value-for-k-6ff5b0955e3d. Available: https://www.marketwatch.com/press-release/natural-language-processing-nlp-market-insights-by-emerging-trends-future-growth-revenue-analysis-demand-forecast-to tesla=y .

Gaussier, “A probabilistic interpretation of precision, recall, and F-score, with implications for estimation,” in Lecture Notes in Computer Science, Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, p. Hasan, “Bengali functional sentence classification through machine learning approach”, in 2021 12th International Conference on Computer Networking and Communication Technologies (ICCCNT), 2021.

APPENDIX

PLAGIARISM REPORT