• Tidak ada hasil yang ditemukan

Chapter 3 Dataset

3.2 Multiclass Dataset

A multi-class classification dataset is a dataset used for training machine learning models for classification tasks with more than two classes. Each example or sam- ple in such a dataset includes a collection of attributes that define it, as well as a label that indicates the class it belongs to. The model’s objective is to predict the class of a new sample based on its attributes. Examples of multi-class classification problems include image classification with many different object classes, text classi- fication with multiple categories of documents, and speech recognition with different spoken words or phonemes.A multi-class classification dataset for sentiment analysis can be beneficial for several reasons:

Handling nuanced sentiment: Sentiment analysis is not always a binary task of determining whether a text is positive or negative. Using a multi-class dataset can allow for the classification of texts into more nuanced sentiment categories such as positive, neutral, or negative.

Better performance: By providing more classes to classify into, a multi-class dataset can increase the model’s ability to capture more subtle differences in senti- ment. With more classes, the model can learn more complex relationships between the features of the text and the sentiment it expresses.

Handling mixed sentiment: A text may express multiple sentiments at once, or the sentiment may change over the course of the text. With a multi-class dataset, the model can learn to identify and classify these mixed sentiments.

Better understanding of the problem: When working with a multi-class dataset, it’s easier to understand how different sentiments are expressed in language and how they differ from each other. This can be useful for understanding the nuances of the problem, and for debugging and evaluating the performance of the model.

The ratings between 1 and 5 are separated into three portions in this Multiclass class dataset because three different forms of multi-class are being employed. Here, the job condition and key phrase content are displayed along with the Amazon group’s negative rating of 2, which is indicated. Similar to Infosys, the rating there is 5, which means that the content and other aspects are positive. The last review comes from Infosys and has a rating of 4, has the word ”Positive,” and is illustrated with more tables of contents.

Table 3.7: Dataset for Multiclass Classification Company

Name Rating Job Status Content Sentiments

Amazon 2

Former Employee, less than 1 year

Pay,Benefits,Discounts, Covid Testing, Good Managers Produc- tion Hours, Work Life/Sleep Balance

Negative

Infosys 5

Current Employee, more than 8 years

I don’t think there is any cons You will get great exposure

Positive

Infosys 4

Current Employee, more than 10 years

Green Card slots are less, Promotion slots are less sometimes Stability, Long term assignments, On par with market salary, Learning programs

Positive

The dataset gathers information from five separate rating counts ranging from 1 to 5 for multiclass categorization. Each rating includes the ratings 16136, 16658, 32794, 15744, and 17050 in order. After all processing, a total of 98382 ratings have already been gathered and processed. This is depicted in the following figures.

Table 3.8: Rating Based on Number of Reviews

Rating Count of Rating

1 16136

2 16658

3 32794

4 15744

5 17050

Grand Total 98382

Figure 3.6: Rating based on number of reviews

Here are some reviews of Amazon, Apple, Conduent, Dell, HCL Tech, IBM, Infosys, and some others along with their elaborated data of numbers containing Negative, Neutral, and Positive dataset

Table 3.9: Top 10 Companies Based on Sentiment Count Company

Name Negative Neutral Positive Grand Total

Amazon 7012 4084 7626 18722

Apple 462 1385 4219 6066

Concentrix 1283 2687 3653 7623

Conduent 1218 1353 810 3381

Dell Tech-

nologies 1520 1742 0 3262

HCL Tech-

nologies 2918 2663 0 5581

IBM 4369 0 0 4369

Infosys 587 1616 1142 3345

Microsoft 118 521 2538 3177

TATA 531 4470 0 5001

Grand To-

tal 20018 20521 19988 60527

Several job statuses are gathered to specify the job review more accurately and proper. the rating count of the job status is: Current Employee - 21571, Current Employee, less than 1 year - 8507, Current Employee, more than 1 year - 11874, Current Employee, more than 10 years - 2556, Current Employee, more than 3

years - 8127, Current Employee, more than 5 years - 4680, Former Employee - 12889, Former Employee, less than 1 year - 6645, Former Employee, more than 1 year - 8870, Former Employee, more than 3 years - 5282 and finally in total 91001 collections have been made. As per the figure shows.

Table 3.10: Top 10 Job Status based on Count of Rating

Job Status Count of Rating

Current Employee 21571

Current Employee, less than 1 year 8507 Current Employee, more than 1 year 11874 Current Employee, more than 10 years 2556 Current Employee, more than 3 years 8127 Current Employee, more than 5 years 4680

Former Employee 12889

Former Employee, less than 1 year 6645 Former Employee, more than 1 year 8870 Former Employee, more than 3 years 5282

Grand Total 91001

The rating count of the job status of multiclass classification is: Current Employee - 24%, Current Employee, less than 1 year - 9%, Current Employee, more than 1 year - 13%, Current Employee, more than 10 years - 3%, Current Employee, more than 3 years - 9%, Current Employee, more than 5 years - 5%, Former Employee - 14%, Former Employee, less than 1 year - 7%, Former Employee, more than 1 year - 10%, Former Employee, more than 3 years - 6% all along in 100% of collections have been made. As per the figure shows.

Figure 3.7: Top 10 Job Status based on Count of Rating

Table 3.11: Count of Sentiments

Sentiments Count of Rating

Negative 32794

Neutral 32794

Positive 32794

Grand Total 98382

Figure 3.8: Count of Sentiments

Table 3.12: Word Count based on Sentiments

Sentiments Sum of number of words

Negative 1906791

Neutral 1048725

Positive 1009845

Grand Total 3965361

Figure 3.9: Word count based on sentiments

Among all as the rating is divided into three segment each segment like Positive, Neutral and Negative has the ratio of 26%, 26% and 48% consecutively. The word cloud is divided mainly into two main categories such as pros and cons. The words that are mostly used in Positive and Very Positive and sometimes Neutral reviews are collected together.

Figure 3.10: Word Cloud of Multi Class Dataset

Therefore, the researchers collected the dataset on their own using data scraping.

In this code, BeautifulSoup is used to collect the data from the companies.

This dataset also contains nobility in such a way that,

• This dataset is completely new, raw, and collected by the researchers..

• This dataset is categorized and labeled by using AI applications to make more impactful and accurate outcomes.

• This dataset is cleaned and scrubbed. The errors, duplicates, and irrelevant data are identified from this dataset and fixed properly.

• No Human Generated Decision has been taken for labeling, rating, or any value of this dataset. This data-collecting and cleaning procedure are fully automated.

• All the invalid reviews are removed from this dataset

• The researchers used the voting system to come up with a combined answer or decision for removing irrelevant and invalid reviews.

In such a way, this dataset represents new, raw, and nobility that this research only uses for the first time.

For the analysis and pre-processing of the data, there are some consequential pro- cesses are used in this research. Firstly, the process started by importing the de- pendencies such as NumPy, panda, matplotlib, seaborn, etc. Later on, stop words, punkt, wordnet, omw-1.4 are downloaded. Secondly, the data has been loaded by the researchers in a way where these key points are maintained which are: Com- pany name, Given Rating, Job status, Pros, and Cons. Thirdly, the data cleaning part has been started where the null values are removed as a column. In the first checking, 3 in pros and cons null values were found. After removing the null values, in the second search, there are all values without null values. So, in that way, this is a user-friendly datasheet where the researchers can work. Fourthly, the datasheet has been copied to excel and the analysis part started to occur.

Chapter 4

Proposed Methodology

The proposed method starts with peparing data and creating an embedding layer for deep learning models. Bi-GRU was used for the binary dataset and Bert model for the multiclass dataset.

Dokumen terkait