Chapter 3 Dataset
3.2 Multiclass Dataset
A multi-class classification dataset is a dataset used for training machine learning models for classification tasks with more than two classes. Each example or sam- ple in such a dataset includes a collection of attributes that define it, as well as a label that indicates the class it belongs to. The model’s objective is to predict the class of a new sample based on its attributes. Examples of multi-class classification problems include image classification with many different object classes, text classi- fication with multiple categories of documents, and speech recognition with different spoken words or phonemes.A multi-class classification dataset for sentiment analysis can be beneficial for several reasons:
Handling nuanced sentiment: Sentiment analysis is not always a binary task of determining whether a text is positive or negative. Using a multi-class dataset can allow for the classification of texts into more nuanced sentiment categories such as positive, neutral, or negative.
Better performance: By providing more classes to classify into, a multi-class dataset can increase the model’s ability to capture more subtle differences in senti- ment. With more classes, the model can learn more complex relationships between the features of the text and the sentiment it expresses.
Handling mixed sentiment: A text may express multiple sentiments at once, or the sentiment may change over the course of the text. With a multi-class dataset, the model can learn to identify and classify these mixed sentiments.
Better understanding of the problem: When working with a multi-class dataset, it’s easier to understand how different sentiments are expressed in language and how they differ from each other. This can be useful for understanding the nuances of the problem, and for debugging and evaluating the performance of the model.
The ratings between 1 and 5 are separated into three portions in this Multiclass class dataset because three different forms of multi-class are being employed. Here, the job condition and key phrase content are displayed along with the Amazon group’s negative rating of 2, which is indicated. Similar to Infosys, the rating there is 5, which means that the content and other aspects are positive. The last review comes from Infosys and has a rating of 4, has the word ”Positive,” and is illustrated with more tables of contents.
Table 3.7: Dataset for Multiclass Classification Company
Name Rating Job Status Content Sentiments
Amazon 2
Former Employee, less than 1 year
Pay,Benefits,Discounts, Covid Testing, Good Managers Produc- tion Hours, Work Life/Sleep Balance
Negative
Infosys 5
Current Employee, more than 8 years
I don’t think there is any cons You will get great exposure
Positive
Infosys 4
Current Employee, more than 10 years
Green Card slots are less, Promotion slots are less sometimes Stability, Long term assignments, On par with market salary, Learning programs
Positive
The dataset gathers information from five separate rating counts ranging from 1 to 5 for multiclass categorization. Each rating includes the ratings 16136, 16658, 32794, 15744, and 17050 in order. After all processing, a total of 98382 ratings have already been gathered and processed. This is depicted in the following figures.
Table 3.8: Rating Based on Number of Reviews
Rating Count of Rating
1 16136
2 16658
3 32794
4 15744
5 17050
Grand Total 98382
Figure 3.6: Rating based on number of reviews
Here are some reviews of Amazon, Apple, Conduent, Dell, HCL Tech, IBM, Infosys, and some others along with their elaborated data of numbers containing Negative, Neutral, and Positive dataset
Table 3.9: Top 10 Companies Based on Sentiment Count Company
Name Negative Neutral Positive Grand Total
Amazon 7012 4084 7626 18722
Apple 462 1385 4219 6066
Concentrix 1283 2687 3653 7623
Conduent 1218 1353 810 3381
Dell Tech-
nologies 1520 1742 0 3262
HCL Tech-
nologies 2918 2663 0 5581
IBM 4369 0 0 4369
Infosys 587 1616 1142 3345
Microsoft 118 521 2538 3177
TATA 531 4470 0 5001
Grand To-
tal 20018 20521 19988 60527
Several job statuses are gathered to specify the job review more accurately and proper. the rating count of the job status is: Current Employee - 21571, Current Employee, less than 1 year - 8507, Current Employee, more than 1 year - 11874, Current Employee, more than 10 years - 2556, Current Employee, more than 3
years - 8127, Current Employee, more than 5 years - 4680, Former Employee - 12889, Former Employee, less than 1 year - 6645, Former Employee, more than 1 year - 8870, Former Employee, more than 3 years - 5282 and finally in total 91001 collections have been made. As per the figure shows.
Table 3.10: Top 10 Job Status based on Count of Rating
Job Status Count of Rating
Current Employee 21571
Current Employee, less than 1 year 8507 Current Employee, more than 1 year 11874 Current Employee, more than 10 years 2556 Current Employee, more than 3 years 8127 Current Employee, more than 5 years 4680
Former Employee 12889
Former Employee, less than 1 year 6645 Former Employee, more than 1 year 8870 Former Employee, more than 3 years 5282
Grand Total 91001
The rating count of the job status of multiclass classification is: Current Employee - 24%, Current Employee, less than 1 year - 9%, Current Employee, more than 1 year - 13%, Current Employee, more than 10 years - 3%, Current Employee, more than 3 years - 9%, Current Employee, more than 5 years - 5%, Former Employee - 14%, Former Employee, less than 1 year - 7%, Former Employee, more than 1 year - 10%, Former Employee, more than 3 years - 6% all along in 100% of collections have been made. As per the figure shows.
Figure 3.7: Top 10 Job Status based on Count of Rating
Table 3.11: Count of Sentiments
Sentiments Count of Rating
Negative 32794
Neutral 32794
Positive 32794
Grand Total 98382
Figure 3.8: Count of Sentiments
Table 3.12: Word Count based on Sentiments
Sentiments Sum of number of words
Negative 1906791
Neutral 1048725
Positive 1009845
Grand Total 3965361
Figure 3.9: Word count based on sentiments
Among all as the rating is divided into three segment each segment like Positive, Neutral and Negative has the ratio of 26%, 26% and 48% consecutively. The word cloud is divided mainly into two main categories such as pros and cons. The words that are mostly used in Positive and Very Positive and sometimes Neutral reviews are collected together.
Figure 3.10: Word Cloud of Multi Class Dataset
Therefore, the researchers collected the dataset on their own using data scraping.
In this code, BeautifulSoup is used to collect the data from the companies.
This dataset also contains nobility in such a way that,
• This dataset is completely new, raw, and collected by the researchers..
• This dataset is categorized and labeled by using AI applications to make more impactful and accurate outcomes.
• This dataset is cleaned and scrubbed. The errors, duplicates, and irrelevant data are identified from this dataset and fixed properly.
• No Human Generated Decision has been taken for labeling, rating, or any value of this dataset. This data-collecting and cleaning procedure are fully automated.
• All the invalid reviews are removed from this dataset
• The researchers used the voting system to come up with a combined answer or decision for removing irrelevant and invalid reviews.
In such a way, this dataset represents new, raw, and nobility that this research only uses for the first time.
For the analysis and pre-processing of the data, there are some consequential pro- cesses are used in this research. Firstly, the process started by importing the de- pendencies such as NumPy, panda, matplotlib, seaborn, etc. Later on, stop words, punkt, wordnet, omw-1.4 are downloaded. Secondly, the data has been loaded by the researchers in a way where these key points are maintained which are: Com- pany name, Given Rating, Job status, Pros, and Cons. Thirdly, the data cleaning part has been started where the null values are removed as a column. In the first checking, 3 in pros and cons null values were found. After removing the null values, in the second search, there are all values without null values. So, in that way, this is a user-friendly datasheet where the researchers can work. Fourthly, the datasheet has been copied to excel and the analysis part started to occur.
Chapter 4
Proposed Methodology
The proposed method starts with peparing data and creating an embedding layer for deep learning models. Bi-GRU was used for the binary dataset and Bert model for the multiclass dataset.