TOXIC FRIEND DETECTOR BY Lai Xuan Ying

The outcomes results will demonstrate the personality tracking that prediction and identification of the people based on the following techniques and methods mentioned in this project.

INTRODUCTION

Problem Statement and Motivation
Project Objectives
Project Scope
Impact, Significance and Contribution
Report Organization

Content can be entered in English and the input will be analyzed and predicted. Personality traits will be identified using the Big Five Dimension Model to predict and identify an individual's emotions and personality.

LITERATURE REVIEW

Review on Previous works

Personality and Emotional Models
Natural Language Processing
Sentiment Analysis
Feature Extraction
Strngth and weakness

8] Additionally, an important part of the natural language processing pipeline is one that can discover domain vocabularies in domain analysis. All attributes are independent of each other and are attempted in the context of the class.

Figure 2.13: Probabilistic graphical model

Overview of research

Limitation of Previous Studies

In sentiment analysis models, the TF-IDF constraint is widely used due to its conciseness and can be implemented by considering word weights. In addition, this method cannot identify and reflect the feature of a specific aspect of the user. Instead, the LDA model will sometimes be related to other topics since the Dirichlet distribution of topics cannot extract its relationship and correlation.

Proposed Solution

The following techniques and methods studied in literature review can be performed to convert the data input of the messages or email into processable elements with more details for further use in feature extraction. Moreover, Parts of Word (POS) tags are extracted to handle words outside of vocabulary (OPV) to identify the feature of a word in a sentence according to grammatical categories of the words of a language. The LDA model will be possessed in our proposed system for feature extraction and feature selection.

Also, it is a simple and efficient model for reducing dimensions and summarizing input data instead of other models. The training set of positive and negative sentences will be split using LDA to search for suitable latent topics and contributing keywords to the topics. Typically, there are hundreds and thousands of messages or emails of an individual, so the sentiment analysis model is performed on the classification.

This is because this hybrid method performed high performance and high accuracy instead of two independent methods. His general direction of emotion imposed in each sentence is similar to the general feeling expressed by the words of this bare text. It can also be less congested and more resilient to data entry noise.

PROPOSED SYSTEM METHOD/APPROACH

Design Specification

Methodology

After that, the system application hypothesis will transcribe the speech into a text file along with some available details. True Positive (TP) indicates the number of positive samples correctly predicted as positive. False Positive (FP) indicates the number of negative samples that are correctly predicted as negative.

True Negatives (TN) indicates the number of negative samples correctly predicted as positive. False negatives (FN) indicate the number of positive samples correctly predicted as negative. Confusion matrix is an N x N matrix evaluated to determine the performance of a classification model, where N is the number of targeted classes.

The accuracy score is the ratio of the correct prediction to the total number of predictions used. Precision is the number of correct positives divided by the number of positives analyzed by the classifier. Recall is the number of correct results divided by the number of all relevant positive samples that predicted correctly.

Figure 3.2: System Development Methodology

Timeline

SYSTEM DESIGN

Project Flow Diagram
Speech-To-Text Procedures Flow
Natural Language Processing and Machine Learning Trained Model
API Flow

After successfully sending the API request, the speech model of the sphinx4 will be executed to retrieve the speech file and transcribe the speech to text. This project focuses on profiling the speech to text of the speech recognition of an individual. Finally, the system application hypothesis will transcribe the speech to the text file along with some available details.

The utterance will be fragmented and break into the sequences of the subword units in the acoustic model. The following diagram has demonstrated the flow of the text classification model block diagram with two phases which are the training phase and the recognition phase. Thus, the set of words, such as a comment, will be represented by a matrix involving the scores of the words.

One of the models will be identified as the best in encoding and modeling under tem. Additionally, the data frame with the following metrics must be returned for evaluating the model. The result of the toxic or non-toxic will be based on the threshold with the combination of the score for the text dataset.

SYSTEM IMPLEMENTATION

Hardware Setup

Software Setup

It can also support different types of multiple languages and other features in all standard packages as well as providing unlimited expansion and customization. Being highly customizable, it provides a better user experience and easier management of a wide range of IDEs. It is also open source tools for researchers with a "research ready system" to carry out their basic work and involves different kinds of implementations of the techniques based on its design and patterns.

It is an open source text editor that can include syntax, auto-completion, code folding and highlighting and especially auto-completion for markup, scripting and programming languages. It's free and built on the open source, and can even run anywhere, anytime. It is also determined by the circumstances while the domain is requesting a resource that is not the same of the domain serving that resource.

It is a high-level framework to use ready-made machine learning algorithms instead of creating a new one. It is also an NLP package quite suitable for LDA subject modeling and other machine learning embedding algorithms. It is always performed to construct the information extraction, text preprocessing for deep learning and natural language understanding. . viii.

Setting and Configuration

Flask Configuration

For the instance, with the following of the address implied in thhost parameters, while the port number is implied in the port address, the app.run in host. Also, the text classification models will be run as mentioned in Chapter 4 for further training and testing of an individual's speech-to-text. To allow users to interact with the Toxic Friend detector system, a web application has been designed to present the product speech to text and text classification for the toxic and non-toxic labels on the web page.

The diagram above demonstrates the code snippet of the function of receiving a large audio file and then transcribing the speech to text with one of the audio source APIs, which is recognize_google() with the Google Web Speech API. It continues the NLP pipeline by way of tokenization, stopword removal, and tokenization of those techniques. The two figures above show the Wordcloud of toxic and non-toxic words that are typically generated in its library during text analysis.

The words that are frequent and prominent in the body of the text will be used as the following figures. With the following 3 figures above, the three different embeddings TF-IDF, Word2Vec and Doc2Vec have stored their model with different methods of encoding the stem text. The figure above has shown a simple GUI design for users to interact with the Toxic Friend Detector system.

Figure 5.5: Speech Recognition libraries

SYSTEM EVALUATION AND DSICUSSION

After obtaining the result of the overall evaluation performance of 9 approach models, logistic regression with Doc2Vec is identified as the better coding and better model during the embedding approach. To evaluate the performance of logistic regression using the Doc2Vec approach, the probability associated with a class is included in the final binary class with a specified threshold. Assuming that the probability is part of the "1" class, which indicates that it is a toxic text, using a threshold of t.

The value of the threshold can be picked up according to the desired values of the precision and recall. The output has displayed the toxicity of the speech file and the text transcription of the speech. Due to unfamiliarity with the API, many exercises have to be done to understand the API connection to the machine learning model.

The model accuracy of the proposed speech-to-text model of Sphinx4 is also quite low among the existing speech recognition. The input can be successfully inserted and the result of the result is successfully displayed, regardless of whether this person is toxic or not. There may be some of the techniques or mtho not working well in the system.

Figure 6.3: Performance measurement of every algorithm classifier with Doc2Vec

CONCLUSION AND RECOMMENDATION

Conclusion on Project Achievements

The tokenization, stop words and steeming are performed in the pre-processing and the TF-IDF, Word2Vec, Doc2Vec will also be implemented in feature engineering for embedding with the various algorithm classifiers such as Linear SVC, Naives Bayes and Logistic Regression. In this project, the logistic regression with the Doc2Vec is chosen to perform the text classification and as the final deployment with the most optimal htperaratmer setting with the low validation loss. In contrast, it is critical to emphasize the need for optimization to make the model more efficient in performing the same task.

Although the project has been completed at the time of writing, current work has indicated reasonably good progress, and reasonably confident of completing and achieving the project objective.

Recommendation

13] UmniySalamah, DesiRamayanti, "Application of Logistic Regression Algorithm for Complaint Text Classification in Indonesian Ministry of Marine and Fisheries" Volume 5 Issue 5, September- October 2018 p. 2394-2231. 14] Sayar Ul Hassana, Jameel Ahameda, Khaleel Ahmad "Analysis of Machine Learning Based Algorithms for Text Classification" pp Volume 3, 2022 Pages Understanding TF-ID: A Simple Introduction", MonkeyLearn. 21] "What is API: Definition, Types, Specification, Documentation", altexsoft, https://www.altexsoft.com/blog/engineering/what-is-api-definition- types-.

Radha, “Speaker-independent speech recognition system for Tamil language using HMM.” Procedia Engineering, vol. Chandra, “A review of automatic speech recognition architectures and approaches,” International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. Setyono, “Sphinx4 for Indonesian Continuous Speech Recognition System”, in International Seminar on Application of Information and Communication Technology, 2017, p.

Program / Course Bachelor of Computer Science (Honours) Title Final year project Toxic Friend Detector. Note: Promoter/Candidate(s) is/are required to provide a soft copy of the full set of the Originality Report to the Faculty/Institute. Based on the above results, I hereby declare that I am satisfied with the originality of the end-of-year project report submitted by my student(s) as stated above.