I would like to express my sincere gratitude to the Head of CSE Department for his kind help in completing my thesis and also other faculty members and staff of CSE Department of Daffodil International University. Phishing, which attempts to steal confidential information by pretending to be a legitimate source, is one of the most widespread and successful attacks. The success of phishing websites is based on manipulating human emotions, which raises concerns and creates an emergency situation with the warning that failure to act can result in significant data and money losses.
Although many detectors have been proposed, more work needs to be done due to the large number of fake websites. In order to increase the accuracy of phishing detection, in this study, we propose a phishing classifier model that uses multinomial naive Bayes, logistic regression, and natural language processing over URL text. This study has demonstrated the success of the algorithm in increasing the accuracy of phishing detection, and the literature will demonstrate the success of this algorithm in URL text classification.
Introduction
INTRODUCTION
- Motivation
- Rationale of the Study
- Research Questions
- Preliminaries
- Related Works
- Visual Similarity
- Heuristics-Based Approach
- Fuzzy Rule-based Approach
- Search Engine
- Machine Learning Approach
- Comparative Analysis & Summary
- Scope of the Problem
- Challenges
To evaluate the effectiveness of the proposed features, I conducted extensive experiments using different machine learning algorithms. On the Internet, phishing scams remain one of the most serious threats facing users. This comparison includes examining the HTML tags, images, and JavaScript versions present on the suspect page, as well as other aspects of the website.
20] provided a description of the construction of the proprietary machine learning model used by Google to identify phishing sites. Blacklisting is one of the simplest methods to identify phishing sites, but it cannot be used to identify new phishing sites. One of the most recent methods researchers are using to determine if a website is a phishing site is machine learning.
Research Methodology
Research Subject
Data Collection
- Tokenizar
- CountVectorizer
- Character-to-index mapping
- Sequence-padding
- Embedding
- Snowball Stemmer – NLP
- Stemming
- Logistic Regression and the Underlying Mathematics
- Multinomial Naive Bayes
- How Multinomial Naive Bayes Works
- Confusion Matrix
- F1 Score
- Conclusion
According to this model, each word is given a unique number corresponding to the number of times the word appears in the model. Therefore, an encoding vector is returned with the length of the entire vocabulary, which includes all words, as well as the integer number of times each word appears in the phrase. One such representation that is often used is called single-number encoding, where each integer is expressed as a string with the same length as the size of the vocabulary.
For the purpose of calculating probability in logistic regression, we make use of the logistic function or the sigmoid function. One of the most common supervised learning classifiers used in the analysis of categorical text data is called multinomial Naive Bayes. You will emerge from reading this article with a comprehensive knowledge of the multinomial Naive Bayes method as well as all the ideas associated with it.
The Multinomial Naive Bayes algorithm is actually a probabilistic learning approach used in Natural Language Processing most of the time (NLP). There is no correlation between the presence or absence of one property and the presence and absence of the other property. It is important to have a basic understanding of the Bayes theorem concept before trying to understand the function of the naive Bayes theorem because the former is based on the latter.
You can calculate the probability that tags appear in text using this formula. The process involves the development and selection of a model that provides a high degree of accuracy to data that is not part of the sample set. Therefore, it is essential to check the correctness of the model before attempting to calculate the expected values.
The F1 score of a classification problem is calculated by finding the harmonic mean of the problem's precision and recall scores.
Implementation Requirements
- Python Libraries Python comes with an extensive standard library, an aspect of python
- Numpy
- Matplotlib
- Jupyter Notebook
Now, the issue that immediately comes to mind is why we use a harmonic mean rather than an arithmetic mean. This is due to the fact that HM is stricter in punishing excessive values. Now, if we were to take HM, we would receive 0 as a correct result, which is appropriate since this model is completely useless for any and all applications.
By making a few tweaks to the formula above so we can add an adjustable parameter named beta for the purpose of this discussion. The effectiveness of a model is evaluated against a user who weights recall twice as much as precision using the fbeta metric. In this chapter, we covered the initial work that was done on our research project, as well as the theoretical framework that underpinned it.
Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and in an interactive environment across platforms. Matplotlib is available for Python scripts, Python and IPython shells, Jupyter notebooks, web application servers, and four GUI toolkits. Using matplotlib, we can generate plots, histograms, power spectra, bar graphs, error plots, scatter plots and more with just a few lines of code.
Jupyter Notebook is an open source web application with an interactive environment for creating and sharing documents containing live code, equations, visualizations, and narrative text. It can be used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and many other applications.
Experimental Setup
Experimental Result & Analysis
- RegexpTokenizer
- SnowballStemmer
- WordCloud
- Logistic Regression
- Multinomial Naive Bayes
- Classification Result
- Model Testing Result
If we want complete control over how text is tokenized, we use regular expressions to do so. It is a stemming algorithm that is also known as Porter2 stemming method since it is an improved version of Porter Stemmer due to the fact that various problems with Porter Stemmer have been solved based on this. The term "lemme" refers to the result of the process of reducing a word to its stem, which can then be attached to the suffixes, prefixes, or roots of other words.
Stemming is, in layman's terms, the process of reducing a word to its root word or stem in such a way that words of a similar kind come under a common stem. 34;sportingli." The way each of these algorithms derives from the word "sporty" reveals an important distinction between them that can be observed quite clearly. A data visualization approach known as a "word cloud" is used to display textual data within which the size of each word represents the frequency with which it is used or its relevance.
By using a word cloud, important aspects of the text material can be highlighted. While analyzing data on social networking platforms, it is a common practice to use word clouds. Creating a word cloud using Python requires Python modules called matplotlib, pandas, and wordcloud.
In recent years, text data has experienced exponential growth, resulting in an ever-increasing demand for the analysis of extremely large volumes of such data. Word Cloud is a useful tool for analyzing text data through the process of data representation in the form of tags or words; the meaning of a word is determined by the frequency with which it appears in the text. From the findings presented above, it is clear that the logistic regression model provides the best fit, as evidenced by the actual score of 96%.
For cross-verification, I paste a good url in bad part and I also paste a bad url in good part.
Discussion
Impact on Society, Environment & Sustainability
- Impact on Society
- Impact on Environment
- Ethical Aspects
- Sustainability Plan
- Summary of the Study
- Conclusions
- Implication for Further Study
This is the reason why I created the model and why I need to fix it properly so that people can easily check the phishing site. Malicious actors attempt to seize control of the vulnerable system to perform a variety of attacks on the users' transactions. 34;phishing attack" The intruder tries to trick the genuine user into believing that they are accessing a fraudulent website.
Logistic regression and multinomial naive Bayes are two variants of machine learning classifiers used by the proposed method. Performance measures such as F1 score, recall, and precision are used in the implementation of these algorithms. From the experimental findings, it is quite obvious that the LR algorithm has better F1 score as well as higher precision and recall.
Additionally, compared to other machine learning classifiers, the LR classifier boasts a phishing detection accuracy of 96%, which is significantly higher. Evaluation of these machine learning classifiers using larger data sets is the next step in the development of the proposed system. Nevertheless, when considering the potential dangers posed by rogue URLs, this figure is quite low, and there is room for future development.
The rapid development of technology in recent years has led to the creation of a number of innovative and complicated architectural designs and preprocessing methods, some of which are reported to be superior to those described. Because of this, there is potential for future research and it is expected that models built with these advanced algorithms will have better performance.
Summary, Conclusion, Implication for Future Research
In Proceedings of the 3rd ACM on International Workshop on Security And Privacy Analytics, IWSPA '17, bladsye 55–63, New York, NY, VSA, 2017. In Shlomi Dolev, Danny Hendler, Sachin Lodha, en Moti Yung, redakteurs, Cyber Security Cryptography and Machine Learning, bladsye 231–248, Cham, 2019. Vigna, “Opsporing en analise van driveby- download attacks and malicious JavaScript-kode,” in Proceedings of the 19th international conference on World wide web.
Loukas, “A taxonomy of attacks and an overview of defense mechanisms for semantic social engineering attacks,” ACM Computing Surveys (CSUR), vol.
APPENDIX