• Tidak ada hasil yang ditemukan

CCSIT GRADUATION PROJECTS PROCEEDING 2022

N/A
N/A
Protected

Academic year: 2024

Membagikan "CCSIT GRADUATION PROJECTS PROCEEDING 2022"

Copied!
98
0
0

Teks penuh

(1)

P ROJECTS P ROCEEDING 2022

(2)

I NTRODUCTION

، يملعلا ثحبلا معدو عيجشت لىع لصيف نب نمحرلا دبع ماملإا ةعماجب تامولعملا ةينقتو بساحلا مولع ةيلك لمعت ي ف تامولعملا ةينقتو بساحلا مولع ةيلك تابلاطو بلاط تعجش انه نمو للاخ نم يملعلا ر شنلا لىع ر شاعلا ىوتسملا

اردانف ،ةيلكلا اهيف رختفت ةقباسو ةيلكلا ي ف ةديدج ةبرجت هذه تناك .جئاتن نم هيلا اولصوت امو جرختلا ع يراشم ي ف مهلمع ط ر شني نا ةبل

ةيلكلا تابلاطو بلاط عاطتسا نا ةبرجتلا هذه نع جتن .ةدمتعم ر شن ةيعوا ي ف يملع ثحب سويرولاكبلا

ةروشنملا ةيملعلا قارولأا نودجت بيتكلا اذه ي ف .ةقومرم تارمتؤم وا تلاجم ي ف مامتهلااب ةريدجلا ثاحبلأا نم ددع ر شن ماعل ةيلكلا ةبلط نم 2021

2022 / .

The College of Computer Science and Information Technology (CCSIT) at Imam Abdulrahman

Bin Faisal University (IAU) encourages and supports scientific research. As a new experience,

CCSIT encouraged senior students to publish their findings and achievements during their

works in the graduation projects. Proudly, we can announce that our students succeed to publish

their work in journals and conferences with high reputation. In this proceeding you will find

the scientific papers published by our students for the year 2021/2022.

(3)

Section I: Journal publications

Journal 1: Phishing Email Detection Using Machine Learning Techniques

Journal 2: Intelligent Techniques for Predicting Stock Market Prices: A Critical Survey

Section II: Conference publications

Con1: Sa’ah: Creative Eco-Friendly Mobile Application That Encourages Living Sustainably Con2: Aknaf Website: Interactive Website to Automate the Institution’s Work

Con3: Flourish: Requirements and Design of an Android Application Prototype for Various Symptoms Management in ADHD Patients

Con4: Machine Learning Based Preemptive Diagnosis of Lung Cancer Using Clinical Data Con5: Leen: Web-based Platform for Pet Adoption

Con6: Road Damages Detection and Classification using Deep Learning and UAVs Con7: a comparison between vgg16 and xception models used as encoders for image captioning

Con8: Smart Inventory System

Section III: Others

Intelligent Watering System

(4)

(5)

Phishing Email Detection Using Machine Learning Techniques

Hussain Alattas1, Fay Aljohar2, Hawra Aljunibi3, Muneera Alweheibi4, Rawan Alrashdi5, Ghadeer Al azman6, Abdulrahman Alharby7 and Naya Nagy8

[email protected] [email protected] [email protected] [email protected] 2170002618

@iau.edu.sa [email protected] [email protected] [email protected]

University of Imam Abdulrahman bin Faisal, College of Computer Science, and Information Technology, KSA

Abstract

Phishing is a social engineering technique that mainly aims to steal personal or confidential data and may harm the target individual or organization in many ways. In phishing, fraudsters hide their identity as legitimate people, banks, or institutions, whether governmental or private. And since e-mail communication is the most used method in transmitting confidential or official messages, fraudsters normally target the email users to send their deceptive messages in order to extract data. However, this paper presents an overview of previously conducted studies with respect to detecting phishing email messages using machine learning. The paper’s objective is to analyze and assess the procedures of previously proposed models, datasets, and their results within the specified scope.

Keywords:

Phishing Attacks, Machine Learning, Phishing Emails, Social Engineering, Email Security.

1. Introduction

Phishing emails represent a threat in the world of the Internet, as email is the main place to send messages, whether personally or officially, as many individuals depend on it and review it daily. The interaction of one individual in an organization with a phishing message may lead to the destruction of the entire organization, this is what we mean by a threat phishing message. In this paper, we discuss some of the previous research on detecting phishing attacks in email and some models and suggested features in detecting these attacks. We also present a comparative study of classic machine learning techniques such as Random Forest, Random Forest, Naive Bayes, Decision Tree, and Support Vector Machine (SVM). This paper is sectioned by a problem statement, background, review of literature that has three sub-sections supervised machine learning techniques, non-supervised, and others; moreover, it illustrates a comparison table between models in the aspect of approaches, limitations, algorithms, response time, and accuracy.

2. Problem Statement

A phishing attack is generally accomplished by sending email messages that appear to come from a trusted source and require the user to enter financial, personal, or confidential data. The problem is when the user interacts with the email and sends the requested response, either by replying to the email by sending confidential data, visiting a website, or clicking on a link. Attackers are always coming up with new and inventive ways to dupe people into thinking their activities are related to a legitimate website or email. The user interacts without thinking when the situation seems to be dangerous, fearful, urgent, etc. Most end users usually make the decision based on how they look and feel.

3. Background

In the early 1990s, a huge number of users with false credit card details created an algorithm for stealing user information, they registered themselves on America Online (AOL) site without any validation and started using system resources. When AOL eliminated the random credit card generators in 1995, the Warez group shifted to other techniques, including communicating with individuals via AOL Messenger while pretending to be AOL employees and requesting their personal information. In 1996, American On line's Usenet group posted the first mention of the term

"phishing" in response [1]. Phishing occurs when

cybercriminals send malicious emails to trick a victim

into falling for a scam. The goal is usually to persuade

users to divulge sensitive information such as financial

data or system credentials. The advantages of phishing

for cybercriminals include its simplicity, low cost, and

(6)

2 | P a g e Published In: IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.3, March 2022

effectiveness. Attackers can easily gain access to

valuable information with very little effort and for a low price. Due to this, we are going to discuss a variety of machine learning models to detect such phishing e- mails and then block them [1]. Machine learning is a method of analyzing data that automates the process of constructing analytical models. This branch of artificial intelligence relies on the idea that computers can identify patterns, learn from data, and make decisions without the need for any human interference [2].

4. Review of literature

As phishing emails constitute the primary gateway to phishing websites, several papers were examined that discuss phishing email detection and classification techniques. A major approach for phishing email detection and classification is to employ machine learning techniques.

4.1 Machine Learning and Phishing Emails Detection

Machine learning is a critical ally in fighting phishing emails. Mostly, it investigates the content, metadata, context, and regular user behavior to analyze and detect phishing. Machine-learning includes several types such as supervised machine learning which utilizes label data to train models, and unsupervised machine learning which utilizes patterns from unlabeled data to train them. Though, unsupervised machine learning may give less accurate results compared to supervised machine learning [3].

Examples of previous work regarding these machine learning techniques are going to be discussed in the subsequent sections.

4.1.1 Supervised Machine Learning Techniques

As described in [4], A. Shaheen et al. proposed a model based on supervised machine learning algorithms to classify phished and ham mail. In supervised learning algorithms, a training set is used to classify test sets. The dataset consists of 1605 emails, 1191 are ham and 414 are phished. Ham emails are derived from a publicly available dataset, while phished emails are derived from multiple sources. After preprocessing and converting the

dataset, features were extracted and used to feed the classifiers. The features are extracted from the dataset using the Python programming language and the Nerve Learning Toolkit. The dataset consists of extracted features is segmented and fed into five classifiers: Logistic, Random Forest, SVM, Voted Perceptron, and Naive Bayes. Results showed that the classification of emails through SVM and Random Forest classifiers was highly accurate, achieving the highest accuracy of 99.8%.

Akash Junnarkar et al. [5] built a comprehensive system for spam classification using semantics-based text classification and URL-based filtering. They establish a spam classification system that followed a two-step methodology to ensure that all mail received was either spam or not. The process begins with text classification and is followed by URL analysis and filtering to determine whether any links present in the email are malicious. Five machine learning algorithms were considered for text classification: K-Nearest Neighbours, Naive Bayes, Decision Tree, Random Forest, and SVM. The highest accuracy is obtained with Naive Bayes and SVM, hitting a 97.83 % accuracy rate for SVM and 95.48 % for Naive Bayes.

As Naive Bayes and S had the highest accuracy, they were implemented in the final model to identify trigger words within the text. Lists of spam trigger words and blacklisted URLs were compiled using several datasets. The model was hosted as an API that was called by JavaScript code in Google Apps script to process emails in real-time.

In [6], Jameel et al. proposed a phishing detection model that uses a feed-forward neural network. The model was created based on the characteristics of phishing emails. Thus, a set of 18 features were extracted from the tested email, these email features appear in the header and the HTML body of the email.

In a subsequent step, a multilayer feedforward neural

network is used to classify the tested email into

phishing or ham email. A total of 9100 phishing and

ham emails have been used to test this model; 4550 of

these emails are phishing emails were collected from

publicly available phishing Corpus

(www.monkey.org), while 4550 of these samples are

ham emails were collected from the Spam Assassin

project's ham corpora. According to the testing results,

(7)

the identification rate of this model was excellent (98.7%).

A method based on neural networks was proposed by George et al. [7]. The team used two datasets consisting of 4500 emails phish and ham. To identify ham and phish emails, they applied various algorithms, including Feedforward Neural Network (FNN) with back propagation, and fist order statistical measures. As a result, the false-negative rate and the false positive rate are exceptionally low. With 12 features, 99.95% of the results were classified correctly.

Kumar et al. [8] investigated the detection of phishing emails lacking links and URLs. In their proposed work, they have used NLP and WordNet. Using 600 phishing emails and 400 legitimate emails, they have compiled a list of features including the absence of recipients' names, asking for money, or mentioning money, a sense of urgency, and a sense of urgency that lures victims to respond. They had based their work on Stanford Core NLP's application program interface to identify all the words found in phishing emails.

Harikrishnan et al. proposed [9] (Term Frequency Inverse Document Frequency) TFID+ (Singular Value Decomposition) SVD and TFIDF+ (Nonnegative Matrix Factorization) NMF to evaluate if it is in fact phishing email or not. The model starts by using email datasets with and without headers passed to data pre- processing. Then, to convert words to a numeric representation it uses TFIDF. After that, it uses SVD and NMF to extract features. Lastly, to decide whether it is legitimate or not, classical Machine Learning (ML) techniques are utilized. The accuracy of the result for this model was low due to the highly imbalanced dataset.

Senturkurk et al. [10] proposed a model that begins with data set training by concentrating on the email's body and ignoring the attachments and header. After the data sets are ready, it starts the feature selection.

Then passed it to Waikato Environment for Knowledge Analysis (WEKA) tool after converting it to the proper format. Later, a sub-list is initiated below this new decision node and a sub-decision tree is built.

After that, a different algorithm used: Naïve Bayes and decision tree. Finally, the result shows it will appear

high accuracy rate when a supplied test is selected and performing datasets for all operations is in a real-time environment.

The proposed approach by Hamid et al. [11] is called the Hybrid Feature Selection (HFS). HFS applies to 6923 datasets from both Nazario and SpamAssassin datasets. In addition, it analyzes the sender behavior to resolve a feature matrix utilizing seven email relevant features to determine whether an email is phishing or not. Further, in order for HFS to classify the email, it uses an algorithm named Bayes Net algorithm for email classifications.

As shown by Adewumi and Akinyelu [12] the Firefly Algorithm (FFA) is combined with the (SVM) for machine learning classification to build a hybrid classifier called FFA_SVM. For the purpose of evaluating the FFA_SVM algorithm, a database was constructed of 4000 phishing and ham emails along with their features. FFA_SVM has outperformed the standard SVM.

Alayham et al. [13] design and develop a tool that detects the source code of a phishing site associated with a Gmail account using a decision tree algorithm and generates a report of phishing sites attached to a victim's email as the percentages of phishing emails stored in the user's mailbox. Also, the application can send notifications to the user regarding a phishing site that was detected in the incoming message. The Agile Unified Process (AUP) methodology was used to implement the tool.

Husak and J. Cegan [14] Develop an automated tool to deal with PhiGARo phishing incidents that identify individuals who respond to phishing attack attempts.

The network traffic of the honeypot is monitored, and any phishing emails detected are sent to the PhiGARo tool. The PhiGARo framework is divided into two parts, the Phishing Incident Handling section and the Phishing Response and Detection section. Initially, the phishing incident is reported by the user who recognizes the phishing message in their mailbox.

PhiGARo is implemented by Incident Handler

manually, then interpreting the results, blocking the

phishing email or URL, and finally notifying the

victims.

(8)

4 | P a g e Published In: IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.3, March 2022

Egozi and Verma [15] created a phishing email

detection tool with 26 features. Features include word count, stop words, repeating punctuation, and unique words. 17 machine languages were studied and categorized under weighted and unweighted, based on the results, the weighted linear SVM algorithm represented the best model.

Unnithan et al. [16] proposed a model based on a variety of mathematical algorithms to measure if an email is a legitimate email or not. Consists of two dataset emails with headers and without headers. This sample is sent to count-based representation Term Frequency Inverse Document Frequency (TFIDF) and then combined with domain-level features to convert the input to an understandable input for machine learning algorithms. The last step in the model to decide whether it is a legitimate or phishing email is passed to several machine learning such as logistic regression, Naive Bayes, SVM.

4.1.2 Unsupervised Machine Learning Techniques

Fuertes et al. [17] is described how to develop a Scrum-based algorithm implementation of automatic learning, feature selection, and neural networks, with the goal of attack detecting and mitigating from inside the email server. The samples were divided into three different time periods and tested on a different dataset that was previously merged. Feature Selection, Neural Networks, Agile Scrum methodology, and Matlab process tool are used during the implementation of the proposed algorithm. Because the developed methods complement each other during detection, the acquired results from the concept tests are highly promising.

The findings of the three data sets were evaluated, and the average accuracy was 93.9%, and to validate the results obtained the source of information from the Phish Tank blacklist was used.

Andrade et al. [18] create a Python software that uses a machine-learning algorithm to learn how to recognize bad URLs, then provides relevant analysis and information about the bad URLs. The program also includes an examination of the analysis of anomalous behavior linked to phishing web attacks, as well as how machine learning techniques may be used to counter the problem. This analysis is carried out

using tainted datasets provided by Kaggle Phishing Dataset and Python tools to develop machine learning to detect phishing attacks by analyzing URLs to determine whether they are good or bad based on specific characteristics of URLs, with the goal of providing information in real-time so that proactive decisions can be made to reduce the impact of the attack. When information is added to machine learning algorithms and the algorithm is performed, the accuracy and error are likely to improve.

Unnithan et al. [19] proposed a model based on a variety of mathematical algorithms to measure if an email is a legitimate email or not. Consists of two dataset emails with headers and without headers. This sample is sent to count-based representation TF-IDF and then combined with domain-level features to convert the input to an understandable input for machine learning algorithms. The last step in the model to decide whether it is a legitimate or phishing email is passed to several machine learning such as logistic regression, Naive Bayes, Support Vector Machine. The accuracy of this model after testing

4.1.3 Other Machine Learning Techniques

The proposed phishing detection model in [20] by Viktorov, uses a dataset of phishing and non-phishing emails from different websites. The model starts with preprocessing the collected data to extract features from each email. Second, passed to feature selection which splits into two scenarios. Those scenarios are automated and manually. In the manually use clustering, which is like classification, but it is unsupervised. third, it is passed to the classification selection phase. fourth to multi-classifier, that uses several algorithms to build it such as Logistic regression, Decision Tree and Sequential minimal optimization. The results showed that clustering will increase the accuracy rate.

Rastenis et al. [21] discuss the Multi-Language Spam/Phishing Classification solution that classifies an unwanted email to either spam or phishing emails classes through using the email body content and a dataset that is constructed by three other known data sets: Nazario, SpamAssassin, and VilniusTech.

Additionally, it can classify the email even if it is

written in Russian and Lithuanian languages rather

(9)

than just English through integrating with existing classifying emails solutions and automated translation.

Fang et al. proposed [22] an approach named THEMIS (Greek word) that uses unbalanced dataset and divides the email into two parts: the email’s header and body.

Then, it splits it more into two levels: the char-level, and word-level for both header and body. Also, it calculates the likelihood if an email is phishing by comparing the probability with a classification value called a threshold, if the probability is greater than this value then it is a phishing email.

Li et al. have presented [23] the overall function of the Long Short-Term Memory (LSTM) Network method for big email data. LSTM cannot use an open-source dataset; thus, a filter must be conducted manually first of the nature of the phishing emails the enterprise receives. After a filter has been established, both

supervised KNN and unsupervised K Means are used to conduct labeling automation to construct a set of samples used for phishing email detection.

5 Comparison

This section represents a comparison between given machine learning techniques discussed in the literature to detect phishing emails. The comparison is based on which algorithm(s) or model(s) had been used, accuracy, Ture Positive Rate (TPR), False Positive Rate (FPR), datasets used, number of features, response time, and drawbacks.

5.1 Supervised Machine Learning Techniques Comparison Table.

Author Algorithm(s) used Accuracy TPR FPR Datasets used No. of

Features Response time Drawbacks Supervised Machine Learning Techniques

[4] Random Forest 99.87% 99.9% 0.2% N. A 9 N. A

Data used may not reflect real life

scenarios

[5] SVM 97.83 % 53.0% 3.0%

Enron Data set and spam.csv

Kaggle data

N. A N. A

There is no real-time learning of email

classifiers in the provided data sets [6]

FNN 98.72% 98% 1.2% N. A 18 0.00000067

seconds

Increased numbers of neurons will increase training and testing

time

[7] FNN 99.95% 100% 0.09% N. A 12 0.00000118

seconds N. A

[8] NLP 99.4% N. A N. A N. A N. A N. A

Unable to extract text from email

attachment [11] Bayes Net 94% 0.97% 0.13% Nazrario &

SpamAssassin 7 N. A

Graphical form in phishing emails cannot be detected

[12] SVM 99.94% N. A 0.01% Dataset consists

of 4000 emails 16 0.16 seconds N. A Decision Tree 96.5

%

92%-

97% PhishingCorpus

7 8.54 seconds

Dataset is highly imbalanced

Random Forest 97.1% Slow

(10)

6 | P a g e Published In: IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.3, March 2022

[9]

KNN 97.6

%

8%-

26%

4.3 Seconds

Naive Base 94.7

%

0.01 seconds

AdaBoost 97.7% N. A

SVM 98.7% 0.16 seconds

Logistic Regression 96.8% 12.11 seconds

[10]

Naïve bayes

89% N. A N. A MIME 13

0.01 seconds Datasets must be in real-time environment to

success

Decision Tree 8.54 seconds

[13] Decision Tree 95.05% N. A N. A Used 3 available

dataset 8 8.54 seconds N. A

[14] IPFIX N. A N. A Low N. A N. A 3 to 19 per day

Must support the trustworthiness of honeytokens and honeypots

[15] SVM 90% 83.0% 96.0% IWSPA 28 0.16 seconds Takes few hours to

run

[16]

Naïve Bayes 79.5% N. A N. A

Enron and

Avocado N. A

0.01 seconds

Cannot extract feature from headers

SVM 88.4% 3593/4

583

489/458

3 0.16 seconds

Logistic Regression 80.1% N. A N. A 2.11 seconds

Table 1: Supervised Machine Learning Techniques Comparison

5.2 Unsupervised Machine Learning Techniques Table.

Table 2: Unsupervised Machine Learning Techniques Comparison.

5.3 Other Machine Learning Techniques Table.

Author Algorithm(s) used Accuracy TPR FPR Datasets used No. of

Features Response time Drawbacks Unsupervised Machine Learning Techniques

[17] Agile Scrum 93.9% N. A 2.7% Debian

Phish Tank 7 N. A N. A

[18] Logistic 90% N. A N. A Kaggle N. A 12.11 seconds N. A

[19]

SVM 95%

3807/

3572 7/217 N. A 5 N. A

cannot extract features from

headers

Naïve Bayes 94%

Logistic Regression 96%

(11)

Table 3: Other Machine Learning Techniques Comparison.

5.4 Analysis

According to the comparisons in Tables 1, 2, and 3, four main models had been considered as remarkable models among others based on different parameters for email classification. These models are SVM, NLP, Random Forest and Naive Bayes models. Despite the fact that they had gained popularity in many previous works regarding email classification techniques SVM, NLP, Random Forest, and Naive Bayes algorithms have very high accuracy, TPR, and FPR compared to other algorithms with fast response times. On the other hand, two datasets had also gained popularity in the phishing detection field to extract informative email features for classification, these are Spam Assassin and Nazario corpuses. However, our literature study had shown that there are many effective algorithms of email classification, yet attackers are becoming more and more sophisticated with powerful techniques.

Thus, each time ones want to decide which algorithms or learner are best to distinguish if an email is a phishing or non-phishing email is now becoming a difficult challenge.

6 Conclusion

Over the past few years, the problem of phishing emails has become more common. Phishing is a type of attack. The intention of phishing is to obtain personal information, such as passwords, credit card numbers, or other account information, by using emails. Phishing emails closely resemble legitimate ones, making it hard for a layperson to distinguish them. Machine learning techniques currently play a major role in phishing email detection and classification. Several models and approaches are available for phishing email detection. Each approach has its own unique advantages and capabilities, as well as limitations. Hence, this literature review has summarized and compared several methods and approaches for protecting against phishing email attacks.

References

[1] P. Verma, A. Goyal, and Y. Gigras, “Email phishing: text classification using natural language processing,” Comput. Sci. Inf. Technol., vol. 1, no. 1, pp. 1–12, 2020.

[2] E. Bisong, “What is machine learning?” in ” in Building Machine Learning and Deep Learning

Author Algorithm(s) used Accuracy TPR FPR Datasets used No. of

Features Response time Drawbacks Other Machine Learning Techniques

[21]

SVM English

only (90.07%

±3.17%) English, Russian and

Lithuanian (89.2%±2.1

4)

95.2% N. A

Nazario, SpamAssassin, and

VilniusTech.

N. A

0.16 seconds

Accuracy lessens 10%

if a mixed dataset is used for training and

testing

Random Forest Too slow

Decision Tree 8.54 seconds

Naïve Bayes 0.01 seconds

KNN 4.3 Seconds

[22] Threshold value 99.848% 99.0% 0.043

%

WordNet, Enron,

and Nazario N. A Increased

response time N. A [23]

KNN

95% 98% N. A Collected from a private enterprise 7

4.3 Seconds

Consume time on constructing the filter

K-Means Fast

[20]

Logistic Regression

93% N. A 4.89% Datasets consist of 4800 emails 47

12.11 seconds Email is not clustered before classification

which reduced the accuracy

Decision Tree 8.54 seconds

CART N. A

SMO Medium

(12)

8 | P a g e Published In: IJCSNS International Journal of Computer Science and Network Security, VOL.22 No.3, March 2022

Models on Google Cloud Platform, Berkeley, CA:

Apress, 2019, pp. 169–170.

[3] Gatefy, “How artificial intelligence and machine learning fight phishing,” Gatefy, 22-Mar-2021.

[Online]. Available: https://gatefy.com/blog/how- ai-and-ml-fight-phishing/. [Accessed: 13-Mar- 2022].

[4] S. Rawal, A. Shaheen, and S. Malik, “Phishing Detection in E-mails using Machine Learning,”

Int. J. Appl. Inf. Syst., vol. 12, no. 7, pp. 21–24, 2017.

[5] A. Junnarkar, S. Adhikari, J. Fagania, P.

Chimurkar, and D. Karia, “E-mail spam classification via machine learning and natural language processing,” in 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), 2021.

[6] N. Ghazi, M. Jameel and L. E. George, “Detection of phishing emails using feed forward neural network,” Int. J. Comput. Appl., vol. 77, no. 7, pp.

10–15, 2013.

[7] A. A. Abdullah, L. E. George, and I. J.

Mohammed, “Research Article Email Phishing Detection System Using Neural Network,”

Research Journal of Information Technology, vol.

6, no. 3, pp. 39–43, 2015.

[8] Aggarwal, Shivam, Vishal Kumar and Sithu D.

Sudarsan. “Identification and Detection of Phishing Emails Using Natural Language Processing Techniques.” SIN (2014).

[9] B, Harikrishnan & Ravi, Vinayakumar & Kp, Soman. (2018), "A Machine Learning Approach Towards Phishing Email Detection," CEN- Security@IWSPA 2018.

[10] Ş. Şentürk, E. Yerli and İ. Soğukpınar, "Email phishing detection and prevention by using data mining techniques," 2017 International Conference on Computer Science and Engineering (UBMK), 2017, pp. 707-712, doi:

10.1109/UBMK.2017.8093510.

[11] I. R. A Hamid, J. Abawajy, and T.-H. Kim,

“Using feature selection and classification scheme for automating phishing email detection,” Stud.

Inform. Contr., vol. 22, no. 1, pp. 61–70, 2013.

[12] O. A. Adewumi and A. A. Akinyelu, “A hybrid firefly and support vector machine classifier for phishing email detection,” Kybernetes, vol. 45, no. 6, pp. 977–994, 2016.

[13] R. Alayham, C. Ren, J. Arshad and A.

Muhammad, "Email Anti-Phishing Detection Application", Management & Science University, 2019.

[14] M. Husak and J. Cegan, "PhiGARo: Automatic Phishing Detection and Incident Response Framework", Masaryk University, Brno, Czech Republic, 2021.

[15] G. Egozi and R. Verma, "Phishing Email Detection Using Robust NLP Techniques", Department of Computer Science University of Houston, Houston TX, USA, 2021.

[16] Unnithan, Nidhin A., et al. "Machine learning based phishing e-mail detection." Security- CEN@ Amrita (2018): 65-69.

[17] L. Zapata, D. Ona, G. Rodriguez, and W. Fuetres,

"Phishing Attacks: Detecting and Preventing Infected E-mails Using Machine Learning Methods", Universidad de las Fuerzas Armadas ESPE, Sangolqui, Ecuador, 2019.

[18] I. Ortiz-Garc, R. Andrade and M. Cazares,

"Detection of Phishing Attacks with Machine Learning Techniques in Cognitive Security Architecture", Politecnica Salesiana,Quito, 2021.

[19] Unnithan, Nidhin A., et al. "Machine learning based phishing e-mail detection." Security- CEN@ Amrita (2018): 65-69.

[20]

Viktorov, Oleg. "Detecting phishing emails using machine learning techniques." PhD diss., Middle East University, 2017.

[21]

J. Rastenis, S. Ramanauskaitė, I. Suzdalev, K.

Tunaitytė, J. Janulevičius, and A. Čenys,

“Multilanguage spam/phishing classification by email body text: Toward automated security incident investigation,” Electronics (Basel), vol.

10, no. 6, p. 668, 2021

[22] Y. Fang, C. Zhang, C. Huang, L. Liu, and Y.

Yang, “Phishing email detection using improved RCNN model with multilevel vectors and attention mechanism,” IEEE Access, vol. 7, pp.

56329– 56340, 2019.

[23] Q. Li, M. Cheng, J. Wang, and B. Sun, “LSTM

based phishing detection for big email data,” IEEE

Trans. Big Data, pp. 1–1, 2020.

(13)

Authors Biography

Muneera Alweheibi, Rawan Alrasheddi, Fay Aljohar, and Hawra Aljunibi: are currently pursuing

their Bachelors degree in Cyber Security and Digital Forensics at the department of Networks and Communications, College of Computer Science and Information Technology (CCSIT), Imam Abdulrahman Bin Faisal University, Dammam.

Mainly, their research interests include email security, Artificial Intelligence and Machine Learning.

Hussain Alattas

is currently working in the department of Networks and Communications Department, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University (IAU), as a lecturer. Hussain has completed his BS degree in Computer Science from IAU and MS degree in Cybersecurity and Artificial Intelligence from The University of Sheffield.

Ghadeer Alazman

is currently working in the department of Networks and Communications Department, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University (IAU), as a teaching assistant.

Ghadeer has completed his BS degree in the science of

Cyber Security and Digital Forensics from IAU.

(14)

Review 1

Intelligent Techniques for Predicting Stock Market Prices: A

2

Critical Survey

3

Abstract: The stock market is a field that many people are interested in, regardless of their occupa- 4 tional background. Individuals who have adequate knowledge can buy shares in the market and 5 generate additional income. Nowadays, the cost of living has increased. Hence, the number of peo- 6 ple who are investing in the stock market is increasing dramatically. While anyone can participate 7 in the stock market at any time, there is no guarantee that they will profit from this investment. The 8 stock market is a risky way to invest, given that it is unknown whether the value of a specific stock 9 will rise or fall. Making stock market predictions using artificial intelligence techniques is a possible 10 way to help people anticipate stock market trends. The current research study showed that many 11 factors impact changes in the stock market’s value in general and in the Saudi Arabia Stock Ex- 12 change specifically. To the best of our knowledge, most previous research only considered historical 13 data for predicting stock market trends. The present study aimed to enhance the accuracy of the 14 daily closing price for three sectors of the Saudi stock market by considering historical data and 15 sentiment data. Several intelligent algorithms were considered, and their performance indicators 16 were discussed and compared. In general, this research study found that more accurate stock mar- 17 ket prediction models can be produced by employing both historical data and sentiment data. 18 19 Keywords: stock market, predictions, artificial intelligence techniques, historical data, and 20

sentiment data. 21

22

1. Introduction 23

Living expenses and taxes have been increasing in recent years, while salaries have 24 become insufficient to meet future needs. Consequently, people are more likely to start 25 new firms or look for extra sources of income. One of the widely utilised methods to ac- 26 complish that is to invest in the stock market, which can provide additional income. How- 27 ever, it requires knowledge of the stock market to correctly predict future stock prices in 28

order to avoid the potential risks. 29

The financial market is a simple system that enables individuals to buy, own and 30 then sell shares at any time with a straightforward process conducted on virtual plat- 31 forms. Although it can be beneficial to do so, investing in the stock market might result in 32 significant losses, particularly if an individual lacks an understanding of stock prices and 33 future forecasts. Furthermore, various factors, including the companies' activities and per- 34 formance, supply and demand and news reports, have significant impacts on prices. 35 These issues necessitate the development of stock price prediction applications to accu- 36

rately estimate stock market prices. 37

Since the beginning of the 21st century, artificial intelligence (AI) technologies, includ- 38 ing machine learning (ML) and deep learning (DL), have become popular and increas- 39 ingly applied in different domains. These strategies focus on employing statistical algo- 40 rithms and exploiting data to build smart systems that can learn, comprehend and act in 41 ways that are indistinguishable from humans in a particular scenario. Consequently, re- 42 searchers agree that they significantly enhance the capabilities of computation, pattern 43 matching and analysing data to extract useful insights quickly and accurately. In the field 44 of the stock market, ML or DL algorithms can be trained with different kinds of data, 45 including historical data, representing a stock’s behaviour, and sentiment data from social 46

media, in order to predict the future prices. 47

(15)

In this research study, we reviewed and analysed previously published studies that 48 applied AI-based technologies in the field of stock market prediction. Although various 49 literature reviews examining how intelligent-based systems have been used to predict the 50 stock market prices have been published in recent years, none have been as thorough as 51 this one. In this article, we cover 45 research studies published between 2015 and 2021 in 52

the field of stock market prediction. 53

This critical review study also used a novel taxonomy that, to the best of our 54 knowledge, has never been used in earlier studies. It establishes several criteria against 55 which the articles under review can be evaluated and contrasted, including: 56

• The dataset used; 57

• The ML/DL algorithms applied; 58

• The targeted market: local or global; 59

• The kind of features utilised: sentiment data/ historical data/ or both; 60 The performance results were obtained by applying AI techniques.

61 The findings of this literature review point to promising directions for future research 62 and applications in the field of stock market prediction using intelligent algorithms. Re- 63 searchers will be able to use the comparisons and discussions provided in this article to 64 determine which directions to pursue in their research, such as whether to improve intel- 65 ligent-based algorithms or consider other algorithms, which features should be added or 66 removed when building the training dataset, and which evaluation metrics should be 67

used to evaluate the created intelligent systems. 68

The rest of this article is organised as follows. Section 2 presents a literature review, 69 summarising studies focusing on stock markets and the factors that affect stock prices. 70 Section 3 presents a comparison and analysis of the examined research publications, the 71 stock markets they target using ML and DL approaches and their findings. Section 4 pre- 72 sents the study’s conclusions and recommends future research directions. 73

2. Literature Review 74

This section presents an overview of 45 studies that were conducted to predict the 75 future of stock market prices. We reviewed studies that included the idea of applying AI 76 techniques to the stock market to gain a general understanding of the models that were 77 used, to determine how far research has expanded in this field and to identify the ideas 78 that have not been applied in research on the Saudi stock market. Finally, we provide a 79 brief overview of the suggested ideas that we will follow throughout this project. 80 One of the hottest topics that is being discussed is stocks that are being traded, as 81 they are considered to be an additional source of income and savings. The need to increase 82 the source of income has grown after the rise in the cost of living and the increase in the 83 tax burden, as companies make a general appeal for cost-savings to obtain the funds 84 needed for their investments in the form of shares. In the stock market, the profit and loss 85 ratio is based on the participation rate of each individual. Although the stock market can 86 be beneficial to investors, there is a risk in participating in it, as the profit and loss ratio is 87 not guaranteed due to the stock market’s dependence on many factors, such as historical 88 data of the stock, news data, company performance and future expectations, supply and 89 demand and other factors that cause the need for applications that estimate the price of 90 the stock market shares of companies. From this point of view, we will apply AI and DL 91 techniques to estimate stock market prices. AI and DL refer to systems that simulate hu- 92 man intelligence to perform tasks and can be improved based on the information they 93 collect. AI is used in many applications and fields because it provides value to most jobs, 94 companies and industries. After reviewing 45 research studies and applications that ap- 95 plied AI techniques to estimate stock market prices in the future, we found that it pro- 96 vided the results with specific accuracy. Comparing these studies highlights the gap in 97 the research; thus, it will help us develop a new system using AI techniques to estimate 98 stock market prices and close the existing gap in the applied studies [1][2]. 99 The comparison of the research was based on important factors in the application of 100

(16)

• Common dataset used; 102

• Common algorithm used; 103

• Implementation of sentiment data to estimate the market price; 104

• Results obtained by applying AI techniques. 105

The results from the comparison and analysis will help researchers apply new ideas and 106 facilitate a new shift in the field of information systems, enabling the development of al- 107 gorithms that can be used as effective techniques for estimating stock market prices. 108 We found some gaps in our literature review. While conducting our research and 109 looking for similar studies that used the Saudi stock market, we noticed that there was a 110

lack of research that analysed sentiment data. 111

Normally, several evaluation indicators are used to evaluate intelligent models. The 112 ones most commonly used to evaluate intelligent stock market models are: 113

• Precision: also known as positive predictive value; it measures the number of 114 correctly predicted cases that turn out to be positive; 115

• Accuracy: the number of correct predictions divided by the total number of pre- 116

dictions; 117

• Correlation: an indicator of the linear relationships (meaning they change at the 118 same rate); it is a common way of interpreting simple relationships without iden- 119

tifying a cause-effect statement; 120

• Recall: also known as sensitivity; it is measured by examining how many posi- 121

tive outcomes can be predicted correctly; 122

• F1-score: a static statistic that expresses the balance between recall and precision; 123

• Error rate: measures the number of patterns that have been predicted incorrectly 124

by the model; 125

• Sum of squared errors (SSE): a weighted sum of squared errors that does not 126 equal constant variance when using heteroscedastic errors; 127

• Mean absolute error (MAE): an average error between the magnitudes of two 128

observations expressing the same phenomenon; 129

• Mean squared error (MSE): the average squared difference between the esti- 130 mated and actual values; it is a quantifier of the quality of an estimator; 131

• Root-mean-square error (RMSE): a commonly used measurement of the differ- 132 ence between the predicted and observed values (sample or population) pre- 133

dicted by the model; 134

• Mean absolute percentage error (MAPE): measures the accuracy of a forecasting 135

method, which is typically expressed as a ratio; 136

• R-squared (R2): depicts how much of a dependent variable's variance is ex- 137 plained by an independent variable or variables in a regression model. 138 139 140

2.1 Research Depending on Historical Data 141

142 One study [3] used the idea of predicting the stock price to such an extent that it can 143 be sold before its worth decreases or bought before the price increases. This study used 144

(17)

different artificial neural networks (ANNs) to foresee the stock price, but the productivity 145 of forecasting by ANNs relies on the learning algorithm used to train the ANN. This study 146 compared three algorithms: Bayesian Regularization, Scaled Conjugate Gradient (SCG) 147 and Levenberg-Marquardt (LM). It used data (ticks) from 30 November 2017 to 11 January 148 2018 (barring occasions) of Reliance Private. Each day had around 15,000 data focuses, 149 and the dataset contained around 430,000 data focuses. The data were acquired from 150 Thomson Reuter Eikon database. (This dataset was bought from Thomson Reuter). Every 151 change in the price of a stock from one trade to another has a tick that refers to it. The 152 stock price at the beginning of each 15-minute period was extracted from the tick data, 153 which represents the optional dataset run on similar algorithms. Thus, these three algo- 154 rithms have a tick-data utilisation precision of 99.9%. Moreover, for every LM, SCG, and 155 Bayesian Regularization, the exactness over the 15-minute dataset decreases to 96.2%, 156 97.0%, and 98.9%, respectively, which is significantly poor in comparison to the results 157 acquired using tick data. The neural networks (NNs) used in this study are weak; in fact, 158 many other NNs, such as long-short term memory (LSTM), give better predictions. Fur- 159 thermore, applying sentiment analysis can help achieve an additional edge in relation to 160

stock price expectations. 161

Another study [4] focused on the worst prediction accuracy domain, which is the 162 short-term prediction, using time series data of stock prices. An Alpha Vantage applica- 163 tion programming interface

(

API) was used to access the time series data of 82 random 164 stocks traded at the New York Stock Exchange. (NYSE) The API provides access to daily, 165 weekly and monthly time series data. Since this study used short-term prediction, daily 166 time series data were chosen, which includes the daily opening price, daily high and low 167 prices, daily closing price and daily volume. The study started with a simplified problem, 168 which was predicting whether the prices would increase or decrease in the subsequent 169 days using the stock prices and volumes from the previous days. For this classification 170 problem, logistic regression (LR), Bayesian Network, Simple Neural Network and Sup- 171 port Vector Machines (SVM) with a Radial Basis Function (RBF) kernel were conducted. 172 When using only past price data and technical indicators, the accuracy was found to be 173

70%, which is not high compared to other studies. 174

A study conducted by [5] used ML algorithms, such as Random Forest (RF), K-Near- 175 est Neighbours (KNN), SVM and LR, to evaluate the performance in the field of stocks. 176 That study evaluated the algorithms by assessing performance metrics, such as accuracy, 177 recall, precision and F-score, with the aim of identifying which algorithm most effectively 178 predicted the future performance of the stock market. The dataset is from Kaggle and 179 represents data from the National Stock Exchange of India. That study found that RF had 180 the highest accuracy rate for prediction and the highest recall rate, LR achieved the highest 181 precision and F-score and KNN was the worst performing algorithm among the four that 182 were studied. Overall, RF was the best algorithm, with an accuracy rate of 80.7%. After 183 obtaining the results of the four algorithms, the pros and cons of each technique were 184 identified. Thus, it is easy to determine the best and effective algorithm for the model. 185 One study [6] aimed to apply the KNN and non-linear relapse approaches to antici- 186 pate stock prices for some major companies listed on the Jordanian Stock Exchange. The 187 Jordan Steel Company (JOST), Irbid District Electricity (IREL), Arab International for Ed- 188 ucation and Investment (AIEI), Arab Financial Investment (AFIN) and the Arab Potash 189 Company (APOT) are all listed on that stock exchange to assist investors, decision-makers 190 and clients in making better investment decisions. The study used a dataset of the stock 191 information from 4 June 2009, to 24 December 2009 for five randomly chosen companies 192 recorded on the Jordan Stock Market. Each of these companies has around 200 records 193 with three ascribes, including low price, closing price and high price. The study computed 194 the total squared mistakes, RMSE, and the normal errors for the five companies and iden- 195 tified the contrasts between the anticipated values and the real values in the sample data. 196 It found that the number of errors was small, which demonstrates that the actual value 197 and predicted value are close. According to the results, there is high precision in using the 198 KNN algorithm for forecasting stock values. Then, non-linear regression was applied. The

(18)

outcome indicates that the use of data mining (DM) methods can help decision-makers at 200 various levels when using KNN to examine the data. This study [6] achieved high results 201 using the KNN algorithm as it has a small error ratio, and this yields high precision in 202 contrast with [5], which it did not achieve a high precision result. 203 A DL-based model to predict stock prices was presented in [7]. That study used the 204 historical records of the National Stock Exchange Fifty (NIFTY 50) which contains 50 in- 205 dexes listed in the National Stock Exchange of India from 29 December 2008 to 28 Decem- 206 ber 2018. A multi-step process was used to forecast the opening values of the stock prices 207 of the 50 records. Moreover, the values were predicted week by week, so when a week is 208 over, the actual values of a week are included in the training model before it starts training 209 again. The results using the convolutional neural network (CNN) algorithm show the 210 forecasting performance with a mean of 348.26 in one week, which is better than the mean 211

of 407.14 for two weeks. 212

The study conducted by [8] predicted the ability of various well-known forecasting 213 models, including dynamic versions of a single-factor Capital Asset Pricing Model 214 (CAPM)-based model and Fama and French's three-factor model, to close the gap in the 215 literature. The dataset was collected from the Shanghai Stock Market and it compared the 216 predicting performance of each of the six models with the performance of an ANN model 217 using the same predictor variables; however, it relaxed the model linearity assumption. 218 Surprisingly, there were no statistically significant differences between the CAPM and the 219 three-factor model in terms of forecasting accuracy. Furthermore, each ANN model out- 220 performed the equivalent linear model, showing that NNs might be a valuable tool for 221 predicting stock prices in emerging markets. On average, the overall accuracy of the pro- 222 posed method is equal to 0.0113 MAD, 0.3118 MAPE and 0.2807 MSE. As discussed in [3], 223 by utilising an ANN algorithm, the model provides more accurate results. 224 Another model that analysed the stock market and identified nonlinear relationships 225 between the input data and the output data was proposed in [9]. Two types of ANN algo- 226 rithms were used in this study, RBF and Multi-Layer Feed Forward (MLFF using the 227 Shanghai Stock Exchange composite index in China. The reason for using two ANN tech- 228 niques is that an RBF network can deal with nonlinear functions and operate with the 229 complexity of analysing the rules and laws in the system, while MLFF is used to deal with 230 the complex nonlinear relationship between the input and output data. This study found 231 that RBF outperformed MLFF because the RBF's error is substantially smaller. The appli- 232 cation provided an excellent comparison of two types of ANN algorithms. 233 The study conducted by [10] focused on the efficacy of DL in predicting one-month- 234 ahead stock returns in a cross-section of the Japanese Stock Market. NNs have been used 235 in several studies on stock return predictability. NNs have also been used to make indi- 236 vidual stock return estimates. The MSCI Japan Index dataset consists of data from Decem- 237 ber 1990 to November 2016 and contains 319 indexes. The study used ANNs, Support 238 Vector Regression (SVR) and RF. The result shows that deep NNs outperform shallow 239 NNs, in general, and the top networks also beat typical ML models. Indeed, the findings 240 suggest that DL shows potential as a sophisticated ML method for predicting cross-sec- 241 tional stock returns. A future study could include the use of RNN, which is designed to 242 handle time series data. An analysis of several DL models is also predicted to improve the 243 accuracy of stock return prediction in the cross-section data. 244 Another study [11] aimed to construct a novel ensemble ML framework for daily 245 stock pattern prediction by combining traditional candlestick charting with the latest AI 246 methods. The Chinese Stock Market dataset was used in this research, with a total of 247 65,000 rows of data in each round. A total of six ML models were used in this study: LLR, 248 SVM, KNN, RF, Gradient Boosting Decision Tree (GBDT) and LSTM. After comparing the 249 results of each of these models, RF and GBDT showed a good predictive ability for short- 250 term prediction, whereas the LR prediction level needs to be improved and KNN and 251 SVM only fit in some patterns. The LSTM model has more advantages as a DL, but those 252 advantages were not fully discussed. Overall, the model had an accuracy greater than 253 52%, and an F1-score greater than 50%. This research provides useful information and it 254

(19)

is distinct from other studies conducted on the basis of stock market prediction, because 255 it shows in detail the results of each model that used different algorithms. 256 A study conducted by [12] applied an ML algorithm and time series forecasting using 257 Microsoft Excel as the best statistical tools for graphic and tabular representation of the 258 prediction results. That study used data from Yahoo Finance for Amazon (AMZN) stock, 259 AAPL stock and Google stock datasets. They focused on using LR, three-month moving 260 average (3MMA) and Exponential Smoothing (ES) algorithms. Three different prediction 261 methods were considered. Of them, ES based on LR showed the best results with a 16.62 262 average absolute error. The study is distinct from others as it also applied a time-series 263 analysis to predict the stock market prices for the next month. 264 In [13], two effective models were developed using ANN and SVM classification 265 techniques to predict the direction of stock price index movement. Then, the ability to 266 anticipate the direction of movement in the daily Istanbul Stock Exchange (ISE) National 267 Index was tested. The data were from the ISE National Index's daily closing price move- 268 ment from 2 January 1997 through 31 December 2007. The average performance was 269 found to be better for the ANN model (75.74%) than the SVM model (71.52%). The pre- 270 diction performance of these models can be improved in two ways. The first way adjusts 271 the model parameters by conducting a more sensitive and thorough parameter setting. 272 The second way is to employ additional macroeconomic variables, such as foreign ex- 273 change rates, interest rates and the consumer price index, as inputs to the models. 274 One study [14], proposed a model to understand the financial market and build a 275 neural model for the financial market theory with respect to technical analysis, fundamen- 276 tal analysis and time-series analysis. That study used the feedforward multilayer percep- 277 tron ANN algorithm. This algorithm is used because of its efficiency in predicting a time 278 series and its ability to learn and recognise non-linear data. The authors conducted a sur- 279 vey to gather input from qualified professionals on the models, techniques and indicators 280 used in the pricing of stocks. A questionnaire was sent to 50 investors and analysts work- 281 ing in the stock market. The datasets were obtained from Economatica, Brazil’s Central 282 Bank, the São Paulo Stock Exchange and Thomson Reuters. The result is based on the set 283 of error metrics with a window size equal to 3, as it presents a Prediction of Change in 284 Direction (POCID) correct rate of 93.62% and a MAPE of 5.45%. The results could be fur- 285 ther improved by expanding the algorithms used to obtain high accuracy and discover 286 the best algorithm for forecasting stock market prices and trends. 287 Another study examined the prediction power of NN modelling and SVM to forecast 288 Russian stock prices [15]. The dataset was consisted of the daily Moscow Interbank Cur- 289 rency Exchange (MICEX) stock price index, as well as some technical and fundamental 290 indicators from 2002 to 2016, based on statistical and analytical methods. Datasets are used 291 for training, testing and verification in Python for ML. Feedforward NNs were used to 292 predict the MICEX index. Moreover, a back propagation (BP) algorithm was used to train 293 it. The study used the activation function as its baseline and the dependent and independ- 294 ent variables were normalised to the interval [-1,1]. To decrease the potential for overfit- 295 ting problems, the data were split into a 60% training sample and a 20% testing sample. 296 To determine which parameters of the learning algorithm and NN architecture for each 297 sample are optimal, training samples and testing samples were used. The performance of 298 the NN was also evaluated using a validation sample (20%). The NN’s optimal learning 299 parameters were found empirically using a grid search. An optimal configuration was 300 also found by training and testing processes in SVM. Data normalisation in SVM was ac- 301 complished using transformations. The prediction performance of the NNs and SVM was 302 compared based on MSE, RMSE, MAE, MAPE, R2 and the calculated coefficient of deter- 303 mination (cR2). B SVM was found to have a higher predictive power than NN modelling. 304 305 306 307 A genetic algorithm (GA) proposed in [16] was used to forecast prices and trends for 308 the India Stock Market. The dataset was extracted from the India TCS Stock Market for

(20)

trading values for 259 days, including the opening, closing, lowest and highest prices of 310 each day's trading. The GA is a search algorithm that combines the mechanics of selection 311 and genetics. There are three types of genetic operations: crossover, mutation and selec- 312 tion. Historical

data were used to predict future search points with improved performance 313 efficiently. The study was conducted using a time-series analysis, which identifies pat- 314 terns in statistical information by returning information at regular intervals of time. To 315 make a prediction for the opening price of the next week, the closing price values of the 316 previous week are taken by ignoring the negative ‘ve’ sign, since the opening price is al- 317 ways higher than the closing price. In contrast, when predicting the closing price from the 318 opening price the sign changes from a positive ‘ve’ to a negative ‘ve’. However, the sign 319 does not change when predicting low and high prices. The Chi-square test was used to 320 determine whether the prediction was significant or merely a coincidence. According to 321 the test findings, time series and GA significantly improved the prediction system's accu- 322

racy by 99.87%. 323

Another study [17] proposed a practical method for predicting stock development. It 324 did not present any numerical result as the aim was to present the most appropriate anal- 325 ysis for anticipating the stock market. A dataset from the previous year's stock market was 326 employed and divided into training and testing data to improve the accuracy. The RF 327 algorithm and the SVM algorithm were both considered. The results of the calculation 328 showed that the RF algorithm performed better in predicting a stock's market price, which 329 was proven in [5] and [11], as both studies achieved higher performance using RF. 330 Since stock price prediction using time series forecasting is one of the most complex 331 challenges in the financial field, a method was developed in [18] for predicting stock price 332 and time series using a hybrid method of GA and ANN techniques. That study used data 333 from Apple, Pepsi, IBM, McDonald's and LG. Compared to traditional models, the pro- 334 posed solution exhibited a 99.99% improvement in SSE and a 99.66% improvement in time 335 when a hybrid model of GA and BP was applied to a dataset of Apple stocks. When the 336 Pepsi dataset was used, the approach had an SSE of 0.0121281374; traditional methods 337 without using GA had an SSE of 0.4790571631. That study yielded an SSE accuracy of 338 99.42% and a time reduction of 88.75%. Their method could be further improved by com- 339 bining it with other methods, such as SVM or decision tree (DT), or by expanding their 340 study to include a time-series analysis as was done in [16]. 341 In [19], it was reported that ANN is suitable for stock market prediction since it is a 342 popular way to identify unidentified and unseen patterns in data. That study was divided 343 into two modules; one module was for training and the other was for predicting the stock 344 price based on the previous training. A method was proposed to predict the share price 345 using a BP algorithm and an MLFF network. The dataset was obtained from ACI Pharma- 346 ceuticals. Using two input datasets and five input datasets, a difference was found be- 347 tween the anticipated and actual stock price. The error percentage between the predicted 348 price and the actual price decreased when the model had more training. When using five 349 input datasets, the highest error rate was 3.28% and the lowest was 0.12%. The method 350 used to achieve a more error free prediction was done by training the system with more 351 input datasets. Another method could also be conducted to yield better results; for exam- 352 ple, in [18], ANN was also used, but it was combined with GA to improve the results. 353 Another study [20] proposed an intelligent stock market forecasting system using the 354 ability of ANN and a fuzzy inference system. The goal was to notice the patterns in non- 355 linear and disordered systems. The dataset was from BEXIMCO Ltd. Using a model that 356 combined an NN and fuzzy logic, a total of nine inputs were used from the prediction 357 dataset to compare the actual price and the predicted price. The highest error rate of this 358 model was 4.8895% and the lowest was 0.3734%. Training the model with more data, as 359 was done in [19], could improve its performance and generate a more error free predic- 360

tion. 361

The study conducted in [21] utilized SVM to forecast ISE prices in Turkey. The study 362 proposed a method for learning to predict stock price returns by considering a binary 363 classification problem (positive and negative). Positive return forecasts were represented 364

(21)

by a class label of +1 and negative predictions were represented by a class label of -1. 365 Learning the learned model involves the use of weighted SVM, RF classifi-ers, Relevance 366 Vector Machine (RVM) and Multiple Layer Perceptron (MPL). A three-layer feed forward 367 technique was implemented with 10 neurons as an input, a neural layer for each technical 368 parameter, and a neuron as an output layer to show the predicted result. To update the 369 weights, the tangent sigmoid was used as a transfer function in conjunction with a gradi- 370 ent descent algorithm. The error and output of the initial network were calculated during 371 adaptive gradient descent. In this way, a near-optimal learning rate for the local environ- 372 ment can be obtained. Additionally, a higher learning rate is guaranteed if stabilised learn- 373 ing occurs. The accuracy of the proposed method was reported to be 70%. In contrast to 374 other studies that utilised the SVM algorithm and produced high accuracy, this study 375

should improve its methods to achieve higher accuracy. 376

Another study [22] proposed a method that can forecast stocks from different mar- 377 kets and industries and predict the trend using ML algorithms, such as polynomial re- 378 gression and LR, in addition to learning techniques for predicting a time series using two 379 special types of NN recursions: spoken short-term memory and LSTM. The historical in- 380 formation contains data for the daily low, high, closing and opening prices and the vol- 381 ume of each stock. The dataset consisted of the five-year window of Alibaba, VinGroup, 382

Gambar

Table 1: Supervised Machine Learning Techniques Comparison
Table 3: Other Machine Learning Techniques Comparison.
Table 1: Research comparison & analysis.  773
Table 1 as shown below summarizes the main similarities and  dissimilarities  between  the  proposed  system  and  discussed  systems including users of the system
+7

Referensi

Dokumen terkait