11/11111111" 1111I~I"' - BUET Central Library

The opportunity to work on the spam filter development project was a great learning experience. Lutful Kabir, Director, IICT, BUET for giving me this great learning opportunity in implementing a spam filter.

Introduction

Definition of Sparn
Email Spam and its Costs

Costs of Email Spam
Characteristics of Email Spam

The Necessity of Spam Filtering
Types of Spurn

Advertisement Spam
Financial Spam

Spam Filter
Impact of Personalized Spam Filter

Background and Present state of the P,'oblem
Objective with Specific aims and Possible Outcomes
Outline of Methodology/ Experimental Design
Organization of this Report

Thus, a domain-specific and personalized email spam tilter will be developed by creating two word dictionaries called SPAM and HAM consisting of spam and legal words, respectively, to calculate the spam probabilities of the incoming mail. Thus, a domain-specific and personalized email spam filter will be developed by creating two word dictionaries called SPAM and HAM consisting of spam and legitimate words, respectively, to calculate the spam probabilities of the incoming mail [9].

Spam Filtering and Bayesian Theorem

Methods of Fighting Spam

Hiding the Email Address
Keywords
Blacklist
Whitelist
Ernaillntcrferorneter

This method unfortunately has some problems, as the topic of spam messages changes from time to time. This can be handled by a regularly updated keyword list, but the slightest change in subject words leads to inconsistencies (eg enter 'softw@re' instead of 'software'). The keyword search was based on the known list of words and phrases most of which were believed to exist only in spam.

The method has the ability to detect spam mail based on its origin rather than its content. Unfortunately, new spam hosts can appear instantly, and propagation time can be a significant weakness. If a legitimate user was accidentally blacklisted, there is no way to get out of the blacklist, and therefore all mail is rejected from the blacklisted part of the network.

The address level blacklist blocks the email by matching address from blacklist address defined by the user.

Types of Sparn Filters

Content based Sparn Filters
Whitelist/Blaeklist Filters
Challenge/Response Filters
Community Filters
Bayesian Spam Filters

You can easily tailor the filtering to the exact type of spam message you are dealing with and, just as importantly, not to bang on words or phrases that you use every day at work or with friends. As spammers resort to new tricks to prevent the filters, or new products are advertised, additional filters must be created to handle them. Whitelist filters will not accept email from any address unless it is on a list of known 'good' email addresses.

While this may be true, it is our opinion that it is a valid measure, provided that the challenge is not sent as a matter of course, but only after a message has been analyzed and judged to be questionable. This information is sent to the central server where a 'fingerprint' of the message is added to the database. The frequency of the token in spam messages that the thc filter has been trained on.

The frequency of the token in the good messages on which the filter was trained. There is a possibility of fooling him by diluting the spam message with sufficiently apparently innocent words.

Bayesian Filter lmd Bayes' Theorem

Statement of Bayes' Theorem
Application of Bayes' Probability Theorem
Frequency versus Probability Format
Set Enumeration Method of Bayes' Theorem
Bayesian Spam Filter and Set Enumeration Methods

It can be thought of as a way of understanding how the probability that a theory is true is affected by new evidence. It has been used in a wide variety of contexts, from marine biology to the development of 'Bayesian' spam blockers for email systems. In the philosophy of science, it has been used to try to clarify the relationship between theory and evidence.

Many insights in the philosophy of science involving confirmation, falsification, the relationship between science and pseudoscience, and other topics can be made more precise, and sometimes extended or corrected, by using Bayes' theorem [17]. The counting method can still be considered the most basic and intuitive method for calculating a conditional probability. Therefore, if the probability of the 1st of two subsequent events is a/N and the probability of both together is PIN, then the probability of the 2nd, assuming that I st is Pia [19].

Using the set count method, as mentioned above, a Bayesian spam filter can be designed as a custom spam filter. Using the set count method, the number of times a specific word appears in spam, divided by the total number of times overall (spam and spam emails) is the spam probability of the corresponding word.

Personalized Email Spam Filter using Bayesian Filter

Why is Bayesian Filtering better'!

Working Principle of Bayesian Filter

Algorithm of the Bayesian Filter .1 Loading the Email

Tokenization
Calculation
Learning Process
Training and Different Training Types
Testing

The Algorithm of the Personalized Email Spam Filter

Creating the training HAM Database
Crellting the Training SP AM Database
Creating a Probability Word Database for the FilteJ'
Creating the HAM Database
Creating the SP AM database
How the Actual Filtering is done

Making the Filter Personalized

Making the Token Dictionary
Calculating the Spamming Probability
Training

The Bayesian spam filters work by analyzing and then calculating the probability that the content in the email is spam. The Bayes filter works with the individual small parts of the text, the so-called tokens. There is always the opportunity to review and correct the result of the filters, in order to obtain better functioning in the future.

This HAM training database is created using user-selected outgoing mail and also known legal inbox mails. The junk data file also relies on known spam mails of the user's inbox. From these words, the custom filter calculates the probability that the new message is spam or not.

In the token dictionary, each token of the incoming mail gets COUNT -SPAM AND COUNT -HAM values, which means:. When a new email is received, the token dictionary is searched for all the words included in the letter.

Development of Personalized Email Spam Filter

Requirement Analysis

Creating a token dictionary to calculate the probability of spam for each of the tokens of the incoming email whose spam is to be calculated. Starting spam filters and continuously self-updating and maturing the SPAM and HAM databases and ultimately personalized them only for this.

Software Design Phases

According to the above flowchart shown in fig 4.2, the message is signed after receiving the email. The more accurate and authentic the tokenization will be, the more accurate the filter will be. Every word in the incoming whose length is greater than three and less than ten will be drawn.

Then, according to the tokenization, a probability value is generated for each token in the incoming mail. Then the overall probability of spammity and legitimacy of each word of each incoming mail is calculated. If the email is spam, the words found in the email either update the number of times found in the spam or are inserted into the word list.

Fig 4.2: Design Phases of Personalized Spam Filter

Process Flow

1)lre Witelis

Rite-

2)lre8a:kJis

3)lre FaSlai2Erl~

Database Design

Each incoming email and the associated words of each email are stored in the dictionary table in_mail and in_mail_ respectively. The tokenized words from the in_mail table are stored here based on mail ID. The tokenized words of the outgoing mail table are stored here based on mail ID.

The probability calculation is performed on this table. MAIL_ID: Unique 10 for each email. WORD: Depending on the email ID, each token token is treated as word of every incoming email. The individual words of the calculating email are taken from the dictionary table in the email.

This is the table which stores all the previously found spam words for each incoming mail. This is the table, which stores all the previously found legitimate words for each incoming mail and also found directly from the content of the outgoing mail.

Technology Used

How many times this word was previously found in the mail, these wcrc were calculated and terminated as spam mails. When the MUA is a program installed on the user's system, it is called an email client (such as Mozilla Thunderbird, Microsoft Outlook, Eudora Mail, Incredimail, or Lotus Notes). When it is a web interface used to interact with the incoming mail server, it is called webmail.

So, using the above concepts and email technology, we developed our custom spam tipper. The application is developed using 1'01'3 (Post Office Protocol 3) protocols for receiving incoming mail. It is a standardized way for users to access mailboxes and transfer messages to their computers.

When using the POP protocol, all e-mail messages will be transferred from the mail server to the local computer.

Implementation of the Developed Architecture

How many times the word found in ouUnail_wordlist This is the legitimacy calculation for matched words in token_dictionary and ham table. For the new words, they will not be found in spam or ham table, their spammity or legitimacy value will be zero. After the result, if the message is spam, for all the matched words in the spam table, COUNT_HAM values will be incremented by I.

For all the new words, the new words will be inserted into the spam table and the COUNT_HAM value will be initialized to I. The spam will thus be updated with the words that mail was finally decided with spam mail. After the result, if the message is ham, for all the matched words in the ham table, COUNT_SPAM values will be incremented by I.

For all new word, the new words are inserted into the ham table and the COUNT SPAM value is initialized to I. Similarly, ham is updated with the words which email is finally concluded with legitimate email.

Results and Discussions

Login Screen
Menu Bar
Spammacity Calculation
Delete the Mail from Mailbox
Spam Mail Body Submission
Outgoing Mail Body Submission
Blacklist and Whitelist Filter
Conclusion
Future Works

As shown in rig-5.2, id indicates the unique email ID of each email in the mailbox, 'subject indicates the email subject, spam probability gives the calculated value of the spam probability. Show calculation button gives the calculation detail of the respective mail and finally the action says that the mail can be deleted if the user wants it by pressing delete field. Thus, the spammity and legitimacy probabilities of all tokcns are taken and the final result is calculated by the division of the total spammity probability and the total legitimacy probability.

The ham database was inserted or updated by all the tokens of the mail and this false negative would have a significant effect on the next mail, which will contain the tokens of the mail in question. Finally, Figure 5.7 shows the whitelist and blacklist filters, which consist of all addresses from which incoming email has been found so far. The success of the personalized spam filtering developed in this research work depends on the tokenization techniques and is based on the effective creation of the proposed SPAM and HAM database.

The more accurate and efficient the tokenization, the more accurate the filtering. Another obstacle to the development of personalized spam filtering is the creation of a SPAM and HAM learning base.