Personalized Email Spam Filter using Bayesian Filter
3.3 The Algorithm of the Personalized Email Spam Filter
Using the above algorithm for the Bayesian spam filter, a spam filter can be made personalized for a particular user or finally for a particular organization.
The filter will be introduced with an initial training and will start to work. As time goes on, the tilter will be trained and will make it enhanced with mail behaviors of the particular user or organization with the so-called SPAM and I-lAM mails [24]. So, based on the Bayesian spam filter concept and algorithm, the filter will personalized day by day. The algorithm of the personalized spam filter can be given as:
3.3.1 Creating the training HAM Database
Before mail can be filtered using personalizes spam filter, we have to create a training HAM database. This training HAM database is created using user's selected outgoing mail and also with the known legal mails of the inbox. Based on these legal word collections, the personalized spam filter has a baseline to start filtering and has. the initial capability to distinguish among the users
mostly used legal words. So during the spammacity and legitimacy calculation these values of the legitimacy have the effect to perform the calculation.
3.3.2 Crellting the Training SP AM Database
Besides ham mail, the personalized filter also relies on a spam data file. This spam data file must include a large sample of known spam words which is recognized only the by the user and must be constantly updated with each incoming mail. The spam data file also relies on the known spam mails of the user's in box. This will ensure that the personalized filter is aware of the personal mail behavior of a particular user and become more and more personalized day by day. Thus the training SPAM database is created. The SPAM database thus updated with the incoming spam mail and also can be updated with the well-known spam words downloaded from Internet and from the unive~sal spam words. Thus two-way adaptation will be performed to be creating and update the spam databases.
3.3.3 Creating a Probability Word Database for the FilteJ'
A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word oeeurs in spam as opposed to the total previous stored mails occurred previously. This is done by analyzing the users' incoming mail and by analyzing known spam words and also by the ham words taken from the previously considered ham mails and also by the selected outgoing mal contents. Por each word, both spammacity and legitimacy is found out. Thus the word of probability is created. Por all the words and tokens in both pools of mail are analyzed to generate the probability that a particular word points to the mail being spam.
3.3.4 Creating the HAM Database
A fter the initial legal or ham database IS created usmg the user's usual outbound mailbox and also using the legal mail list of the inbox, the filter is ready to use, of course taking into account that the initial spam database is also created. The filter than can distinguished among legal and illegal mails only respect to that particular user or organization. Thus each mail is introduced with the filter; the HAM database goes more matured and matured either with the new words or with the updating of the previously stored words. Thus the HAM database is continuously self-updated as a new legal mail is introduced with the filter. This HAM database is only tailored to this particular user or organization, as other docs will have no any genuine usc with the databasc.
3.3.5 Creating the SP AM database
After the initial SPAM database is created using known spam words and also using known spam mails, the filter is ready to use to find out which is spam mail only with respect to that particular user or organization. As each spam mail is introduced the spam database is either updated with new spam words or with the previous spam words. Thus the SPAM database is created and self- updated continuously.
3.5.6 How the Actual Filtering is done
Once the ham and spam databases have been created, the word probabilities can be calculated and the filter is ready for use. When a new mail arrives, it is broken down into words and the most relevant words - i.e., those that are most significant in identifying whether the mail is spam or not is singled out. From these words, the personalized filter calculates the probability of the new message being spam or not. If the probability is greater than a threshold, 1.0, then the n~essage is classified as spam [26].
Thus in the following fig 3.1 the training good and bad messages are initially
introduced and their spammacity is calculated and then the tilter is ready to use and after that stage the spam filter starts to work and the filtered messages is found.
Good Messages
/'
~ Word
Initial
I- Probabilities
~ ~ Filtered
/'
Filter Messages'--
~:...-
Bad Message s
~
f-!i'
Fig 3.1: Personalized Spam Filter