International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
________________________________________________________________________________________________
Bayesian Classification (Gender) Using Simulation Technique
1Samar Ballav Bhoi and 2Munesh Chandra Adhikary
1Directorate of Distance & Continuing Education (DDCE), F.M. University, Balasore,
2 P G Dept. of Applied Physics & Ballistics (APAB), F.M. University, Balasore E-mail: [email protected], [email protected]
Abstract: - A Bayesian classification is a statistical classification in emerging trends of information technology in development, which predicts the probability that a given sample is a member of a particular class. It is based on the Bayes theorem. The Bayesian classification shows better accuracy and speed when applied to large databases. The Bayesian Classification represents a supervised learning method as well as a statistical method for classification.
Assumes an underlying probabilistic model and it allows us to capture uncertainty about the model in a principled way by determining probabilities of the outcomes. It can solve diagnostic and predictive problems. This Classification is named after Thomas Bayes (1702-1761), who proposed the Bayes Theorem. Bayesian classification provides practical learning algorithms and prior knowledge and observed data can be combined. Bayesian Classification provides a useful perspective for understanding and evaluating many learning algorithms. It calculates explicit probabilities for hypothesis and it is robust to noise in input data. A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".
An overview of statistical classifiers is given in the article on Pattern recognition. A Bayesian-network or belief network represents the dependencies between variables to provide a succinct design of a joint probability distribution. The network is a directed graph where nodes are sets of random variables; directed links connect node pairs signifying which nodes have a direct effect on other nodes; each node has a conditional probability table representing the quantifiable impact each parent node has on the child node’s value; and the graph has no directed sequences determining the specific path to be taken or result. The links, representing direct conditional dependency between nodes, and the probability coupled with each links are typically established by subject matter experts (SMEs). Uncertainties can be applied to each node to help make runs stochastic. A deterministic run is executed when a child node’s values are derived exclusively from the inputs of the node’s parent(s). A Bayesian- network can reason from effects to causes (diagnostic inference), from causes to effects (causal inference), between causes of a common effect (inter causal inference), or by combining two or more of the above (mixed inference). One of the obstacles with producing a Bayesian- network comes from the inability of SMEs to ascertain all the nodes and directed links essential for an implementation in a particular domain. Finally,
determining the probability weights for each link is often considered the most complex phase of creating and modifying a Bayesian-network. Data mining is most important in area of A.I. in which we can draw the classification gender of data item using simulation process/technique. Objective of the paper aims at giving you some of the fundamental techniques used in data mining. This paper emphasizes on a brief overview of data mining as well as the application of data mining techniques to the real world using simulation technique. This topic is based on classification of genders using Bayesian Classification which is purely new data sets of random numbers represents height, weight and foot sizes. Bayesian reasoning is applied to decision making and inferential statistics that deals with probability inference. It is used the knowledge of prior events to predict future events.
Keywords: Bayesian Network, Attributes, posterior probability, prior probability, data mining, extraction, simulation, classification, SMEs, KDD, data items, random number.
I. INTRODUCTION
In academic view and it is defined by W.H Inman is subject-oriented, integrated, time-variant, nonvolatile, and a collection of operational data that supports decision–making of the management. It is a tool that manages of data after and outside of operational system.
Data warehousing technology has evolved in business applications for the process of strategic decision making.
It may be considered as the key components of IT strategy and architecture of an organization. Example – an electric billing company, by analyzing data of a data warehouse can predict frauds and can reduce the cost of such determinations, in fact this technology has such great potential that any company processing proper analysis tools can benefit from it. Thus a data warehouse supports Business Intelligence. Presently some uses of Data warehousing and Data Mining are in industries like- Banking, Airline, Hospital and Investment &
Insurance, use of data warehousing and data mining through simulation technique(1). Data Mining emphasizes on three techniques:
a) Classification
b) Clustering
c) Association Rules
Data Mining Vs. Knowledge Discovery in Databases (KDD)
Knowledge Discovery in Databases (KDD) is the process of finding useful Information, knowledge and patterns in data while data mining is the process of using of algorithms to automatically extract desired information and patterns, which are derived by the Knowledge Discovery in Databases process. Let us define KDD: Knowledge Discovery in Databases (KDD) Process .The different steps of KDD are as follows:
(a) Extraction: Obtains data from various data sources.
(b) Preprocessing: It includes cleansing the data which has already been extracted by the above step.
(c) Transformation: The data is converted in to a common format, by applying some technique.
(d) Data Mining: Automatically extracts the information/patterns/knowledge.
(e) Interpretation/Evaluation: Presents the results obtained through data mining to the users, in easily understandable and meaningful format.
Fig-1
E P T DM I KP
Initial Target Preprocessed Transformed Model Data data data data
Fig : 1 (KDD Process) (Where E: Extraction, P- Preprocess T- Transform DM-Data Mining I- Interpreter KP- knowledge Pattern )
Classification: The classification task maps data into predefined groups or classes. Given a database/dataset D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the classification Problem is to define a mapping f:D->C where each ti is assigned to one class, that is, it divides database/dataset D into classes specified in the Set C. A few very simple examples to elucidate classification could be:
• Teachers classify students’ marks data into a set of grades as A, B, C, D, or F.
• Classification of the height of a set of persons into the classes tall, medium or short
Classification Approach: The basic approaches to classification are:
(a) To create specific models by, evaluating training data, this is basically the old data that has already been classified by using the domain of the experts’
knowledge.
(b) Now applying the model developed to the new data.
Some of the most common techniques used for classification may include the use of Decision Trees, Neural Networks etc. Most of these techniques are based on finding the distances or uses statistical methods.
Classification Using Distance (K-Nearest Neighbors algorithm (KNN), this approach, places items in the class to which they are “closest” to their neighbor. It must determine distance between an item and a class.
Classes are represented by centroid (Central value) and the individual points. One of the algorithms that are used is K-Nearest Neighbors (2). Some of the basic points to be noted about this algorithm are:
1. The training set includes classes along with other attributes. (Please refer to the training data given in the Table-1 given below).
2. The value of the K defines the number of near items (items that have less distance to the attributes of concern) that should be used from the given set of training data (just to remind you again, training data is already classified data).
This is explained in point (2) of the following example.
3. A new item is placed in the class in which the most number of close items are placed. (Please refer to point (3) in the following example).
4. The value of K should be <=
items training
of
number_ _ _ ,-(1) -
However, in our example for limiting the size of the sample data; we have not followed this formula.
Example: Consider the following data, which tells us the person’s class depending upon gender and height.
Table-1:
Name Gender Height (cm) Class
Sumitra F 160cm Short
Ananda M 200cm Tall
Ranita F 190cm Medium
Radhika F 188cm Medium
Dally F 170cm Short
Arun M 185cm Medium
Shellina F 160cm Short
Arabinda M 170cm Short
Sachin M 220cm Tall
Manoj M 210cm Tall
Sulekha F 180cm Medium
Anil M 195cm Medium
Karisma F 190cm Medium
Sabita F 180cm Medium
Sipra F 175cm Medium
Questions:
(1) let us to classify the tuple <Ananda, M, 200> from training data.
________________________________________________________________________________________________
(2) let us to classify the tuple <Manoj, M, 210> from training data.
(3) Let us take only height attribute for distance calculation and suppose K=5 then the following are the near five tuples to the data that is to be classified (using Manhattan distance as a measure on the height attribute).
Table-2:
Name Gender Height Class Sumita F 160cm ? (Short) Jully F 170cm ? (Short) Shelly F 160cm ?( Short) Avinash M 170cm ? (Short) Answer: 1) The tuple <Chandan, M, 160> can classify to short class.
2) The tuple <Manoj, M, 210> can classify to tall class.
Classify the data using the simulation technique:
Consider the following data in which position attribute acts as class:
Example training set below. Table-3
The classifier created from the training set using a Gaussian distribution assumption (11) would be (given variances are sample variances): Table-4
sex mean (height )
varianc e (height
)
mean (weight )
varianc e (weight
)
mean (foot size)
variance (foot size) male 5.866 2.1504
e-02
175.8 0.9216 e+02
11.00 0.8e+00 fem
ale
5.422 5.8416 e-02
132.6 3.4504 e+02
7.6 0.104e- 01 Let's say we have equiprobable classes so P (male) = P (female) = 0.5. This prior distribution might be based on our knowledge of frequencies in the larger population, or on frequency in the training set.
Testing: Below is a sample to be classified as a male or female. Table-5
sex height (feet) weight (lbs) foot size(inches)
sample 6 130 8
I wish to determine which posterior is greater, male or female. For the classification as male the posterior is given by
evidence
m fs p m w p m h p m male p
Posterior ( ) ( / ) ( / ) ( / ) )
(
- (2) For the classification as female the posterior is given by
evidence f fs p f w p f h p f female p
Posterior ( ) ( / ) ( / ) ( / ) )
(
- (3) The evidence (also termed normalizing constant) may be calculated (14):
)) / ( ) / ( ) / ( ) (
) / ( ) / ( ) / ( ) (
f fs p f w p f h p f p
m fs p m w p m h p m p evidence
(4) However, given the sample the evidence is a constant and thus scales both posteriors equally. It therefore does not affect classification and can be ignored. We now determine the probability distribution for the sex of the sample.
--- (5)
,
Where and are
the parameters of normal distribution which have been previously determined from the training set. Note that a value greater than 1 is OK here – it is a probability density rather than a probability, because height is a continuous variable.
- (6) Since posterior numerator is greater in the female case, we predict the sample is female.
II: METHOD (SIMULATION):
Simulation(3) is the process of designing a model of a real system and conducting experiments with this model for the purpose of understanding the behavior for the operation of the system. Mathematical steps for MCM:
(Monte Carlo Method) sex height (feet) weight
(lbs)
foot size(inch
es)
Male 6 (6”) 181 12
Male 5.92 (5'11") 192 11 Male 5.58 (5'7") 172 12 Male 5.92 (5'11") 166 10
Female 5(5”) 102 6
Female 5.5 (5'6") 151 8 Female 5.42 (5'5") 129 7 Female 5.75 (5'9") 153 9 Male 5.91 (5'11") 168 10 Female 5.44 (5'5") 128 8
1. Setting up probability distribution for the variables to be analyzed.
2. Construct cumulative probability distribution for each random numbers variables.
3. Generate the random numbers within the range from 00 to 99 or more.
4. Conduct the simulation experiment (4) by means of random sampling.
5. Repeat the step 4 until the required number of simulation runs has been generated.
6. Design and implement a course of action and maintain control.
Using above data (mentioned in table-3) to classify the gender with different position at attributes.
Probability Distribution for Sex, height, weight and foot sizes: table-6
Height (feet) Weight Probability F M
5.0”-5.5” 100-125 0.30 0.26
5.5-6.0” 125-150 0.20 0.24
6.0-6.5” 150-175 0.25 0.20
6.5-7.5” 175-200 0.25 0.30
Using the random numbers 6, 5.3, 6.6, 5.2, 5.9 to classify the gender. Generate the range of random numbers using cumulative probability. Table-7
Height (feet)
Weight Probabi lity
Cumulati ve prob.
Range of RN 5.0”-5.5” 100-125 0.30 0.30 00-02 5.5-6.0” 125-150 0.20 0.50 03-04 6.0-6.5” 150-175 0.25 0.75 05-07 6.5-7.5” 175-200 0.25 1.00 08-09 Calculation class of gender using random numbers:
table-8
III. RESULTS DISCUSSION OF ABOVE METHOD:
After analyzing the test the method classify the given data set into different attributes of gender of items in proper classification. So this method can also applicable for different other attributes of human beings to classify the data. I obtained the results about gender using simulation technique which is approximately correct about previous methods. We can apply the technique of simulation (7) in data mining problem: Some of the applications of data mining are as follows:
Classification, Estimation, Prediction
Used for large data set
Very easy to construct
Not using complicated iterative parameter estimations
Often does surprisingly well
May not be the best possible classifier
Robust, fast, it can usually be relied on to many applications.
Applications (9)
1. Gene regulatory networks 2. Protein structure
3. Diagnosis of illness 4. Document classification 5. Image processing 6. Data fusion
7. Decision support systems
8. Gathering data for deep space exploration 9. Artificial Intelligence
10. Prediction of weather
11. On a more familiar basis, Bayesian networks are used by the friendly Microsoft office assistant to elicit better search results.
12. Another use of Bayesian networks arises in the credit industry where an individual may be assigned a credit score based on age, salary, credit history, etc. This is fed to a Bayesian network which allows credit card companies to decide whether the person's credit score merits a favorable application.
IV. CONCLUSION:
Using the simulation technique in Bayesian Classification (9) of data we can solve the all data mining methods as approximately and near about the results.
Such as classification of data items is totally based on probability concept.
The advantages of Bayesian Networks (18):
Visually represent all the relationships between the variables
Easy to recognize the dependence and independence between nodes.
Can handle incomplete data
Scenarios where it is not practical to measure all variables (costs, not enough sensors, etc.)
Help to model noisy systems.
Can be used for any system model - from all known parameters to no known parameters.
Random No Sex Height Foot size
6.0 Male 181 12
5.3 Female 129 7
6.6 Male 192 11
5.2 Female 129 7
5.9 Male 168 10
________________________________________________________________________________________________
The limitations of Bayesian Networks:
All branches must be calculated in order to calculate the probability of any one branch.
The quality of the results of the network depends on the quality of the prior beliefs or model.
Calculation can be NP-hard
Calculations and probabilities using Baye's rule and marginalization can become complex and are often characterized by subtle wording, and care must be taken to calculate them properly.
REFERENCES
[1] J Han, M Kamber, 2001 Data Mining Concepts and Techniques, Morgan Kaufmann Publishers.
[2] A K Pujari, 2004 Data Mining, [3] Gordern, Simulation techniques
[4] Giuseppe Petrone, Giuliano Cammarata – InTech.’ Modelling and Simulation’
[5] Roger McHaney - BookBoon Understanding Computer Simulation,
[6] Christian P. Robert and George Casella, Springer 2004, Monte Carlo Statistical Methods" (second edition).
[7] Christian P. Robert, George Casella.
"Introducing Monte Carlo statistical Methods"
[8] Tools and Examples for Developing Simulation Algorithms-Hans Christian Öttinger, Swiss Federal Institute of Technology Zürich, Switzerland Springer.
[9] Andrew W. Moore Professor School of Computer Science Carnegie Mellon University, Naïve
Bayes Classifiers www.cs.cmu.edu/~awm [email protected] 412-268-7599
[10] William DuMouchel Shannon Laboratory, AT&T Labs –Research ,Bayesian Measurement of Associations in Adverse Drug Reaction Databases [email protected] DIMACS Tutorial on Statistical Surveillance Methods Rutgers University June 20, 2003 [11] CS/CNS/EE 155: Probabilistic Graphical Models
Problem Set 2 Handed out: 21 Oct 2009 Due: 4 Nov 2009
[12] Learning Bayesian Networks from Data: An Efficient Approach Based on Information Theory Jie Cheng Dept. of Computing Science University of Alberta Alberta, T6G 2H1 Email:
[email protected] David Bell, Weiru Liu Faculty of Informatics, University of Ulster, UK BT37 0QB Email: {w.liu, da.bell}@ulst.ac.uk [13] http://www.bayesia.com/en/products/
bayesialab/tutorial.php
[14] ISyE8843A, Brani Vidakovic Handout 17 1 Bayesian Networks
[15] Bayesian networks Chapter 14 Section 1 – 2 [16] Naive-Bayes Classification Algorithm Lab4-
NaiveBayes.pdf
[17] XindongWu · Vipin Kumar · J. Ross Quinlan · Joydeep Ghosh · Qiang Yang · Hiroshi Motoda · Geoffrey J. McLachlan · Angus Ng · Bing Liu · Philip S. Yu · Zhi-Hua Zhou · Michael Steinbach David J. Hand · Dan Steinberg Received: 9 July 2007 `Top 10 algorithms in data mining