The following section presents several definitions of terms which are commonly used in data classification. A sample data set is presented in figure 3.2 in order to assist with the explanations.
3.3.1 Instance
Aninstance represents an entity described by one or more attributes [26]. Instances are typically visually represented as rows within a data set, and an example of an
CHAPTER 3. DATA CLASSIFICATION 31 instance is presented by the bold rectangle in figure 3.2. For this particular instance shown in the figure, the value for outlook is “Sunny”, the value for temperature is
“Hot”, the value for humidity is “High”, and the value for windy is “True”, and finally, the output for this instance, being whether or not to play, is “No”. An instance contains a value for each attribute, and a value for the class.
Figure 3.2: A sample data set. The weather data set, adapted from [5].
3.3.2 Attribute
Anattribute, or feature, can be described as a variable which represents a particular characteristic of a data set [10, 26]. An attribute can be either numerical or categor- ical. Numerical attributes can be categorised as being either discrete or continuous.
Discrete values usually consist of integers, whereas continuous attributes consist of real valued numbers. Categorical attributes, on the other hand, are made up of a finite set of values which “puts objects into categories, e.g. the name or colour of an object” [10]. A categorical attribute could be made up of integer values; how- ever, they should have no mathematical significance [10]; these integer values should merely be labels.
For instance, height, age, or gender, are possible examples of attributes, and in turn these attributes will have values of a different type; continuous values, integer values, and categorical values respectively. Since theheight attribute is continuous, and assuming the values were measured to two decimal places, then typical values could include 150.33, 185.65, or 191.22. Sincegender is a categorical attribute, there are two possible values for this attribute; male or female. The attributes are visually represented as columns within a data set; as seen in figure 3.2. This data set has four attributes, namely,Outlook,Temperature,Humidity, and Windy.
3.3.3 Class
A class is the target output for each instance of data. In data classification, the goal of a machine learning algorithm is to predict the class through the use of the attributes. The class should be thought of as an output value, whereby a machine learning algorithm uses all the attributes as input. From the sample data set in figure 3.2, the classes are “Yes” and “No”.
An instance of data typically contains a single class; however, this may not always be the case. When more than one class is present, these instances are referred to as multi-labelled instances [5]. In this case, an instance can belong to more than one class.
3.3.4 Data set
Adata set refers to the complete set of instances of data [10]. A data set consists of a number of instances of data which are in turn built up of several attributes. Each instance of data has a class value and thus a data set has several classes. A binary data set is one with only two classes. In a classification task, when a binary data set is used, the machine learning task is then referred to asbinary classification. When a data set consists of more than one class, the machine learning task is referred to as multiclass classification. In figure 3.2 the data set is represented by the dotted lines, i.e. the entire table of data is the data set. For this data set, the task is thus to determine whether or not to play some game based on the weather conditions.
There are two classes,no and yes.
Data sets play a vital role in the development of data classification algorithms and also in the comparison between different algorithms. There are several repositories available for researchers to use in order to test their algorithms. One of the most commonly used one is the UCI machine learning repository [27]. To this date, the repository contains over 200 data sets for classification.
It is common practice that when a new algorithm is developed, researchers will compare the performance of the algorithm on several data sets in the UCI machine learning repository [10]. However, if a new classifier performs well on certain data sets, this does not necessarily imply that the algorithm is good [10] and thus a researcher should take care as to test the performance over several data sets.
When a class within a data set is known, applying data mining techniques to such a data set is known assupervised learning. Conversely, applying data mining techniques to a data set without a known class is referred to asunsupervised learning.
CHAPTER 3. DATA CLASSIFICATION 33 3.3.5 Class balance
The class balance of a data set refers to the number of instances within each class and the ratio between them [28]. If there are approximately an equal number of instances in each class, then the data set is said to be well-balanced. An imbalanced data set is one whereby one or more classes have a number of instances significantly smaller than the number of instances in the other classes. Assume, for instance, that for particular a data set, 90% of the instances specify class A as the output, and 10% specifyclass B, then this data set is considered to be imbalanced. Imbalanced data sets increase the complexity of the data classification process due to the fact that more data is available for one class, and thus learning from such a data set will result in a bias towards the majority class [29].
From figure 3.2, in the class column, one can see that there are 3 no labels (approximately 43% of the total data), and 4yes labels (approximately 57% of the total data), thus it can be said that this data set is well-balanced.
3.3.6 Classifier
A classifier is an algorithm that can discriminate between several classes [26]. Clas- sifiers are created in the hopes that they are able to correctly allocate class labels to as many instances of data as possible. Letx consist of a vector representing the value of each attribute for an instance of data, thusxi ={xi,0, xi,1, ..., xi,n} wheren represents the total number of attributes within a data set, andxi,j represents the value of instance i for attribute j. Let wi correspond to the class value for the ith instance of data.
The goal of a machine learning algorithm is to create some function φ whereby φ(xi) = wi [26]. The function φ can be represented in various ways, for example through the use of mathematical mappings, decisions, or logical rules. An example of a simplistic classifier which is based on the data set in figure 3.2 would be:
IF outlook = rainy THEN don’t play
ELSE play
The output from the simplistic classifier in comparison to the data set is illus- trated in table 3.1. An instance is correctly classified when the class value matches the output of the classifier. The simplistic classifier is able to correctly classify 3 instances, the third, sixth, and the seventh. The ultimate goal is thus to create a classifier which is able to correctly classify all of the instances of data within the data set.
Instance Class - Play Simplistic classifier output
1 No Yes
2 No Yes
3 Yes Yes
4 Yes No
5 Yes No
6 No No
7 Yes Yes
Table 3.1: Class output from figure 3.2, and the output for the simplistic classifier.