STATISTICAL METHODS
5.2 ASSESSING DIFFERENCES IN DATA SETS
For many data-mining tasks, it would be useful to learn the more general character- istics about the given data set, regarding both central tendency and data dispersion.
These simple parameters of data sets are obvious descriptors for assessing differences between different data sets. Typical measures of central tendency include mean, median, andmode, while measures of data dispersion includevarianceandstandard deviation.
The most common and effective numeric measure of the center of the data set is themeanvalue (also called the arithmetic mean). For the set ofnnumeric valuesx1, x2,…,xn, for the given featureX, the mean is
mean =1 n
n i= 1
xi
and it is a built-in function (like all other descriptive statistical measures) in most mod- ern statistical software tools. For each numeric feature in then-dimensional set of sam- ples, it is possible to calculate the mean value as a central tendency characteristic for this feature. Sometimes, each valuexiin a set may be associated with a weightwi, which reflects the frequency of occurrence, significance, or importance attached to the value. In this case, the weighted arithmetic mean or the weighted average value is
mean =
n i= 1wixi
n i= 1wi
Although the mean is the most useful quantity that we use to describe a set of data, it is not the only one. For skewed data sets, a better measure of the center of data is the median. It is the middle value of the ordered set of feature values if the set consists of an odd number of elements, and it is the average of the middle two values if the num- ber of elements in the set is even. Ifx1,x2,…,xnrepresents a data set of sizen, arranged in increasing order of magnitude, then the median is defined by
median =
xn+ 1 2 ifnis odd
xn 2+xn2 + 1
2 ifnis even
Another measure of the central tendency of a data set is themode. The mode for the set of data is the value that occurs most frequently in the set. While mean and
168 STATISTICAL METHODS
median are characteristics of primarily numeric data sets, the mode may be applied also to categorical data, but it has to be interpreted carefully because the data are not ordered. It is possible for the greatest frequency to correspond to several different values in a data set. That results in more than one mode for a given data set. Therefore, we classify data sets as unimodal (with only one mode) and multimodal (with two or more modes). Multimodal data sets may be precisely defined as bimodal, trimodal, etc. For unimodal frequency curves that are moderately asymmetrical, we have the following useful empirical relation for numeric data sets:
mean–mode≤3 × mean–median
that may be used for an analysis of data set distribution and the estimation of one central tendency measure based on the other two.
As an example, let us analyze these three measures on the simple data setTthat has the following numeric values:
T= 3, 5, 2, 9, 0, 7, 3, 6 After a sorting process the same data set is given as
T= 0, 2, 3, 3, 5, 6, 7, 9
The corresponding descriptive statistical measures for central tendency are meanT= 0 + 2 + 3 + 3 + 5 + 6 + 7 + 9
8 = 4 375
medianT= 3 + 5 2 = 4 modeT= 3
The degree to which numeric data tend to spread is called dispersion of the data, and the most common measures of dispersion are thestandard deviationσ and the varianceσ2. The variance ofnnumeric valuesx1,x2,…,xnis
σ2= 1 n−1
n i= 1
xi−mean 2
The standard deviationσis the square root of the varianceσ2. The basic properties of the standard deviationσas a measure of spread are as follows:
1. σmeasures spread about themeanand should be used only when themeanis chosen as a measure of the center.
2. σ = 0 only when there is no spread in the data, i.e., when all measurements have the same value. Otherwiseσ > 0.
ASSESSING DIFFERENCES IN DATA SETS 169
For the data set given in our example,varianceσ2andstandard deviationσare σ2=1
8
8
i= 1
xi−4 3752 σ2= 8 5532
σ= 2 9246
In many statistical software tools, a popularly used visualization tool of descrip- tive statistical measures for central tendency and dispersion is aboxplotthat is typi- cally determined by the mean value, variance, and sometimes max and min values of the data set. In our example, the minimal and maximal values in theTset are minT= 0 and maxT= 9. Graphical representation of statistical descriptors for the data setThas the form of a boxplot, given in Figure 5.1.
Analyzing large data sets requires proper understanding of the data in advance.
This would help domain experts to influence the data-mining process and to properly evaluate the results of a data-mining application. Central tendency measures for a data set are valuable only for some specific distributions of data values. Therefore it is important to know characteristics of a distribution for a data set we are analyzing.
The distribution of values in a data set is described according to the spread of its values. Usually, this is best done using a histogram representation; an example is given in Figure 5.2. In addition to quantifying the distribution of values for each fea- ture, it is also important to know the global character of the distributions and all spe- cifics. Knowing that data set has classic bell curve empowers researchers to use a broad range of traditional statistical techniques for data assessment. But in many prac- tical cases the distributions are skewed or multimodal, and traditional interpretation of concepts such as mean value or standard deviation does not have a sense.
Part of the assessment process is determining relations between features in a data set. Simple visualization through the scatter plots gives initial estimation of these rela- tions. Figure 5.3 shows part of the integrated scatter plot where it compared each pair
Values 10
Max 2σ 5
Mean –2σ
0 Min
_________________________________
Data set T
Figure 5.1. A boxplot representation of the data set T based on mean value, variance, and min and max values.
170 STATISTICAL METHODS
0 0 100200300400
10 20 30
Weeks
40 50
Figure 5.2. Displaying single feature distribution.
0.80 0.90 1.00
0.40 0.50 0.60 0.70
0.10 0.20 0.30
–0.30 –0.20 –0.10 0.00
–0.60 –0.50 –0.40
–1.00 –0.90 –0.80 –0.70
Figure 5.3. Scatter plots showing the correlation between features from–1 to 1.
ASSESSING DIFFERENCES IN DATA SETS 171
of features. This visualization technique is available in most integrated data-mining tools. Quantification of these relations is given through the correlation factor.
These visualization processes are part of data understanding phase, and they are essential in better preparation for data mining. This human interpretation helps to obtain a general view of the data. It is also possible to identify abnormal or interesting characteristics, such as anomalies.