ASSESSING DIFFERENCES IN DATA SETS

STATISTICAL METHODS

5.2 ASSESSING DIFFERENCES IN DATA SETS

For many data-mining tasks, it would be useful to learn the more general characteristics about the given data set, regarding both central tendency and data dispersion.

These simple parameters of data sets are obvious descriptors for assessing differences between different data sets. Typical measures of central tendency include mean, median, andmode, while measures of data dispersion includevarianceandstandard deviation.

The most common and effective numeric measure of the center of the data set is themeanvalue (also called the arithmetic mean). For the set ofnnumeric valuesx1, x2,…,xn, for the given featureX, the mean is

mean =1 n

n i= 1

x_i

and it is a built-in function (like all other descriptive statistical measures) in most mod- ern statistical software tools. For each numeric feature in then-dimensional set of sam- ples, it is possible to calculate the mean value as a central tendency characteristic for this feature. Sometimes, each valuexiin a set may be associated with a weightwi, which reflects the frequency of occurrence, significance, or importance attached to the value. In this case, the weighted arithmetic mean or the weighted average value is

mean =

n i= 1wixi

n i= 1wi

Although the mean is the most useful quantity that we use to describe a set of data, it is not the only one. For skewed data sets, a better measure of the center of data is the median. It is the middle value of the ordered set of feature values if the set consists of an odd number of elements, and it is the average of the middle two values if the number of elements in the set is even. Ifx1,x2,…,xnrepresents a data set of sizen, arranged in increasing order of magnitude, then the median is defined by

median =

x_n+ 1 2 ifnis odd

x_n 2+x_n2 + 1

2 ifnis even

Another measure of the central tendency of a data set is themode. The mode for the set of data is the value that occurs most frequently in the set. While mean and

168 STATISTICAL METHODS

median are characteristics of primarily numeric data sets, the mode may be applied also to categorical data, but it has to be interpreted carefully because the data are not ordered. It is possible for the greatest frequency to correspond to several different values in a data set. That results in more than one mode for a given data set. Therefore, we classify data sets as unimodal (with only one mode) and multimodal (with two or more modes). Multimodal data sets may be precisely defined as bimodal, trimodal, etc. For unimodal frequency curves that are moderately asymmetrical, we have the following useful empirical relation for numeric data sets:

mean–mode≤3 × mean–median

that may be used for an analysis of data set distribution and the estimation of one central tendency measure based on the other two.

As an example, let us analyze these three measures on the simple data setTthat has the following numeric values:

T= 3, 5, 2, 9, 0, 7, 3, 6 After a sorting process the same data set is given as

T= 0, 2, 3, 3, 5, 6, 7, 9

The corresponding descriptive statistical measures for central tendency are meanT= 0 + 2 + 3 + 3 + 5 + 6 + 7 + 9

8 = 4 375

medianT= 3 + 5 2 = 4 modeT= 3

The degree to which numeric data tend to spread is called dispersion of the data, and the most common measures of dispersion are thestandard deviationσ and the varianceσ². The variance ofnnumeric valuesx1,x2,…,xnis

σ²= 1 n−1

n i= 1

xi−mean ²

The standard deviationσis the square root of the varianceσ². The basic properties of the standard deviationσas a measure of spread are as follows:

1. σmeasures spread about themeanand should be used only when themeanis chosen as a measure of the center.

2. σ = 0 only when there is no spread in the data, i.e., when all measurements have the same value. Otherwiseσ > 0.

ASSESSING DIFFERENCES IN DATA SETS 169

For the data set given in our example,varianceσ²andstandard deviationσare σ²=1

i= 1

xi−4 375² σ²= 8 5532

σ= 2 9246

In many statistical software tools, a popularly used visualization tool of descriptive statistical measures for central tendency and dispersion is aboxplotthat is typi- cally determined by the mean value, variance, and sometimes max and min values of the data set. In our example, the minimal and maximal values in theTset are min_T= 0 and max_T= 9. Graphical representation of statistical descriptors for the data setThas the form of a boxplot, given in Figure 5.1.

Analyzing large data sets requires proper understanding of the data in advance.

This would help domain experts to influence the data-mining process and to properly evaluate the results of a data-mining application. Central tendency measures for a data set are valuable only for some specific distributions of data values. Therefore it is important to know characteristics of a distribution for a data set we are analyzing.

The distribution of values in a data set is described according to the spread of its values. Usually, this is best done using a histogram representation; an example is given in Figure 5.2. In addition to quantifying the distribution of values for each feature, it is also important to know the global character of the distributions and all spe- cifics. Knowing that data set has classic bell curve empowers researchers to use a broad range of traditional statistical techniques for data assessment. But in many prac- tical cases the distributions are skewed or multimodal, and traditional interpretation of concepts such as mean value or standard deviation does not have a sense.

Part of the assessment process is determining relations between features in a data set. Simple visualization through the scatter plots gives initial estimation of these relations. Figure 5.3 shows part of the integrated scatter plot where it compared each pair

Values 10

Max 2σ 5

Mean –2σ

0 Min

_________________________________

Data set T

Figure 5.1. A boxplot representation of the data set T based on mean value, variance, and min and max values.

170 STATISTICAL METHODS

0 0 100200300400

10 20 30

Weeks

40 50

Figure 5.2. Displaying single feature distribution.

0.80 0.90 1.00

0.40 0.50 0.60 0.70

0.10 0.20 0.30

–0.30 –0.20 –0.10 0.00

–0.60 –0.50 –0.40

–1.00 –0.90 –0.80 –0.70

Figure 5.3. Scatter plots showing the correlation between features from–1 to 1.

ASSESSING DIFFERENCES IN DATA SETS 171

of features. This visualization technique is available in most integrated data-mining tools. Quantification of these relations is given through the correlation factor.

These visualization processes are part of data understanding phase, and they are essential in better preparation for data mining. This human interpretation helps to obtain a general view of the data. It is also possible to identify abnormal or interesting characteristics, such as anomalies.

Dalam dokumen BUKU DATA MINING (Concepts, Models, Methods, and Algorithms) (Halaman 189-193)