Hierarchical clustering - Clustering analysis for classification and forecasting of solar irrad

A Hierarchical clustering procedure is one which successively merges smaller clusters into larger ones (agglomerative), or divides larger clusters into smaller ones (divisive). This process may be represented by a tree-like structure called a dendrogram which depicts the relationship between objects or clusters. The dendrogram shows how single objects and clusters are grouped together at each step and provides a measure of similarity between them. This similarity is the Euclidean distance where if the distance between two clusters is small then they are close together and hence more similar. If the distance is large then the clusters are less similar. The Euclidean distance on

they-axis on the dendrogram is the distance between the singletons, and thereafter are the distances between centroids of clusters.

Figure 4.4 demonstrates the hierarchical clustering method by way of an example. Figure 4.4 (a) shows a set of 5 points of morning and afternoon averages ofk_b. The method starts off by assuming each point is a cluster on its own. Then the clusters that are closer together merge to form a new cluster. The distance between clusters 3 and 4 is 0.04, and are clearly closer to each other than to other clusters, hence they merge to form cluster 6. The distances between clusters 2 and 1 and 2 and 5 are 0.23 and 0.21, respectively. Therefore, clusters 2 and 5 merge to form cluster 7. For merging clusters, Ward linkage was used. There are also other linkage options such as single, average and complete. However, according to Tuff´ery (2011) the Ward linkage (Ward, 1963) is considered the most effective linkage method. At the last step, clusters 6 and 8 merge.

Figure 4.4: (a) Five points that will be clustered using the hierarchical method. Each point starts off as a cluster on its own. (b) Dendrogram showing how clusters in (a) were merged. Clusters 3 and 4 and 2 and 5 were merged at distance 0.04 and 0.21, respectively. The centroid of cluster 7 was merged with cluster 1 at a distance of 0.4. Lastly, the centroids of clusters 6 and 8 were merged at distance 1.4.

To demonstrate the use of hierarchical clustering on the minute-resolutionk_bprofiles, the method was applied to the k_b Principal Components. The Ward’s linkage method was used with the Eu- clidean distance as the metric. According to equation 4.3, the Ward’s linkage method minimizes the total within-cluster sum of the squared error (SSE) when merging two clusters. The Ward’s distance between two clustersAandBhaving centersaandband frequenciesn_Aandn_B, is given by

d(A, B) = d(a, b)²

n_A⁻¹+n_B⁻¹, (4.3)

whereaandbare the centroids of clustersAandB, respectively. Once all of the objects are clustered the dendrogram is produced. Cutting the dendrogram at a desired level will result in a set of disjoint groups (or clusters). However, in the present study, the optimal number of clusters was not known a priori. The choice of the optimal number of clusters in order to specify the level at which the dendrogram should be cut must be decided using an appropriate method. The present work used the cluster sum of squares as a guide to finding the level at which the dendrogram should be cut to yield the optimal number of clusters.

Computing the cluster sum of squares for different clustering solutions, can be used as a guide for choosing the optimal number of clusters. According to Tuff´ery (2011), the total sum of squares, I, of the cluster is the weighted mean of the squares of the distances of the individual points from the cluster center (or centroid), and is given by

I =X

i∈I

pi(xi−x)¯ ², (4.4)

wherex¯is the mean of x_i andp_i is the weight associated with observationi. In a similar manner, the sum of squares of a cluster is computed with respect to its own center

I_j =X

i∈I_j

p_i(x_i−x¯_j)². (4.5)

If the data is partitioned intokclusters, each with sums of squaresI1, . . .,Ik, then within-cluster sum of squares,I_W, is

I_W =

j=1

I_j. (4.6)

The between-cluster sum of squares,I_B, is defined as the mean of the squares of the distances of the centers of each cluster from the global center, given by

I_B = X

j∈clusters

i∈Ij

p_i

(x_j −x)¯ ². (4.7)

Therefore, the total sum of squares is the sum of the within-sum of squares and between-sum of squares, given as

I =IW +IB. (4.8)

The illustration in Figure 4.5 depicts the total sum of squares for a set of points which is the sum of the within-sum of squares and between-sum of squares.

Figure 4.5:The total cluster sum of squares (I) is the sum of the within-sum of squares (I_W) and between- sum of squares (IB). Global cluster centers are indicated in red. Adapted from Tuff´ery (2011).

The value forI_W can be used to find the optimal number of clusters present in the data. If all points belong to one cluster i.e. k = 1,I_W will be high since there will be points that are far away from the cluster centroid, thus increasing the sum of squares. Ask increases, I_W decreases since there are more centroids and the clusters become more homogeneous. However, finding the largest kis not necessarily the best clustering solution. Instead the number of clusters should be increased such that if the last significant decrease in I_W occurs when moving from k to k + 1 clusters, the partition intok+ 1 clusters is correct. This is demonstrated in Figure 4.6.

To decide on the level of cutting of the dendrogram and to obtain the k_b clusters, Figure 4.6

shows I_W computed for values of kranging from 1 to 10. The curve starts off at a high value for k= 1 which is expected since all objects are assigned to one cluster. Ask increases,I_W decreases dramatically and thereafter begins to flatten out askapproaches 10. Tuff´ery (2011) recommends that the value ofkshould be chosen such that on moving fromktok+ 1, there is an insignificant decrease inI_W. However, Tuff´ery (2011) provides no criteria for what constitutes an insignificant decrease of I_W, so choosing the cut-off value of k is a matter of judgement. For the minute-resolution k_b data, the last significant decrease was chosen to bek= 3 tok= 4. Therefore, the optimal number of clusters is set to 4. The dendrogram can now be cut at the level that yields 4 clusters.

Figure 4.6: Within-cluster sum of squares for varying values of k, for kb clusters using the hierarchical method. Fork= 1,I_W is high. Askincreases,I_W decreases dramatically and thereafter begins to flatten as kapproaches 10. The optimal value ofkis 4 since moving fromk= 3 tok= 4 results in a small decrease in I_W.

The silhouette plot for the hierarchical k_b clusters of the Durban data is given in Figure 4.7.

Cluster 1 has a lowSI_C and is rather weakly clustered. Cluster 2 also has a lowSI_C. TheSI_C for Cluster 3 is above 0.8 indicating a compact cluster. Cluster 4 has a slightly lowerSI_C than Cluster

Figure 4.7:Silhouette plot for clusters 1 to 4. Clusters 1 and 2 have lowSICindicating less compact clusters.

Clusters 3 and 4 have highSI_C indicating compact clusters. The percentage of days in each cluster is also given. NegativeSIvalues are days that lie closer to the border of the cluster.

3, but nevertheless still sufficiently high to be regarded as compact. The percentage of days in each cluster is also given. Days with negative are close to the border of two clusters and comprise 11%

of days. For all 4 clusters produced by the Ward’s hierarchical method, theSIT OT was found to be 0.61.

The Ward’s hierarchical clustering procedure applied to the PCA-reducedkb data, produced the dendrogram in Figure 4.8. Using the with-cluster sum criterion in Figure 4.6, the dendrogram was cut at the level that produced 4 clusters. A cluster map showing the first two Principal Components is given in Figure 4.9. Cluster 3 and Cluster 4 are relatively compact. However, Clusters 1 and 2 are less compact.

Dalam dokumen Clustering analysis for classification and forecasting of solar irradiance in Durban, South Africa. (Halaman 69-76)