PDF Quality Analysis of Correlation Clustering

This is to certify that the thesis titled "Quality Analysis Of Correlation Clustering" submitted by Mamata Samalto Department of Computer Science and Engineering, Indian Institute of Technology Guwahati is a record of bona fide research work under our supervision and is worthy in consideration for the award of the institute's doctorate in philosophy. However, construction of the graph and its type from a given vector data set is not part of CC.

Correlation Clustering

Motivation of the Research Work

Contributions of the Thesis

The application of outward rotation, random projection and randomized rounding (RP R2) techniques in the context of CC to analyze the quality of the obtained clusters. In the case of graph clustering methods, one explicitly constructs a graph from the given vector data points.

Organization of the Thesis

The optimal graph construction role on the quality of the obtained clusters is investigated in this chapter.

Introduction

Similarly, the second term of (2.1) specifies to minimize all negatively labeled edges which are placed in the same group. For the MAXAGREE objective, a variant of the LP formulation given in (2.1) is obtained as given in (2.2).

Application of CC

A variant of CC has been proposed to obtain clusters in the presence of connectivity between edges that go beyond + and - as proposed by Bansal et al. Author worker correlation clustering for obtaining clusters from retrieved internet search results.

Summary

This chapter studies the implication of using different rounding techniques in the context of CC. Outrotation and RP R2 rounding techniques are experimentally observed to be sensitive to the data properties.

Rounding Techniques

The spin-out rounding technique is a simple rearrangement of the solution vectors v∗i before applying the hyperplane rounding technique. The RP R2 rounding technique [27] is a family of rounding procedures, namely random projection and random rounding defined by the function f :R→[0,1].

Empirical Study

In Table 3.4, the number of data points (nodes in the graph), dimension of each data point and number of classes that the data set contains are shown in 2nd, 3rd and 4th columns respectively. Fifth column indicates imbalance in the dataset; that is, for each positive data point, how many negative data points are sampled.

Analysis

CC with More Than Two Clusters

Conclusion

The effect of different graph construction methods on graph clustering, namely spectral clustering is studied in [54]. The convergence of CC quality is also studied with respect to optimal and traditional graph construction methods. An empirical study was conducted to understand the impact of optimal graph construction methods on the approximation value and in turn the quality of the obtained clusters.

The following results are noted empirically from various non-optimal and optimal graph construction methods:.

Similarity Graphs

Optimal Graph Construction

When the vector data does not contain any noise, the above objective function is minimized with the constraint that each vertex must have at least degree greater than or equal to 1. The objective function is written in terms of a matrix M encoding the points of data and the edge information between the pair of data points such that minimize kLXkF. As scaled values of the direction lead to improved values of the function tf(w), the above inequality converts to an equality constraint.

The above unconstrained optimization problem is solved using standard toolboxes to obtain the undirected unweighted graph.

An Empirical Study

Implication of the optimal graph construction (both complete and general graphs). on approximate value of CC. In the case of Wine dataset, the −neighborhood graph's approximation value (0.876) is close to that of the given theoretical value (0.878). External Quality: Edge index is considered for measuring the quality of the obtained clusters using SDP-CC.

The convergence of the intrinsic and extrinsic qualities of the SDP-CC formulation as the number of data points increases is studied.

Conclusions

Introduction

Constrained K-means Algorithm

Constraint Spectral Clustering

The matrix Q, which encodes the bound and impossible connection constraints, is explicitly introduced into the objective function of the spectral clustering method. By substituting JSC and JCM in equation (5.2), the limited spectral clustering is given as:. 5.3) For an empirical study of cSC, Wacquet et al. The choice of γ = 0 is motivated by the fact that the first term in (5.3) is neglected, which means nullifying the JSC spectral clustering objective and providing the maximum weight to the constraint graph at the data points.

Thus, one can draw reasonable conclusions about both constrained spectral clustering method and correlation clustering method.

Spectral Constraint Clustering with Local Proximity Measure

To compare CC with constraint spectral clustering, in this chapter constraints are generated based on the weight of the edges.

Flexible Constrained Spectral Clustering

Datasets and Constraint Generation

Result

CC vs Flexible Constrained Spectral Clustering: The free parameter involved in Wang's method, α, indicates the extent to which the constraint set is satisfied. From the experimental results, it is noted that Wang's method has an advantage over CC method from both synthetic and real data sets as depicted in Table 5.2 (Wang's method vs CC on synthetic data sets) and Table 5.3 (Wang's method vs CC on data sets from the real world) . The constraint coverage has no direct bearing on the quality of the obtained clusters within Wang's method.

Comparing Wang's method with CC shows that CC competes with Wang's method on eight out of fifteen data sets.

Summary

The MAXAGREE formula for CC has not been used in practice due to the computational time required to obtain the partitions, since the CC formulation involves solving an expensive SDP formulation with a large number of variables. The computational complexity for solving the SDP formulation is given by O(n4.5) [77] where is the number of variables involved in the SDP formulation (or the number of nodes in G or the number of data points in the training set). Speeding up the SDP formulation is key to applying CC to large-scale datasets.

13] proposed a variable reduction method to obtain an approximate solution to the original SDP formulation through the low-rank Broyden Fletcher Goldfarb Shanno (BFGS) method [70].

Various SDP Relaxation Techniques for CC

However, in this formulation it is difficult to deal with the semidefinite constraint along with a larger number of variables and constraints. Second Relaxation: Instead of eliminating the rank constraint as described in [32], the rank constraint and the positive semidefinite constraints are combined into a single constraint as: V2 −nV = 0. This approach to scaling SDP formulation has been applied to MAX CUT with encouraging results.

Reduction of variables: A scalable formulation for solving the SDP is achieved by reducing the number of variables involved in the formulation of the SDP [13].

Scalable CC Formulation − SSDP-CC

In MAX CUT, the objective is to find a vertex set S that maximizes the weight of the edges in the cut (S, V-S). The random approximation algorithm for MAX CUT is the first term of equation (2.6) [32]. The second term of equation (2.6) corresponds to maximizing the number of positively correlated edges that lie within each cluster.

The matrix V in equation (6.4) is replaced with RRT and plays a decisive role in reducing the number of variables in the SDP formulation.

Computational Study of SSDP-CC

Results of Scalable CC

In the case of a well-separated Gaussian data set, both SSDP-CC and SDP-CC obtain well-separated clusters. This is potentially due to the fact that the value of the objective function in the SSDP-CC case is very close to that of the SDP-CC formulation. In the case of the yeast data set, for the individual classes considered, SSDP-CC outperforms the SDP-CC formulation.

However, in the case of add32 for a very low rank (7), the edge index of SDP-CC is better than that of SSDP-CC.

SSDP-CC Comparison with Constrained Spectral Clus- tering

Real-world datasets: Even in real-world datasets, SSDP-CC takes less time compared to the SDP-CC formulation, as shown in Figure 6.4. It can be observed that the time required to solve the SSDP-CC formulation increases with the increase in the size of the data set. Synthetic datasets: Figure 6.10 shows the comparison of the SSDP-CC method with the cSC method on synthetic datasets.

From this figure, observe that cSC has a clear edge over the SSDP-CC formulation.

Summary

Chapter 6 discusses a scalable CC formulation that reduces the number of variables involved in the SDP formulation. This chapter presents a scalable solution for the formulation of SDP for CC (SDP-CC) by reducing the number of constraints. Time required to solve SDP given a large number of data points (or nodes).

Compare the proposed scalable formulation with another known variant in which scalability is achieved by reducing the number of variables involved in the SDP formulations as discussed in chapter 6.

Proposed Formulation

In the following, equation (7.6) is called the reduced constraint SDP-CC or RC SDP-CC. Note that the above formulation does not yet have a theoretical result limiting the value of the objective function of the relaxed formulation. All points belonging to the first group are assigned a value of 1, and all other points in this group are assigned a value of 0.

Let the first three vertices, namely v1, v2, v3, belong to the first group, and the remaining vertices to the second group.

Experimental Evaluation

The objective function value is equal to or greater than that of original SDP-CC formulation as the number of constraints in the proposed formulation is reduced. Number of variables reduced in the case of SSDP-CC formulation is significantly less compared to the number of constraints reduced in the proposed case. As the number of constraints is reduced, the time taken to obtain clusters decreases significantly and objective function value is increased.

Only in the case of bcsstk33 and bcsstk29 is the number of constraints reduced to match the number of variables reduced.

Summary

Future Work

Hopcroft, Error bounds for correlation clustering, in Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005, pp. Instance-level constraints versus space-level constraints: Maximizing prior knowledge in data clustering, in Proceedings of the 19th International Conference on Machine Learning (ICML), 2002, p.

79] Kiri Wagstaff en Claire Cardie, Clustering with instance-level constraints, in Proceedings of the 17th International Conference on Machine Learning (ICML), 2000, pp.