Clustering Blockchain Data
3.5 Evaluation
64 S. S. Chawathe patterns of the transactions. For example, if a cluster contains 200 transactions, with 100 of them occurring at roughly daily intervals, 70 at weekly, and 30 at monthly, that cluster may be subdivided accordingly.
Well-known services. Certain well-known services have transaction patterns that allow association of input and output addresses. For example, theSatoshi Dice service has been studied in this context and used to connect addresses [38].
Well-known services may also be used to detect when addresses merged by other rules are likely separately controlled and should therefore be separated. For this purpose, recent work [20] divides Bitcoin services into six categories and defined a pairwise compatibility matrix for those categories. For example, gambling services are deemed incompatible with pool services, while they are compatible with exchanges. Addresses that appear in incompatible pairs of categories are deemed erroneously merged and so separated.
3.4.3 Scalability
Our focus has been on understanding blockchain data and on framing the space of clustering as applied to that data. Compared to other clustering’s typical application domains, work on clustering applied to blockchains is still at a very early stage.
So it is appropriate to focus on problem definitions and output-quality evaluation metrics over performance and scalability issues such as running time and space requirements. Nevertheless, it is appropriate to briefly consider such scalability issues here.
General-purpose density-based clustering algorithms, such as the many variants of the DBSCAN algorithm, are good candidates because they are not only well studied but also implemented widely in several popular libraries and systems [1,26,36]. Specialized algorithms for outlier detection and network clustering are also applicable [54,59,63]. The performance of DBSCAN and related algorithms strongly depends on the nature of the data, the critical andMinPts parameters, and on the availability of suitable indexes [24,55]. Given the volume of blockchain data, parallel execution of clustering algorithms is a useful strategy. For example, recent work has demonstrated a DBSCAN-like algorithm that is designed for the MapReduce framework [28].
3 Clustering Blockchain Data 65 studied datasets. The volume of data, as well as other characteristics, mean that it is impracticable to rely on the availability of such test data in the near future as well. It is therefore necessary to use methods that do not rely on human-studied datasets but, instead, use some other intrinsic characteristics of the input data and output clusters, i.e., aninternalmethod [58].
3.5.1 Distance-Based Criteria
Such intrinsic criteria for evaluating clusters have been studied in the domain of document clustering in the vector-space model [51]. That work builds on earlier work [18] and on the intuitive desirability ofcompactnessandisolationof clusters.
3.5.1.1 Cluster Quality Criteria
In particular, four quality criteria and associated indices are presented. (The following presentation uses different, more descriptive, names for the criteria and modified, but equivalent, formal definitions of the indices.)
Total intra- and inter-cluster distance. This criterion seeks to minimize both the separation of objects that are in the same cluster and the separation of clusters.
More precisely, letCdenote a collection of clusters,da distance metric for the elements (vectors) being clustered, and (C)the centroid of clusterC:
(C)= 1
|C|
x∈C
x (3.1)
The dst index that this criterion seeks to minimize may then be expressed as follows:
dst(C)=
C∈C
x∈C
d(x, (C))+
C∈C
d( (C), (∪C)) (3.2) The first term on the right-hand side quantifies intra-cluster distances, as it sums the distances of elements from the centroids of their assigned clusters.
The second quantifies inter-cluster distances, as it sums the distances of cluster centroids from the global centroid. Minimizing the intra-cluster distances is a natural expression of the general preference for denser clusters. In contrast, min- imizing the inter-cluster distances may appear counterintuitive at first because it contradicts the general preference for well-separated clusters. However, the intuitive effect of this criterion is to strike a balance between the desire for compact clusters (which, on its own, could be satisfied by putting each object in its own cluster) and the desire for fewer clusters (which, on its own, could be satisfied by putting all objects in a single cluster. This distance has the same value in both these extreme cases, and is lower in intermediate cases.
66 S. S. Chawathe Cluster separation. Intuitively, this criterion favors compact, well-separated clus-
ters by comparing element-wise within-cluster distances with distances between cluster centroids. More precisely, for a given clusterC, we may compute the ratio of the maximum of the distances between pairs of elements in that cluster and the minimum of the distances from the centroid ofCto other clusters. Theeldindex is defined as the reciprocal of the sum of this ratio over all clusters that contain at least two elements. This index is undefined for the two extreme cases of one cluster per object and one cluster for all objects.
sep(C)=
⎛
⎜⎜
⎝
C∈C
|C|>1
max{d(xi, xj)|xi, xj ∈C} min{d( (C), (C))|C=C∈C}
⎞
⎟⎟
⎠
−1
(3.3)
Element-wise distances. The intuition behind this criterion is that, in good clus- terings, the largest distance between any pair of elements in a cluster should be small, and the smallest distance between a pair of elements in different clusters should be large. More precisely, theeldindex (which is to be minimized) sums, for each element, the difference between its distances to the farthest object in its cluster and the nearest object in some other cluster. This index has its maximum value when there is a single cluster with all elements. Its value is also large when elements are in singleton clusters. It favors intermediate clusterings that have large clusters.
eld(C)=
C∈C
x∈C
maxy∈Cd(x, y)−min
z∈Cd(x, z)
(3.4)
Intra- and inter-cluster similarity. Unlike the earlier criteria, this criterion depends on a method for evaluating the similarity S of two elements, and quantifies the intuition that intra-cluster similarities should be large while inter-cluster similarities should be small. More precisely, the sim index (which is to be maximized) sums, over all clusters, the difference between the total pairwise similarity for elements in the cluster and the total pairwise similarities with one object in and one object out of the cluster. This index favors large clusters over smaller clusters. Indeed, in the extreme case of one cluster per object, its value is negative, while in the other extreme of all objects in a single cluster, its value is maximized.
sim(C)=
C∈C
⎛
⎜⎜
⎝
x,y∈C
S(x, y)−
x∈C z∈C
S(x, z)
⎞
⎟⎟
⎠ (3.5)
3 Clustering Blockchain Data 67
3.5.1.2 Mahalanobis Distance
As a variation on the general theme of the above criteria, it may be useful to measure the distance between an element and its assigned cluster using a metric such as the Mahalanobis distance in order to better account for skewed cluster shapes and diverse variances along different dimensions [35].
IfV is a set of elements (vectors) with meanμcovariance matrixS=(σij), then the Mahalanobis distance of an elementvfromEis
dm(v, V )=
(v−μ)TS−1(v−μ) (3.6) For instance, the two terms inside the parentheses in Eq. (3.5) may be replaced by the Mahalanobis distance ofx from the set of points in (respectively, not in) the x’s cluster.
Work using the Mahalanobis distance for outlier detection in the address- and transaction-graphs found a strong correlation between the detected outliers and the boundaries of scatterplots of in- and out-degrees of vertices [49]. This result appears unsurprising in this case because normalized values of the same features are used for determining outliers. However, it does indicate that these features likely have the dominant influence for that method.
3.5.2 Sensitivity to Cluster Count
With other parameters held constant, as we increase the number of clusters (especially in a method such as k-means), we expect the distance of each element from the centroid of its assigned cluster to decrease. However, the rate at which the latter distance decreases will, in general, vary across clustering methods. It is natural to prefer methods that provide a more rapid reduction in this distance as the number of clusters increases. This intuition can be made more concrete by plotting a curve with number-of-clusters on the horizontal axis and average distance of elements from the cluster centroids on the vertical axis. The desired criterion then maps to seeking curves with a small area between the curve and the horizontal axis. If we invert the vertical axis, then we seek to maximize the area under the curve. In this respect, this criterion is similar to that used in the receiver-operator characteristic (ROC) [21] for studying the trade-off between true positives and false positives.
While there is graphical similarity, the underlying concepts may be different in some cases. However, when clustering is being used for a classification task such as fraud detection, the concepts are similar. For instance, such a curve is used in some work that evaluates clustering for Bitcoin fraud detection [41].
68 S. S. Chawathe
3.5.3 Tagged Data
There is a small amount of tagged blockchain data that can be used for evaluating clustering, outlier detection, and related methods. This data is typically derived from well-known cases of theft or other fraudulent activity on the Bitcoin network. The scarcity of such tagged data makes it unsuitable for use in a primary evaluation method; however, it is valuable as a secondary check on the effectiveness of methods in detecting confirmed thefts or other tagged activity.
While such data has been mainly gathered passively, in response to reports of Bitcoin thefts or similar events, there is also some work [38] on actively tagging a subset of blockchain data using store purchases and other activities that trigger Bitcoin transactions, and correlating the results.
There are also some services that associate user-defined tags with Bitcoin addresses [7,31]. Since these services are easily updated by anyone without any restrictions, there are no guarantees about the accuracy of the tags and associations.
Nevertheless, the information could be used in a secondary manner for clustering, for example, to assign descriptive tags to clusters identified using more reliable data.
Another related method uses synthetically tagged data, such as outliers detected by another method, to evaluate results by measuring the distance of the outliers from the nearest cluster centroids [49]. In that work, the focus is on detecting outliers using clustering (k-means) as a baseline, but the same strategy could be used in the opposite direction. A related strategy, also used in that work, is to use the results on one model of the underlying data to evaluate results on another model, using known or assumed mappings between concepts in the two models. In particular, results on the transactions-as-vertices graph model can be used to evaluate results on the owners-as-vertices model, and vice versa.
3.5.4 Human-Assisted Criteria
Although, as noted earlier, it is often not convenient or possible to analyze blockchain datasets with human feedback, some human-assisted methods can augment the automated methods. For this purpose, it is important that the clusters and their important features be available for multiple commonly used visualizations, such as scatterplots, histograms, and parallel-coordinates plots.
In some prior work, such human-assisted criteria have been profitably used to improve the clustering process and to validate some of the results. For example, a study on de-anonymizing Bitcoin addresses used a graphical visualization of the user network [38]. That work also illustrates the use of several other visualizations of the blockchain data for making inferences about the dominant modes in which currency moves through the network of user addresses. Similarly, a visual repre- sentation of the Bitcoin transactions graph has been used to detect communities by using a few labeled nodes as starting points [22].
3 Clustering Blockchain Data 69