Clustering Blockchain Data
3.4 Clustering
At a high level, there are two approaches for clustering blockchain data. The first approach is based on applying mostly conventional, well- studied clustering algorithms (e.g., k-means) and robust implementations to feature-vectors extracted from blockchain datasets [25,30,33,62]. The key tasks in this case are determining the set of features to use, extracting them effectively from the underlying data, and transforming them (e.g., scaling, normalization) to improve clustering. Section3.4.1 describes feature extraction for this purpose. The associated challenge of scaling clustering methods to the size and throughput of blockchain datasets is an important one as well, but one that may be addressed using well-developed prior work on clustering large datasets in general.
The second approach is based on a more direct utilization of the particular characteristics of blockchain data (e.g., co-occurrence of transaction inputs, or TXIs, or other temporal patterns [19]) and their associated semantics (e.g., identical or closely related owners). The key tasks in this case are determining which semantics, assumptions, and heuristics to use for the purpose of forming clusters, and designing algorithms tailored to those that are capable of efficiently operating on the large datasets involved. Section3.4.2describes such options in the context of the address graph (Sect.3.3.3), where the focus is on determining which vertices (addresses) are likely to represent the same off-network user or other entity, i.e., on correlating the address graph with the owner graph (Sect.3.3.4).
3.4.1 Feature Extraction
Prior work has identified and used several features from the underlying data for clustering and other purposes. For instance, the features extracted from transaction data (Sect.3.3.1) and used by recent work on fraud detection based on trimmed k-means clustering [41] are categorized as:
currency values: the sum, mean, and standard-deviations of the amounts received and sent.
network properties: in- and out-degrees, clustering coefficient (described below), and number of triangles.
neighborhood: based on the sets of transactions from which a given transaction receives (and sends) satoshi, i.e., its in-neighborhood (and, respectively, out- neighborhood).
Intuitively, the localclustering coefficient[56] for a vertex measures the degree of interconnectedness among its neighbors. In more detail, it measures how similar the subgraph induced by these neighbors is to a fully connected (complete) graph on those vertices. Formally, the local clustering coefficient cc(v)of a vertexvin a directed graphDis defined as:
62 S. S. Chawathe
cc(v)=|(ND+(v)×ND+(v))∩E(D)| dD+(v)(dD+(v)−1)
where, following standard notation [9], we use V (D) and E(D) to denote the vertices and edges of a directed graphD, and where
ND+(v)= {u∈V (D)|(v, u)∈E(D)}
is the out-neighborhood ofvanddD+(v)= |ND+(v)|is the out-degree ofv.
These extracted features were used as inputs to the classic and trimmed versions of the k-means clustering algorithm [15,17,47]. A curve of within-group distances versus number of clusters suggestedk = 8 as a suitable parameter for the data in that study.
In similar work on anomaly detection [49], three features were used for the transactions graph (similar to the one of Sect.3.3.1): in-degree, out-degree, and total value. The same study used six other features for the address graph (similar to the one of Sect.3.3.3) were in- and out-degrees, means of incoming and outgoing transaction values, mean time interval, and clustering coefficient. In order to account for the widely varying scales and large ranges of the feature metrics, the normalized logarithm of their values is used for further processing.
3.4.2 Address Merging
In principle, it is possible (and often recommended) that a Bitcoin address be used exactly once to receive funds and exactly once to send funds, with unused funds sent to a new address controlled by the owner of the first one. However, it is quite common for an address to be reused. A common case is the use of Bitcoin addresses embedded into software user interfaces or in Web pages to solicit donations. These kinds of addresses are long lived and participate in several transactions. Many such reused addresses also explicitly identify their owners, such as a software development team or a person. (As well, much like vanity license plates for vehicles, there are vanity addresses that include the owner’s name, or desired tag, as a substring when expressed in the base-58 encoding used by Bitcoin [57].) Such identified addresses are important resources not only because of the direct identification they provide but also because they may be used to bootstrap methods that can identify or classify addresses that are otherwise anonymous.
In this context, an important task that has been the focus of prior work is address merging, i.e., finding sets of Bitcoin addresses that are likely to be owned and controlled by a common person or entity. Some methods for merging Bitcoin addresses are below [16,38,53]:
Co-occurring transaction inputs. All the addresses that are referenced by inputs of a common transaction are assumed to have a common controlling owner. We
3 Clustering Blockchain Data 63 may recall that transaction inputs (TXIs) do not include any addresses. However, these addresses are available in the outputs (TXOs) of prior transactions to which these inputs refer by providing transaction identifiers and output indices [60].
Transaction input–output patterns. Although the intent of transactions is not explicitly available, in some cases it may be reasonably inferred based on patterns. These patterns may use the number and variety of transaction inputs and outputs and, importantly, the incoming and outgoing distributions of value (satoshi). For example, it is very easy to determine, with high confidence, which transactions are payments to members of a Bitcoin mining pool, based on the total value (equal or close to the current block reward), and the input–output distribution: a single or very few inputs, representing the mining pool, and a large number of outputs, representing the pool members.
Another important class of transactions in this context is the class of change- making transactions. Recall that there is no direct way to use only a part of the value available in an unspent transaction output (UTXO). Rather, that effect must be achieved by spending the entire UTXO value in a transaction, but paying part of it to an output that is controlled by the owner of the source UTXO.
Since change-making is an essential operation for many real-world transactions, the corresponding blockchain transactions are also frequent. Again, there is no explicit marker for such transactions. However, they may be detected by using a variety of techniques.
One effective technique is based on the observed practice (though not require- ment) of Bitcoin wallet programs generating completely new addresses on behalf of their users whenever change-making is required [38]. The resulting observable pattern in a transaction is an addressAwith the following properties:
• The transaction is not a coinbase (mining reward claiming) transaction.
• There is no prior transaction that includesAin its output. (It follows thatA cannot be referenced by any transaction’s input as well.)
• Among all the outputs of this transaction,Ais the only one satisfying the above conditions.
Peer host address. While the above methods rely on information from the trans- action model of blockchain data, some information from other models, such as the node model, may also be used to aid transaction-based clustering. An obvious candidate is the IP address of the peer host that from that originates a transaction. There is no explicit field in transactions that records its originating peer. At first glance, there is no way to determine whether a transaction arriving form a peer is one originated at that peer or forwarded by the peer on behalf of another, recursively. However, by using a client that simultaneously connects to a large number of peers, it is possible to use timing information to make inferences about the likely origins of a transaction. That is, if a client is connected to a large number of peers, then it is reasonable to assume that the peer from which a transaction was first received is very likely the peer at which it originated.
Temporal patterns. When transactions matching various patterns (as above) are clustered and tagged, the clusters may be refined by observing the temporal
64 S. S. Chawathe patterns of the transactions. For example, if a cluster contains 200 transactions, with 100 of them occurring at roughly daily intervals, 70 at weekly, and 30 at monthly, that cluster may be subdivided accordingly.
Well-known services. Certain well-known services have transaction patterns that allow association of input and output addresses. For example, theSatoshi Dice service has been studied in this context and used to connect addresses [38].
Well-known services may also be used to detect when addresses merged by other rules are likely separately controlled and should therefore be separated. For this purpose, recent work [20] divides Bitcoin services into six categories and defined a pairwise compatibility matrix for those categories. For example, gambling services are deemed incompatible with pool services, while they are compatible with exchanges. Addresses that appear in incompatible pairs of categories are deemed erroneously merged and so separated.
3.4.3 Scalability
Our focus has been on understanding blockchain data and on framing the space of clustering as applied to that data. Compared to other clustering’s typical application domains, work on clustering applied to blockchains is still at a very early stage.
So it is appropriate to focus on problem definitions and output-quality evaluation metrics over performance and scalability issues such as running time and space requirements. Nevertheless, it is appropriate to briefly consider such scalability issues here.
General-purpose density-based clustering algorithms, such as the many variants of the DBSCAN algorithm, are good candidates because they are not only well studied but also implemented widely in several popular libraries and systems [1,26,36]. Specialized algorithms for outlier detection and network clustering are also applicable [54,59,63]. The performance of DBSCAN and related algorithms strongly depends on the nature of the data, the critical andMinPts parameters, and on the availability of suitable indexes [24,55]. Given the volume of blockchain data, parallel execution of clustering algorithms is a useful strategy. For example, recent work has demonstrated a DBSCAN-like algorithm that is designed for the MapReduce framework [28].