Clustering Methods for Big Data Analytics and Semi-Supervised Learning

The volume opens with a chapter titled "Overview of Scalable Partitioning Methods for Large Data Clustering." In this chapter, BenHaj Kacem et al. The aim of this work is to give a theoretical and empirical overview of scalable big data clustering methods.

Partitional Clustering Methods

Then, Section 1.4 provides an experimental evaluation of big data partial clustering methods on various simulated and real large data sets. Several works have been proposed to improve the efficiency of conventional partition clustering methods.

Big Data Partitional Clustering Methods

Parallel Methods

MPI-Based Methods
GPU-Based Methods
MapReduce-Based Methods
Spark-Based Methods

Data Reduction-Based Methods
Centers Reduction-Based Methods
Hybrids Methods
Summary of Scalable Partitional Clustering Methods for Big Data Clustering

This method is motivated by the fact that k-means requires the calculation of all distances between each of the cluster centers and data points. Therefore, the LSH technique is used to reduce the number of data points when constructing cluster centers.

KdtKM (Kanungoet al. 2002) (Pellegand Moore 2003) TiKM (Phillips 2002) (Elkanet al. 2003) CDKM (Lai et al

Empirical Evaluation of Partitional Clustering Methods for Large-Scale Data

This dataset was obtained from the UCI machine learning repository.1 The second real dataset is the Household dataset (House), which contains the results of household electricity consumption measurements. The analysis of the empirical results firstly shows that hybrid methods are significantly faster than all other methods, because they simultaneously use different acceleration techniques to improve the efficiency of the conventional k-means method.

Conclusion

Therefore, we can conclude that MapReduce framework and triangle inequality reduce the running time of the conventional k-means method without affecting clustering results. MapReduce and triangle inequality techniques can improve the efficiency of conventional clustering without affecting the final clustering results.

Li, Een efficiënt k-means clusteringalgoritme op MapReduce, in Proceedings of Database Systems for Advanced Applications, pp. Hao, Een parallel k-means clusteringalgoritme met MPI, inProceedings of Fourth International Symposium on Parallel Architectures, Algorithms and Programming, pp.

Overview of Efficient Clustering Methods for High-Dimensional Big Data Streams

Introduction
Streaming Data
Challenges of Stream Clustering of Big Data

Adaptation to the Stream Changes and Outlier Awareness
Storage Awareness and High Clustering Quality
Efficient Handling of High-Dimensional, Different-Density Streaming Objects
Flexibility to Varying Time Allowances Between Streaming Objects
Energy Awareness and Lightweight Clustering of Sensor Data Streams

Recent Contributions in the Field of Efficient Clustering of Big Data Streams

High-Dimensional, Density-Based Stream Clustering Algorithms
Advanced Anytime Stream Clustering Algorithms
Energy-Efficient Algorithms for Aggregating and Clustering Sensor Streaming Data
A Framework and an Evaluation Measure for Subspace Stream Clustering

Conclusion

A self-adaptation for the different densities of the data is strongly needed while designing a stream clustering algorithm. Kröger, Density-connected subspace clustering for high-dimensional data, in Proceedings of the SIAM International Conference on Data Mining, SDM.

Figure 2.1 gives some examples about real world application that produce data streams

Clustering Blockchain Data

Introduction

Motivation

Fraud Detection and Law Enforcement
Systems Insights
Anonymity and Traceability

Contribution
Organization

Methods to cluster blockchain data: Clustering methods are described in the context of the above conceptual models. Blockchains, and in particular the key aspects of the data they generate, are described in section 3.2.

Fig. 3.1 Bitcoin (XBT) prices in US dollars (USD), log scale, by date. Plot generated by combining historical exchange rate data from a freely available spreadsheet [5] (2010–2013, from the defunct Mt

Blockchain Data

Blocks
Mining
Transactions
Flow of Currency

More specifically, the primary content of each block, i.e. the transactions, is not hashed en masse with the rest of the block. It is composed of a transaction identifier together with the index of the desired output of the referenced transaction, in the ordered list of outputs.

Fig. 3.5 A simplified view of the blockchain. The transactions in each block are combined pairwise recursively in tree form to yield the Merkle-root hash in the block header

Models of Blockchain Data

Transactions
Blocks
Addresses
Owners
Nodes

3 Clustering Blockchain Data 55 4-byte unsigned integer that is the output index in the reference transaction (see output index below). For example, Bitcoin implementations have some freedom in designing the signature script used to verify authorization to spend. The color of each cell is a measure of the number of transactions with corresponding input and output counts, also using a logarithmic scale.

Fig. 3.8 Transactions per blockchain-block for four representative block groups. In the scatterplot for each block group, the horizontal axis measures the date and time (as POSIX epoch time) at which the block was committed and the vertical axis is a count

Clustering

Feature Extraction
Address Merging
Scalability

Such identified addresses are important resources not only because of the direct identification they provide, but also because they can be used to upload methods that can identify or classify addresses that are otherwise anonymous. Remember that there is no direct way to use only part of the value available in an unspent transaction output (UTXO). An obvious candidate is the IP address of the peer host from which a transaction originates.

Evaluation

Distance-Based Criteria

Cluster Quality Criteria
Mahalanobis Distance

Sensitivity to Cluster Count
Tagged Data
Human-Assisted Criteria

The minimization of the intra-cluster distances is a natural expression of the general preference for denser clusters. More precisely, the sim index (which must be maximized) sums, over all clusters, the difference between the total pairwise similarity for elements in the cluster and the total pairwise similarities with one object in and one object out of the cluster. For example, a study on de-anonymization of Bitcoin addresses used a graphical visualization of the user network [38].

Conclusion

Han, K-betyder clustering via principal komponentanalyse, i Proceedings of the Twenty-first International Conference on Machine Learning, ICML'04(ACM, Banff, 2004), s. Yanovich, Automatic Bitcoin address clustering, inProceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico (2017). Schweiger, Scan: en strukturel klyngealgoritme for netværk, i Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'07(ACM, New York, 2007), pp.

An Introduction to Deep Clustering

Introduction
Essential Building Blocks for Deep Clustering

Learning Deep Representations
Deep Clustering Loss Functions

Sequential Multistep Deep Clustering

Fast Spectral Clustering
Deep Sparse Subspace Clustering (SSC)
Deep Subspace Clustering (DSC)
Nonnegative Matrix Factorization (NMF) + K-Means

Joint Deep Clustering

Task-Specific and Graph-Regularized Network (TAGnet)
FaceNet
Deep Clustering Network (DCN)
Joint NMF and K-Means (JNKM)

Closed-Loop Multistep Deep Clustering
Conclusions

Next, each family of deep clustering approaches (sequential multilevel, joint, and finally closed-loop multilevel deep clustering) will be discussed in order. In the following, we describe some representative algorithms that belong to this family of deep clustering methods. In the following, we describe some representative algorithms that belong to this family of deep clustering methods (Figure 4.12).

Fig. 4.1 Taxonomy of Deep Clustering presented in this chapter

Spark-Based Design of Clustering Using Particle Swarm Optimization

Introduction

Among these algorithms, particle swarm optimization (PSO), as one of the swarm intelligence algorithms, has gained great popularity in the past two decades and appeared to be a potentially full and fruitful research area [15]. The rest of this chapter is organized as follows: Sect.5.2 presents a background on the basic concepts related to the particle swarm optimization algorithm, the MapReduce model and the Spark framework. Section 5.6 presents experiments that we have performed to evaluate the effectiveness of the proposed method.

Background

Particle Swarm Optimization
MapReduce Model
Apache Spark

It consists of a swarm of particles where each particle is considered as a potential solution to the optimization problem. Once done, the results are merged to provide a final solution to the very large and complex problem [6]. Spark owes its prosperity to its ability to perform in-memory computations, which means that data does not need to be moved to and from disk but is kept in memory.

Figure 5.1 outlines the flowchart of MapReduce paradigm. The enormous data set is divided into several chunks, small enough to be fitted into a single machine, each chunk is then assigned to a map function to be processed in parallel

Related Works

This method first assigns each data point to the closest cluster prototypes in the map function. The map function receives the particles information as a key value pair where the key is the particle ID and the value represents all the information related to the particle. Then, after processing, the output must be written back to the file system.

Table 5.1 summarizes the existing methods.

Proposed Approach: S-PSO for Clustering Large-Scale Data

Data Assignment and Fitness Computation Step
Pbest and Gbest Update Step
Position and Velocity Update Step
K-Means Iteration Step

Let P (t) = {Pi(t)..PS(t)} be the collection of the particle's information where Pi(t) = {xi(t), vi(t), pbestPi(t), pbestFi(t ) } represents the information of particle i in the iteration t where xi(t) is the position, vi(t) is the velocity, pbestPi(t) is the best position and pbestFi(t) is the best fitness. LetpbestP (t)= {pbestP1(t)..pbestPS(t)}the set of personal best position where pbestP1(t) is the pbestP of the particle i at iteration t. Letx(t) = {x1(t)..xS(t)}the set of position values wherexi(t) is the position of the particles at iteration.

Theoretical Analysis .1 Complexity Analysis

Time Complexity
Space Complexity
Input/Output Complexity
Time-To-Start Variable Analysis

The conversion between the two algorithms is conditioned by a variable we introduce called Time-To-Start. If this variable is added to be at the beginning of PSO, we get a low quality result but a much reduced execution time and the opposite is correct. Therefore, this variable must be chosen in a way that ensures balance between quality and time.

Experiments and Results .1 Methodology

Environment and Data Sets Description
Performance Measures
Comparison of the Performance of S-PSO Versus Existing Methods
Evaluation of the Impact of Time-To-Start Variable on the Performance of S-PSO
Scalability Analysis

The clustering process for this dataset detects the type of attacks among all connections. The speedup measure consists of determining the size of the dataset and varying the number of computer nodes. To evaluate the scaling of our proposed method, we increase both the size of the dataset and the number of cores.

Conclusion

Data Stream Clustering for Real-Time Anomaly Detection: An Application to

Introduction

The challenge of the insider threat detection problem lies in the variety of malicious insider threats in the datasets. To address the shortcoming of the high number of false alarms, we propose a streaming anomaly detection approach, namely Ensemble of Random subspace Anomaly detectors In Data Streams (E-RAIDS). Moreover, E-RAIDS is evaluated not only in terms of the number of detected threats and FP alarms, but in terms of (1) F1 measure, (2) voice feature subspaces, (3) real-time anomaly detection, and (4) ) detecting (more than One) behavior-all threat.

Related Work

Clustering for Outlier Detection
Streaming Anomaly Detection for Insider Threat Detection

6 Data Stream Clustering for Anomaly Detection 119 sources define the stream environment of the insider threat problem. In this work, data stream clustering is used to support the outlier detection techniques for real-time anomaly detection. In this book chapter, we use data stream clustering to detect outliers (malicious insider threats) while reducing the number of false alarms.

Anomaly Detection in Data Streams for Insider Threat Detection

Insider Threat Feature Space
Background on Distance-Based Outlier Detection Techniques

Micro-Cluster-Based Continuous Outlier Detection
Anytime Outlier Detection

E-RAIDS Approach

Feature Subspace Anomaly Detection
Ensemble of Random Feature Subspaces Voting

In the following, we give a more detailed description of the feature set used in this work. Instead, it evaluates the range queries with respect to the (lesser) centers of the microclusters. As described later in Sect.6.3.3.2, if the ensemble votes to generate an alarm, the subOutSet for each attribute subspace is used to evaluate whether all malicious insider threats are detected (i.e., the goal of any-behavior all-threat).

Experiments

Description of the Data set
Experimental Tuning
Evaluation Measures
Results and Discussion

MCOD vs AnyOut Base Learner for E-RAIDS in Terms of Evaluation Measures

In the state of the art, a significant number of approaches have been validated in terms of FP measurement. As mentioned above, the ultimate goal of the E-RAIDS approach is to detect all malicious insider threats over a real-time data stream, minimizing the number of false alarms. The challenge of the insider threat problem lies in the variety and complexity of malicious insider threats to data sets.

RAIDS-MCOD

In this work, the results for E-RAIDS-MCOD and E-RAIDS-AnyOut are presented and discussed as follows: with respect to (1) the predefined evaluation measures;. 2) voice function subspace; (3) real-time anomaly detection; and (4) detection of (more than one)-behavior-all-threat. In the following, we analyze the performance of E-RAIDS with MCOD base learner vs AnyOut base learner against the predefined evaluation measures: T PT out of PT; FPA arm; and F1 goals. The results are reported in terms of the parameter values in the given sequential, respectively r, w for E-RAIDS-MCOD and τ, oscAgr, w for E-RAIDS-AnyOut.

RAIDS-AnyOut

Real-Time Anomaly Detection in E-RAIDS
Conclusion and Future Work

We compare the number of feature subspaces in the ensemble that voted for a malicious insider threat in each of E-RAIDS-MCOD and E-RAIDS-AnyOut. We recall the complexity of the malicious insider threat scenarios in the CMU-CERT datasets. Phillips et al., Insider threat detection in lost, in Proceedings of the 50th Hawaii International Conference on System Sciences (2017) 18.

Fig. 6.3 The variation of F1 measure as a function of window size w for E-RAIDS with MCOD base learner over the communities

Effective Tensor-Based Data Clustering Through Sub-Tensor Impact Graphs

Selçuk Candan, Shengyu Huang, Xinsheng Li, and Maria Luisa Sapino

Introduction

Contributions of This Chapter: Sub-Tensor Impact Graphs

Background .1 Tensors

Tensor Decomposition
Tensor Decomposition and Clustering
Block-Based Tensor Decomposition

Sub-Tensor Impact Graphs (SIGs) and Sub-Tensor Impact Scores

Accuracy Dependency Among Sub-Tensors
Sub-Tensor Impact Graphs (SIGs)
Sub-Tensor Impact Scores

Application #1: Block-Incremental CP Decomposition (BICP) and Update Scheduling Based on Sub-Tensor

Reducing Redundant Refinements
Evaluation

Application #2: Noise-Profile Adaptive Decomposition (nTD) and Sample Assignment Based on Sub-Tensor

Grid-Based Probabilistic Tensor Decomposition (GPTD)
Noise-Sensitive Sample Assignment
Evaluation

1If the subtensor is empty, then the factors are 0 matrices of the appropriate size. Intuitively, the subtensor influence plot represents how the decomposition accuracies of a given set of subtensors of an input tensor affect the overall accuracy of the combined decomposition. However, as discussed earlier, inaccuracies in the decomposition of one subtensor can propagate to the rest of the subtensors in phase 2.

Fig. 7.1 A third-order (3-mode) tensor of dimensions, I × J × K

RMSE with Noise Adaptation - CIAO

Exec. Time with Noise Adaptation - CIAO

Application #3: Personalized Tensor Decomposition (PTD) and Rank Assignment Based on Sub-Tensor

Problem Formulation
Sub-Tensor Rank Flexibility
Rank Assignment for Personalized Tensor Decomposition
Evaluation

Setup
Discussion of the Results

In particular, PTD analyzes the subtensor influence graph (in light of the user's interest) to identify initial decomposition orders for the subtensors in a way that will increase the accuracy of the final decomposition for the partitions of interest. The goal of personalized tensor decomposition (PTD) is to obtain a personalized (or preference-sensitive) decomposition XˆofXin such that. The PTD algorithm then uses this graph to calculate the impact of the inaccuracy of the initial decomposition of a subtensor on the final decomposition accuracy of XP, i.e., the cells of X collectively covered by the user's statement of interest (i.e., KP ).

Table 7.1 Various tensor partitioning scenarios considered in the

Conclusions

Faloutsos, Gigatensor: 100-fold tensor scaling analysis and discovery algorithms, in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), p. on Data Mining (ICDM) (2008). Xiong et al., Cooperative temporal filtering with Bayesian probabilistic tensor factorization, in Proceedings of SIAM 2010 International Conference on Data Mining (2010).

Index

GPU,see Graphics processing unit GPU-based k-means method (GPUKM), 6 GPU fuzzy c-means method (GPUFCM), 6 Graph-based anomaly detection (GBAD),. MapReduce model, 92, 96 MR-CPSO (seeMR-CPSO) using Spark (seeSpark-based PSO . clustering method) in fitness calculation, 96 hybrid method, 95–96 personal best position, 93. Anomaly detection system in real-time E-RAIDS (see Ensemble of random . subspace anomaly detectors in data streams).