The volume opens with a chapter titled "Overview of Scalable Partitioning Methods for Large Data Clustering." In this chapter, BenHaj Kacem et al. The aim of this work is to give a theoretical and empirical overview of scalable big data clustering methods.
Partitional Clustering Methods
Then, Section 1.4 provides an experimental evaluation of big data partial clustering methods on various simulated and real large data sets. Several works have been proposed to improve the efficiency of conventional partition clustering methods.
Big Data Partitional Clustering Methods
- Parallel Methods
- MPI-Based Methods
- GPU-Based Methods
- MapReduce-Based Methods
- Spark-Based Methods
- Data Reduction-Based Methods
- Centers Reduction-Based Methods
- Hybrids Methods
- Summary of Scalable Partitional Clustering Methods for Big Data Clustering
This method is motivated by the fact that k-means requires the calculation of all distances between each of the cluster centers and data points. Therefore, the LSH technique is used to reduce the number of data points when constructing cluster centers.
Empirical Evaluation of Partitional Clustering Methods for Large-Scale Data
This dataset was obtained from the UCI machine learning repository.1 The second real dataset is the Household dataset (House), which contains the results of household electricity consumption measurements. The analysis of the empirical results firstly shows that hybrid methods are significantly faster than all other methods, because they simultaneously use different acceleration techniques to improve the efficiency of the conventional k-means method.
Conclusion
Therefore, we can conclude that MapReduce framework and triangle inequality reduce the running time of the conventional k-means method without affecting clustering results. MapReduce and triangle inequality techniques can improve the efficiency of conventional clustering without affecting the final clustering results.
Li, Een efficiënt k-means clusteringalgoritme op MapReduce, in Proceedings of Database Systems for Advanced Applications, pp. Hao, Een parallel k-means clusteringalgoritme met MPI, inProceedings of Fourth International Symposium on Parallel Architectures, Algorithms and Programming, pp.
Overview of Efficient Clustering Methods for High-Dimensional Big Data Streams
- Introduction
- Streaming Data
- Challenges of Stream Clustering of Big Data
- Adaptation to the Stream Changes and Outlier Awareness
- Storage Awareness and High Clustering Quality
- Efficient Handling of High-Dimensional, Different-Density Streaming Objects
- Flexibility to Varying Time Allowances Between Streaming Objects
- Energy Awareness and Lightweight Clustering of Sensor Data Streams
- Recent Contributions in the Field of Efficient Clustering of Big Data Streams
- High-Dimensional, Density-Based Stream Clustering Algorithms
- Advanced Anytime Stream Clustering Algorithms
- Energy-Efficient Algorithms for Aggregating and Clustering Sensor Streaming Data
- A Framework and an Evaluation Measure for Subspace Stream Clustering
- Conclusion
A self-adaptation for the different densities of the data is strongly needed while designing a stream clustering algorithm. Kröger, Density-connected subspace clustering for high-dimensional data, in Proceedings of the SIAM International Conference on Data Mining, SDM.
Clustering Blockchain Data
Introduction
- Motivation
- Fraud Detection and Law Enforcement
- Systems Insights
- Anonymity and Traceability
- Contribution
- Organization
Methods to cluster blockchain data: Clustering methods are described in the context of the above conceptual models. Blockchains, and in particular the key aspects of the data they generate, are described in section 3.2.
Blockchain Data
- Blocks
- Mining
- Transactions
- Flow of Currency
More specifically, the primary content of each block, i.e. the transactions, is not hashed en masse with the rest of the block. It is composed of a transaction identifier together with the index of the desired output of the referenced transaction, in the ordered list of outputs.
Models of Blockchain Data
- Transactions
- Blocks
- Addresses
- Owners
- Nodes
3 Clustering Blockchain Data 55 4-byte unsigned integer that is the output index in the reference transaction (see output index below). For example, Bitcoin implementations have some freedom in designing the signature script used to verify authorization to spend. The color of each cell is a measure of the number of transactions with corresponding input and output counts, also using a logarithmic scale.
Clustering
- Feature Extraction
- Address Merging
- Scalability
Such identified addresses are important resources not only because of the direct identification they provide, but also because they can be used to upload methods that can identify or classify addresses that are otherwise anonymous. Remember that there is no direct way to use only part of the value available in an unspent transaction output (UTXO). An obvious candidate is the IP address of the peer host from which a transaction originates.
Evaluation
- Distance-Based Criteria
- Cluster Quality Criteria
- Mahalanobis Distance
- Sensitivity to Cluster Count
- Tagged Data
- Human-Assisted Criteria
The minimization of the intra-cluster distances is a natural expression of the general preference for denser clusters. More precisely, the sim index (which must be maximized) sums, over all clusters, the difference between the total pairwise similarity for elements in the cluster and the total pairwise similarities with one object in and one object out of the cluster. For example, a study on de-anonymization of Bitcoin addresses used a graphical visualization of the user network [38].
Conclusion
Han, K-betyder clustering via principal komponentanalyse, i Proceedings of the Twenty-first International Conference on Machine Learning, ICML'04(ACM, Banff, 2004), s. Yanovich, Automatic Bitcoin address clustering, inProceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico (2017). Schweiger, Scan: en strukturel klyngealgoritme for netværk, i Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'07(ACM, New York, 2007), pp.
An Introduction to Deep Clustering
- Introduction
- Essential Building Blocks for Deep Clustering
- Learning Deep Representations
- Deep Clustering Loss Functions
- Sequential Multistep Deep Clustering
- Fast Spectral Clustering
- Deep Sparse Subspace Clustering (SSC)
- Deep Subspace Clustering (DSC)
- Nonnegative Matrix Factorization (NMF) + K-Means
- Joint Deep Clustering
- Task-Specific and Graph-Regularized Network (TAGnet)
- FaceNet
- Deep Clustering Network (DCN)
- Joint NMF and K-Means (JNKM)
- Closed-Loop Multistep Deep Clustering
- Conclusions
Next, each family of deep clustering approaches (sequential multilevel, joint, and finally closed-loop multilevel deep clustering) will be discussed in order. In the following, we describe some representative algorithms that belong to this family of deep clustering methods. In the following, we describe some representative algorithms that belong to this family of deep clustering methods (Figure 4.12).
Spark-Based Design of Clustering Using Particle Swarm Optimization
Introduction
Among these algorithms, particle swarm optimization (PSO), as one of the swarm intelligence algorithms, has gained great popularity in the past two decades and appeared to be a potentially full and fruitful research area [15]. The rest of this chapter is organized as follows: Sect.5.2 presents a background on the basic concepts related to the particle swarm optimization algorithm, the MapReduce model and the Spark framework. Section 5.6 presents experiments that we have performed to evaluate the effectiveness of the proposed method.
Background
- Particle Swarm Optimization
- MapReduce Model
- Apache Spark
It consists of a swarm of particles where each particle is considered as a potential solution to the optimization problem. Once done, the results are merged to provide a final solution to the very large and complex problem [6]. Spark owes its prosperity to its ability to perform in-memory computations, which means that data does not need to be moved to and from disk but is kept in memory.
Related Works
This method first assigns each data point to the closest cluster prototypes in the map function. The map function receives the particles information as a key value pair where the key is the particle ID and the value represents all the information related to the particle. Then, after processing, the output must be written back to the file system.
Proposed Approach: S-PSO for Clustering Large-Scale Data
- Data Assignment and Fitness Computation Step
- Pbest and Gbest Update Step
- Position and Velocity Update Step
- K-Means Iteration Step
Let P (t) = {Pi(t)..PS(t)} be the collection of the particle's information where Pi(t) = {xi(t), vi(t), pbestPi(t), pbestFi(t ) } represents the information of particle i in the iteration t where xi(t) is the position, vi(t) is the velocity, pbestPi(t) is the best position and pbestFi(t) is the best fitness. LetpbestP (t)= {pbestP1(t)..pbestPS(t)}the set of personal best position where pbestP1(t) is the pbestP of the particle i at iteration t. Letx(t) = {x1(t)..xS(t)}the set of position values wherexi(t) is the position of the particles at iteration.
Theoretical Analysis .1 Complexity Analysis
- Time Complexity
- Space Complexity
- Input/Output Complexity
- Time-To-Start Variable Analysis
The conversion between the two algorithms is conditioned by a variable we introduce called Time-To-Start. If this variable is added to be at the beginning of PSO, we get a low quality result but a much reduced execution time and the opposite is correct. Therefore, this variable must be chosen in a way that ensures balance between quality and time.
Experiments and Results .1 Methodology
- Environment and Data Sets Description
- Performance Measures
- Comparison of the Performance of S-PSO Versus Existing Methods
- Evaluation of the Impact of Time-To-Start Variable on the Performance of S-PSO
- Scalability Analysis
The clustering process for this dataset detects the type of attacks among all connections. The speedup measure consists of determining the size of the dataset and varying the number of computer nodes. To evaluate the scaling of our proposed method, we increase both the size of the dataset and the number of cores.
Conclusion
Data Stream Clustering for Real-Time Anomaly Detection: An Application to
Introduction
The challenge of the insider threat detection problem lies in the variety of malicious insider threats in the datasets. To address the shortcoming of the high number of false alarms, we propose a streaming anomaly detection approach, namely Ensemble of Random subspace Anomaly detectors In Data Streams (E-RAIDS). Moreover, E-RAIDS is evaluated not only in terms of the number of detected threats and FP alarms, but in terms of (1) F1 measure, (2) voice feature subspaces, (3) real-time anomaly detection, and (4) ) detecting (more than One) behavior-all threat.
Related Work
- Clustering for Outlier Detection
- Streaming Anomaly Detection for Insider Threat Detection
6 Data Stream Clustering for Anomaly Detection 119 sources define the stream environment of the insider threat problem. In this work, data stream clustering is used to support the outlier detection techniques for real-time anomaly detection. In this book chapter, we use data stream clustering to detect outliers (malicious insider threats) while reducing the number of false alarms.
Anomaly Detection in Data Streams for Insider Threat Detection
- Insider Threat Feature Space
- Background on Distance-Based Outlier Detection Techniques
- Micro-Cluster-Based Continuous Outlier Detection
- Anytime Outlier Detection
- E-RAIDS Approach
- Feature Subspace Anomaly Detection
- Ensemble of Random Feature Subspaces Voting
In the following, we give a more detailed description of the feature set used in this work. Instead, it evaluates the range queries with respect to the (lesser) centers of the microclusters. As described later in Sect.6.3.3.2, if the ensemble votes to generate an alarm, the subOutSet for each attribute subspace is used to evaluate whether all malicious insider threats are detected (i.e., the goal of any-behavior all-threat).
Experiments
- Description of the Data set
- Experimental Tuning
- Evaluation Measures
- Results and Discussion
- MCOD vs AnyOut Base Learner for E-RAIDS in Terms of Evaluation Measures
In the state of the art, a significant number of approaches have been validated in terms of FP measurement. As mentioned above, the ultimate goal of the E-RAIDS approach is to detect all malicious insider threats over a real-time data stream, minimizing the number of false alarms. The challenge of the insider threat problem lies in the variety and complexity of malicious insider threats to data sets.
RAIDS-MCOD
In this work, the results for E-RAIDS-MCOD and E-RAIDS-AnyOut are presented and discussed as follows: with respect to (1) the predefined evaluation measures;. 2) voice function subspace; (3) real-time anomaly detection; and (4) detection of (more than one)-behavior-all-threat. In the following, we analyze the performance of E-RAIDS with MCOD base learner vs AnyOut base learner against the predefined evaluation measures: T PT out of PT; FPA arm; and F1 goals. The results are reported in terms of the parameter values in the given sequential, respectively r, w for E-RAIDS-MCOD and τ, oscAgr, w for E-RAIDS-AnyOut.
RAIDS-AnyOut
- Real-Time Anomaly Detection in E-RAIDS
- Conclusion and Future Work
We compare the number of feature subspaces in the ensemble that voted for a malicious insider threat in each of E-RAIDS-MCOD and E-RAIDS-AnyOut. We recall the complexity of the malicious insider threat scenarios in the CMU-CERT datasets. Phillips et al., Insider threat detection in lost, in Proceedings of the 50th Hawaii International Conference on System Sciences (2017) 18.
Effective Tensor-Based Data Clustering Through Sub-Tensor Impact Graphs
Selçuk Candan, Shengyu Huang, Xinsheng Li, and Maria Luisa Sapino
- Introduction
- Contributions of This Chapter: Sub-Tensor Impact Graphs
- Background .1 Tensors
- Tensor Decomposition
- Tensor Decomposition and Clustering
- Block-Based Tensor Decomposition
- Sub-Tensor Impact Graphs (SIGs) and Sub-Tensor Impact Scores
- Accuracy Dependency Among Sub-Tensors
- Sub-Tensor Impact Graphs (SIGs)
- Sub-Tensor Impact Scores
- Application #1: Block-Incremental CP Decomposition (BICP) and Update Scheduling Based on Sub-Tensor
- Reducing Redundant Refinements
- Evaluation
- Application #2: Noise-Profile Adaptive Decomposition (nTD) and Sample Assignment Based on Sub-Tensor
- Grid-Based Probabilistic Tensor Decomposition (GPTD)
- Noise-Sensitive Sample Assignment
- Evaluation
1If the subtensor is empty, then the factors are 0 matrices of the appropriate size. Intuitively, the subtensor influence plot represents how the decomposition accuracies of a given set of subtensors of an input tensor affect the overall accuracy of the combined decomposition. However, as discussed earlier, inaccuracies in the decomposition of one subtensor can propagate to the rest of the subtensors in phase 2.
RMSE with Noise Adaptation - CIAO
Exec. Time with Noise Adaptation - CIAO
Application #3: Personalized Tensor Decomposition (PTD) and Rank Assignment Based on Sub-Tensor
- Problem Formulation
- Sub-Tensor Rank Flexibility
- Rank Assignment for Personalized Tensor Decomposition
- Evaluation
- Setup
- Discussion of the Results
In particular, PTD analyzes the subtensor influence graph (in light of the user's interest) to identify initial decomposition orders for the subtensors in a way that will increase the accuracy of the final decomposition for the partitions of interest. The goal of personalized tensor decomposition (PTD) is to obtain a personalized (or preference-sensitive) decomposition XˆofXin such that. The PTD algorithm then uses this graph to calculate the impact of the inaccuracy of the initial decomposition of a subtensor on the final decomposition accuracy of XP, i.e., the cells of X collectively covered by the user's statement of interest (i.e., KP ).
Conclusions
Faloutsos, Gigatensor: 100-fold tensor scaling analysis and discovery algorithms, in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), p. on Data Mining (ICDM) (2008). Xiong et al., Cooperative temporal filtering with Bayesian probabilistic tensor factorization, in Proceedings of SIAM 2010 International Conference on Data Mining (2010).
Index
GPU,see Graphics processing unit GPU-based k-means method (GPUKM), 6 GPU fuzzy c-means method (GPUFCM), 6 Graph-based anomaly detection (GBAD),. MapReduce model, 92, 96 MR-CPSO (seeMR-CPSO) using Spark (seeSpark-based PSO . clustering method) in fitness calculation, 96 hybrid method, 95–96 personal best position, 93. Anomaly detection system in real-time E-RAIDS (see Ensemble of random . subspace anomaly detectors in data streams).