• Tidak ada hasil yang ditemukan

Clustering Methods for Big Data Analytics and Semi-Supervised Learning

N/A
N/A
Protected

Academic year: 2023

Membagikan "Clustering Methods for Big Data Analytics and Semi-Supervised Learning"

Copied!
192
0
0

Teks penuh

The volume opens with a chapter titled "Overview of Scalable Partitioning Methods for Large Data Clustering." In this chapter, BenHaj Kacem et al. The aim of this work is to give a theoretical and empirical overview of scalable big data clustering methods.

Partitional Clustering Methods

Then, Section 1.4 provides an experimental evaluation of big data partial clustering methods on various simulated and real large data sets. Several works have been proposed to improve the efficiency of conventional partition clustering methods.

Big Data Partitional Clustering Methods

  • Parallel Methods
    • MPI-Based Methods
    • GPU-Based Methods
    • MapReduce-Based Methods
    • Spark-Based Methods
  • Data Reduction-Based Methods
  • Centers Reduction-Based Methods
  • Hybrids Methods
  • Summary of Scalable Partitional Clustering Methods for Big Data Clustering

This method is motivated by the fact that k-means requires the calculation of all distances between each of the cluster centers and data points. Therefore, the LSH technique is used to reduce the number of data points when constructing cluster centers.

KdtKM (Kanungoet al. 2002) (Pellegand Moore 2003) TiKM (Phillips 2002) (Elkanet al. 2003) CDKM (Lai et al
KdtKM (Kanungoet al. 2002) (Pellegand Moore 2003) TiKM (Phillips 2002) (Elkanet al. 2003) CDKM (Lai et al

Empirical Evaluation of Partitional Clustering Methods for Large-Scale Data

This dataset was obtained from the UCI machine learning repository.1 The second real dataset is the Household dataset (House), which contains the results of household electricity consumption measurements. The analysis of the empirical results firstly shows that hybrid methods are significantly faster than all other methods, because they simultaneously use different acceleration techniques to improve the efficiency of the conventional k-means method.

Conclusion

Therefore, we can conclude that MapReduce framework and triangle inequality reduce the running time of the conventional k-means method without affecting clustering results. MapReduce and triangle inequality techniques can improve the efficiency of conventional clustering without affecting the final clustering results.

Li, Een efficiënt k-means clusteringalgoritme op MapReduce, in Proceedings of Database Systems for Advanced Applications, pp. Hao, Een parallel k-means clusteringalgoritme met MPI, inProceedings of Fourth International Symposium on Parallel Architectures, Algorithms and Programming, pp.

Overview of Efficient Clustering Methods for High-Dimensional Big Data Streams

  • Introduction
  • Streaming Data
  • Challenges of Stream Clustering of Big Data
    • Adaptation to the Stream Changes and Outlier Awareness
    • Storage Awareness and High Clustering Quality
    • Efficient Handling of High-Dimensional, Different-Density Streaming Objects
    • Flexibility to Varying Time Allowances Between Streaming Objects
    • Energy Awareness and Lightweight Clustering of Sensor Data Streams
  • Recent Contributions in the Field of Efficient Clustering of Big Data Streams
    • High-Dimensional, Density-Based Stream Clustering Algorithms
    • Advanced Anytime Stream Clustering Algorithms
    • Energy-Efficient Algorithms for Aggregating and Clustering Sensor Streaming Data
    • A Framework and an Evaluation Measure for Subspace Stream Clustering
  • Conclusion

A self-adaptation for the different densities of the data is strongly needed while designing a stream clustering algorithm. Kröger, Density-connected subspace clustering for high-dimensional data, in Proceedings of the SIAM International Conference on Data Mining, SDM.

Figure 2.1 gives some examples about real world application that produce data streams
Figure 2.1 gives some examples about real world application that produce data streams

Clustering Blockchain Data

Introduction

  • Motivation
    • Fraud Detection and Law Enforcement
    • Systems Insights
    • Anonymity and Traceability
  • Contribution
  • Organization

Methods to cluster blockchain data: Clustering methods are described in the context of the above conceptual models. Blockchains, and in particular the key aspects of the data they generate, are described in section 3.2.

Fig. 3.1 Bitcoin (XBT) prices in US dollars (USD), log scale, by date. Plot generated by combining historical exchange rate data from a freely available spreadsheet [5] (2010–2013, from the defunct Mt
Fig. 3.1 Bitcoin (XBT) prices in US dollars (USD), log scale, by date. Plot generated by combining historical exchange rate data from a freely available spreadsheet [5] (2010–2013, from the defunct Mt

Blockchain Data

  • Blocks
  • Mining
  • Transactions
  • Flow of Currency

More specifically, the primary content of each block, i.e. the transactions, is not hashed en masse with the rest of the block. It is composed of a transaction identifier together with the index of the desired output of the referenced transaction, in the ordered list of outputs.

Fig. 3.5 A simplified view of the blockchain. The transactions in each block are combined pairwise recursively in tree form to yield the Merkle-root hash in the block header
Fig. 3.5 A simplified view of the blockchain. The transactions in each block are combined pairwise recursively in tree form to yield the Merkle-root hash in the block header

Models of Blockchain Data

  • Transactions
  • Blocks
  • Addresses
  • Owners
  • Nodes

3 Clustering Blockchain Data 55 4-byte unsigned integer that is the output index in the reference transaction (see output index below). For example, Bitcoin implementations have some freedom in designing the signature script used to verify authorization to spend. The color of each cell is a measure of the number of transactions with corresponding input and output counts, also using a logarithmic scale.

Fig. 3.8 Transactions per blockchain-block for four representative block groups. In the scatterplot for each block group, the horizontal axis measures the date and time (as POSIX epoch time) at which the block was committed and the vertical axis is a count
Fig. 3.8 Transactions per blockchain-block for four representative block groups. In the scatterplot for each block group, the horizontal axis measures the date and time (as POSIX epoch time) at which the block was committed and the vertical axis is a count

Clustering

  • Feature Extraction
  • Address Merging
  • Scalability

Such identified addresses are important resources not only because of the direct identification they provide, but also because they can be used to upload methods that can identify or classify addresses that are otherwise anonymous. Remember that there is no direct way to use only part of the value available in an unspent transaction output (UTXO). An obvious candidate is the IP address of the peer host from which a transaction originates.

Evaluation

  • Distance-Based Criteria
    • Cluster Quality Criteria
    • Mahalanobis Distance
  • Sensitivity to Cluster Count
  • Tagged Data
  • Human-Assisted Criteria

The minimization of the intra-cluster distances is a natural expression of the general preference for denser clusters. More precisely, the sim index (which must be maximized) sums, over all clusters, the difference between the total pairwise similarity for elements in the cluster and the total pairwise similarities with one object in and one object out of the cluster. For example, a study on de-anonymization of Bitcoin addresses used a graphical visualization of the user network [38].

Conclusion

Han, K-betyder clustering via principal komponentanalyse, i Proceedings of the Twenty-first International Conference on Machine Learning, ICML'04(ACM, Banff, 2004), s. Yanovich, Automatic Bitcoin address clustering, inProceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico (2017). Schweiger, Scan: en strukturel klyngealgoritme for netværk, i Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'07(ACM, New York, 2007), pp.

An Introduction to Deep Clustering

  • Introduction
  • Essential Building Blocks for Deep Clustering
    • Learning Deep Representations
    • Deep Clustering Loss Functions
  • Sequential Multistep Deep Clustering
    • Fast Spectral Clustering
    • Deep Sparse Subspace Clustering (SSC)
    • Deep Subspace Clustering (DSC)
    • Nonnegative Matrix Factorization (NMF) + K-Means
  • Joint Deep Clustering
    • Task-Specific and Graph-Regularized Network (TAGnet)
    • FaceNet
    • Deep Clustering Network (DCN)
    • Joint NMF and K-Means (JNKM)
  • Closed-Loop Multistep Deep Clustering
  • Conclusions

Next, each family of deep clustering approaches (sequential multilevel, joint, and finally closed-loop multilevel deep clustering) will be discussed in order. In the following, we describe some representative algorithms that belong to this family of deep clustering methods. In the following, we describe some representative algorithms that belong to this family of deep clustering methods (Figure 4.12).

Fig. 4.1 Taxonomy of Deep Clustering presented in this chapter
Fig. 4.1 Taxonomy of Deep Clustering presented in this chapter

Spark-Based Design of Clustering Using Particle Swarm Optimization

Introduction

Among these algorithms, particle swarm optimization (PSO), as one of the swarm intelligence algorithms, has gained great popularity in the past two decades and appeared to be a potentially full and fruitful research area [15]. The rest of this chapter is organized as follows: Sect.5.2 presents a background on the basic concepts related to the particle swarm optimization algorithm, the MapReduce model and the Spark framework. Section 5.6 presents experiments that we have performed to evaluate the effectiveness of the proposed method.

Background

  • Particle Swarm Optimization
  • MapReduce Model
  • Apache Spark

It consists of a swarm of particles where each particle is considered as a potential solution to the optimization problem. Once done, the results are merged to provide a final solution to the very large and complex problem [6]. Spark owes its prosperity to its ability to perform in-memory computations, which means that data does not need to be moved to and from disk but is kept in memory.

Figure 5.1 outlines the flowchart of MapReduce paradigm. The enormous data set is divided into several chunks, small enough to be fitted into a single machine, each chunk is then assigned to a map function to be processed in parallel
Figure 5.1 outlines the flowchart of MapReduce paradigm. The enormous data set is divided into several chunks, small enough to be fitted into a single machine, each chunk is then assigned to a map function to be processed in parallel

Related Works

This method first assigns each data point to the closest cluster prototypes in the map function. The map function receives the particles information as a key value pair where the key is the particle ID and the value represents all the information related to the particle. Then, after processing, the output must be written back to the file system.

Table 5.1 summarizes the existing methods.
Table 5.1 summarizes the existing methods.

Proposed Approach: S-PSO for Clustering Large-Scale Data

  • Data Assignment and Fitness Computation Step
  • Pbest and Gbest Update Step
  • Position and Velocity Update Step
  • K-Means Iteration Step

Let P (t) = {Pi(t)..PS(t)} be the collection of the particle's information where Pi(t) = {xi(t), vi(t), pbestPi(t), pbestFi(t ) } represents the information of particle i in the iteration t where xi(t) is the position, vi(t) is the velocity, pbestPi(t) is the best position and pbestFi(t) is the best fitness. LetpbestP (t)= {pbestP1(t)..pbestPS(t)}the set of personal best position where pbestP1(t) is the pbestP of the particle i at iteration t. Letx(t) = {x1(t)..xS(t)}the set of position values ​​wherexi(t) is the position of the particles at iteration.

Fig. 5.3 Flowchart S-PSO
Fig. 5.3 Flowchart S-PSO

Theoretical Analysis .1 Complexity Analysis

  • Time Complexity
  • Space Complexity
  • Input/Output Complexity
  • Time-To-Start Variable Analysis

The conversion between the two algorithms is conditioned by a variable we introduce called Time-To-Start. If this variable is added to be at the beginning of PSO, we get a low quality result but a much reduced execution time and the opposite is correct. Therefore, this variable must be chosen in a way that ensures balance between quality and time.

Experiments and Results .1 Methodology

  • Environment and Data Sets Description
  • Performance Measures
  • Comparison of the Performance of S-PSO Versus Existing Methods
  • Evaluation of the Impact of Time-To-Start Variable on the Performance of S-PSO
  • Scalability Analysis

The clustering process for this dataset detects the type of attacks among all connections. The speedup measure consists of determining the size of the dataset and varying the number of computer nodes. To evaluate the scaling of our proposed method, we increase both the size of the dataset and the number of cores.

Table 5.2 Summary of the data sets
Table 5.2 Summary of the data sets

Conclusion

Data Stream Clustering for Real-Time Anomaly Detection: An Application to

Introduction

The challenge of the insider threat detection problem lies in the variety of malicious insider threats in the datasets. To address the shortcoming of the high number of false alarms, we propose a streaming anomaly detection approach, namely Ensemble of Random subspace Anomaly detectors In Data Streams (E-RAIDS). Moreover, E-RAIDS is evaluated not only in terms of the number of detected threats and FP alarms, but in terms of (1) F1 measure, (2) voice feature subspaces, (3) real-time anomaly detection, and (4) ) detecting (more than One) behavior-all threat.

Related Work

  • Clustering for Outlier Detection
  • Streaming Anomaly Detection for Insider Threat Detection

6 Data Stream Clustering for Anomaly Detection 119 sources define the stream environment of the insider threat problem. In this work, data stream clustering is used to support the outlier detection techniques for real-time anomaly detection. In this book chapter, we use data stream clustering to detect outliers (malicious insider threats) while reducing the number of false alarms.

Anomaly Detection in Data Streams for Insider Threat Detection

  • Insider Threat Feature Space
  • Background on Distance-Based Outlier Detection Techniques
    • Micro-Cluster-Based Continuous Outlier Detection
    • Anytime Outlier Detection
  • E-RAIDS Approach
    • Feature Subspace Anomaly Detection
    • Ensemble of Random Feature Subspaces Voting

In the following, we give a more detailed description of the feature set used in this work. Instead, it evaluates the range queries with respect to the (lesser) centers of the microclusters. As described later in Sect.6.3.3.2, if the ensemble votes to generate an alarm, the subOutSet for each attribute subspace is used to evaluate whether all malicious insider threats are detected (i.e., the goal of any-behavior all-threat).

Fig. 6.2 E-RAIDS framework
Fig. 6.2 E-RAIDS framework

Experiments

  • Description of the Data set
  • Experimental Tuning
  • Evaluation Measures
  • Results and Discussion
    • MCOD vs AnyOut Base Learner for E-RAIDS in Terms of Evaluation Measures

In the state of the art, a significant number of approaches have been validated in terms of FP measurement. As mentioned above, the ultimate goal of the E-RAIDS approach is to detect all malicious insider threats over a real-time data stream, minimizing the number of false alarms. The challenge of the insider threat problem lies in the variety and complexity of malicious insider threats to data sets.

Table 6.1 Tuned parameters
Table 6.1 Tuned parameters

RAIDS-MCOD

In this work, the results for E-RAIDS-MCOD and E-RAIDS-AnyOut are presented and discussed as follows: with respect to (1) the predefined evaluation measures;. 2) voice function subspace; (3) real-time anomaly detection; and (4) detection of (more than one)-behavior-all-threat. In the following, we analyze the performance of E-RAIDS with MCOD base learner vs AnyOut base learner against the predefined evaluation measures: T PT out of PT; FPA arm; and F1 goals. The results are reported in terms of the parameter values ​​in the given sequential, respectively r, w for E-RAIDS-MCOD and τ, oscAgr, w for E-RAIDS-AnyOut.

RAIDS-AnyOut

  • Real-Time Anomaly Detection in E-RAIDS
  • Conclusion and Future Work

We compare the number of feature subspaces in the ensemble that voted for a malicious insider threat in each of E-RAIDS-MCOD and E-RAIDS-AnyOut. We recall the complexity of the malicious insider threat scenarios in the CMU-CERT datasets. Phillips et al., Insider threat detection in lost, in Proceedings of the 50th Hawaii International Conference on System Sciences (2017) 18.

Fig. 6.3 The variation of F1 measure as a function of window size w for E-RAIDS with MCOD base learner over the communities
Fig. 6.3 The variation of F1 measure as a function of window size w for E-RAIDS with MCOD base learner over the communities

Effective Tensor-Based Data Clustering Through Sub-Tensor Impact Graphs

Selçuk Candan, Shengyu Huang, Xinsheng Li, and Maria Luisa Sapino

  • Introduction
    • Contributions of This Chapter: Sub-Tensor Impact Graphs
  • Background .1 Tensors
    • Tensor Decomposition
    • Tensor Decomposition and Clustering
    • Block-Based Tensor Decomposition
  • Sub-Tensor Impact Graphs (SIGs) and Sub-Tensor Impact Scores
    • Accuracy Dependency Among Sub-Tensors
    • Sub-Tensor Impact Graphs (SIGs)
    • Sub-Tensor Impact Scores
  • Application #1: Block-Incremental CP Decomposition (BICP) and Update Scheduling Based on Sub-Tensor
    • Reducing Redundant Refinements
    • Evaluation
  • Application #2: Noise-Profile Adaptive Decomposition (nTD) and Sample Assignment Based on Sub-Tensor
    • Grid-Based Probabilistic Tensor Decomposition (GPTD)
    • Noise-Sensitive Sample Assignment
    • Evaluation

1If the subtensor is empty, then the factors are 0 matrices of the appropriate size. Intuitively, the subtensor influence plot represents how the decomposition accuracies of a given set of subtensors of an input tensor affect the overall accuracy of the combined decomposition. However, as discussed earlier, inaccuracies in the decomposition of one subtensor can propagate to the rest of the subtensors in phase 2.

Fig. 7.1 A third-order (3-mode) tensor of dimensions, I × J × K
Fig. 7.1 A third-order (3-mode) tensor of dimensions, I × J × K

RMSE with Noise Adaptation - CIAO

Exec. Time with Noise Adaptation - CIAO

Application #3: Personalized Tensor Decomposition (PTD) and Rank Assignment Based on Sub-Tensor

  • Problem Formulation
  • Sub-Tensor Rank Flexibility
  • Rank Assignment for Personalized Tensor Decomposition
  • Evaluation
    • Setup
    • Discussion of the Results

In particular, PTD analyzes the subtensor influence graph (in light of the user's interest) to identify initial decomposition orders for the subtensors in a way that will increase the accuracy of the final decomposition for the partitions of interest. The goal of personalized tensor decomposition (PTD) is to obtain a personalized (or preference-sensitive) decomposition XˆofXin such that. The PTD algorithm then uses this graph to calculate the impact of the inaccuracy of the initial decomposition of a subtensor on the final decomposition accuracy of XP, i.e., the cells of X collectively covered by the user's statement of interest (i.e., KP ).

Table 7.1 Various tensor partitioning scenarios considered in the
Table 7.1 Various tensor partitioning scenarios considered in the

Conclusions

Faloutsos, Gigatensor: 100-fold tensor scaling analysis and discovery algorithms, in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), p. on Data Mining (ICDM) (2008). Xiong et al., Cooperative temporal filtering with Bayesian probabilistic tensor factorization, in Proceedings of SIAM 2010 International Conference on Data Mining (2010).

Index

GPU,see Graphics processing unit GPU-based k-means method (GPUKM), 6 GPU fuzzy c-means method (GPUFCM), 6 Graph-based anomaly detection (GBAD),. MapReduce model, 92, 96 MR-CPSO (seeMR-CPSO) using Spark (seeSpark-based PSO . clustering method) in fitness calculation, 96 hybrid method, 95–96 personal best position, 93. Anomaly detection system in real-time E-RAIDS (see Ensemble of random . subspace anomaly detectors in data streams).

Gambar

KdtKM (Kanungoet al. 2002) (Pellegand Moore 2003) TiKM (Phillips 2002) (Elkanet al. 2003) CDKM (Lai et al
Fig. 1.2 GPU architecture with three multiprocessors and three streaming processors
Fig. 1.3 Data flow of MapReduce framework
Fig. 2.2 Reference [12]. Two applications of mining body-generated streaming data. (a) In a health care scenario [13] and (b) in a translation scenario in collaboration with psycholinguists in the humanities area [21]
+7

Referensi

Dokumen terkait