Index - Clustering Methods for Big Data Analytics and Semi-Supervised Learning

182 Index

Blockchain(s) (cont.) private/restricted, 43 Blockchain data

addresses, 59

Bitcoin blockchain and Bitcoin Core, 49 blocks, 50, 51

Bitcoin network, 56

Bitcoin value distribution, 57, 58 block-height, 59

coinbase transaction, 59 merkle-root-hash, 58 nonce, 59

previous-block-header-hash, 57 target/n-bits, 59

time, 58 version, 57

clustering (seeClustering blockchain data) consensus-based development, 49 documentation, 49

flow of currency

change-making transactions, 52 DAG, 52, 53

P2PKHscheme, 53–54 transaction fees, 52–53 UTXOs, 52

mining, 51 models, 47–48 nodes, 60

operations model, 49 owners, 60

transactions

coinbase transaction, 51

four representative block groups, 55–57 graph, 52–54

identifier, 51 locktime feature, 52 outpoint, 52 pubkey script, 52, 55 secondary structure, 53, 54 sequence number, 52 signature script, 52, 55 value, 52

vertices labeling, 54–55

Block-incremental CP decomposition (BICP) evaluation

ALS process, 164 alternative strategies, 163 criteria, 163

data sets, 163 data updates, 163

execution times and decomposition accuracies, 164–165

hardware and software, 163

redundant refinement reduction, 161–162

update sensitive block maintenance, first phase, 160

update sensitive refinement, first phase, 161 Block reward, 51

Block subsidy, 51

CANDECOMP/PARAFAC (CP) decompositions, 148–149 Center displacement k-means method

(CDKM), 12

Centers reduction-based methods, 11–13 Closed-loop multistep deep clustering, 75, 85

CCNN, 85–86 DBC, 85 DEC, 84–85

Cluster assignment hardening loss, 78–79 Clustering blockchain data, 48

address merging bootstrap methods, 62

co-occurring transaction inputs, 62–63 peer host address, 63

temporal patterns, 63–64

transaction input–output patterns, 63 well-known services, 64

evaluation

distance-based criteria, 65–67 external, 64–65

human-assisted criteria, 68 internal, 65

purity and entropy, 64 sensitivity to cluster count, 67 tagged data, 68

feature extraction, 61–62 scalability, 64

Clustering CNN (CCNN), 85–86 Clustering loss, 78

Cluster separation, 66 CluStream, 119

CMU-CERT data sets, 130 compactSize, 55

Compute Unified Device Architecture (CUDA), 5–6

Continuous Outlier Detection (COD), 123 Cryptocurrencies, 43

Data assignment, 13 Data clustering, 13 Data mining, 25

Data reduction algorithm to cluster large-scale data (DRFCM), 11

Index 183

Data reduction-based methods, 10–11 Data sampling, 13

Data skeleton, 13 Data stream

behaviours, 116 data acquisition, 115 feature space, 117

malicious insider threat detection any-behaviour-all-threat, 116–117 clustering methods (seeData stream

clustering) in data sets, 116

stream mining problem, 115 threat hunting, 116

Data stream clustering cluster tracking, 119 outlier detection, 120

streaming anomaly detection, insider threat detection

data set, 130 deep learning, 120

distance-based outlier detection techniques, 123–125 DNN model, 120

E-RAIDS approach (seeEnsemble of random subspace anomaly detectors in data streams)

feature space, 121–122 ocSVM, 121

RNN, 120 XABA, 121 DBSCAN algorithm, 64

DBSCAN-based clustering models, 30 Deep clustering

closed-loop multistep deep clustering, 75, 85

CCNN, 85–86 DBC, 85 DEC, 84–85

deep representation models, 75, 76 joint deep clustering, 74–75, 82

DCN, 83, 84 FaceNet, 83 JNKM, 84 TAGnet, 82–83 loss functions, 75

autoencoder reconstruction loss, 77 cluster assignment hardening loss,

78–79 clustering loss, 78

joint deep clustering loss function, 78 types, 76–77

sequential multistep deep clustering, 74 deep SSC, 80

DSC, 81

fast spectral clustering, 79–80 NMF+k-means, 82 taxonomy, 74

Deep clustering network (DCN), 83, 84 Deep embedded clustering (DEC), 84–85 Deep learning (DL)

clustering approaches (seeDeep Clustering) predictive modeling tasks, 73

unsupervised pretraining, 73 Deep neural network (DNN), 76, 120 Deep representation (DR) models, 75, 76, 79 Deep subspace clustering (DSC), 81 Directed acyclic graph (DAG), 52, 53 Discriminatively boosted clustering (DBC),

85 Distance-based criteria

cluster quality criteria, 65–66 compactness and isolation, 65 Mahalanobis distance, 67

Distance-based outlier detection techniques, 123–125

DL,seeDeep learning

ECLUNalgorithm, 38 Element-wise distances, 66

Energy-efficient distributed in-sensor-network k-center clustering algorithm with outliers (EDISKCO) algorithm average energy consumption, 36, 37 clustering quality, 36, 37

on coordinator side, 36 memory and residual energy, 35 on node side, 36

SenClu, 36–37

Ensemble-based insider threat (EIT), 121 Ensemble of random subspace anomaly

detectors in data streams (E-RAIDS) advantage, 117

any-behaviour-all-threat, 125 AnyOut, 117

evaluation measures, 132–133 experimental results

MCODvs. AnyOut base learner, evaluation measures, 134–138 MCODvs. AnyOut, voting feature

subspaces, 138–139

more than one-behaviour-all-threat detection, 141

real-time anomaly detection, 139–141 experimental tuning, 131–132

feature subspaces, 117

184 Index

Ensemble of random subspace anomaly detectors in data streams (E-RAIDS) (cont.)

data repositories and survival factor, 127–128

definition, 126

ensemble of random feature subspaces voting, 129

local outlier detection, 125 framework, 125–126 MCOD, 117 RandSubOut, 118 survival factor, 117–118 vote factor, 118

Fast spectral clustering (FSC), 79–80 Feature space, 117

FPAlarm, 133, 134

Fuzzy c-means clustering using MapReduce framework (MRFCM), 7–8 Fuzzy c-means using MPI framework

(MPIFCM), 5

Genetic algorithm (GA), 91 Gibbs samples, 167–168

GPU,seeGraphics processing unit GPU-based k-means method (GPUKM), 6 GPU fuzzy c-means method (GPUFCM), 6 Graph-based anomaly detection (GBAD),

121

Graphics processing unit (GPU), 14 architecture, 6

CUDA, 5–6

disadvantage, memory limits, 7 GPUFCM, 6

GPUKM, 6 multiprocessors, 6 streaming processors, 6 video and image editing, 5

Grid-based probabilistic tensor decomposition (GPTD), 166–167

Hadoop distributed file system (HDFS), 7, 95

HASTREAM, 32–33 Hybrids methods, 13–14

Input/output complexity, 104–105 Insider threat detection, 115–116 Intermediary data blow-up problem, 150 Intra- and inter-cluster similarity, 66

Joint deep clustering, 74–75, 82 DCN, 83, 84

FaceNet, 83 JNKM, 84 loss function, 78 TAGnet, 82–83

Joint NMF and k-means (JNKM), 84

K kd-tree, 11

k-means-based clustering models, 30 k-means using kd-tree structure (KdtKM),

Knowledge Discovery in Databases (KDD) dataset, 17

evaluation and visualization, 38 KPPS algorithm, 14

Kullback–Leibler divergence, 79

Labeled data, 1 LiarTreealgorithm, 34 LSHTiMRKM method, 13

Mahalanobis distance, 67

MapReduce-based k-means method (MRKM), 7

MapReduce-based k-prototypes (MRKP), 8 MapReduce model, 16

data flow, 7, 8 disadvantage, 9 flowchart, 94

fuzzy c-means clustering, 96 HDFS, 95

iterative algorithms, 92 k-means method, 96 k-prototypes, 96 map and reduce phases, 7 MRFCM, 7–8

MRKM, 7

Index 185

MRKP, 8

principal components, 94 shuffle phase, 7 shuffling step, 94

MCOD,seeMicro-cluster-based continuous outlier detection

Merkle tree structure, 50

Message passing interface (MPI), 3, 5, 16 Micro-cluster-based continuous outlier

detection (MCOD), 117, 123 vs. AnyOut

in evaluation measures, 134–138 voting feature subspaces, 138–139 centres of, 124

experimental tuned parameters, 131–132 micro-clusters, definition, 123

MinBatch k-means method (MBKM), 10 Miners, 51

Monte Carlo-based Bayesian decomposition, 166

MPI-based k-means (MPIKM), 5 MR-CPSO

existing methods, 97, 98 modules, 96–97 shortcomings, 97–98 vs. S-PSO, 104–105 Multiprocessors (MPs), 6

Noise-profile adaptive decomposition (nTD) method

benefits, 166 evaluation

criteria, 170 data sets, 169

hardware and software, 170

leveraging noise profiles impact, 170, 171

noise, 169 GPTD, 166–167

Monte Carlo-based Bayesian decomposition, 166 noise-sensitive sample assignment

Gibbs samples, 167–168 naive option, 168

SIG-based sample assignment, 168–169 probabilistic two-phase decomposition

strategy, 165 tensor noise, 165 Nonce, 51, 59

Nonnegative matrix factorization (NMF), 82 nTD,seeNoise-profile adaptive decomposition

method

OMRKM, 13

One class SVM (ocSVM), 121

Overlapping k-means method using Spark framework (SOKM), 10

Particle swarm optimization (PSO) algorithm, 93

clustering method

MapReduce model, 92, 96 MR-CPSO (seeMR-CPSO) using Spark (seeSpark-based PSO

clustering method) in fitness computation, 96 hybrid method, 95–96 personal best position, 93

population-based optimization algorithm, 93

social behavior of birds, 92–93 swarm intelligence algorithms, 92 theoretical analysis

complexity analysis, 104–105 time-to-start variable analysis, 105 Partitional clustering methods

Big data analytics, 2 efficiency, 3 fuzzy c-means, 2

iterative relocation procedure, 2 k clusters, 2

k-means, 2, 91 k-modes, 2 k-prototypes, 2 for large-scale data

empirical results, 17–20 quality of k-means, 18 real datasets, 17 representative method, 16 running time of k-means, 17–18 simulated datasets, 17

SSE, 18 optimization, 1

scalable partitional clustering methods (seeBig data partitional clustering methods)

Pattern Assignment and Mean Update (PAMU), 11

Pattern Compression and Removal (PCR), 11 Pay to public key hash (P2PKH) scheme,

53–54

Personalized PageRank (PPR) scores, 159, 160

186 Index

Personalized tensor decomposition (PTD), 147 evaluation

criteria, 175 data set, 175

decomposition strategies, 175 hardware and software, 175 results, 175–177

foci of interest, 172

problem formulation, 172–173 rank assignment, 173–174 sub-tensor rank flexibility, 173 PreDeConStream, 31–32 PRKM method, 10, 11 Pseudoanonymity, 47

PSO,seeParticle swarm optimization PTD,seePersonalized tensor decomposition

Real-time anomaly detection system E-RAIDS (seeEnsemble of random

subspace anomaly detectors in data streams)

RADISH, 117

Real-time stream mining problem, 115 Receiver-operator characteristic (ROC), 67 Recurrent neural network (RNN), 120 Recursive partition k-means (RPKM), 10 Resilient distributed dataset (RDD), 9, 10, 95

Scalable partitional clustering methods,see Big data partitional clustering methods

Semi-supervised learning, 1

Sequential multistep deep clustering, 74 deep SSC, 80

DSC, 81

fast spectral clustering, 79–80 NMF+k-means, 82

SIGs,seeSub-tensor impact graphs Simulated annealing (SA), 91 Space complexity, 104

Spark-based k-prototypes (SKP) clustering method, 9

Spark-based methods, 9–10, 16

Spark-based PSO clustering method (S-PSO), 92

data assignment and fitness computation step, 98–100

environment and data sets description, 105–106

vs. existing methods, 108

k-means algorithm, 98 k-means iteration step, 102–103 methodology, 105

vs. MR-CPSO, 104–105 pbest and gbest update step, 101 performance measures, 107

position and velocity update step, 101–102 process flowchart, 98, 99

scalability analysis running time, 109, 110 scaleup results, 109, 111 sizeup results, 109, 112 speedup results, 109, 111

Time-To-Start variable impact, 108–109 Sparse subspace clustering (SSC), 80 S-PSO,seeSpark-based PSO clustering

method

Stochastic sub-gradient descent (SGD), 81 Stream clustering, Big data

advanced anytime stream clustering algorithms, 34, 35

anytime mining algorithms, 30 budget algorithms, 30

energy awareness and lightweight clustering, sensor data streams, 30 energy-efficient algorithms and clustering

sensor streaming data ECLUNalgorithm, 38 EDISKCO algorithm, 35–37 high-dimensional density-based stream

clustering algorithms curse of dimensionality, 30

DBSCAN-based clustering models, 30 HASTREAM, 32–33

k-means-based clustering models, 30 PreDeConStream, 31–32

self-adjustment, 30 subspace clustering, 30 properties, 39, 40

storage awareness and high clustering quality, 29

stream changes and outlier awareness, 29 subspace stream clustering, 38–39 Streaming data

eye-tracking system, 28–29

mining body-generated streaming data, 28 multiple data collection sensors, 27, 28 social data, 26

static mining, 26

streaming tweets with tags and time, 27 wired streaming data, 27

wireless sensor network deployment, 27–28

Streaming processors (SPs), 6

Dalam dokumen Clustering Methods for Big Data Analytics and Semi-Supervised Learning (Halaman 186-192)