USING ENTROPY AS A WAY TO DETECT DEPENDENCIES IN STATIC DATA SETS.

(1)

COMPUTATION

FOR

DEPENDENCY-MONITORING ON DATA STREAMS

by Jonathan Boidol & Andreas Hapfelmeier presented by Dustin Tran

1

(2)

(3)

BRIEF INTRODUCTION

• Industrial sensors or wireless sensor networks (WSNs).

• Characteristics: large size and dimension of streaming data.

• DIMID is an algorithm to monitor dependencies in high dimensional streaming data.

• Ie. to monitor the relationships between data sources

(4)

USING ENTROPY AS A WAY TO DETECT DEPENDENCIES IN STATIC DATA SETS.

Examples:

1. Reshef et al compare several methods to find novel associations in data sets with a large number of variables.

2. Benesty et al use entropy to detect delays within time series.

3. Dionisio et al analyzed financial time series and concluded, that mutual information is a superior measure of dependence between random variables.

(5)

SOME ALGORITHMS FOR STREAM MONITORING

1. PeakSim by Seliniotaki et al.

2. MID by Boidol and Hapfelmeier.

3. MISE by Keller et al.

(6)

CHARACTERISTICS

• DIMID contains a dimensionality reducing method and an estimator for entropy.

• This allows DIMID to update the current relationships with new data as it becomes available instead of recomputing everything after each update.

• A reduction in run-time from linear to logarithmic in the number of observed samples, in comparison with 03 other algorithms.

(7)

TECHNIQUES

1. Window approach.

2. A suitable estimator for the entropy calculation.

(8)

ADVANTAGES

1. Does not need to transform the current data in the observation window

• but operates directly on the unpreprocessed data

• does not introduce imprecision through transformation or sampling.

2. Improve the run-time from linear to logarithmic in the size of the observed samples

• through an incremental calculation of the dependency measure.

(9)

MESURING DEPENDENCY

• Mutual information measurement I: the predictability of one variable by another.

• To denote a subset of w successive measurements, we write s_it,wwhich describe measurements from stream s_i from time t to t + w - 1

• In general, we want to compute I for some or all pairs of variables s_it,w, s_jt,w ∈ S at all time points t.

(10)

OVERVIEW

DIMID consists of an initialization phase and an iterated update procedure

At t = 0, the window is initialized fully.

After t > w, only the newest and the faded-out value are utilized.

(11)

CALCULATING I(X;Y)

• (1) is called probability density function

• H() is differential entropy function

• The difference between the differential entropies h(X) + h(Y) and the joint entropy h(X, Y )

• I(X; Y ) increases the better X allows us to predict Y and vice versa.

Defined

Equivalen t

(12)

CALCULATING H(X)

• C_E is the Euler Constant

• Rho_i is the euclidean distance to the nearest neighbour of the i-th sample.

• c(n) depends only on the predetermined sample size n, while the NN- distance Rho_i determines the local density of the data points.

(13)

CALCULATING H(X;Y)

• (5) shown simplified Johnson-Lindenstrauss mapping.

• A distance-preserving mapping from higher-dimensional into low-dimensional spaces.

• Project the data points si , sj into a random subspace si◦j with dimension d

= 1

• Where r ∈ R² is a 2d vector whose entries are independently drawn from a normal distribution N(0, 1) and (s_iT , s_jT) is the 2×n dimensional matrix of the sample points.

(14)

CALCULATIONS OVER TIME

(15)

(16)

EXPERIMENTS ON SENSOR DATASETS

A large variety of data sets including movement tracking, financial time series and environment sensors with different lengths and a total of 50 streams.

(17)

EXPERIMENTS ON SYNTHETIC

DATASET

Additionally, they created a synthetic data set LNR

Two additional functions are: Uniform noise (uniform in [0, 1]) and Gaussian noise (with μ = 0 and σ = 1).

Total 8 functions, 28 pairwise I.

The advantage of the synthetic data is a clear knowledge of the dependency in the data which has to be inferred in other data.

(18)

DATASETS

(19)

RESULT ON SENSOR DATASETS

(20)

RUNTIME ANALYSIS

(21)

CONCLUSION

• The improvement is due to the ability of an entropy-based distance measure to discriminate not only simple, linear relationships but also more complex interactions from background noise.

• Window size might influence the sensitivity of a detection algorithm.

(22)

THE END

Thank you for listening.