VAE-based Detection - Feiyang Cai's Dissertation

Variational AutoEncoder (VAE) is a generative model that learns parameters of a probability distribution to represent the data [100]. A VAE consists of an encoder, a decoder, and a loss function. The objective is to model the relationship between the observationxand the low-dimensional latent variablezusing the loss functionL(θ,φ;x) =Ez∼q_φ(z|x)[logpθ(x|z)]−DKL[qφ(z|x)||p(z)], where θ andφ are neural network parameters. The first term in the loss function is the model fit, and the second is the KL divergence between the approximate posterior and the prior ofz. A popular choice for the prior is the Gaussian distribution. VAE- based methods can utilize the reconstruction error or reconstruction accuracy for anomaly detection [52]. In this section, we introduce a VAE-based detection method based on Inductive Conformal Anomaly Detection (ICAD). The VAE allows generating multiple examples that are similar to the input. If these examples are not conformal to the training data, many of the corresponding nonconformity scores will be very large indicating the test example is out of the distribution of the training dataset.

3.3.1 Offline Training and Nonconformity Measure

Let us consider an LECy= f(x)defining a mapping from inputxto outputy. It should be noted that in this work only the inputs are taken into consideration for OOD detection. The set of input examples used for training is denoted byDtrain={xi}^li=1. During the system operation, a sequence of inputs denoted by (x^′₁, . . . ,x^′_t, . . .)is processed one by one. The task of the OOD detection is to quantify how different the input sequence is from the training dataset. If the difference is large, the algorithm raises an alarm indicating that the LEC may generate an outputywith a large error compared to the test error obtained at design time.

Since the observations of the environment are highly time-correlated, the collected training data are not exchangeable. As suggested in [62], reshuffling – a random permutation for the training dataset – is performed before training the VAE. After reshuffling, the training dataset is split into a proper training setDproper= {xi}^mi=1and a calibration setDcalibration={xi}^li=m+1. For each example in the calibration set, a functionAis used to compute the NCM that assigns a numerical score indicating how different a test example is from the training dataset. The nonconformity scores of the calibration examples are sorted and stored in order to be used at runtime.

Given a new inputx^′_t, the nonconformity scoreα_t^′is computed using the nonconformity functionArela- tive to the proper training set,α_t^′=A

{xi}^mi=1,x^′_t

.The computation requires evaluating the nonconformity (strangeness) ofx^′_t relative to{xi}^mi=1. The choice of the nonconformity functionAmust ensure computing informative nonconformity scores in real time. Usingk-NN, for example, requires storing the training dataset, which is infeasible for high-dimensional inputs. Instead, we learn an appropriate neural network architecture that is trained offline using the proper training set and encodes the required information in its parameters.

This neural networkmonitorsthe inputs to the perception or end-to-end control LEC and is used to compute the nonconformity scores in real time.

For a VAE, an in-distribution inputxshould be reconstructed with a relatively small reconstruction error.

Conversely, an OOD input will likely have a larger error. The reconstruction error is a good indication of the strangeness of the input relative to the training set, and it is used as the NCM. We use the squared error between the input examplexand generated output example ˆxas the NCM defined as

α=AVAE(x,x) =ˆ ||x−xˆ||². (3.1)

During the offline phase, for each examplexj: j∈ {m+1, . . . ,l}in the calibration dataset, we sample a single reconstructed input ˆxj from the trained VAE and compute the nonconformity scoreαj using Equa- tion (3.1). The precomputed nonconformity scores of the calibration data are sorted and stored in order to be used at runtime. The steps that are performed offline are summarized in Algorithm 1.

Algorithm 1VAE-based out-of-distribution detection

Input: a training setDtrain={xi}^li=1; a test sequence(x^′₁, . . . ,x^′_t, . . .); number of calibration examplesl−m;

number of examplesNgenerated by the VAE; thresholdτand parameterω of CUSUM detector Output: boolean variableAnomt

Offline:

1: Split the training set D^train={xi}^li=1 into the proper training setD^proper={xi}^mi=1 and calibration set Dcalibration={xi}^li=m+1

2: Train a VAE using the proper training set

3: for j=m+1 toldo

4: Generate ˆxjusing the trained VAE

5: αj=AVAE(xj,xˆj)

6: end for Online:

7: S0=0

8: fort=1,2, . . .do

9: fork=1 toNdo

10: Sample ˆx^′_t,kusing the trained VAE

11: α_t,k^′ =AVAE(x^′_t,xˆ_t,k^′ )

12: pt,k=^|{i=m+1,...,l}|α_i≥α_t,k^′ | l−m 13: end for

14: Mt=^R₀¹∏^N_k=1εp_t,k^ε⁻¹dε

15: St=max(0,St−1+Mt−ω)

16: Anomt←St>τ

17: end for

3.3.2 Online Detection

Given a test inputx_t^′, thep-valuept is computed as the fraction of calibration examples that have nonconformity scores greater than or equal toα_t^′

pt=|{i=m+1, . . . ,l}|αi≥αt|

l . (3.2)

It should be noted that the computation of the p-value can be performed efficiently online since it requires storing only the calibration nonconformity scores. Besides, since the nonconformity scores for the calibration data are sorted in the offline phase, the calculation ofp-values can be accelerated by using a binary search. If pt<ε, the examplex^′_t can be classified as an anomaly. Using a singlep-value for detecting OOD examples can lead to an oversensitive detector with a large number of false alarms that inhibit the operation of the CPS.

Our objective is to incorporate multiple examples into the detection algorithm and compute a set ofp-values to test if there are many smallp-values indicating an OOD input.

Given an input examplex^′_t at timet, the encoder portion of a VAE is used to approximate the posterior distribution of the latent space and sampleNpoints{z^′_t,k}^Nk=1from the posterior that are used as input to the decoder portion in order to generate multiple new examples{xˆ^′_t,k}^N_k=1. Typically, the posterior of the latent

space is approximated by a Gaussian distribution. Sampling from the posterior generates encodings{z^′_t,k}^Nk=1

so that the decoder is exposed to a range of variations of the input example and outputs{xˆ_t,k^′ }^Nk=1.

For each generated example ˆx^′_t,k,k∈ {1, . . . ,N}, the nonconformity scoreα_t,k^′ is computed as the reconstruction error between the test inputx^′_t and the generated example using Equation (3.1). Then, the p-value pt,kis computed as the fraction of calibration examples that have nonconformity scores greater than or equal toα_t,k^′ using Equation (3.2). It should be noted that, although the generated examples{xˆ_t,k^′ }^Nk=1satisfy the exchangeability assumption, they are sampled from a Gaussian distribution conditioned by the input and are not sampled from the same distribution as the calibration dataset. Therefore, theN p-values{pt,k}^Nk=1are not independent and uniformly distributed in[0,1]. However, if the inputx^′_tis sampled from the same distribution as the training dataset, thep-values will be large, and they can be used for OOD detection.

At runtime, for every new input examplex_t^′received by the perception or end-to-end control LEC at time t, apower martingale[69] can be computed based on the sequence ofp-values for someεas

M_t^ε=

N k=1

∏

εp^ε−_t,k¹,

and thesimple mixture martingale[69] can be defined as

Mt= Z ₁

0 M_t^εdε= Z ₁

0 N

∏

k=1

εp^ε_t,k⁻¹dε. (3.3)

Both martingales will grow only if there are many smallp-values in{p_t,k}^Nk=1, which will indicate an OOD input. If most of thep-values are relatively greater than 0, the martingales are not expected to grow. We use simple mixture martingale in our approach to avoid parameter tuning required for the power martingale.

In order to robustly detect whenMt becomes consistently large, we use the cumulative sum (CUSUM) procedure [106]. CUSUM is a nonparametric stateful test and can be used to generate alarms for OOD inputs by keeping track of the historical information of the martingale values. The detector is defined asS0=0 andSt=max(0,St−1+Mt−ω), whereω preventsSt from increasing consistently when the inputs are in the same distribution as the training data. An alarm is raised wheneverSt is greater than a thresholdSt>τ, which can be optimized using empirical data [106]. Typically, after an alarm, the test is reset withSt=0.

Algorithm 1 describes the VAE-based real-time OOD detection. The NCM can be computed very efficiently by executing the learned VAE neural network and generatingN new examples. The complexity is comparable to the complexity of the perception or end-to-end LEC that is executed in real time.

Dalam dokumen Feiyang Cai's Dissertation (Halaman 33-37)