• Tidak ada hasil yang ditemukan

VAE-based Detection

Dalam dokumen Feiyang Cai's Dissertation (Halaman 33-37)

Variational AutoEncoder (VAE) is a generative model that learns parameters of a probability distribution to represent the data [100]. A VAE consists of an encoder, a decoder, and a loss function. The objective is to model the relationship between the observationxand the low-dimensional latent variablezusing the loss functionL(θ,φ;x) =Ezqφ(z|x)[logpθ(x|z)]−DKL[qφ(z|x)||p(z)], where θ andφ are neural network parameters. The first term in the loss function is the model fit, and the second is the KL divergence between the approximate posterior and the prior ofz. A popular choice for the prior is the Gaussian distribution. VAE- based methods can utilize the reconstruction error or reconstruction accuracy for anomaly detection [52]. In this section, we introduce a VAE-based detection method based on Inductive Conformal Anomaly Detection (ICAD). The VAE allows generating multiple examples that are similar to the input. If these examples are not conformal to the training data, many of the corresponding nonconformity scores will be very large indicating the test example is out of the distribution of the training dataset.

3.3.1 Offline Training and Nonconformity Measure

Let us consider an LECy= f(x)defining a mapping from inputxto outputy. It should be noted that in this work only the inputs are taken into consideration for OOD detection. The set of input examples used for training is denoted byDtrain={xi}li=1. During the system operation, a sequence of inputs denoted by (x1, . . . ,xt, . . .)is processed one by one. The task of the OOD detection is to quantify how different the input sequence is from the training dataset. If the difference is large, the algorithm raises an alarm indicating that the LEC may generate an outputywith a large error compared to the test error obtained at design time.

Since the observations of the environment are highly time-correlated, the collected training data are not exchangeable. As suggested in [62], reshuffling – a random permutation for the training dataset – is performed before training the VAE. After reshuffling, the training dataset is split into a proper training setDproper= {xi}mi=1and a calibration setDcalibration={xi}li=m+1. For each example in the calibration set, a functionAis used to compute the NCM that assigns a numerical score indicating how different a test example is from the training dataset. The nonconformity scores of the calibration examples are sorted and stored in order to be used at runtime.

Given a new inputxt, the nonconformity scoreαtis computed using the nonconformity functionArela- tive to the proper training set,αt=A

{xi}mi=1,xt

.The computation requires evaluating the nonconformity (strangeness) ofxt relative to{xi}mi=1. The choice of the nonconformity functionAmust ensure computing informative nonconformity scores in real time. Usingk-NN, for example, requires storing the training dataset, which is infeasible for high-dimensional inputs. Instead, we learn an appropriate neural network architecture that is trained offline using the proper training set and encodes the required information in its parameters.

This neural networkmonitorsthe inputs to the perception or end-to-end control LEC and is used to compute the nonconformity scores in real time.

For a VAE, an in-distribution inputxshould be reconstructed with a relatively small reconstruction error.

Conversely, an OOD input will likely have a larger error. The reconstruction error is a good indication of the strangeness of the input relative to the training set, and it is used as the NCM. We use the squared error between the input examplexand generated output example ˆxas the NCM defined as

α=AVAE(x,x) =ˆ ||x−xˆ||2. (3.1)

During the offline phase, for each examplexj: j∈ {m+1, . . . ,l}in the calibration dataset, we sample a single reconstructed input ˆxj from the trained VAE and compute the nonconformity scoreαj using Equa- tion (3.1). The precomputed nonconformity scores of the calibration data are sorted and stored in order to be used at runtime. The steps that are performed offline are summarized in Algorithm 1.

Algorithm 1VAE-based out-of-distribution detection

Input: a training setDtrain={xi}li=1; a test sequence(x1, . . . ,xt, . . .); number of calibration examplesl−m;

number of examplesNgenerated by the VAE; thresholdτand parameterω of CUSUM detector Output: boolean variableAnomt

Offline:

1: Split the training set Dtrain={xi}li=1 into the proper training setDproper={xi}mi=1 and calibration set Dcalibration={xi}li=m+1

2: Train a VAE using the proper training set

3: for j=m+1 toldo

4: Generate ˆxjusing the trained VAE

5: αj=AVAE(xj,xˆj)

6: end for Online:

7: S0=0

8: fort=1,2, . . .do

9: fork=1 toNdo

10: Sample ˆxt,kusing the trained VAE

11: αt,k =AVAE(xt,xˆt,k )

12: pt,k=|{i=m+1,...,l}|αiαt,k | lm 13: end for

14: Mt=R01Nk=1εpt,kε1

15: St=max(0,St1+Mt−ω)

16: Anomt←St

17: end for

3.3.2 Online Detection

Given a test inputxt, thep-valuept is computed as the fraction of calibration examples that have nonconfor- mity scores greater than or equal toαt

pt=|{i=m+1, . . . ,l}|αi≥αt|

l . (3.2)

It should be noted that the computation of the p-value can be performed efficiently online since it requires storing only the calibration nonconformity scores. Besides, since the nonconformity scores for the calibration data are sorted in the offline phase, the calculation ofp-values can be accelerated by using a binary search. If pt<ε, the examplext can be classified as an anomaly. Using a singlep-value for detecting OOD examples can lead to an oversensitive detector with a large number of false alarms that inhibit the operation of the CPS.

Our objective is to incorporate multiple examples into the detection algorithm and compute a set ofp-values to test if there are many smallp-values indicating an OOD input.

Given an input examplext at timet, the encoder portion of a VAE is used to approximate the posterior distribution of the latent space and sampleNpoints{zt,k}Nk=1from the posterior that are used as input to the decoder portion in order to generate multiple new examples{xˆt,k}Nk=1. Typically, the posterior of the latent

space is approximated by a Gaussian distribution. Sampling from the posterior generates encodings{zt,k}Nk=1

so that the decoder is exposed to a range of variations of the input example and outputs{xˆt,k }Nk=1.

For each generated example ˆxt,k,k∈ {1, . . . ,N}, the nonconformity scoreαt,k is computed as the recon- struction error between the test inputxt and the generated example using Equation (3.1). Then, the p-value pt,kis computed as the fraction of calibration examples that have nonconformity scores greater than or equal toαt,k using Equation (3.2). It should be noted that, although the generated examples{xˆt,k }Nk=1satisfy the exchangeability assumption, they are sampled from a Gaussian distribution conditioned by the input and are not sampled from the same distribution as the calibration dataset. Therefore, theN p-values{pt,k}Nk=1are not independent and uniformly distributed in[0,1]. However, if the inputxtis sampled from the same distribution as the training dataset, thep-values will be large, and they can be used for OOD detection.

At runtime, for every new input examplextreceived by the perception or end-to-end control LEC at time t, apower martingale[69] can be computed based on the sequence ofp-values for someεas

Mtε=

N k=1

εpε−t,k1,

and thesimple mixture martingale[69] can be defined as

Mt= Z 1

0 Mtεdε= Z 1

0 N

k=1

εpεt,k1dε. (3.3)

Both martingales will grow only if there are many smallp-values in{pt,k}Nk=1, which will indicate an OOD input. If most of thep-values are relatively greater than 0, the martingales are not expected to grow. We use simple mixture martingale in our approach to avoid parameter tuning required for the power martingale.

In order to robustly detect whenMt becomes consistently large, we use the cumulative sum (CUSUM) procedure [106]. CUSUM is a nonparametric stateful test and can be used to generate alarms for OOD inputs by keeping track of the historical information of the martingale values. The detector is defined asS0=0 andSt=max(0,St−1+Mt−ω), whereω preventsSt from increasing consistently when the inputs are in the same distribution as the training data. An alarm is raised wheneverSt is greater than a thresholdSt>τ, which can be optimized using empirical data [106]. Typically, after an alarm, the test is reset withSt=0.

Algorithm 1 describes the VAE-based real-time OOD detection. The NCM can be computed very effi- ciently by executing the learned VAE neural network and generatingN new examples. The complexity is comparable to the complexity of the perception or end-to-end LEC that is executed in real time.

Dalam dokumen Feiyang Cai's Dissertation (Halaman 33-37)