Detection of Out-of-distribution Data in Learning-enabled Cyber-physical Systems

4.3 Detection of Out-of-distribution Data in Learning-enabled Cyber-physical Systems

generator are jointly trained using the loss function

L(θ,φc,φz;x) =E_z∼q_φz_(z|x)[logpθ(x|z)]

−Ec∼q_φ_c(c|x)[DKL(qφ_z(z|x)||p(z|c))].

(4.1)

Similar to a traditional VAE, the first term is used to enable the decoder to reconstruct the input from the latent representation with a small reconstruction error. The second term aims to minimize the Kulback-Leibler (KL) divergence between the approximate posterior and the prediction specific priorp(z|c).

In practice, the balance between these two terms should be carefully tuned to control the trade-off between the fidelity of reconstruction and quality of samples from the model [116]. Recently, a calibrated decoder architecture calledσ-VAE, which can automatically tune the trade-off and improve the quality of the generated samples, is developed in [116], The idea ofσ-VAE is to add a weighting parameterσbetween the reconstruction term and KL-divergence term. The parameterσ can be computed analytically and does not require manual tuning. Implementation details can be found in [116]. This technique can also be used with the proposed VAE for classification and regression models in a similar fashion by adding the weighting parameterσ in the reconstruction term in Eq. (4.1).

4.3.2 Inductive Out-of-distribution Detection

Our approach is based on Inductive Conformal Anomaly Detection (ICAD), which requires a suitableNon- Conformity Measure (NCM) defined to quantify how different a test example is relative to the training dataset [63]. The VAE for classification and regression models are used to define the NCM. There are two significant benefits of using such models. First, the approach can scale up to the high-dimensional inputs and second, the models encode both input and output variables of the regression or classification tasks into the latent representations, and consequently, can be used to detect different types of OOD data.

4.3.2.1 Nonconformity measures

A test examplexand its predictive labely^′are encoded aszin the latent space of the VAE for classification and regression model, and subsequently, the decoder portion generates a reconstructed example ˆxby sampling. If the input-output pair(x,y^′)is sampled from the same joint distribution of the training dataset Ptrain(x,y), the test examplexshould be reconstructed with a relatively small reconstruction error. Therefore, the reconstruction error, or the squared error between the test inputxand its reconstructed example ˆx, can be used as an NCM

ARC(x) =||x−xˆ||². (4.2)

It is possible that some input features have rare or no contribution to the LEC prediction. Take an image input as an example, the reconstruction-based NCM will result in a large nonconformity score when the generative model has difficulty generating fine-granularity details of the original input. Therefore, the input features should be treated differently based on their influence on the LEC output. Layer-wise Relevance Propagation(LRP) is typically used as a tool for interpreting neural networks by identifying which input features contribute most to the LEC predictions [95]. Considering an inputx, by running a backward pass in the predictor portion of the VAE for classification and regression model, LRP computes a relevancer, which has the same size as the inputx. In order to deal with the problem that different inputs will have different total contributions, the relevance should be normalized by the sum of contributions for all features in the input.

Let’s define a functionr=G(x)to represent the LRP algorithm computing a relevance maprfor a given inputx. The NCM with LRP is computed by weighting the reconstruction error using the relevance mapr

ARC-LRP(x) =||r·(x−x)ˆ ||². (4.3)

An important property introduced by the VAE for classification and regression models is that the latent representations are disentangled by the target variable since the prior is conditioned on the target variable.

Specifically, in the classification problem, the representations will be clustered by the target class. Therefore, for a test examplexand its predictive labely^′, the distance of the representationzto its corresponding class centerc_y′can be defined as the distance-based NCM

Adist(x) =||x−c_y′||², (4.4)

where the centerc_y′can be computed as the mean of the representations of the training data with the class label y^′. Note that such distance-based NCM cannot be used for regression since the target variable is continuous.

4.3.2.2 Detection method

The method is divided into offline and online phases. During the offline phase, the training datasetD^train= {(xi,yi)}^li=1is split into a proper training datasetDproper={(xi,yi)}^mi=1and a calibration datasetDcalibration= {(xi,yi)}^li=m+1. Then, we train a VAE for classification and regression model using the proper training set Dproper. For each example xj: j∈ {m+1, . . . ,l}in the calibration set, the encoder portion of the model approximates the posterior distribution of the latent space and samples a pointzjfrom it. The nonconformity scoreα^Γ_j of this example can be computed by the NCMs defined earlier (Eq. (4.2), (4.3), and (4.4)). Specif- ically, for the reconstruction-based NCMsARCandARC-LRP, the sampled pointzjis used to reconstruct the

input; for the distance-based NCMsAdist, the sampled pointzjis directly used to compute the distance to the cluster center. The precomputed nonconformity scores for data in the calibration set are sorted as{αj}^lj=m+1

for online detection.

During oneline detection, a sequence of test inputs(x^′₁, . . . ,x_t^′, . . .)arrive to the LEC one by one. Based on the approach in Section 3.3, multiple examples are incorporated to improve the robustness of the detection.

For each test inputx^′_t,Npoints{z^′_t,1, . . . ,z_t,N^′ }are sampled from the learned posterior distribution in the latent space. Then, for each generated pointz^′_t,k, the nonconformity scoreα_t,k^′ can be computed using the same NCMAused for calibration data. Thep-valuept,kcan be computed as the ratio of calibration nonconformity scores that are at least as large asα_t,k^′ ,

pt,k=|{i=m+1, . . . ,l}|αi≥α_t,k^′ |

l−m .

If the test datax^′_tis sampled from a distribution similar to the distribution of the training dataset, most of the values in thisp-value set{p_t,k}^Nk=1should be relatively larger than 0. However, if there are many small values in thep-value set, the test examplex^′_t is very likely to be OOD. A martingale can be used to test if there are many smallp-values in the set [69]

Mt= Z 1

0 M_t^εdε= Z 1

0 N

∏

k=1

εp^ε_t,k⁻¹dε.

If the test examplex^′_tis OOD, the martingale valueMtwill increase dramatically due to many smallp-values in the set. Further, as described in Section 3.3.2, a stateful CUSUM detectorScan be used to generate alarms when the martingale becomes consistently large. It should be noted that, if the input is not an example in a time sequence, the stateful CUSUM detector should be omitted, and the martingale value can be used directly.

Algorithm 5 summarizes the proposed method.

Dalam dokumen Feiyang Cai's Dissertation (Halaman 64-67)