Neural Computing for Event Log Quality Improvement

Regarding the first problem, we train autoencoders under purely unsupervised mode to handle the problem of anomaly detection without using any prior knowledge of the domain. We focus on developing algorithms that can capture the general pattern and sequence aspect of the data.

Problem scenario

In this thesis we focus on (ii), i.e. improving the quality of existing event logs. There are two phases to improve the quality of the collected data, namely data cleaning and imputation.

Table 1.1 Event log BPI Challenge 2013. Only information of the first two cases is shown.

Objectives

Outline

Machine learning background

Feed-forward neural networks

Therefore, we can evaluate the output of the network with the input based on the random weights. Then, in the second stage, we calculate the error between the output and the desired output, and then take the derivatives of the error with respect to the weight values in the network.

Recurrent Neural Networks (RNNs)

The undefined hidden vector h0 can be zero or randomly initialized before the training procedure. The undefined hidden vectorh0and cellvectorc0 can be zero or randomly initialized before the training procedure.

Fig. 2.3 Visualisation of Equation 2.2 for the LSTM module.

Autoencoders (AEs)

The number of hidden units in the bottleneck layer, i.e. the dimension of the code, can be lower or higher than. In addition, it can also obtain the true distribution of the data, especially in the task of pattern analysis [4].

Fig. 2.6 A standard variational autoencoder. Dashed lines denote encoder network approxi- approxi-mating q φ (z|x), solid lines denote decoder network p θ (x|z)

Related work on quality of event logs

In this thesis, we assume no knowledge about the process that generated an event log. In this work, we adopt a similar approach to extract the sequential information from the event log.

Event log definition

First, we start with the definition of the event log and its notation; then we will discuss how we collect the data sets for our work. For any evente∈E and attribute namea∈AN, #a(e)∈Da is the value of attribute namedafor evente. Let Did be the set of event IDs, Dcase the set of case IDs, Did the set of activity labels, Dtst the set of possible timestamps, and Dres the set of possible resource IDs.

In the negotiation phase, an event log is converted into a suitable format to be fed into an autoencoder.

Datasets

Artificial datasets

Real-life datasets

Introduction

Methods

We define the reconstruction error by indicating the distance between the input vector and the output vector. As a result, we use the reconstruction error as a signal to detect anomalies in the event log. With the threshold selection, if the reconstruction error is less than the anomaly detection threshold, the variable is normal and vice versa.

Abnormal Timing Detector: To detect irregular timing, the reconstruction error is defined as the absolute value of the difference between the input and output timing attributes. However, in this case, the error is defined as the average of the absolute error between the probability distributions associated with the activity's output and input attributes.

Fig. 4.1 Multivariate anomaly detection procedure.

Anomalous attribute simulation

In the second detector, instead of using the probability distribution, we use the activity label as a signal of an anomaly. The activity is considered anomalous if the reconstructed label and the input label do not match. We call this argmax-based detector because the input or output label is the label with maximum likelihood.

Input data treatment

Since an autoencoder requires a fixed-size matrix as input, zero-padding is applied to all instances for which p

The goal of the model training step is to train a model that can capture the general behavior of data in an event log. Once a model is trained, each instance in an event log is reconstructed into an output matrix of elementsc′i,jor sizep×q.

Fig. 4.3 An example of two vectorized inputs of sequence length of 4 and 3. The longest length is five

Experiments

We introduce a masking matrix to indicate zero padding values that should not be considered when the model calculates the weight update loss. As explained in Section 4.2, we use a distance-based method to classify normal and abnormal behavior, so we consider a modified rms error loss function that uses the masking matrix of Eq. We used dropout at 0.2 in VAE and AE to avoid overfitting issues, while in LAE we use clipping to avoid gradient vanishing/exploding issues.

We start the training procedure with the learning rate of 0.01 and gradually decrease by 0.001 after each training. By doing this, we find that the learning rate 0.0001 is sufficiently good for the training procedure of three models.

Evaluation criteria

The performance of proposed methods should be evaluated mainly based on their ability to identify the anomalies. There are 4 possible outcomes of the binary classification of a variable labeled enteni-abnormal or j-normal, which are T Pi (true positive: labeli is correctly assigned to labeli), FPi (false positive: label ji is incorrectly assigned to labeli), T Ni(true negative: Label ji is correctly identified as label j), and FNi(false negative: Label ji is incorrectly identified as labeli). In addition to using the above metrics for evaluation, we also create visualizations to better understand the performance of the detection models.

The first visual assessment we use is the reconstruction error histogram to examine the distribution of this value between normal and non-normal data. Since we use the threshold-based method described in Section 4.2 to separate normal and abnormal data points, we superimposed two histograms of normal and abnormal data reconstruction errors on the same plot to test whether the detection algorithm can distinguish between them.

Results

Apart from the figures reported for LAE (Fig. 4.5c, 4.5f, 4.5i and 4.5l), it can be seen that the distribution of the normal and abnormal groups shown in Fig. In conclusion, we achieve sufficiently good results in predicting anomalous activity in artificial logs, while the performance in real-logs should be improved. Increasing the size of hidden layers degrades the performance of argmax-based detectors, while the performance of threshold-based detector still remains.

Comparing the performance among the three models, the results show that VAE and AE outperform LAE in all datasets, although we expect LAE to give competitively better results compared to the others with complex calculations. In terms of model complexity, the LAE architecture is more complex than the VAE and AE models, resulting in more expensive computational costs (see Table 4.2).

Fig. 4.4 Reconstruction error of Time attribute. The blue histogram denotes for reconstruction error of normal points and the green one denotes for anomalous points.

Discussion

Introduction

Methods

Missing attribute simulation

In this case, x1 is drawn from a uniform distribution in the range[1,5], i.e. 5 events belong to the log, and x2 from a uniform distribution in the range[1,2], i.e. only 2 attributes can be missing (activity or timestamp). Note that continuous missing attributes are set to NaN, whereas discrete missing attributes are set to NaT. As a result of this procedure, the probability of an attribute value being missing is completely random.

Table 5.1 Example of missing attribute value setting, using the BPI Challenge 2012 event log.

Input data treatment

The goal of the model training step is to train a model that can learn the latent distribution of the event log data. After the model is trained, each instance in the event log is reconstructed into an output matrix of p×q elementsc'i,jof. As a result of the model learning step, in the output matrix, the values of the missing attributes (indicated by 0 in the input matrix) are mapped to a valid value for numerical attributes and to a probability for categorical attributes.

To define the loss function to be optimized, during model training we introduce a masking matrix to distinguish missing values and zero padding values from non-zero values in an input matrix. In this post-processing step, missing values of numeric attributes are converted to valid values in an event log by inverting Eq.

Experiments

Evaluation criteria

Results

The high variability of the results for the BPI 2013 event log is due to the distribution of activity durations. Comparing the performance of the model in the real and artificial log, we see that the model can achieve a more stable result in the later data set. This is due to the fact that there are fewer disturbances and distractions due to noise in the artificial tree trunks.

This may be due to the fact that a sequence of activities in an event log tends to follow a particular pattern determined by the process control flow. It therefore appears that the performance of the model should be evaluated in the future with respect to the complexity of a process model control flow.

Table 5.6 Model performance for missing activity label reconstruction, measured by accuracy.

Discussion

The results in Table 5.6 show a remarkable effectiveness of the proposed method in reconstructing missing activity names. As the number of missing attribute values increases, the performance of the proposed reconstruction models deteriorates. Once these control flow structures are learned by a model, they can easily be used to reliably impute missing values.

However, it can be generalized to other variables that typically belong to an event log. First, the bias introduced by the distribution of timestamps values leads to poor imputation of timestamps compared to activity names.

Conclusion

Future work

A journal article titled "Neural Computing to Improve Event Log Quality" is currently in preparation for journal submission (expected January 2018). In Proceedings of the 17th International Conference on Foundations of Intelligent Systems, ISMIS’08, pages 150–159, Berlin, Heidelberg. I would like to thank Mr. Hoang Minh Le for being so responsive to all my technical questions and providing me with the computer and software to perform all the experiments in this thesis.

In this appendix, we will graphically show the simulated anomalous time attribute. a) Accepted-Assigned (b) Accepted-In Progress. The green dot indicates abnormal data, the blue dot indicates normal data, and the solid red line is the boundary. In this appendix, we provide the visualization of the ROC curves obtained from the experiments in Chapter 4.

In this appendix, we provide a brief guide on how to reproduce the results shown in this thesis.