file that contains configuration information for the various subcomponents and processes of the Inference Engine. The details of the configuration file is described in Section 5.5. Lastly, when used in Anomaly Detection Mode, a set of parameters describing a previously trained deep learning model is required to instantiate a version of the model capable of performing anomaly detection.
The main outputs of the Inference Engine are the model parameters when used in Training Mode and a report detailing detected anomalies when used inAnomaly Detection Mode.
The integration of all subcomponents, and the implementation of all processes and sub-processes making up the Inference Engine, is detailed in Section 5.5.
next expected log event key is. This is then compared to the actual target key, extracted from the log file, and if they are different, an anomaly is flagged. This process is detailed further in Section 5.5.
The Feature Extractor subcomponent implements the necessary processes to transform an in- coming parsed log file into a feature-rich dataset consisting of a window of size w of log event keys and a corresponding target log event key. The processes performed by theFeature Extractor are shown in Figure 5.2.
Extract Sequence Create Features
&
Targets Transform features Input
and targets Stored Mapping
Structured Log File
Legend Common processing path Training mode only processing path Anomaly detection mode only processing path
Input Sequence Target Dc2c74b7, Dc2c74b7,
5d5de21c Dc2c74b7
. . . . . .
Input Sequence Target
7, 7, 19 7
. . . . . .
Log Key Sequence Dc2c74b7, Dc2c74b7, 5d5de21c,
dc2c74b7, dc2c74b7 . . .
Figure 5.2: Processes of theFeature Extractor subcomponent. Also illustrated is how the data is transformed between processes
As shown in Figure 5.2, theFeature Extractorconsists of three main processes: Extract Sequence, Create Features and Targets, and Transform.
The Extract Sequence process, implemented as a Python method, extracts a sequence of log event keys from the provided parsed log file. When received, the parsed log file consists of a structured dataset in which each row entry corresponds to a log message. Each log message is represented by multiple data fields, including a log event key. The Extract Sequence process extracts the log keys from all messages, while ensuring that the sequence in which the log keys appear is preserved, to generate a sequence of log keys that is representative of the system’s execution path that was recorded by the original log file. Some systems, particularly those that have multiple concurrent processes running, generate log files that contain log messages from multiple processes or sessions. In systems where this is known to be the case, theExtract Sequence process is able to group sequences by session, provided that a session identifier is provided. The output of theExtract Sequence process is a sequence of log event keys as shown
in Figure 5.2.
Once a sequence of log event keys representing the system’s execution path has been extracted, the Create Features and Targets process applies a sliding window to the sequence of log keys to extract pairs of input sequences and target log event keys. This process is implemented as a Python method and the size of the window, or the number of log event keys to be used in the input sequence, is tunable. The sliding window is applied to the sequence of log event keys until the entire sequence has been considered. In the event that there are insufficient log event keys remaining to satisfy the window size, then no further windows may be created. In instances where log key sequences have been grouped by session, the sliding window is only applied within each session group, and not across sessions. The output of the Create Features and Targets process is a dataset consisting of a number of input sequence - target key pairs generated across the entire log file.
Once the input features and labels have been created, the final process of theFeature Extractor transforms the data such that it can be fed into a deep learning model. At this stage of the processing pipeline, both the input sequence and targets are alphanumeric text strings.
Deep learning algorithms perform better on numerical data and as such, the features need to be transformed. TheTransform process performs this transformation. When a new model is being trained, the Transform process learns a new transformation for the data. This transformation is in the form of a mapping in which each unique log key, in the entire log file dataset, is mapped to a unique numeric integer value. This mapping is stored in a yaml file such that the same mapping used for model training may be used when the model is performing inference. During inference, theTransform process loads a previously stored model corresponding to the log files that are being processed. TheTransform process is implemented as a Python method.
The output of theTransform process, and the final output of theFeature Extractor, is a trans- formed dataset consisting of input sequence - target key pairs. This dataset is ready to be ingested by a deep learning model for either model training or inference.
The entireFeature Extractor subcomponent is implemented as a Python class, namedFeature- Extractor and its UML Class Diagram is shown in Figure 5.3.
An instance ofFeatureExtractor is instantiated with the following parameters which are stored as class attributes:
1. sample by session - a flag specifying whether the log messages in the log file are to be sampled by session
2. window size - an integer specifying the size of the sliding window i.e. how many log event keys to include per window
3. training mode - a flag specifying whether the Feature Extractor is to be run in training mode or not
4. data transformation- a path to a previously generated data transformation, required when running in Anomaly Detection mode
+extract_features()
+extract_log_key_sequence_by_session () +extract_log_key_sequence()
+create_features_add_labels() +fit_transform()
+transform() +_transform() +sample_by_session +window_size +verbose +training_mode +data_transformation +unique_keys +output_dir +name
FeatureExtractor
Figure 5.3: UML Class Diagram of theFeatureExtractor class.
5. verbose- a flag specifying whether the Feature Extractor will output information regarding its operation
6. output dir - a string specifying a path to a directory where all feature extractor outputs are to be stored
7. name - a unique name for identifying a particular instance of the Feature Extractor An additional class attribute, unique keys, is used to store the set of unique log event keys present in the log dataset that is to undergo Feature Extraction. The methods of the class implement the various processes as previously described.