PDF Presented by: University

Data Miner: A subcomponent of ALFAF that performs automatic parsing of log files into a structured data set. Inference Engine: A subcomponent of ALFAF that performs automatic analysis of log files to detect system errors from log files.

Background to the Study

While the framework developed is designed to be primarily evaluated on the CBF system, the robustness and generalizability of the framework to other systems is considered as a secondary objective to influence the design of the framework. Ensuring continuous operation of CBF and meeting reliability and uptime requirements, then quickly and efficiently locating faults and correcting failures, is a challenging and non-trivial task.

Research Focus

After this, Chapter 5 describes the design and development of the Inference Engine that implements the chosen machine learning technique for log file analysis. The design and development of an automatic, end-to-end log file analysis framework, consisting of the Data Miner and Inference Engine combined in a pipeline, is.

Scope and Limitations

Scope

In Chapter 9, these results are then used to formulate requirements or guidelines for log generation to ensure optimal performance of the developed framework. Instead, Chapter 8 subjectively evaluates the performance of the framework in the context of the MeerKAT CBF, and Chapter 9 provides insights for further refinement and testing.

Limitations

Overview of Dissertation

These research questions are posed in such a way as to drive the development of the research project towards achieving its primary objective as outlined. In addition to formulating the research questions, high-level system requirements and design considerations for the solution system are derived.

Figure 2.1: An overview of the methodology and the various phases followed to develop this project

Phase 2: System Design

Detailed Design & Development

Thereafter, the high-quality design of each component is expanded to allow the development of a prototype for each subsystem. Then, once the interface definition process is complete, further subsystem design and development of each component can take place individually.

Framework Development & System Integration

While system design is in progress, the detailed design of each subsystem informs the interface definition process that takes place between interacting elements and is also informed by the same process. Initial subsystem design must therefore occur simultaneously for all subsystems, as all subsystem elements must be sufficiently designed so that their design can appropriately inform and guide interface decisions.

Phase 3: System Verification & Analysis

These tests are run and the results analyzed and evaluated as part of the System Performance Testing and Analysis Process. This process considers various system metrics that enable objective evaluation of the system.

Methodology Conclusion

Complex Systems

It is instead the interaction between the components of a system that produces the system's desired output. In some cases, systems are also operated remotely, which introduces an additional layer of complexity to the system.

Complex System Failure & Debugging

In complex systems, it is typical for many of the subsystems that make up the system to be interdependent. From the presented framework, it is quite clear that the effectiveness of the debugging approach depends largely on the debugger's domain knowledge and experience with the system.

Log File Analysis

Without applying the debugging framework, where applicable, engineers and developers turn to the log files generated by the system to perform log file analysis. With sufficient content, log files can describe and illustrate the system behavior at runtime.

Figure 3.1: Example log file from the MeerKAT Correlator Beamformer

Case Study System: MeerKAT Radio Telescope’s Correlator Beamformer 24

CBF control and monitoring is facilitated through a number of software packages running on the CMC. Identifying the root cause of failures is challenging in the context of MeerKAT CBF.

Figure 3.2: Figure showing the KAT-7 Radio Telescope on site in the Karoo, Image source:[26]

Automated Log File Analysis

Machine Learning Based Log File Analysis
Log Parsing
Feature Engineering
Log File Analysis
Summary

64] considers the use of Natural Language Processing (NLP) techniques to treat log messages contained in log files as raw text for the purpose of performing anomaly detection. These methods also rely on log parsing to parse log messages into log events and log parameters.

Figure 3.10: Figure detailing the four separate processes of Machine Learning Based Log File Analysis.

Selection of Tools and Frameworks

Numpy
Pandas
Scikit-learn
PyTorch
Jupyter Notebooks

The importance and relevance of good feature engineering has been highlighted and further explored for specific machine and deep learning methods. As a result, Numpy is a critical Python package when developing machine learning and deep learning models.

The Way Forward

This facilitates independent development and testing of Data Miner before integrating it into the framework. This will help in evaluating the performance of the data miner across different datasets.

High-Level Design: Data Miner

The flow of data into and out of the Data Miner is indicated by hollow arrows in Figure 4.1. The Pre-Parser needs information about the general structure and format of the log messages.

Subcomponent Design: Data Preparer

As shown in Figure 4.4, the components of the log message are specified by enclosing <, >. The data transformer uses the regular expression generated by the preparser to extract the various components of the log message preamble and the log message body.

Figure 4.2: Example excerpt from a log file highlighting the log message preamble and the log message content

Subcomponent Design: Pre-Processor

The pre-processor first determines whether there are unchecked regular expressions in the given list. If there are still regular expressions to be checked, the Preprocessor selects the next one and checks if the pattern is present in the log message.

Subcomponent Design: Log Parser

Log Parsing Algorithms

The first stage of partitioning uses the number of log message tokens as the partition criterion. The Length Matters algorithm, or LenMa, is a web processing log parsing algorithm that extracts log message event templates based on the word length of log messages [80].

Figure 4.9: Overview of the AEL log parsing algorithm. Image source: [49]

Log Parser Design

If the length of the longest common sequence is greater than the threshold, then the incoming log message is considered to match the event template described by LCSseq. If the length of the longest common subsequence is smaller than the threshold, then a new LCSO object is created and the tokenized log message is used as theLCSseq.

Developing a Regex-based Log Parser

This method performs the actual parsing function by calling algorithm-specific methods that result in the execution of the log parsing algorithm. Attributes also include configurable algorithm parameters that are passed to the LogParser class when instantiating the object.

Component Design: Data Miner

Data Miner configuration file (.yaml) log_format: # format of the log messages preprocessing: # list of regex for preprocessing. The output of the Data Miner is two data structures generated by the processing pipeline.

Figure 4.17: Format of the Data Miner Configuration File

Parsing Algorithm Tuning

Data Miner Design Conclusion

The process followed for the design and development of the Inference Engine is described in section 2.2.1. Initially, based on the design considerations, a high-level design of the Inference Engine is detailed and described.

Design Considerations and System Requirements

Deep Learning Framework The Inference Engine will use the PyTorch Deep Learning Framework [74] to implement machine learning and deep learning models. Use The Inference Engine is a modular, self-contained tool that can be used independently of the ALFAF.

High-Level Design: Inference Engine

In Training mode, after loading data, the inference engine initiates the Model Training processing stage. During this processing phase, the inference engine attempts to detect and identify anomalies in the submitted log file.

Figure 5.1: High-Level Design of the Inference Engine component of the Automated Log File Analysis Framework

Subcomponent Design: Feature Extractor

The Extract Sequence process, implemented as a Python method, extracts a sequence of log event keys from the provided parsed log file. This process is implemented as a Python method, and the size of the window or the number of log event keys to use in the input sequence can be set.

Figure 5.2: Processes of the Feature Extractor subcomponent. Also illustrated is how the data is transformed between processes

Subcomponent Design: Anomaly Detection Model

The methods of the class implement the various processes as previously described. hidden units) Fully connected / linear. Research on the use of LSTM neural networks for the problem of log file analysis in the form of anomaly detection has illustrated various LSTM architectures in terms of the number of LSTM layers and whether the LSTM is bidirectional or not.

Figure 5.4: Architecture of the Long Short-Term Memory Recurrent Neural Network implemented by the Anomaly Detection Model

Component Design: Inference Engine

Feature Extraction
Data Loading
Model Training
Anomaly Detection

These are then loaded into tensors in preparation for the rest of the anomaly detection process. The Anomaly Detection process is performed when the Inference Engine is run in Anomaly Detection Mode.

Figure 5.7: UML Activity Diagram illustrating the functionality of the Model Training process As seen in Figure 5.7, the input to the Model Training process is the training data, encoded in PyTorch tensors and loaded into an iterable object

Inference Engine Design Conclusion

This chapter finalizes the design and development of the Automated Log File Analysis Framework (ALFAF). The design considerations and system requirements presented for the development of the subcomponents apply to the ALFAF.

System Interfaces

Internal Interfaces

The rows of the data structure in Figure 6.2 represent individual log messages from a particular log file, while the columns represent data fields describing the log message. Depending on the structure of the log messages, additional data fields can be generated, as described in chapter 4).

External Interfaces

The user exit interface is a data interface that describes the data format, structure, and required fields of output data generated by the framework that can be used by end users to assist in debugging. Each row of the data structure again represents a single log message from a given log file.

Figure 6.3: Data structure and format of the Debug and Suspicious Lines Reports As shown in Figure 6.3, the data structure is based on the parsed log data structure shown in Figure 6.2, with additional data fields

Automated Log File Analysis Framework Design

Implementation

The dataminer and inference engine attributes are used to store instances of the DataMiner and InferenceEngine classes that implement the Data Miner and Inference Engine subcomponents, respectively. The two modes of operation of the framework are invoked by two class methods, namely the train anomaly detection model for the training mode and the analysis log for the inference mode.

Figure 6.5: UML Class Diagram for the AutomatedLFAFramework class that implements the Automated Log File Analysis Framework

Framework Design Conclusion

Compute Platform

All algorithm setup and performance evaluation for this project was performed on an available desktop computer with the configuration shown in Table 7.1. In these experiments, all Data Miner setup and processing is done on the CPU, while Inference Engine setup and processing is done on the GPU.

Datasets

Data Miner Tuning and Verification

Design and Functional Verification

Interfaces An interface exists that facilitates data transfer from the Data Miner to the Inference Engine and is described in detail in Section 6.2. Performance Metrics Data Miner is able to output the following performance metrics: Parsing Accuracy and Parsing Time.

Algorithmic Tuning

The maximum size of the log file that can be consumed is limited by the available computing power and not by the design of the Data Miner. Parsing Algorithm Data Miner supports several state-of-the-art log parsing algorithms, including: AEL, Drain, IPLoM, LenMa, LogMine and Spell.

Performance Evaluation

Parsing time is simply measured as the time it takes for a log parsing algorithm to complete parsing a given log file. As shown in Figure 7.1, up to a log file containing 1 million log messages, most of the log parsing algorithms' parsing times scale in a similar manner.

Table 7.5: Data Miner Parsing Efficiency Test Results

Inference Engine Tuning and Verification

Design and Functional Verification

Deep Learning Algorithm A subcomponent of the inference engine anomaly detection model uses a recurrent neural network. Processing mode The inference engine was designed to process log files offline only.

Algorithmic Tuning

Deep Learning Framework The development of the inference engine uses the PyTorch framework to implement the Anomaly Detection Model. This set of hyper-parameters was used for all further performance evaluation of the Inference Engine.

Performance Evaluation

When performing anomaly detection, the Inference Engine has a tunable parameter that defines the criterion used to decide whether a log message is anomalous or not. This model is then used by the Inference Engine in inference mode to detect anomalies in the HDFS validation dataset.

Table 7.9: Anomaly Detection Performance of Inference Engine on Validation Dataset Number of Candidates Log Keys

Framework Verification

Design and Functional Verification

The operation of the framework in training mode is detailed in Section 8.1.2, and the operation of inference mode is shown in Section 8.1.3. The external interfaces to the system under test and the end user are detailed in section 6.2.

Conclusion

Dataset
Training the Framework
Using the Framework to Detect Anomalies
Subjective Evaluation

Log Message Format The log message format of the CBF log messages is shown in Figure 8.2. Regular Expression Preprocessing An examination of the CBF log files reveals numerous recurring runtime variable formats.

Figure 8.1: Example log file from the MeerKAT Correlator Beamformer

Analysis

Data Miner Analysis

Adapting the log parsing algorithms to the log files being considered is necessary to achieve the best results. The results presented in Table 7.5 show that log parsing algorithms can quickly parse log files in most cases.

Inference Engine Analysis

Inference Engine performance was evaluated using the HDFS Ground Truth dataset where anomalies are labeled per session. This section discusses how certain attributes of the log files can affect the performance of the Inference Engine and which attributes are required for optimal performance.

Overall Framework Analysis

In order for the Inference Engine to model system behavior accurately, the available log files must describe the normal operation of the entire system. If the logs describe only a subset of the system's operation, normal events will be misclassified, leading to a higher false alarm rate.

Analysis Conclusion

Research Question 1

The research conducted in Chapter 3 was the basis for the design and development of the Automatic Log File Analysis Framework (ALFAF). The design of the sub-components and the framework were verified and the performance of both the Data Miner and the Inference Engine were evaluated.

Reflection on Project Objectives

Before testing ALFAF performance at scale on the CBF system, the recommendations regarding the CBF logs made in Section 9.3.1 should be considered. The structure and content of the CBF logs was found to be the factor hindering ALFAF's anomaly detection performance.

Recommendations for Future Work

Improving ALFAF Performance on CBF

Considerations for improving the CBF logs in this regard are presented as recommendations in section 9.3.1. The ALFAF design is modular, robust and tunable, and as such can be used to perform automated log analysis on logs generated by any system.

ALFAF Research Recommendations

Stearley, “What Supercomputers Say: A Study of Five System Logs,” in Proceedings of the International Conference on Trusted Systems and Networks, 2007, p. Li, “Spell: Streaming Parsing of System Event Logs,” in Proceedings - IEEE International Conference on Data Mining, ICDM, 2017, p.

Specifications of the Compute Platform used for experiments in this study

Data Miner algorithmic tuning results

Optimal parameters for the various log parsing algorithms after tuning on the

Data Miner Parsing Accuracy Test Results

Data Miner Parsing Efficiency Test Results

Data Miner Robustness Test Results

Search space for the various model parameters

Optimal hyper-parameters for the LSTM model of the Inference Engine

Anomaly Detection Performance of Inference Engine on Validation Dataset

Anomaly Detection Performance of Inference Engine on Test Dataset

Performance of the various log parsing algorithms on the CBF log files

Optimal hyper-parameters for the LSTM model of the Inference Engine for the