2.4 Methodology Conclusion
3.1.3 Log File Analysis
4. Identify which processes or functions, of the suspicious component, are causing the failure, 5. Be able to identify the the root cause of the error, and,
6. Evaluate the obtained evidence to increase confidence in the identified error and to effec- tively determine how to rectify it.
It is often found that novice debuggers spend more time understanding the symptoms of failure before moving on to running testing experiments. System experts, however, with their better understanding of the system, as well as experience in the process of debugging, are typically able to fast track the process and begin with isolating the cause of failure much sooner. This further supports the statement that domain knowledge, in addition to the ability to effectively debug, is critical to debugging failures that occur within systems.
With sufficient domain knowledge, debuggers have various options and tools for debugging available at their disposal. In the case of building software systems, developers and debuggers typically employ Integrated Development Environments (IDE) during development. These IDEs not only serve as an effective environment for software development by including useful features such as syntax highlighting, but they also make numerous options for testing and debugging code available. The most commonly used tools during software debugging are console outputs, break points and unit testing.
On the the other hand, when building hardware systems, engineers typically find themselves spoilt for choice when deciding which tools to employ to debug and test a particular hardware design. The tools employed range from basic multi-metres and vector network analysers, to complex simulation software packages that enable the simulation of various hardware designs.
These tools provide the engineer with the freedom and flexibility to conduct isolated experiments in an effort to identify the root cause of the problem or failure.
Typically however, the debugging framework described previously, and the tools used by de- velopers and engineers for the purposes of debugging, is only available and employed during the design and development phases of a system. Once deployed and operational, engineers and developers seldom have the opportunity to run tests and experiments to determine the cause of failure in large complex systems as minimising system downtime is often a priority. Without being able to apply the debugging framework, where applicable, engineers and developers turn toward the log files generated by the system to perform log file analysis [18][12][14][1].
Log file analysis is a form of debugging that focuses on error or failure identification and lo- calisation and involves analysing log files that are produced by the system in question during operation [14][18]. Many systems, of both the software and hardware variety, are often designed and configured to generate log files during runtime. In the case of software systems, program flow and execution is recorded by writing, or printing, a text string that provides information about a certain event. This may be the result of a method, a change in program flow or state or an event triggered by some external action e.g. a user connecting to the system. These events are logged, that is printed to some system console providing information about the software’s operation, using print, printf and other similar functions in programming languages, and are typically also written to a file known as thelog file.
In hardware systems, the manner in which log file generation is handled depends on the nature of the system. Some hardware systems that have microcontrollers may log system states and events, and store this information in non-volatile flash memory. This log may then be retrieved by reading the contents of this flash memory. Other hardware systems generate log files through software applications and processes that control or monitor the hardware. For example, a custom processing board may be polled, through controlling software, for information about its operating temperature or system state. This controlling software application then uses the received information to generate a log file for the hardware component.
Log files are generated while the system is in operation and, in most cases, engineers, operators, debuggers and sometimes even end-users, can access these log files at any time after they have been generated. The content and format of these log files vary and can differ substantially between systems. Most commonly, log files contain information pertaining to the various events that occur within the system, or they contain information about the status of the system in terms of its health and performance. This information may include, but is not limited to, times- tamps of when the event occurred, the inputs and outputs of the event, objects that triggered or are impacted by the event, health and status information, performance related metrics and information about the user who initiated the event [18][19][15]. System designers have con- trol, and subsequently the responsibility, during development over which system information is generated and stored in these log files.
A snippet of an example log file is shown in Figure 3.1 below. This log file was generated by the Correlator Beamformer, a subsystem of the MeerKAT Radio Telescope detailed further in Section 3.1.4.
Figure 3.1: Example log file from the MeerKAT Correlator Beamformer
From the log file in Figure 3.1, the various events that occur during the initialisation process of the Correlator Beamformer can be seen.
With sufficient content, log files are able to describe and illustrate the system behaviour during runtime. A system expert is then able look at the log files generated under normal operation and understand what the system is doing, and why the system is exhibiting a particular behaviour.
Similarly, an expert is also able to analyse log files generated during failure cases and begin deducing where the problem originated.
Using log file analysis for debugging system failures makes the following assumptions (partly adapted from and supported by [20]):
1. The system being tested or monitored generates log files that includes information appli- cable to the debugging of the system (execution times, state changes, variable changes, use requests etc.)
2. System designers can define a logical and agreed upon logging policy that specifies the information contained within and the format of the log files
3. Engineers and operators with sufficient domain knowledge can analyse the generated log files to detect, identify and localize errors and failures that occur within the system If these assumptions hold true, log files become an indispensable resource to operators and engineers tasked with debugging systems that experience failures during runtime.
Although log files contain a substantial amount of information, much of which is useful in determining the cause of failures within systems, there are challenges associated with log file analysis.
Because log files are generated during system runtime, and because the systems are typically run for extended periods of time, the amount of information contained within log files can quickly become overwhelming. Furthermore, the increasing complexity in the systems being designed and deployed today also results in more verbose log files as there are now more components within the system that have messages to log [14]. The size of the log files becomes a problem when tasking an operator or engineer to manually analyse them to detect faults. This is partly alleviated by the use of text search tools, such as grep [21], that enable debuggers to search for common expected log messages that may be associated with a system failure. To further motivate the use of search tools, it is common for log messages to be associated with a log level indicating the severity or importance of each log message. Examples log levels from the Python logging framework include ERROR, WARNING and CRITICAL [22]. Debuggers can use search tools to search for log messages of a particular level to reduce the volume of logs that need to be analysed. However, filtering log messages based on log level may hide useful system information that could potentially provide insights into the lead up to the failure.
Given the often unstructured and random format of log files [15][18], they are not easily readable by humans. The content of these log files is also predominantly text-based data, and it is well known that humans typically experience difficulty in analysing and finding patterns within
textual data [23]. Also, given the large amount of data to be analysed, it is readily assumed and accepted that human-error will be a factor.
Log files are also commonly extremely verbose and noisy [12]. Log files log the system operation during runtime, and not only during failure, although some more complex logging schemes do employ log levels and intricate logging schemes to determine which events ultimately get logged.
As a result, the information that is important to debugging an error or failure is often hidden away among unrelated log messages. This further increases the difficulty, even for experts, in locating the root causes of failure from these log files.
Finally, log files, by design, log all kinds of failure. A study done by Mirgorodskiy et al. [1], makes a distinction betweenfail-stop andnon-fail-stoperrors and failures. Fail-stop errors refer to errors that cause an event or process to cease prematurely, whereasnon-fail-stop errors refer to errors that do not result in a break in execution, but instead result in anomalous behaviour within the process or execution. Depending on the system in question, different types of errors and failures may be of varying levels of importance and since log files record all types, it further complicates the process of analysis as the errors of concern are interleaved with failures and errors of less significance.
One of the most concerning aspects that limits the usefulness of log file analysis as an effective debugging tool is that, in most cases, only an engineer or operator with sufficient domain knowledge of the system in question is able to extract useful information from these log files [13]. Given the nature of what is being logged i.e. execution flows, state changes, variable changes, request traces etc., it becomes obvious that an adept understanding of the system and its innermost workings and behaviours is required in order to deduce the root causes of failure from log files. To the untrained, log files appear to be nothing more than endless lines of random text.
Log file analysis then, while potentially useful, is not without its shortcomings. However, in the presence of sufficient domain knowledge, it is especially useful for determining the root cause of an error or failure within a large, deployed system. With sufficient domain knowledge, an engineer can analyse log files and pinpoint exactly where the error occurred, what triggered it, and even how it affected other components of the system. For this reason, log file analysis is deemed a viable solution when considering the debugging of large and complex systems.
Debugging Complex System Failure
As alluded to in Section 3.1.2, the complexity found within complex systems extends into the failure and debugging of these systems. In the previous subsection, it was discussed how debugging is typically employed to find the root cause of a failure or error that occurs within a system, and how this is a non-trivial task. With complex systems, however, the cost and complexity involved with identifying root causes of failure is exacerbated as the the size of the system increases [14][13].
The reason for this is due to the very nature of complex systems as previously defined: they are a collection of interacting components and systems. This essentially results in many more possible
points of failure and a larger area of focus when performing debugging as the interdependency between systems makes root cause failure analysis even more challenging. Additionally, complex systems are typically systems that are deployed for use for extended periods of time or in critical applications [10], and as such, conventional debugging tools and methods are usually not applicable as minimising system downtime is a major priority. Debugging deployed complex systems then typically relies on log file analysis.
Log file analysis is appropriate for debugging complex systems because the logs generated, when logging is implemented correctly, encompass the entire system and all its interacting components. As a result, the root cause of failure can be identified albeit with significant manual effort.
The shortcomings of log file analysis are however exacerbated in the context of large, complex systems:
• due to the increase in size and complexity of the system, the volume of data generated for log files increases exponentially and this makes manual review impractical and time consuming,
• given that a complex system consists of many systems, it may require multiple system experts, each with a specific domain knowledge, to collaboratively debug together to identify the root cause of failure,
• it might prove challenging to enforce a standard for log files that is system-wide throughout all components of the complex system.
As previously mentioned, downtime is incurred when a complex system fails. Reducing down- time involves both identifying the error and rectifying the error, and in some instances applying fixes to reduce the likelihood of the error occurring again. It has been demonstrated in practice that the time to recover a system from failure is dominated by the time taken to identify and localise the error after it occurs [23]. As a result, it is not uncommon for system failures to be resolved by simply restarting the system to restore operation in attempts to minimise downtime.
The process of debugging the failure is then handled as a separate activity that does not impact or inhibit continued system operation.
While log file analysis can assist in identifying the root causes of failure, the shortcomings associated with this approach make it impractical in the context of complex systems. A more effective, and preferably automated, approach to debugging failures using log file analysis within complex systems is therefore desired.
3.1.4 Case Study System: MeerKAT Radio Telescope’s Correlator Beam-