Complex System Failure & Debugging

2.4 Methodology Conclusion

3.1.2 Complex System Failure & Debugging

Any system, whether simple or complex, is prone to failure [9]. Failure, in the context of engineering systems, is defined as an event or occurrence that halts or impairs system functionality and/or behaviour [1]. As previously discussed, consequences of system failure may vary depending on the system.

When a complex system fails, identifying the origin of the failure can be a challenging and time consuming task [14]. This is a direct implication of the nature of systems as they are made up of many interacting components. The larger and more complicated the system, the more individual components there are that are prone to failure. The process of debugging a failure then, in the case of complex systems, can sometimes lead to extended periods of downtime and may require additional efforts from engineers and/or operators.

Complex system failure is most analogous to failures that occur within large software systems [12]. Large software systems often comprise many interdependent and related libraries and mod- ules that each contain a plethora of methods, classes, objects and variables. When developing large software systems, debugging uncommon failures often requires investigating all methods, classes and libraries that the particular suspicious section of code in question depends on. This process may often involve tracing a failure across and up through multiple function calls and execution paths, and investigating the code base that makes up other components of the larger software system.

Given that a complex system consists of components and smaller systems, that may in turn consist of smaller sub-systems that are further made up of assemblies of components, a similar debugging approach, as employed for the debugging of software systems is required as the failure may need to be investigated and traced across a number of components. Additionally, root cause analysis for failures caused by a component at a very low level in the system hierarchy may be even more challenging and time consuming to debug, as the failure is often flagged generically at a much higher level of the system.

In complex systems, it is typical for many of the subsystems making up the system to be dependent on one another. More specifically, they are dependent on the inputs and/or outputs to and from other subsystems and the interfaces between subsystems [8]. In this case, if one system fails, it will cause another system to also fail, and the error that gets propagated to the user level may not be an accurate representation of the actual, underlying cause of the failure.

This scenario can often result in also having to trace a failure back and forth across multiple subsystems, and interfaces, in an effort to locate the root cause of the failure. Once the root cause of the failure is located, depending on the nature and complexity of the failure, additional analysis may be required to fully understand the failure.

From this, it can be seen that the process of debugging complex systems incurs additional challenges in performing fault localization and root cause analysis [12][15].

Debugging

Debugging is generally described as a task that combines the processes of both testing and correcting written code [16]. Typically the term is used to refer to software systems, but it is equally applicable to hardware, processes and system behaviours.

During the design of a new system, engineers and developers typically test various aspects of the designed system to ensure a smoother experience for the user and to ensure that the system delivers the intended functionality and performance under the specified operating conditions.

A good developer or engineer is capable of identifying and testing the most common failure modes as well as a few obscure and uncommon ones. However, even system experts are seldom able to identify every single possible failure mode, and the nature of some systems is such that some failures only ever occur when the system is deployed and through user interaction with the system [15] [17] i.e. during system runtime. The occurrence of such failures often requires further debugging outside of the development phase, once the system is deployed, which, depending on the system and its application, may or may not always be possible.

The concept of debugging has evolved over the years to be more applicable to the ever-growing size and complexity of the systems being developed today [14]. Two types of debugging are considered, namely classic debugging and anomaly detection [17].

Classic debugging typically refers to software systems and involves the use of methods and techniques that determine which blocks of code, in a larger software program, are likely to contain errors and faults. This type of debugging is done by software engineers and developers.

Classic debugging then, may be considered to be debugging that occurs during the design and

development phase of a system component. This applies equally to other system components including hardware and interfaces.

Anomaly detection describes the process of identifying unexpected events that occur during runtime, and/or while the user is interacting with the system [17]. Anomaly detection is not used during the design and development phase of components, but is instead performed while the system is in operation and this allows the process to identify system failures that typically only occur in the production or deployment environment. As a result, anomaly detection more accurately reflects the type of debugging that is required to identify failures in complex systems.

When systems fail, the task of debugging falls either to the engineers and developers who built the system or to the system operators. In a study by Xu and Rajlich [16], it is put forward that the process of debugging is fairly cognitive in nature, and that it holds similarities to the process a medical doctor uses to perform a diagnosis in that developers are presented with the symptoms and have to use their knowledge on the subject matter to identify the cause. The researchers go on to suggest that the process of debugging employs all six levels of Bloom’s Taxonomy of cognitive learning. These levels, in order, are knowledge, comprehension,application, analysis, synthesis, and finally,evaluation. This suggests that the process of debugging is a non-trivial task and effective debugging requires a good foundation in domain specific knowledge and insight to the system in question. It is also suggested by Xu and Rajlich [16] that the level of cognitive ability required for this process is predominantly what distinguishes novice debuggers from their expert counterparts.

When it comes to performing debugging, there is no general solution or methodology that may be applied. Debugging practices are typically tailored to a particular problem and system, and are either enabled by the availability of information about the failure or are impaired by the lack thereof. Typically though, the process of debugging, and specifically fault localisation and detection i.e. anomaly detection, fit the framework described below as adapted from [16]:

1. Create a hypothesis as to the cause of the error/failure 2. Test the hypothesis and try to replicate the error/failure 3. Based on findings, modify the hypothesis or create a new one 4. Repeat until the cause of the error/failure is determined

From the presented framework, it is fairly obvious that the effectiveness of the debugging approach depends largely on the debugger’s domain knowledge and experience with the system.

Knowing how a system works, and how it is expected to work, as well as understanding the intricacies behind its functioning, can greatly improve the effectiveness of the process of debugging. This idea is support by [16] in which it is suggested that in order to debug effectively, a debugger must:

1. Understand what the symptoms of the problem mean,

2. Be able to obtain more information about the problem through further tests,

3. Identify which component(s) of the system are likely to be responsible for the failure,

4. Identify which processes or functions, of the suspicious component, are causing the failure, 5. Be able to identify the the root cause of the error, and,

6. Evaluate the obtained evidence to increase confidence in the identified error and to effectively determine how to rectify it.

It is often found that novice debuggers spend more time understanding the symptoms of failure before moving on to running testing experiments. System experts, however, with their better understanding of the system, as well as experience in the process of debugging, are typically able to fast track the process and begin with isolating the cause of failure much sooner. This further supports the statement that domain knowledge, in addition to the ability to effectively debug, is critical to debugging failures that occur within systems.

With sufficient domain knowledge, debuggers have various options and tools for debugging available at their disposal. In the case of building software systems, developers and debuggers typically employ Integrated Development Environments (IDE) during development. These IDEs not only serve as an effective environment for software development by including useful features such as syntax highlighting, but they also make numerous options for testing and debugging code available. The most commonly used tools during software debugging are console outputs, break points and unit testing.

On the the other hand, when building hardware systems, engineers typically find themselves spoilt for choice when deciding which tools to employ to debug and test a particular hardware design. The tools employed range from basic multi-metres and vector network analysers, to complex simulation software packages that enable the simulation of various hardware designs.

These tools provide the engineer with the freedom and flexibility to conduct isolated experiments in an effort to identify the root cause of the problem or failure.

Typically however, the debugging framework described previously, and the tools used by developers and engineers for the purposes of debugging, is only available and employed during the design and development phases of a system. Once deployed and operational, engineers and developers seldom have the opportunity to run tests and experiments to determine the cause of failure in large complex systems as minimising system downtime is often a priority. Without being able to apply the debugging framework, where applicable, engineers and developers turn toward the log files generated by the system to perform log file analysis [18][12][14][1].

Dalam dokumen PDF Presented by: University (Halaman 34-37)

2.4 Methodology Conclusion

3.1.2 Complex System Failure &amp; Debugging

3.1.2 Complex System Failure & Debugging