Mitigating Soft Error Risks through Protecting Critical Variables and Blocks

A thesis that partially met the requirements for the "Master of Computer Science and Engineering" degree. Nazim Uddin entitled "Mitigating Soft Error Risks through Protecting Critical Variables and Blocks" has been approved by the Board of Examiners in partial fulfillment of the requirements for the degree of Master of Science in the Department of Computer Science and Engineering, Khulna University of Engineering. & Technology, Khulna, Bangladesh in April 2005.

Problems Statement

The safety-critical systems are expected to provide high reliability and meet real-time criteria. Soft errors have thus become a major concern for both safety-critical systems and general computing systems.

Motivation of the Thesis

The main objective of the thesis is to develop an approach that detects soft errors and then develop potential techniques that can recover the system from soft errors before the system breaks down. To devise a new technique that will detect soft errors with less time, cost and memory requirements.

Contributions of the Thesis

To develop a preventive soft error technique, by marking critical components (blocks of code), which will lower the soft error risks to a great extent.

Scope of the Thesis

Thesis Organization

Chapter 2 presents an overview of soft error tolerance, different types of soft errors, sources of soft errors and also discusses about the existing methodologies to detect and

Chapter 4 provides the detailed discussion about the soft error detection and correction techniques. Definition of Preceding variable, algorithm for preceding variable identification

Introduction

This chapter discusses the definition of soft errors, types of soft errors, sources of soft errors, and an overview of soft error mitigation techniques.

Definition of Soft Errors

In general, the amount of charge collected (Qcoll) does not exceed an amount known as the critical charge (Qcrit), so no soft fault occurs. If the amount of charge collected (Qcoll) exceeds the critical charge (Qcrit), a soft fault will occur.

Types of Soft Errors

Benign Fault
Detected Unrecoverable Error (DUE)

If the soft error is not detected by the user, but can affect the output of the program, then this situation is described as Silent Data Corruption (SDC). If the incorrect program output is detectable but unrecoverable, then the soft error is described as a Detected Unrecoverable Error (DUE).

Figure 2.1: Classification of the possible outcomes of a Faulty Bit in a Microprocessor

Sources of Soft Errors

IR or L(di/dt) Supply Noise
Power Transients
Alpha Particles
Cosmic Rays
Low-energy Cosmic Neutron Interactions with 10B found in Boro-Phospho- Silicate Glass (BPSG)

Due to the resistance of the interconnection that makes up the network, there is a voltage drop in the network, which is usually called an IR drop. Fast switching currents of on-chip devices can cause supply voltage fluctuations that can be classified as IR and L(di/dt) drops. The drop in L(di/dt) has become a dominant part of the overall supply drop in high-performance designs.

Stopping force is one of the key concepts needed to explain the interaction of alpha particles with silicon. On the Earth's surface, about 95% of the particles that can cause soft faults are energetic neutrons, while the rest are protons and pions. Solitary electrons (similar to beta particles, although their ultimate source is unknown) make up much of the remaining 1%.

In the history of particle physics, cosmic rays were the source of the discovery of the positron, muon, and pi meson.

Figure 2.2: Interaction of an Alpha Particle or a Neutron with Silicon Crystal

An Overview of Soft Error Mitigation Techniques

Process Technology Solutions
Hardware Based Approaches
Hybrid Approaches

Soft errors are generally associated with random errors or data corruption in electronic systems. In general, soft errors are non-destructive functional errors caused by energetic ion impacts and can be corrected by resetting or rewriting the device. Redundant detection programs run a redundant copy of the same program and compare the output to see if a soft error occurs.

Using error detection and correction codes (ECC) to detect soft errors is another popular software-implemented technique. 25] used redundant multi threading to determine the architectural vulnerability factor, and Shye et al., [26] used process level redundancy to detect soft faults. Hardware approaches to soft error mitigation mostly include circuit-level solutions, logic-level solutions, and architectural solutions.

The capacitance can also be increased by adding an explicit capacitor to the diffusion area of the devices.

Introduction

Measuring the Criticality of the Block

The reason is that transient errors change the value of one bit at a time and the probability of changing two bits and/or two transient errors at a time is almost zero. Fault injection is done by changing a bit of the parameter value, either anywhere in the code or in the parameter name. Therefore, changing one bit is possible in any of the six bits which in turn will generate a different character in each case.

Similarly, the wrong combinations for the parameter value (the binary value of ‗7' is 000111) can be generated. If the effects in the first three cases are almost the same, the test is terminated. To better analyze the cause and effect of the failure in the system, domain expertise is needed.

Table 3.1: Evaluation Criteria and Ranking System of FMECA [41]

Lowering the Criticality of a Block

Model refactoring can be detailed using an example, which consists of Figure 3.1 and Figure 3.2. Two states in Figure 3.1, called 'Pass to Server' and 'Retrieved', have been merged into one state, 'Verifying', in the refactored state diagram (Figure 3.2). Once the criticality ranking is returned, a model can be recomputed with the aim of reducing the criticality of the components.

Criticality reduction can be achieved by reducing any of the parameters: Execution Time (ET), number of iterations or failure propagation ("pull out fan" and "pull out fan"). Next, the design is analyzed to find out which parameter is causing the product's great value. If the criticality of the component shows a higher value, then it means that the soft error in it has more effects on the overall functionality of the system than other components.

Refactoring can be applied to the component's architecture or behavior model to lower the component's complexity.

Figure 3.1: An Example Statechart of ‗User‘s Access to Server‘ before Refactoring

Introduction

Flagging the single Preceding Variables

In the figures, three statements are considered and assumed to be part of a program code segment, and seven variables are set. As shown in Figure 4.2, during execution, determining the value of variable ‗rslt2' of statement 2 will depend on statement 1 for the result of variable ‗rslt1', statement 3 will depend on statement 2 to calculate ‗var0' since ‗ var0 is the sum of ‗rslt2' and ‗var4'. If soft errors occur in any of the variables, it can be detected by comparing and checking only var0.

Considering the assignment statements, the tree in Figure 4.3 is formed and the root node (rslt0) is determined. It is seen that the root node is more critical among others because it is (the root node) decisive in these directions. If soft errors occurred on any node other than the root, it will eventually propagate to the root node.

So, to detect soft error, comparing critical variables will be more efficient instead of considering all variables to be compared.

Figure 4.1: An example program-segment to show Variable Dependency

Identifying Multiple Preceding Variables

Consider the dependency graph in Figure 4.7, the node labeled r1 has an output level of 2, which means that this variable directly affects the computation of two other variables. Also, the direction of the edges represents the flow direction of computational effects. If a node produces an incorrect result, this incorrect result will cause some invalid results for variables directly or indirectly dependent on it.

Figure 4.8 shows a simple arithmetic problem focusing on instructions and variable dependence to describe the proposed approach. The arithmetic expressions shown in Figure 4.8 are represented using a defined instruction set as shown in Figure 4.9, where the values of the variables v1 and v2 are stored in memory locations $0 and $2. For soft fault detection, Maurizio Rebaudengo et al.[48] duplicate all variables and check for consistency as shown in Figure 4.10.

The proposed method significantly reduces the program execution time and memory consumption.

Figure 4.4 represents three different preceding variables S1, S2 and S3, forming with different set where each set has different set of variables

Recovery from Soft Errors

Introduction

Experimental Setup

Identifying the Critical Blocks

To calculate the critical ranking of the blocks, several random errors were injected into these blocks. For each block, a single soft fault is injected at a time and the effects on the system are observed through FMECA analysis. This process is repeated several times to identify the critical mode of the block, typically 10 to 20 trials are applied.

From the experimental data presented in Table 5.1, it is observed that "sort order decision" block is more critical than any other blocks.

Figure 5.1 represents the program blocks of the sorting program. Where, ‗ENV‘ means environment and blocks are represented by round rectangles

Applying Refactoring to Lower the Criticality

Soft Error Detection using Preceding Variables

Comparisons with Existing Approaches

In the case of the bubble sort algorithm, comparing duplicate array elements after sorting can significantly reduce the number of checks and thus reduce execution time. The figures shown in the next section show the execution time plot for Fibonacci, batch summation, bubble sort, quick sort, matrix multiplication and choice sort used to evaluate the performance of the proposed method. Program execution time depends on several system specification parameters, such as processor speed, primary memory size, and number of threads running on the system; therefore, the result will be different for different machines.

Since the execution time is very small, 100,000 iterations are applied to make the results easy to use. It shows the understandable improvement in execution time over the existing method due to a reduced number of comparisons. The execution time overhead for the selection sort algorithm is also plotted for the original code, code transformed by Maurizio Rebaudengo et.

We believe that the execution time of the selection sorting algorithm in the proposed method is minimized compared to the existing one.

Table 5.2: Source code size of the selected program in bytes Original

Discussion

Matrix multiplication, Bubble sort, and Selection sort increase exponentially with increasing data size. Therefore, the proposed method reduces the risks of soft errors by reshaping the critical blocks and then detects the soft errors in less time than [48] using only the prior variable.

Conclusions

Concluding Remarks

Future Recommendations

Reinhardt, "Detailed design and evaluation of redundant multi-threading alternatives" In Proceeding of the 29th Annual International Simposium on Computer Architecture, pp. Mukherjee, "Transient fault detection via simultaneous multithreading" In Proceeding of the 27th International Simposium on Computer Architecture, pp. Cheng, "Verbygaande foutherstel deur gebruik te maak van gelyktydige multithreading," In Proceeding of the 29th Annual International Simposium on Computer Architecture, pp.

Falsafi, "Dual use of superscalar datapath for transient-fault detection and recovery" In Proceedings of the 34th ACM/IEEE International Symposium on Microarchitecture, pp. Gurumurthi, "Dynamic prediction of architectural vulnerability from microarchitectural state," in Proceedings of the International Symposium on Computer Architecture, New York, NY USA, pp. Austin, "DIVA: A reliable substrate for deep submicron microarchitecture design," In Proceeding of the 32nd Annual International Symposium on Microarchitecture, pp.

Cheng, "Transient error recovery using concurrent multithreading," in 29th Annual International Symposium on Computer Architecture, pp.