2 Timing Speculation (TS)

The BTWC design improves performance by allowing certain timing errors to occur during normal operation, while maintaining correct operation by adding error detection and correction. The additional circuitry of a BTWC design will consume additional power and take time to correct any errors. Better-than-worst-case (BTWC) design is a design style that was first introduced by Bob Colwell, the architect of the Intel Pentium Pro and Pentium IV processors [1].

The advantage of BTWC designs is the ability to derive additional performance based on typical operating cases and use the EDAC (error detection and correction) circuitry to handle the occasional errors. BTWC design operates within Region Two, and therefore identification of point c is the ultimate goal of BTWC design.

Figure 1 illustrates a general approach for BTWC design that includes a core computational component coupled with a checker mechanism that validates the semantics of the core’s operations [8]

2 Timing Speculation (TS)

Region One is the error-free region where the clock frequency is normally chosen as shown as point a. Guard tape is a traditional design approach that attempts to limit timing uncertainties by providing additional design space. These timing uncertainties are mostly due to all kinds of variations that affect the timing of the circuit.

Overview of TS Microarchitectures

Stage-Level TS Microarchitectures
Leader-Checker TS Microarchitectures In CMPs, two cores can be paired in a leader-checker organi-

3 Taxonomy of Design for TS

MIS is related to input workload which affects the circuits internal activity and the settling time of the outputs. Paths with the same static propagation delay can have dramatically different distributions of their settling time due to the input workload variation. Together with circuit's previous status, the basic shape of the circuit dynamic activity curve is formed.

Many factors affect the delayed distribution of results, but the input workload determines the basic shape of the distribution. Obtaining the actual distribution of output latencies for the given workload can help designers estimate the error rate for each output in order to select a well-balanced operating clock frequency, which is the fundamental design challenge BTWC.

Analysis of Circuit Dynamic Behavior with Timed Ternary Decision Diagram

With the same input workload, Product A has a 99% probability of being settled by time t, while Product B has a 53% probability of being settled at time t. By increasing the speed along the path at output B only, assuming the circuit has only two paths, then most errors will be reduced and the circuit can run in cycle time t with very little error correction penalty. In this example, two paths with the same static delay behave differently when observed by a real application.

Knowing the dynamic behavior of a circuit is important for BTWC design in order to improve the performance of a design for the common case [18][19]. A study by Intel [12] introduced some circuit techniques for dynamic variation tolerance, and the authors also explored timing optimization based on path delay histogram and path activation probability.

ABSTRACT

Introduction and Motivation

Circuit-level implementation issues
Pipeline error recovery mechanisms
Supply Voltage Control

When an input signal transitions at the same time as the clock, metastability can occur in the Razor flip-flop. The presence of the delayed clock also introduces a new shortcut limitation in the design. To prevent this corruption of the shadow latch data, a minimum path length constraint is added to the input of each razor flip-flop in the design.

Figure 4(b) shows the timing diagram of the pipeline recovery pipeline for a command that fails in the EX phase of the pipeline. In addition, the flush string starts propagating the ID of the failed stage in the opposite direction of the instruction.

Figure 5: Block diagram of Razor logic. [22]

3 Experimental Evaluation 3.1 Razor Pipeline Implementation

TIMBER: Circuit design

TIMBER flip-flop
TIMBER latch

Although the timing error is not signaled to the central error control unit (Err1 signal is 0), the error relay logic configures the select input of flip-flop f2 to 01. Thus, when a two-stage timing error occurs at flip-flop f2, the error is masked by borrowing a TB and an ED time interval atf2. When EN is low, the transmission gate L is open and the WOOD latch operates as a conventional master-slave flip-flop.

A timing error is detected by comparing the values stored in the master latch and the slave latch on the falling edge of the clock. The timing error is masked because the slave latch is transparent for the entire control period.

TIMBER case study

The master latch MO samples the value of the data signal D on the rising edge of the clock and drives the slave latch and the output Q to the sampled value. The main latch M1 samples the data signal D on the rising edge of the delayed clock DCK after a delay 8 determined by the value of the selection inputs S1S0. On the rising edge of the delayed clock, DCK, the transmission gate PO opens and the transmission gate P1 closes.

So after delay 8, for the remainder of the clock period when CK is high, the master latch M1 drives the slave latch and the output Q to the new value sampled by M1. The timing error atf2 is reported to the central error control unit by latching the error signal (Err2 signal goes high) on the subsequent falling edge of CK. Note that this ED interval is equivalent to the sum of the ED intervals in the TIMBER flip-flop.

When a single-step timing error occurs, the timing violation of the late-arriving data signal is within the TB time interval. When a two-step timing error occurs at latchl2, the error is masked by borrowing a TB and an ED time slot. Thus, when a two-step timing error occurs at flip-flopf2, the error is masked by borrowing a TB and an ED time interval atf2.

An atf2 timing error is flagged to the CPU by latching the error signal (the Err2 signal goes high) on the next falling edge of CK. When a two-step timing error occurs at latch l2, the error is masked by borrowing the timing interval TB and ED. This chapter presents the details of the methodology used for the research in this dissertation.

Table 1: Summary of several EDAC methodologies

Experimental setup and general work flow

It was used not only as a method of error checking, but also as a basis for error evaluation and other activities. Since two primary outputs with the same static path delay could have dramatically different path alignment behavior with the same input workload, the error reduction method must be selectively applied to those error-contributing paths/cells. However, the nature of TCF and BDD makes the calculation too complex for use in large circuits.

The achievements presented in this paper are: (1) effective reduction of timing speculation errors without major changes to the original circuit by considering the effect of input workload variance on circuit activity, (2) creation of a universal design/simulation flow with commercial tools that allow accurate error estimation for a range of operating clock frequencies, which gives the BTWC designer insight into efficiency gains, and (3) implementation of an offline error checking method, which allows designers to perform the desired cost-benefit analysis. Part D explains the identification of error-contributing outputs and the selection of key cells to minimize errors for a given input workload.

Error checking method

Understanding the .vcd file contents, the special custom error estimator and error checker in this work are all based on the information contained in . To realize the error estimation methodology, the first step is to get all the link timestamps of the selected PO node. In Part A Section (1), the code prepares all transition timestamps of the given PO, and in Part A Section (2), the code extracts the settlement timestamps of the given PO.

A settling time histogram of the desired outputs can be plotted with an appropriate bin size. In this work, the error estimate essentially matches the simulation result because 50 ps is also the step size of the swept simulation as described in Section V. The stabilization probability curve is the cumulative distribution function (CDF) of the output alignment histogram.

Therefore, the analysis is performed on the behavior of the production solution of the possible results for errors to obtain the estimate of the error rate; this analysis enables the identification of category IV POs that contribute to errors. Custom scripts output the solution timestamp for each cycle of the tested results, and the histogram of the results shows the number of errors of the tested results. In this work, a PO is considered an error contributor (Category IV) if the PO's error rate is twice the average error rate of the entire circuit.

For each cycle, the switching activity of the selected nodes is stored in the LUT in sequence. The simulation clock period is swept from the error-free clock period to 70% of the original clock (ie, the static critical path delay time). The goal of this work is to reduce the error rate more effectively by shortening the propagation delay of the identified error-contributing POs.

Selected Cell Replacement (SCR) – selectively replaces cells on the fan-in cone of the identified error-contributing POs with low-threshold voltage cells based on activity level. In this section, the error rate and the improved error rate were compared at 70% of the original clock period.

Figure 15: An example of value change dump file. (a) is the header part, and (b) is the body part

Total Error Reduc0on to Baseline design

Chandrasekar, “Dynamic Voltage Drop (IR) Analysis and Design Closure: Issues and Challenges,” in 11th International Symposium on Quality Electronic Design (ISQED pp. Chen, “Analysis of Dynamic Circuit Behavior with Timing Ternary Decision Diagram,” in IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2010, pp Chen, “Analysis of digital circuit dynamic behavior with timed ternary decision diagrams for best-to-worst design,” in IEEE.

Chen, “CCP: A Common Case Promotion for Improved Timing Fault Tolerance with Energy Efficiency,” in IEEE/ACM International Symposium on Low-Power Electronics and Design (ISLPED pp. Chang, “Efficient Logic Characteristic Function for Fast Timing ATPG,” in 2006 IEEE /ACM International Conference on Computer-Aided Design, 2006, p Mudge, “Razor : A Low-Power Pipeline Based on Circuit-Level Timing Speculation,” in 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO p.

The, “Energy-efficient and metastability-immune resilient circuits for dynamic variation tolerance,” in IEEE Journal of Solid-State Circuits, 2009, vol. Aitken, “TIMBER: Time lending and error relaying for online timing error resilience,” in Design, Automation & Test in Europe Conference & Exhibition, 2010, pp. Mohanram, “Approximate Logic Circuits for Low Overhead, Non-intrusive Simultaneous Fault Detection,” in Design, Automation & Test in Europe Conference & Exhibition (DATE no.

Mohanram, "Low Cost Concurrent Error Masking Using Approximate Logic Circuits," i IEEE Transactions on computer-aided design of integrerede kredsløb og systemer, 2013, vol. Gardner, "Design techniques for cross-layer resilience," i Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010, s. Sanda, "Cross-layer resilience challenges: Metrics and optimization," i Design, Automation & Test i Europa Konference &.