This chapter has presented the fundamental steps for the first two phases of DDD: transforming the program into DSA form, and applying projection to decompose the DSA program.
Recall that DDD requires slack-elastic programs to guarantee correctness. The method performs the following steps:
1. Dynamic Single Assignment Transformation
(a) Perform early decomposition to eliminate nested loops from the program.
(b) Transform the resulting programs into DSA form by rewriting selection statements and straightline code.
2. Projection
(a) Build variable dependency sets for each program. If any variables appear in multiple sets, insert the appropriate copy and input variables.
(b) Insert distributed assignments into the sequential code in order to prepare it for projec- tion.
(c) Build the new dependency and projection sets for the program and apply the technique of projection.
The results of these fundamental steps are a decomposed system that is semantically equivalent to the original sequential program, with added concurrency. This system may include unnecessary processes, unnecessary communications, and may not be optimized for circuit performance. Further optimizations and additional techniques for DDD are described in Chapters 4 and 5.
Chapter 3
Asynchronous Circuits and Synthesis
Now that we have presented DDD for general process decomposition, we focus our attention on process decomposition for the high-level synthesis of asynchronous VLSI systems. Process decom- position is the first step in the asynchronous design flow and the skill with which it is performed greatly impacts the performance and energy consumption of the final hardware. For DDD to gen- erate systems that can be implemented as fast and energy-efficient asynchronous circuits, low-level circuit information must be incorporated in its high-level transformations.
This is the first of three chapters that present a version of DDD tailored specifically for use in the design of asynchronous hardware. We begin in this chapter by providing a general introduction to asynchronous VLSI circuits and synthesis. We then present templates for a family of fast asyn- chronous circuits. These templates allow DDD to estimate low-level circuit performance metrics without requiring formal logic synthesis. The ensuing chapters describe the actual modifications to DDD for use in asynchronous design.
3.1 Quasi Delay-Insensitivity
By definition, asynchronous systems eschew a global clock signal, but they may still make many dif- ferent timing assumptions to synchronize their actions. The most conservative style of asynchronous design isdelay-insensitive (DI), which makes no timing assumptions and guarantees the correctness
of computations for any set of wire- and gate-delays. It has been shown that the class of completely DI systems is quite limited, excluding most circuits of interest [36].
Quasi delay-insensitive (QDI) design makes only one kind of timing assumption, and is the most conservative approach commonly found in asynchronous VLSI systems. QDI systems include isochronic forks, where the assumption is made that when a certain wire splits, signals propagate along the different wire paths with similar delays. The addition of this one timing assumption allows entire microprocessors to be built. In fact, the fastest working asynchronous microprocessors to date are QDI [42]. (Other asynchronous design styles exist with more timing assumptions [21] but, as with clocked circuits, the safety margins required to ensure correctness hinder their performance.) The Caltech synthesis techniques (both manual and automated) described in this thesis localize isochronic forks to the extent that their assumptions are easily met.
The QDI design style enhances some of the inherent advantages of asynchronous design, and adds others too:
• Low Power: Unlike asynchronous design styles with more timing assumptions, no delay lines or similar elements are required to match delays along different paths for correctness. Hence, QDI circuits stop switching completely when idle, reducing idle dynamic-energy consumption to zero. From the perspective of synchronous VLSI, this is equivalent to “perfect” clock gating.
• Robustness: Independence from delays allows systems to remain correct no matter how phys- ical parameters affect performance. With only the minimal timing assumption of isochronic forks, QDI systems are robust to variations in physical parameters such as voltage, tempera- ture, and fabrication. (Variations in fabrication are becoming increasingly prevalent as feature size shrinks.) In practice, the voltage of QDI systems can be scaled during runtime to trade off energy and speed without requiring any dedicated circuitry or ramp-down protocols. QDI microprocessors have been demonstrated running correctly at sub-threshold voltages [42].
• Modularity: Using the Caltech synthesis flow for QDI design (both the existing manual approaches and the new DDD techniques), isochronic forks are almost always localized within
individual circuits, leaving the system interconnect delay-insensitive. Modular design with QDI systems is therefore easier than with synchronous components that may have different clock domains, or with less conservative asynchronous components where different timing constraints may need to be met at the interfaces. Increased modularity also promotes the re-usability of circuits designed in the asynchronous QDI style.
The main disadvantages of QDI design are an area penalty caused by the extra circuitry and wiring required to implement delay-insensitivity, and a current lack of synthesis tools for automated design. The area penalty can increase the energy consumption of a system, but this effect is usually dwarfed by the other low-power advantages of asynchronous and QDI design. The lack of automated synthesis tools is, of course, addressed in part by DDD.