This dissertation would not have been possible without his inspiration and support, and I will always appreciate his willingness to make time (even without going to sleep on a Saturday morning!) to discuss my research. And needless to say, everything would have been chaos without the help of our friendly department administrative staff.
The Advantages of Asynchrony
The elimination of the global clock reduces the number of timing assumptions in the system to varying degrees depending on the style of asynchrony chosen. In any case, an immediate benefit of eliminating some timing assumptions is that fewer or even no timing assumptions need to be matched between the component interfaces.
VLSI Synthesis
Perhaps the biggest current advantage of asynchronous design is increased robustness: the lack of timing assumptions separates issues of performance from issues of correctness. The throughput of asynchronous systems relies mostly on what are known aspipeline dynamics or handshaking dynamics, as determined during high-level synthesis.
Contributions
Together, DDD and the reconfigurable asynchronous architecture provide new possibilities for designers looking to synthesize asynchronous VLSI systems.
Organization
The main logic cells of this FPGA are based on the same asynchronous pipeline stages that DDD focuses on. These transformations can be thought of as analogous to partitioning and retuning in synchronous VLSI systems.
Notation
DDD performs process decomposition by analyzing the data dependencies in the source program to remove unnecessary synchronization and then split the program into an equivalent system of target processes. Projection: The second phase of DDD splits the DSA sequential program into a new system with one process for each variable in the code.
Basic Concepts for Data-Driven Decomposition
- Program Execution
- Program Equivalence
- Assignments and Data Dependencies
- Reaching Definitions
- DSA Transformations
The program transformations of the DSA phase all transform sequential programs into new sequential programs. In this context, it considers both regular commands and communication statements to be "commands", and both regular variables and communication channels to be "variables". We say that variable x directly depends on variable ify is used in an assignment tox.
Dynamic Single Assignment
- Motivation
- DSA Indices
- Straightline Code
- Selection Statements
- DSA Indices from Pre-Selection Code
- DSA Indices for Post-Selection Code
- Proof of Correctness
- Repetition Statements
- Loops with Terminating Bodies
- Nested Conditional Loops
- Special Cases
- Putting It Together
For all variablesx inS, rename the variable on the left side of the ith assignment stick "xn(i).". But the first definition of a DSA variable in the loop body is actually the inserted copy assignment at the beginning of the loop.
Using the Projection Technique
- Dependency Sets
- Copy Variable and System Channel Insertion
- Inserting Copy Variables
- Internal Communication Channel Insertion
- Performing Projection
- Looking Ahead
For example, if a variable appears in the dependency sets of several other variables, copy variables must be inserted into the sequential code. Finally, all internal channel ports are also linked to variables in the program and included in their projection sets.
Related Work: Static Single Assignment Form
This system is shown in Figure 2.7. Note that there are several inefficiencies in the system running Q3. If DDD is used for hardware synthesis, the DSA system is certainly more efficient than the SSA system.
Summary
The most conservative style of asynchronous design is delay-insensitive (DI), which makes no timing assumptions and guarantees correctness. The QDI design style enhances some of the inherent advantages of asynchronous design and also adds others:.
Communications and Handshakes
Channel Encodings
Different encodings for a bit of information: (a) Data rails for an e1of2 channelC, or all rails for a 1of2 channelC. set by the receiver. Similarly, two information bits can be expressed using an e1of4 channel (with five wires) and three information bits can be expressed using either an e1of8 channel (with nine wires) or alternatively the combination of an e1of2 and an e1of4 channel (with eight wires in total).
Handshakes
Asynchronous VLSI Synthesis Overview
Some of the intricacies of handshake extension lie in the interleaving of handshakes for different channels, so that a minimum amount of data requires explicit storage. PRS is performed as follows: note the production rules whose guard conditions are currently evaluated to be true and fire one of them.
Asynchronous Pipeline Stages
No channel is used more than once in a Pinit or Ploop execution track. Although nested loops are not allowed by this specification, nested select statements and conditional communications are, as long as no channel is used more than once in a select branch.
Precharged Half-Buffers
Traditional Compilation
Using techniques described by Martin [38], we compile this HSE to generate the PRS given in Figure 3.4. The final PRS, which remains CMOS-implementable as well as stable and non-interfering, is shown in Figure 3.5.
Circuit Templates
- Unconditional Communications
- Conditional Outputs
- Conditional Inputs
- State-holding Bits
For conditional inputs, the PCHB may ignore and not accept valid data on the input channel. The local enable signal depends on all the input possibilities and connects the different computing networks in the PCHB.
Performance Metrics
Let be the maximum number of transitions between the circuit's receiving data and its generation of a valid signal for any output channel. Let τvalide be the maximum number of transitions between receiving circuit data and its generation.
Summary
Variable Processes
Therefore, any original channels that appear in a DDD variable process are input channels and are used at most once per iteration. Therefore, each projection-introduced internal channel that appears in a DDD-variable process (such as the Px forx process) is used at most once per iteration.
Channel Processes
Again, since the program is deterministic, reaching definitions are preserved and the transformed process is program equivalent to the original. The new process is program-equivalent to the original and satisfies the conditions for asynchronous pipeline stages.
Isolating Hardware Units
Arrays
Thus, a separate variable aidx allows computations of functions such as g(i) to be isolated in a non-state control process. Similarly, if array accesses occur within select statements, a separate variable allows the computations of the wait conditions to be separated from the state control process for channel OPA.
Functional Units
To prevent any single DDD process from requiring redundant circuitry that could increase its cycle time and thus the system cycle time, DDD tries to separate computations of stateful bits. Note that processes PAop, PAidx, PAwr, and PArd all contain state retention bits to distinguish between the multiple array accesses required in one iteration of the original program.
Reducing System Communications
Encoding Guard Expressions
- Removing Nested Selections
- Removing Basic Selections
IfCenc< Cunenc, then we know that we are encoding the guard conditions of the selection in question. In this case, encoding the guard conditions reduces the communication cost of selection by almost half.
Conditional Communications
If such a communication-free selection statement contains assignments to multiple variables, it is wise to check whether guard encoding leads to a better system before rewriting the selection statement.
Reducing System Computations
- Motivation
- Distillation
- Elimination
- Example
This can result in the removal of guard variables and the channels on which they are communicated, reducing the dynamic power consumption of the system. In a DDD-generated system, when an internal channel is used conditionally in one process, the same state is also always checked in the process at the other end of the channel.
Future Work
If non-deterministic selection statements exist in the original code, their results can be stored in a variable and that variable used as a guard in the subsequent deterministic selection statement. The non-deterministic selection can then be isolated and removed, leaving a deterministic locally slack elastic system to which DDD can be applied.
Summary
The global optimization problem combining recomposition and slack matching is complex; DDD addresses it with the randomized heuristic of simulated annealing. This chapter describes the circumstances under which individual processes can and should be recomposed, the system model used to implement slack matching, a new method for slack matching homogeneous systems, and the cost functions used in the simulated annealing.
Motivation
The first is that although we generally cluster processes to minimize communication and energy consumption, other criteria can be chosen by designers instead, requiring only a change in the cost functions used during the clustering heuristic. The second advantage of reassembly is that the performance optimization of slack matching can be applied simultaneously to the same decomposed system of processes.
Recomposition of PCHB Processes
Clustering in Series
Therefore, we insert LR buffers at selected points of the system and try to do it with as little energy consumption as possible. Finally, we introduce a clustering heuristic for DDD that combines recomposition with loose matching, describing approaches to special cases.
Clustering in Parallel
This eliminates single-wire switching and can simplify the routing problem, but has no significant impact on system power consumption. Our heuristics cluster together processes with features that provide greater energy reductions; DDD focuses on grouping processes that share data or are arranged in series.
Limits to Cluster Size
The depth of these trees depends on the size of their fanin (which in turn depends on the width and number of channels) and on the maximum number of transistors allowed in series (which limits the fanin for each individual gate). All the information needed to estimate the cycle time of a recomposed process can be determined from its CHP specification (see Chapter 3).
Slack Matching
FBI Delay Model for PCHBs
- FBI Graphs
- Messages and Tokens
The delay of the circuit element represented by the node φn is δ(φn), and each node delay in the FBI model is defined as follows. Edges in the graph are marked with namese(n), where 0≤n Ifn is the index of the current PCHB in the ring, when you pass the ends to FBI. Thus, input validations can be computed in the same number of transitions as forward delays. Please note that the highest achievable throughput of the system is smaller than the maximum achievable throughput of the individual pipelines. Slack matching adds buffers to Q so that the dynamic slack ranges of the two pipelines overlap, and the system can run as fast as the individual pipelines. The ring in the PCHB graph on the left implements an incrementing process IN C that has been isolated but is used twice per iteration of the original sequential program. Within the framework of simulated annealing, our task is to determine the number of slack buffers Nslack(e) required for each edgee ∈E0 such that the critical cycle time of the system is not greater than τmax. We applied the techniques of DDD to the design of the instruction fetch unit of the Lutonium, an asynchronous 8051 microcontroller [41]. It is responsible for instruction decoding (of variable length instructions), generation of the next program counter, read and write accesses to instruction memory and interrupt handling. Sequential Transformations The system created when DDD is applied to the Fetch unit of the asynchronous 8051 microcontroller. Some approaches provide the same level of functionality as our PCHB pulldown logic. We limit the cell to three input channels because with four input channels the computational logic pulldown network grows too long for high performance. This reduces the number of transistors in series of the pulldown network, which due to the additional logic required to program. SRAM cells located next to leg transistors in a sample-pull network for programmable PCHB with unconditional communications. This decision is re-justified later when we consider the e1of4 channels, since even without taking into account the additional precharge transistors, the second configuration uses over 10% more area than the first. From these patterns, a cell can implement not only basic computational blocks, but also, either alone or in combination with another cell, two basic control circuits in asynchronous QDI design: the controlled merge and the controlled split. The interconnect area penalty resulting from the multiple threads of a dual-track channel is also partially offset by the omission of a clock tree in an asynchronous FPGA. The logic cells can be programmed to connect to these channels in the same way they can connect to external interconnect - the new channels are completely local. Since all cells in the FPGA are identical, optimizing the performance of a system implemented on the FPGA requires homogeneous slack-matching. One cluster implementing a two-bit full adder with input channels A and B, sum output channel S, and carry channels C. Microprocessor Execution Unit Compared to the e1of2 cell, an equivalent e1of4 cell (with the same control logic but twice the data path) requires 132 SRAM bits and more than five times the area. However, details at the analog circuit level can cause the e1of4 cell to operate slower than the smaller e1of2 cell. Similarly, the e1of4 interconnect will use approx. 2.1 times the area as a synchronous 2-bit interconnect, regardless of which type of switching network is chosen. One of the reasons is the effect of various changing network architectures on the area required by the e1of4 interconnect. The total fanin of cluster logic cells and slack buffers is therefore Imax=K·Nc+Nb. Meanwhile, all logic cells and slack buffer outputs can be copied to two different channels in the interconnect. While its early stages can be applied to convert both software and hardware algorithms into concurrent systems, we have also introduced techniques in DDD that make it the first high-level synthesis technique to target the pipeline stages used in high-performance asynchronous VLSI systems. We have presented parameterizable circuit templates for preloaded half-buffers (PCHBs), the family of asynchronous pipeline stages used in most high-performance asynchronous VLSI systems, and we have shown that all processes in the parsed system generated by DDD can be implemented . directly as PCHBs. The DDD tool can be combined with area and performance estimation tools for FPGAs to explore all possibilities, including more adventurous interconnect schemes that leverage delay insensitivity and the flexibility of asynchronous design. Although communication primitives can be used to communicate with other processes, they are not necessary to transfer information within a single process. When the CHP program enters this structure, it halts until one of the guards becomes true. In this structure, at most one of the logical expressions g1, g2 and g3 can be true at any given moment.
Critical Cycles
Dynamic Slack
Clustering Heuristic
Summary
Decomposition and System Transformations
Logic Cells
Reconfigurable PCHB Circuit
Communication Patterns
Performance and Area
Cluster Design
Mapping Examples
Full Adder and ALU
Architectural Models and Interconnect
Logic Cell Comparisons
Interconnect
Parameterized Cluster Design
Summary
Future Work
Basic Statements