High-Level Synthesis and Rapid Prototyping of Asynchronous VLSI Systems

This dissertation would not have been possible without his inspiration and support, and I will always appreciate his willingness to make time (even without going to sleep on a Saturday morning!) to discuss my research. And needless to say, everything would have been chaos without the help of our friendly department administrative staff.

The Advantages of Asynchrony

The elimination of the global clock reduces the number of timing assumptions in the system to varying degrees depending on the style of asynchrony chosen. In any case, an immediate benefit of eliminating some timing assumptions is that fewer or even no timing assumptions need to be matched between the component interfaces.

VLSI Synthesis

Perhaps the biggest current advantage of asynchronous design is increased robustness: the lack of timing assumptions separates issues of performance from issues of correctness. The throughput of asynchronous systems relies mostly on what are known aspipeline dynamics or handshaking dynamics, as determined during high-level synthesis.

Contributions

Together, DDD and the reconfigurable asynchronous architecture provide new possibilities for designers looking to synthesize asynchronous VLSI systems.

Organization

The main logic cells of this FPGA are based on the same asynchronous pipeline stages that DDD focuses on. These transformations can be thought of as analogous to partitioning and retuning in synchronous VLSI systems.

Notation

DDD performs process decomposition by analyzing the data dependencies in the source program to remove unnecessary synchronization and then split the program into an equivalent system of target processes. Projection: The second phase of DDD splits the DSA sequential program into a new system with one process for each variable in the code.

Figure 2.1: Example of process decomposition.

Basic Concepts for Data-Driven Decomposition

Program Execution
Program Equivalence
Assignments and Data Dependencies
Reaching Definitions
DSA Transformations

The program transformations of the DSA phase all transform sequential programs into new sequential programs. In this context, it considers both regular commands and communication statements to be "commands", and both regular variables and communication channels to be "variables". We say that variable x directly depends on variable ify is used in an assignment tox.

Figure 2.2: Three programs demonstrating the importance of preserving reaching definitions.

Dynamic Single Assignment

Motivation
DSA Indices
Straightline Code
Selection Statements

DSA Indices from Pre-Selection Code
DSA Indices for Post-Selection Code
Proof of Correctness

Repetition Statements

Loops with Terminating Bodies
Nested Conditional Loops

Special Cases
Putting It Together

For all variablesx inS, rename the variable on the left side of the ith assignment stick "xn(i).". But the first definition of a DSA variable in the loop body is actually the inserted copy assignment at the beginning of the loop.

Figure 2.3: CHP selection statement and its transformation to DSA form.

Using the Projection Technique

Dependency Sets
Copy Variable and System Channel Insertion

Inserting Copy Variables
Internal Communication Channel Insertion

Performing Projection
Looking Ahead

For example, if a variable appears in the dependency sets of several other variables, copy variables must be inserted into the sequential code. Finally, all internal channel ports are also linked to variables in the program and included in their projection sets.

Figure 2.5: Example CHP processes for projection.

Related Work: Static Single Assignment Form

This system is shown in Figure 2.7. Note that there are several inefficiencies in the system running Q3. If DDD is used for hardware synthesis, the DSA system is certainly more efficient than the SSA system.

Summary

The most conservative style of asynchronous design is delay-insensitive (DI), which makes no timing assumptions and guarantees correctness. The QDI design style enhances some of the inherent advantages of asynchronous design and also adds others:.

Communications and Handshakes

Channel Encodings

Different encodings for a bit of information: (a) Data rails for an e1of2 channelC, or all rails for a 1of2 channelC. set by the receiver. Similarly, two information bits can be expressed using an e1of4 channel (with five wires) and three information bits can be expressed using either an e1of8 channel (with nine wires) or alternatively the combination of an e1of2 and an e1of4 channel (with eight wires in total).

Handshakes

Asynchronous VLSI Synthesis Overview

Some of the intricacies of handshake extension lie in the interleaving of handshakes for different channels, so that a minimum amount of data requires explicit storage. PRS is performed as follows: note the production rules whose guard conditions are currently evaluated to be true and fire one of them.

Asynchronous Pipeline Stages

No channel is used more than once in a Pinit or Ploop execution track. Although nested loops are not allowed by this specification, nested select statements and conditional communications are, as long as no channel is used more than once in a select branch.

Precharged Half-Buffers

Traditional Compilation

Using techniques described by Martin [38], we compile this HSE to generate the PRS given in Figure 3.4. The final PRS, which remains CMOS-implementable as well as stable and non-interfering, is shown in Figure 3.5.

Circuit Templates

Unconditional Communications
Conditional Outputs
Conditional Inputs
State-holding Bits

For conditional inputs, the PCHB may ignore and not accept valid data on the input channel. The local enable signal depends on all the input possibilities and connects the different computing networks in the PCHB.

Figure 3.6: PCHB corresponding to the PRS compiled from ∗ [A?a, B?b; X!(a ∧ b)].

Performance Metrics

Let be the maximum number of transitions between the circuit's receiving data and its generation of a valid signal for any output channel. Let τvalide be the maximum number of transitions between receiving circuit data and its generation.

Figure 3.11: PCHB circuit annotated with transition counts and delays used to estimate cycle time.

Summary

Variable Processes

Therefore, any original channels that appear in a DDD variable process are input channels and are used at most once per iteration. Therefore, each projection-introduced internal channel that appears in a DDD-variable process (such as the Px forx process) is used at most once per iteration.

Channel Processes

Again, since the program is deterministic, reaching definitions are preserved and the transformed process is program equivalent to the original. The new process is program-equivalent to the original and satisfies the conditions for asynchronous pipeline stages.

Isolating Hardware Units

Arrays

Thus, a separate variable aidx allows computations of functions such as g(i) to be isolated in a non-state control process. Similarly, if array accesses occur within select statements, a separate variable allows the computations of the wait conditions to be separated from the state control process for channel OPA.

Functional Units

To prevent any single DDD process from requiring redundant circuitry that could increase its cycle time and thus the system cycle time, DDD tries to separate computations of stateful bits. Note that processes PAop, PAidx, PAwr, and PArd all contain state retention bits to distinguish between the multiple array accesses required in one iteration of the original program.

Reducing System Communications

Encoding Guard Expressions

Removing Nested Selections
Removing Basic Selections

IfCenc< Cunenc, then we know that we are encoding the guard conditions of the selection in question. In this case, encoding the guard conditions reduces the communication cost of selection by almost half.

Figure 4.2: Systems without and with guard encoding.

Conditional Communications

If such a communication-free selection statement contains assignments to multiple variables, it is wise to check whether guard encoding leads to a better system before rewriting the selection statement.

Reducing System Computations

Motivation
Distillation
Elimination
Example

This can result in the removal of guard variables and the channels on which they are communicated, reducing the dynamic power consumption of the system. In a DDD-generated system, when an internal channel is used conditionally in one process, the same state is also always checked in the process at the other end of the channel.

Future Work

If non-deterministic selection statements exist in the original code, their results can be stored in a variable and that variable used as a guard in the subsequent deterministic selection statement. The non-deterministic selection can then be isolated and removed, leaving a deterministic locally slack elastic system to which DDD can be applied.

Figure 4.4: Vertical decomposition of decomposable (P) and non-decomposable (Q) processes.

Summary

The global optimization problem combining recomposition and slack matching is complex; DDD addresses it with the randomized heuristic of simulated annealing. This chapter describes the circumstances under which individual processes can and should be recomposed, the system model used to implement slack matching, a new method for slack matching homogeneous systems, and the cost functions used in the simulated annealing.

Motivation

The first is that although we generally cluster processes to minimize communication and energy consumption, other criteria can be chosen by designers instead, requiring only a change in the cost functions used during the clustering heuristic. The second advantage of reassembly is that the performance optimization of slack matching can be applied simultaneously to the same decomposed system of processes.

Recomposition of PCHB Processes

Clustering in Series

Therefore, we insert LR buffers at selected points of the system and try to do it with as little energy consumption as possible. Finally, we introduce a clustering heuristic for DDD that combines recomposition with loose matching, describing approaches to special cases.

Clustering in Parallel

This eliminates single-wire switching and can simplify the routing problem, but has no significant impact on system power consumption. Our heuristics cluster together processes with features that provide greater energy reductions; DDD focuses on grouping processes that share data or are arranged in series.

Limits to Cluster Size

The depth of these trees depends on the size of their fanin (which in turn depends on the width and number of channels) and on the maximum number of transistors allowed in series (which limits the fanin for each individual gate). All the information needed to estimate the cycle time of a recomposed process can be determined from its CHP specification (see Chapter 3).

Slack Matching

FBI Delay Model for PCHBs

FBI Graphs
Messages and Tokens

The delay of the circuit element represented by the node φn is δ(φn), and each node delay in the FBI model is defined as follows. Edges in the graph are marked with namese(n), where 0≤n

Figure 5.4: Ring of PCHB stages with N=3.

Critical Cycles

Common Cycles
Critical Cycles
Basic Homogeneous Cases

Ifn is the index of the current PCHB in the ring, when you pass the ends to FBI. Thus, input validations can be computed in the same number of transitions as forward delays.

Dynamic Slack

Base Case: Ring of PCHBs
Base Case: Reconvergent Fanout

Please note that the highest achievable throughput of the system is smaller than the maximum achievable throughput of the individual pipelines. Slack matching adds buffers to Q so that the dynamic slack ranges of the two pipelines overlap, and the system can run as fast as the individual pipelines.

Figure 5.11: Throughput vs. number of messages.

Clustering Heuristic

Preliminaries
Copy Processes
Simulated Annealing
Clustering Homogeneous Systems
Clustering Heterogeneous Systems

The ring in the PCHB graph on the left implements an incrementing process IN C that has been isolated but is used twice per iteration of the original sequential program. Within the framework of simulated annealing, our task is to determine the number of slack buffers Nslack(e) required for each edgee ∈E0 such that the critical cycle time of the system is not greater than τmax.

Figure 5.16: Removing cycles for clustering.

Summary

We applied the techniques of DDD to the design of the instruction fetch unit of the Lutonium, an asynchronous 8051 microcontroller [41]. It is responsible for instruction decoding (of variable length instructions), generation of the next program counter, read and write accesses to instruction memory and interrupt handling.

Figure 6.1: Fetch unit of the Lutonium 8051 microprocessor.

Sequential Transformations

Decomposition and System Transformations

The system created when DDD is applied to the Fetch unit of the asynchronous 8051 microcontroller. Some approaches provide the same level of functionality as our PCHB pulldown logic.

Figure 6.4: Results of clustering and distillation.

Logic Cells

Reconfigurable PCHB Circuit

We limit the cell to three input channels because with four input channels the computational logic pulldown network grows too long for high performance. This reduces the number of transistors in series of the pulldown network, which due to the additional logic required to program.

Figure 7.3: Logic cell for the asynchronous FPGA.

Communication Patterns

SRAM cells located next to leg transistors in a sample-pull network for programmable PCHB with unconditional communications. This decision is re-justified later when we consider the e1of4 channels, since even without taking into account the additional precharge transistors, the second configuration uses over 10% more area than the first.

Performance and Area

From these patterns, a cell can implement not only basic computational blocks, but also, either alone or in combination with another cell, two basic control circuits in asynchronous QDI design: the controlled merge and the controlled split. The interconnect area penalty resulting from the multiple threads of a dual-track channel is also partially offset by the omission of a clock tree in an asynchronous FPGA.

Cluster Design

Copying Channels
Feedback Channels
Buffering and Initial Tokens
Summary

The logic cells can be programmed to connect to these channels in the same way they can connect to external interconnect - the new channels are completely local. Since all cells in the FPGA are identical, optimizing the performance of a system implemented on the FPGA requires homogeneous slack-matching.

Mapping Examples

Full Adder and ALU

One cluster implementing a two-bit full adder with input channels A and B, sum output channel S, and carry channels C.

Microprocessor Execution Unit

Architectural Models and Interconnect

Logic Cell Comparisons

Compared to the e1of2 cell, an equivalent e1of4 cell (with the same control logic but twice the data path) requires 132 SRAM bits and more than five times the area. However, details at the analog circuit level can cause the e1of4 cell to operate slower than the smaller e1of2 cell.

Interconnect

Similarly, the e1of4 interconnect will use approx. 2.1 times the area as a synchronous 2-bit interconnect, regardless of which type of switching network is chosen. One of the reasons is the effect of various changing network architectures on the area required by the e1of4 interconnect.

Parameterized Cluster Design

The total fanin of cluster logic cells and slack buffers is therefore Imax=K·Nc+Nb. Meanwhile, all logic cells and slack buffer outputs can be copied to two different channels in the interconnect.

Figure 7.18: Logic cell slice for a cluster based on K-input e1of M cells.

Summary

While its early stages can be applied to convert both software and hardware algorithms into concurrent systems, we have also introduced techniques in DDD that make it the first high-level synthesis technique to target the pipeline stages used in high-performance asynchronous VLSI systems. We have presented parameterizable circuit templates for preloaded half-buffers (PCHBs), the family of asynchronous pipeline stages used in most high-performance asynchronous VLSI systems, and we have shown that all processes in the parsed system generated by DDD can be implemented . directly as PCHBs.

Figure 8.1: DDD for the high-level synthesis of asynchronous VLSI systems.

Future Work

The DDD tool can be combined with area and performance estimation tools for FPGAs to explore all possibilities, including more adventurous interconnect schemes that leverage delay insensitivity and the flexibility of asynchronous design. Although communication primitives can be used to communicate with other processes, they are not necessary to transfer information within a single process.

Basic Statements

Example of process decomposition
Three programs demonstrating the importance of preserving reaching definitions
CHP selection statement and its transformation to DSA form
DSA transformation of repetition statements
Example CHP processes for projection
Rewriting Q 1 to create disjoint projection sets
Final projection of example program Q 3
SSA and DSA forms
Channel Encodings
Four-phase handshake on a one-bit communication channel encoded as e1of2
Formal synthesis flow for asynchronous VLSI design
CMOS-implementable PRS
Final CMOS-implementable PRS for the CHP process ∗ [A?a, B?b; X !(a ∧ b)]
PCHB corresponding to the PRS compiled from ∗ [A?a, B?b; X !(a ∧ b)]
General template for an unconditional precharged half-buffer circuit
Template-designed PCHB circuit
Example of a PCHB with conditional output X !
Example of a PCHB with conditional input A?
PCHB circuit annotated with transition counts and delays used to estimate cycle time. 70
Systems without and with guard encoding
Results of elimination and distillation on MiniMIPS WriteBack unit
Vertical decomposition of decomposable (P ) and non-decomposable (Q) processes
Clustering two processes in series
Clustering in parallel two processes that share inputs
Basic slack matching structures
Ring of PCHB stages with N=3
FBI model for ring of PCHB stages with N=3
Illustrating the FBI vertices on a PCHB circuit
FBI graph for a PCHB annotated with edge boolean conditions for tokens
FBI graph for a PCHBs showing placement of initial tokens
Backward-latency cycles
Acyclic forward-latency graph A F (G)
Throughput vs. number of messages
Reconvergent fanout example
Slack matching reconvergent fanout for pipelines with identical maximum throughputs. 130
Removing cycles for clustering
Incrementer example of removing cycles for clustering
Clustering schedule
Splitting copy processes that exceed maximum fanout for clustering
Fetch unit of the Lutonium 8051 microprocessor
Guard encoding for the Fetch example
Channels and variables in final DSA version of the Fetch
Results of clustering and distillation
DDD example
Quasi delay-insensitive asynchronous FPGA
CHP specification for our asynchronous FPGA cell
Logic cell for the asynchronous FPGA
Reconfigurable cell: computation circuitry
Reconfigurable cell
Reconfigurable cell: completion circuitry
Pulldown networks: chosen configuration
Pulldown networks: alternate configuration
Communication patterns for our asynchronous FPGA cell
FPGA simulation results
Layout for the basic FPGA cell, including twelve programmable SRAM bits
Cluster block diagram
FPGA full adder example
Decomposition of the FBlock execution unit from the asynchronous MiniMIPS
Comparison of three reconfigurable asynchronous logic cells
Interconnect switches for different channel encodings
Structured cluster design
Logic cell slice for a cluster based on K-input e1of M cells
Slack buffer slice for a cluster based on e1of M channels
C-element tree slice
Input multiplexer slice for a parameterized cluster
DDD for the high-level synthesis of asynchronous VLSI systems

When the CHP program enters this structure, it halts until one of the guards becomes true. In this structure, at most one of the logical expressions g1, g2 and g3 can be true at any given moment.