Reconfigurable PCHB Circuit

7.2 Logic Cells

7.2.1 Reconfigurable PCHB Circuit

The basic logic cell largely follows the PCHB template given in Chapter 3 and shown in Figure 3.7.

There is a computation stage that consists of a pulldown network of n-transistors and a simple pullup network of p-transistors. Input and output channel validity trees are also present, along with

Cell Logic A

B C

A.0A.1 A.e B.0B.1 B.e C.0 C.1C.e

Z.0Z.1 Z.e

Figure 7.3: Logic cell for the asynchronous FPGA.

The cell includes three dual-rail input channels, and one dual-rail output channel. Dashed lines indicate channels on which communications may be conditional.

completion logic to generate input enable signals. Again, these components must account for data dependencies when conditional communications are programmed.

We limit the cell to one output channel as multiple output channels would require multiple separate precharged computation stages that could go unused. (Area is an important factor when designing FPGA cells since they are repeated across the chip.) Instead, by grouping single-output logic cells inside clusters where they can share input channels, there is some redundancy in validity circuitry when multiple outputs are computed from the same inputs, but no unusable circuitry when there is no sharing (as is often the case). We limit the cell to three input channels because with four input channels, the pulldown networks of the computation logic grow too long for high performance.

For our prototype design, we choose dual-rail encoding since its binary nature keeps low the number of SRAM cells required to program the compute function. This encoding also allows us to implement inverting functions in the routing network by simply swapping the connections to the two data rails, instead of using up an entire logic cell to perform this task. Other channel encoding choices are discussed later in Section 7.5.2. The basic circuit for the cell is illustrated in Figures 7.3—

7.6. (These are not an exact representation of the circuit: details such as the reset circuitry and staticizers for each of the C-elements have been omitted.)

There are several deviations from the original PCHB template. One difference is that instead of using both the output channel enableZ.e and the local signal enableen in the pullup network and foot transistors of the precharged computation stage, we combine the output enable with internally- generated signals via a C-element into a single new signal, “go.” This reduces the number of transistors in series of the pulldown network which, because of the extra logic required to program

COMPUTATION Pull−Down Network

3−4 n−transistors in series

SRAM cell

_s s

S[0] S[1] S[2] S[3] S[4] S[5] S[6] S[7]

Z.0 Z.1

go go

pre comp Si

Sb Sc Sz

A.0A.1 B.0B.1 C.0C.1

Figure 7.4: Reconfigurable cell: computation circuitry.

_skipz A.0 A.1 B.0 B.1 C.0 C.1

_av

_cv _bv

_Z.0 zv _Z.1

Si.0 Sc.1

Sb.1 Si.0 Sc.1 cv a.1 Sb.1

_readc

gate gate gate

go Sz.1 a.0 Si.1 a.0 Sc.0 a.1 Si.1 Sb.0

_skipc _skipb

gate

_readb _skipz

a.0 bv Sb.0

(a) (b)

Figure 7.5: Reconfigurable cell.

(a) Input and output validity generation. (b) Circuitry for conditional communications.

Sb.0

zv _av

_readb _readc

go z.e

a.e b.e c.e _skipb

_skipc Sc.0

C C

C gate C

Figure 7.6: Reconfigurable cell: completion circuitry.

_Z.1

S0.1 S1.1 S2.1 S3.1 S4.1 S5.1 S6.1 S7.1

S0.0 S1.0 S2.0 S3.0 S4.0 S5.0 S6.0 S7.0

C.0 C.1 C.0 C.1 C.0 C.1 C.0 C.1

B.0 B.1 B.0 B.1

A.0 A.1

_Z.0

Figure 7.7: Pulldown networks: chosen configuration.

SRAM cells placed next to the outputs in a sample pulldown network for programmable PCHB with unconditional communications. (48 transistors, no charge-sharing problems.)

the computations, is already large. The size of the computation network also leads to the second deviation: instead of a single foot transistor, multiple feet provide parallel paths to ground. This alleviates otherwise inevitable charge-sharing problems. Finally, the “pre-comp” unit in Figure 7.4 consists of combinational gates that have been extracted from the pulldown network to reduce the number of n-transistors in series. Since the unit depends only on SRAM-cell outputs, it does not add to the forward latency or the cycle time of the PCHB after the cell has been programmed.

Meanwhile, for conditional communications, instead of the general solutions presented in Chap- ter 3, we create a specific implementation for our cell. This involves the use of non-precharged gates to generate the programmable conditions, as shown in Figure 7.5. The output condition signal skipz is the equivalent of the extra dummy rail presented in earlier sections, and is similarly used in the output validity. The internal signals readb and readc are analogous to useb.1 and usec.1 in the general solution for conditional inputs, while skipb and skipc are analogous to useb.0 and usec.0, respectively.

Note that in our pulldown computation networks, we chose to place the SRAM transistors at the top (next to the output nodes) rather than at the bottom (next to the foot transistor forgo). The two alternate implementations for a three-input logic cell with no conditional communications are shown in Figures 7.7 and 7.8. The second design has large internal nodes and likely requires additional internal precharge p-transistors (not shown in the diagram) to avoid charge-sharing problems. Since

S0.1 S1.0 S2.0

S0.0 S3.0 S4.0 S5.0 S6.0 S7.0

A.0 A.1

B.0 B.1 B.0 B.1

C.0 C.1

_Z.0

A.0 A.1

B.0 B.1 B.0 B.1

C.0 C.1

_Z.1

S7.1 S6.1 S5.1 S4.1 S3.1 S2.1 S1.1

Figure 7.8: Pulldown networks: alternate configuration.

SRAM cells placed next to the foot transistors in a sample pulldown network for programmable PCHB with unconditional communications. (46 transistors, not including extra precharge transistors to alleviate charge-sharing problems.)

aside from the additional precharge transistors the two configurations have roughly the same size, we choose the configuration without charge-sharing problems for our basic cell.

This decision is rejustified later when we consider e1of4 channels, since even without considering extra precharge transistors, the second configuration uses over 10% more area than the first. In general, if we consider basic cells withK e1ofM input channels, then the computation configuration with SRAM transistors on the top requires

Ntop(M, K) =M^K(Mdlog2(M)e+K+ 1) =M^K+1dlog2(M)e+ (K+ 1)M^K

transistors, while the second implementation, with SRAM next to the foot, requires

N_bot(M, K) =M(M^Kdlog₂(M)e+

i=0

Mⁱ) =M^K+1dlog₂(M)e+M(M^K+1−1) M −1

transistors.

Dalam dokumen High-Level Synthesis and Rapid Prototyping of Asynchronous VLSI Systems (Halaman 169-173)