Preliminaries - Clustering Heuristic - High-Level Synthesis and Rapid Prototyping of Asynchrono

5.4 Clustering Heuristic

5.4.1 Preliminaries

For clustering, DDD expresses the decomposed system as a graph G = hV,Ei. Each edge e ∈ E represents a communication channel from a source src(e) ∈ V to a sink snk(e) ∈ V. Each vertexv ∈V either represents a PCHB or serves as a dummy source for edges representing system input-channels or a dummy sink for edges representing system output-channels. The number of data rails in a channel is width(e). Paths, path sources, and path sinks are defined as for FBI graphs in Section 5.3.1.1. The lengthN_P of a path P is given by the number of vertices traversed by the path, src(P) and snk(P) inclusive. The designer may place latency constraints upon the system during clustering, requiring that the longest forward-latency path between a specific system input channel and system output channel cannot exceedδtransitions.

Rings of vertices inGcan be one of three types generated by DDD. The first type of ring includes DSA variables that were used before being defined in an iteration of the DSA sequential program.

The second type of ring includes processes with internal state bits that represent external system channels used multiple times within the original sequential program. The third type of ring includes processes for complex functions that are used multiple times in the sequential program and have been isolated from the rest of the system.

Prior to slack matching, DDD eliminates these rings and transforms G into an acyclic graph G⁰ =hV⁰,E⁰i. Each ring of PCHBs is removed by selecting an edge in the ring, and changing the edge’s source from the original vertex to a new dummy source with the same FBI values as the orginal vertex. The result is a straight path, called aring path, with a length one greater than that of the original cycle. DDD then places a latency constraint on this path, the maximum value of which is determined by using techniques from Section 5.3 to slack match the original ring to meet a specific cycle time. After the clustering phase is complete, DDD restores the ring from the original graph, each edge containing the newly-determined number of slack buffers.

As illustrated in Figure 5.16, DSA-variable rings are broken between vertices for x0 and xn, where n is the maximum DSA index for x. The source of the edge between these two vertices is changed from processP_x_n to new dummy source processP_x⁰

n. For rings that include external system

ring path

0 Px₁

2 Px₃ Px₀ Px₁ Px₂ Px₃

max Px3’

X3x0 Px

X3x0

Figure 5.16: Removing cycles for clustering.

The ring in the PCHB graph on the left implements a chain of DSA variables x0. . . x3, where the value ofx3is assigned tox0at the beginning of the DSA program. The ring is transformed into an acyclic path with a dummy source that has the same FBI values as the variable with the maximum DSA index, as shown in the graph on the right. A latency constraint derived from slack matching the ring to run atτmax can now be placed on the path between verticesP x⁰₃ andP x3.

ring path τmax

2 INC

ARG SUM

Eback SUM’ Eback

SUM ARG INC

Figure 5.17: Incrementer example of removing cycles for clustering.

The ring in the PCHB graph on the left implements an incrementer process IN C that has been isolated but is used twice per iteration of the original sequential program. The processes ARG and SU M each contain one state bit implementing two states, and collect incrementer inputs and distribute incrementer outputs, respectively. The ring is transformed into an acyclic path with a dummy source that has the same FBI values asSU M, as shown in the graph on the right. A latency constraint derived from slack matching the ring to run at 2τmax(because the entire ring is traversed only once for every two times the verticesARG,IN C, andSU M are traversed) can now be placed on the path between verticesSU M⁰ andSU M.

input channels used multiple times, the ring is broken immediately before the input channel’s vertex.

(If an external output channel is used multiple times, rings containing its vertex cannot exist within the system alone.) And finally, in the case of rings that include isolated processes used multiple times, the ring is broken immediately before the process that gathers inputs for the isolated process.

In both of the last two cases, the ring is slack matched for the cycle timeNst ·τmax, where the entire ring is traversed only once for every Nst times that its stateholding processes are traversed.

This allows the overall system to run atτmax. A scenario with an isolated incrementer that is used twice per sequential program iteration is presented in Figure 5.17.

Once acyclic graph G⁰ has been created, if any of the latency constraints (either user-specified or ring-derived) are exceeded, then processes must be clustered in series until they are met. DDD

P8 A?

P2 CP4 CP1

P5 P7

A? X!

CP1 P3

P2 CP4

P7 P8 B?

SNK

SRC SRC

5 6

SRC

0 1 2 3 4

Figure 5.18: Clustering schedule.

The original graph (left) can be initilaly configured using an ASAP (as soon as possible) schedule (right).

uses a greedy heuristic to choose which edges should be removed by clustering in series. Processes should not be clustered if the new internal cycles τi or handshake cyclesτh generated are greater thanτmax. If it is not possible to reduce path lengths to meet latency constraints without increasing the maximum cycle time of the system, then either the latency constraints or the target cycle time must be relaxed.

After generating a homogeneous acyclic graph G⁰, DDD creates a table ofL_max columns, where Lmax is the length of the longest path in G⁰. Every vertex is assigned to a column such that for allv∈V, 0≤col(v)<Lmax. For the column assignments to be legal, the followingorder constraint must be fulfilled:

∀e∈E⁰ : col(snk(e))> col(src(e)) (5.28)

Now, thespanof an edgee∈E⁰ is defined as follows:

span(e) =col(snk(e))−col(src(e))−1 (5.29)

We also define ascheduleofG⁰ to be a set of column assignments to the vertices ofV⁰. An example of an initial schedule is given in Figure 5.18.

4 5 6 7 8 CP

1 2 3 4

5 6 7 8 CP

CP CP

CP 2 CP

1 3

1 2 3 4

5 6 7 8 CP

PH PH

B B

B B B

Figure 5.19: Splitting copy processes that exceed maximum fanout for clustering.

Dalam dokumen High-Level Synthesis and Rapid Prototyping of Asynchronous VLSI Systems (Halaman 147-150)