Logic Parameters - A Programming Language for Parallel Graph Algorithms

quency).

• Run the tool chain: synthesis, map, place and route.

• Read log files which report resource usage, whether the desired clock frequency was met, and the achieved clock frequency.

The GRAPAL compiler’s lowest-level component measurement procedure inputs target frequency, prints out logic modules as VHDL, calls the tool chain commands, then parses log files to return resource and frequency outcomes. All higher level algorithms should not have to specify the clock frequency; instead the called procedure should return a good clock frequency. The primary component measurement procedure inputs only the component, and outputs resource usage along with achieved clock frequency. It does this by searching over target frequency with multiple lower-level passes. Since FPGA report the achieved frequency, whether it is less than or greater than the target frequency, each iteration of this search can use the previous iteration’s achieved frequency. The first iteration uses a target of100 MHz, which is usually within a factor of 2 of the final frequency. By default two iterations are performed, which is usually enough to bring the final frequency within5%of optimal.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

0 0.2 0.4 0.6 0.8 1

Runtime / Runtime of Default

A_net / A_all

Figure 8.1: The mean normalized runtime of all applications and graphs is plotted for each choice of A_net. On the y-axis runtime is normalized to the runtime of the default choice whereA_net =A_all/2and on the x-axisA_net is relative toA_all.

proximately proportional to area devoted to PEs,A_{P Es}. Time spent computing, assuming a good load balance, is approximately inversely proportional toN_{P Es}. So doublingA_{P Es} speeds up the computing portion of total time by approximately2. Total switch bandwidth is approximately proportional to area devoted to interconnect, A_net. Time spent routing messages is approximately inversely proportional to total switch bandwidth. So doubling A_net speeds up the communication portion of total time by approximately2.

We evaluate the choice ofA_net =A_all/2by runningLogicParameterChooseron a range of targetF_net =A_net/A_all ratios. Figure 8.1 shows the mean normalized runtime of all applications and graphs for the rangeF_net ∈ [0,1]. For these graphs the choice of A_net =A_all/2is within1%of the optimal choice ofA_net = 3/8A_all. Runtime for each run is normalized to the runtime of the run withA_net =A_all/2, then the mean is taken over all runs with a particularA_netvalue. Figure 8.1 shows that runtime only varies forF_nets in the range[1/4,1/2]. W_{f lit} is lower-bounded by aW_{f lit}^(min)soF_net cannot be decreased beyond 1/4. W_{f lit} is upper-bounded by a W_{f lit}^(max) so there is no point to increasingF_net beyond 1/2.

LogicParameterChooser =

binary search to maximize W_{f lit} in the range [W_{f lit}^min, W_{f lit}^max]

given constraint Anet(Wf lit) ≤ Aall / 2 return W_{f lit} and N_{P Es}⁽ⁱ⁾ (W_{f lit}) for each i // Sum interconnect area over all FPGAs.

// Each FPGA’s interconnect area is a function of both W_{f lit} and N_{P Es}⁽ⁱ⁾ . Anet(Wf lit) = P4

i=1 A⁽ⁱ⁾_net(Wf lit, N_{P Es}⁽ⁱ⁾ (Wf lit))

// Compute number of PEs on FPGA i as a function of Wf lit. N_{P Es}⁽ⁱ⁾ (Wf lit) =

binary search to maximize N_{P Es}⁽ⁱ⁾ in the range [1, ∞]

given constraint AP E × N_{P Es}⁽ⁱ⁾ + A⁽ⁱ⁾_net(Wf lit, N_{P Es}⁽ⁱ⁾ ) ≤ A⁽ⁱ⁾_all

Figure 8.2: LogicParameterChooser Algorithm. The outer loop maximizes W_{f lit}. With a fixedW_{f lit}, the inner loop can maximize the PE count for each FPGA,N_{P Es}⁽ⁱ⁾ , with the constraint that PEs and interconnect fit in FPGA resources. Resources used for interconnect,A⁽ⁱ⁾_net(W_{f lit}, N_{P Es}⁽ⁱ⁾ ), can only be calculated when bothW_{f lit}andN_{P Es}⁽ⁱ⁾ are known.

The master FPGA has fewer available resources than each of the three slave FPGAs since it has the MicroBlaze processor and a communication interface to the host PC (Fig- ure 5.2). In order to handle multiple FPGAs with different resources amounts the parameter choice algorithm sums resources over all chips in the target platform. A_net is logic-pairs used for interconnect on all FPGAs, and AP Es is logic-pairs used for all PEs.

A_all is the total number of logic-pairs available for GRAPAL application logic across all FPGA. The entire interconnect, including each on-chip network, uses a single flit width, so W_{f lit} is global across FPGAs. On-chip PE count can be unique to each FPGA so LogicParameterChoosermust chooseN_{P Es}⁽ⁱ⁾ for FPGAi, soNP Es =P4

i=1N_{P Es}⁽ⁱ⁾ . Figure 8.2 shows the LogicParameterChooseralgorithm. This algorithm must maximizeN_{P Es}⁽ⁱ⁾ for each FPGA along with maximizingW_{f lit}soA_net ≈A_{P Es}. It must also satisfy the constraint that resources on each FPGA are not exceeded: A⁽ⁱ⁾_{P Es}+A⁽ⁱ⁾_net ≤A⁽ⁱ⁾_all. The only variableA⁽ⁱ⁾_{P Es}depends on isN_{P Es}⁽ⁱ⁾ . A⁽ⁱ⁾_net depends onN_{P Es}⁽ⁱ⁾ for the switch count and network topology andW_{f lit} for the size of each switch. An outer binary search finds W_{f lit}and an inner binary search findsN_{P Es}⁽ⁱ⁾ givenW_{f lit}. Binary searches are adequate for finding optimal logic parameter values sinceA⁽ⁱ⁾_{P Es}andA⁽ⁱ⁾_netare monotonic functions of the

logic parameters. In FPGA i, the maximumN_{P Es}⁽ⁱ⁾ is codependent with the maximumW_{f lit} andW_{f lit}is common across all FPGA so it is simplest to findW_{f lit} first, in the outer loop.

Each binary search first finds an upper bound by starting with a base values and doubling it until constraints are violated. It then performs divide and conquer to find the maxi- mum feasible value. The range of the binary search overW_{f lit}is[W_{f lit}^(min), W_{f lit}^(max)]. W_{f lit}^(min) is the message address width, used to route a message to a PE, which must be contained in the first flit of every message.W_{f lit}^(max)is the minimum of the width of inter-FPGA channels and the width of a message. Each flit must be routable over inter-FPGA channels, and there is no reason to make flits larger than a message.

LogicParameterChoosermust estimate resources used by a single PE,A_{P E}, and estimate resources used by the interconnect, A⁽ⁱ⁾_net as a function of N_{P Es}⁽ⁱ⁾ andW_{f lit}. Each PE inputs and outputs messages, so some of its datapath widths depend on width of the address used to route messages to PEs. A_{P E} depends on N_{P Es}, which we are trying to compute as a function of A_{P E}. LogicParameterChooser places a lower bound on PE resources by using an address width of0, then places an upper bound on PE resources by using an address with large enough to address a design with as many lower-bound PEs as can fit on the FPGAs. To be conservative, the upper bound is used asA_{P E}. The effect of this estimation is minimal since the address width for this upper bound is only1or2bits greater than actual width for all benchmark applications. Since PE logic depends on the GRAPAL application, resource usage for PE with a concrete address width must be mea- sured by the procedure wrapping the FPGA tool chain, ComponentResources. Calls to the FPGA tool chain are expensive, so the two measurements forA_{P E} are performed by ComponentResourcesbefore the binary search loops.

A⁽ⁱ⁾_netis estimated as the sum of resources used by all switches in the on-chip network.

The logic for each switch is a function of bothW_{f lit}and address width,W_addr. The address width used is the same slight overestimate as that used byA_{P E}. Switch resources must be recalculated for each iteration of the outer binary search loop overW_{f lit}. To avoid a call to the expensive FPGA tool chain in each iteration, a piecewise-linear model for switch resources is used. A linear model provides a good approximation since all subcomponents of switch logic which depend onW_addrorW_{f lit}are linear structures. At compiler installation-

time, before compile-time,ComponentResourcesis run for each switch crossed with a range of powers of two ofW_addr andW_{f lit}. All powers of two are included up until the switch exceeds FPGA resources. TheseComponentResourcesmeasurements provide the vertices of the piecewise-linear model. The interpolation performed to approximate resources used by a switch,A_sw(W_addr, W_{f lit}), is:

Asw(Waddr, Wf lit) =Abase+Aaddr+Af lit

where:

A_base = A_sw(bW_addrc₂,bW_{f lit}c₂) A_addr = W_addr− bW_addrc₂

dW_addre₂− bW_addrc₂ [A_sw(dW_addre₂,bW_{f lit}c₂)−A_base] A_{f lit} = Wf lit− bWf litc2

dW_{f lit}e₂− bW_{f lit}c₂ [A_sw(bW_addrc₂,dW_{f lit}e₂)−A_base] bc₂ andde₂round down and up, respectively, to the nearest power of2.

Dalam dokumen A Programming Language for Parallel Graph Algorithms (Halaman 124-128)