Composition to Full Designs - A Programming Language for Parallel Graph Algorithms

PEs composed with network switches to get the full GRAPAL application logic. GRAPAL application logic is then composed with inter-FPGA channel logic on all FPGA and with the MicroBlaze processor and host communication logic on the master FPGA. Resources used for the full design may be slightly different than the sum over components. Clock frequency may be different than the minimum clock frequency over components. The FullDesignFitalgorithm refines logic parameters, memory parameters and frequency until a legal design is produced.

The first iteration of FullDesignFitcompiles the full design with the parameters chosen by LogicParameterChooser and MemoryParameterChooser. These are the number of PEs on each FPGA, N_{P Es}⁽ⁱ⁾ , the flit width of the interconnect,W_{f lit}, PE node memory depth, D_nodes, and PE edge memory depth D_edges. It also uses the minimum clock frequency over the logic components chosen by LogicParameterChooser. Each iteration runs the FPGA tool chain separately for each FPGA with the current parameters. For each FPGA, the FPGA tool chain wrapper may return success or failure with the reason for failure. If success is returned for all FPGAs then FullDesignFit is finished and the HARDWARE compilation stage is complete (Figure 5.1). The reason for failure may be that excessive logic-pairs were used, excessive BlockRAMs were used, or the requested clock frequency was too high. If excessive logic-pairs were used on FPGA i then N_{P Es}⁽ⁱ⁾ is multiplied by0.9. Also, MemoryParameterChooser is run again to get a suitable D_nodes andD_edges for the newN_{P Es}⁽ⁱ⁾ . If the design fits in logic on all FPGAs but used excessive BlockRAMs then bothD_nodesandD_edgesare multiplied by0.9. If the design fit in logic and

BlockRAMs on all FPGAs but the requested clock frequency was not met, then the lower feasible clock frequency found by the FPGA router tool is used for the next iteration. On the other hand, if the design fit and the requested clock frequency was less than0.9times the feasible clock frequency then the higher feasible clock frequency is used for the next iteration.

Compilation time of each full design for each FPGA through the FPGA tool chain is typically 1 hour, but can take up to4hours on a 3 GHz Intel Xeon. Further, one design must be compiled for each of the four FPGA on the BEE3 platform. So full design compilation dominates the runtime of the compiler and it is critical to minimize the number of iterations ofFullDesignFit.

For all applications the first iteration of FullDesignFitis successful. This means that models used byLogicParameterChooserandMemoryParameterChooser do not underestimate resource usage. Across applications, the FPGAs’ logic-pairs utilized is95%to97%. BlockRAMs utilized is89%to94%. So our models overestimate resource usage by at most5%for logic-pairs and11%for BlockRAMs.

Chapter 9 Conclusion

9.1 Lessons

9.1.1 Importance of Runtime Optimizations

Node decomposition to enable load-balancing is the most important runtime optimization we explored. It allows us to fit larger graphs in our architecture of many, small PEs than would otherwise fit. This allows us to always allocate as many PEs as possible so our compiler does not have to trade off between throughput of parallel operations and memory capacity for large nodes. Further, it provided a mean2.6x speedup by load-balancing graphs that fit without decomposition.

Placement for locality mattered less than we expected. It only gave a 1.3x speedup for our benchmark applications and graphs. However, we expect placement for locality to provide greater speedups for larger systems with many more than four FPGAs.

9.1.2 Complex Algorithms in GRAPAL

We found that for the simple algorithms Bellman-Ford and ConceptNet, GRAPAL gives large speedups per chip of up to7xand very large energy reductions of10x to100x. Our implementation effort in GRAPAL of the complex algorithms Spatial Router and Push- Relabel did not yield per-chip speedups, though Spatial Router does reduce energy cost by 3x to30x. We have yet to see whether GRAPAL implementations of Spatial Router and Push-Relabel can be further optimized increase performance (Section 9.2.2).

9.1.3 Implementing the Compiler

Our compiler bridges a large gap between GRAPAL source programs and fully functioning FPGA logic. The core of the compiler that checks the source program and transforms it into HDL operators (Section 5.1) is an important part of the design of GRAPAL but is a relatively small part of the implementation effort. Implementing the parameterized logic architecture (Section 5.2) of PEs and switches was more complex. Our customized library for constructing HDL operators helped us manage the complexity of a dataflow-style architecture that is parameterized by the GRAPAL program. Managing BEE3 platform-level details was a big part of implementation work: Implementing inter-chip communication channels, implementing the host PC to MicroBlaze communication channel, and support- ing the sequential C program with the C standard library on the MicroBlaze processor are difficult pieces of work that we customized for the specific platform. For the compiler to support a different FPGA platform, many of these low-level customizations need to be changed. Another difficult piece of the compiler is wrapping FPGA tools with a simple interface. An API interface to FPGA tools at the HDL level would save much effort in reformatting and semantics-discovery for anyone who wants to wrap FPGA tools into a automated, closed-loop.

Dalam dokumen A Programming Language for Parallel Graph Algorithms (Halaman 130-133)