Butterfly Fat-Tree - Logic Architecture - A Programming Language for Parallel Graph Algorithms

5.2 Logic Architecture

5.2.2 Interconnect

5.2.2.1 Butterfly Fat-Tree

The Butterfly Fat-Tree network connects PEs on the same chip to each other and to the bridge to inter-chip channels. A BFT with a rent parameter of P = 0.5 was chosen for its efficient use of two-dimensional FPGA fabric and for simplicity. A mesh network also efficiently uses two-dimensional fabric but has more complex routing algorithms and is more difficult to compose into an inter-FPGA network. To compose a BFT into an inter- FPGA network the top-level switches are simply connected up to other FPGAs.

TheP = 0.5BFT is constructed recursively by connecting four child BFTs with new top-level 4-2 switches. Figure 5.12 shows a two level BFT. Each 4-2 switch (Figure 5.13) has four bottom channels and two top channels. The address of each message from a bottom channel into a 4-2 switch determines whether the message goes to one of the other three bottom channels or a top channel. The address of a message from a top channel determines which bottom channel it goes to. The two top channels are chosen in a round-robin order with a preference for uncongested channels.

In general, BFTs are parameterized around their rent parameter P, with 0 ≤ P ≤ 1.

P determines the bandwidth between sections of the BFT: If a subtree hasnPEs then the bandwidth out of the subtree is proportional ton^P. With P = 0.5 the bandwidth out of a subtree is proportional to n^1/2, which is the maximum bandwidth afforded by an n^1/2 size perimeter out of a region of n PEs on a two-dimensional fabric. This match to two- dimensional fabric makes theP = 0.5BFT area-universal: network bandwidth out of a subregion per unit area is asymptotically optimal.

Figure 5.13: Each P = 0.5Butterfly Fat-Tree node is a 4-2 switch. The 4-2 switch has twoT-switches and twoΠ-switches. TheT-switch has three 2-to-1 merges and three 1-to- 2 splitters, where each splitter directs a message based on its address. TheΠ-switch has two 2-to-1 merges, two 1-to-2 splitters, two 3-to-1 merges, and two 1-to-3 splitters. Each of the 3-to-1 splitters decides whether to route a message in one of the two up directions or in down based on its address, and it routes messages going in the up direction to the least congested switch. All splitters in bothT- andΠ-switches have buffers on their out- puts to prevent congestion in one output direction from blocking messages going to other directions.

Chapter 6 Performance Model

We developed a performance model for GraphStep to inform decisions about the logic architecture and runtime optimizations. In general, the performance model could be used to estimate performance for a new platform to see if it is likely to be worth the implementation effort to target that platform. The performance model helps us understand where the throughput bottlenecks are and which components of the critical path are most significant.

Section 7.1 shows which components of the critical path are important to optimize, and what the effect of each optimization is on each component rather than just on total runtime.

Section 7.3 explores the benefit of alternative synchronization styles using the performance model. Since it is difficult to instrument FPGA logic, it is especially important to have a good performance model when targeting FPGAs. We use our GRAPAL implementation on the BEE3 platform to supply concrete values for modeled times. Section 6.2 discusses the accuracy of the performance model when used to measured application runtimes on the BEE3 platform. The mean error runtime predicted by the model, compared to actual runtime, is11%.

The GraphStep performance model approximates total runtime by summing time over graph-steps. The time to perform each graph-step (T_step) is a function of communication and computation operation latency along with throughput costs of communication and computation operations on hardware resources. Components of the time are latencies (L∗), throughput-limited times (R∗), and composite times that include bothL∗ components and R∗ components. Our performance model can be thought of as an elaboration of BSP [5]

(Section 2.4.3) for GraphStep. Like BSP, the time of a step is a function of latency and

throughput components, and the significant pieces are computation work on processors (w in BSP), network load (hin BSP), network bandwidth (gin BSP), and barrier synchronization latency (lin BSP).

6.1 Model Definition

Graph-step time is the sum of the time taken by the four subphases: global broadcast, node iteration, dataflow activity, and global reduce (Section 5.2). Figure 6.1 illustrates the work done by the four subphases.

T_step =L_bcast+L_nodes+T_{dataf low} +L_reduce

1. L_bcast is the latency of the global broadcast from the sequential controller to all PEs (Section 6.1.1).

2. L_nodesis the time taken to iterate over logical nodes assigned to each PE to initiate node firings (Section 6.1.2). Each PE starts iterating once it receives the global broadcast. It takes one cycle per node for each node stored at the PE, including nodes which are not fired initially. In the example (Figure 6.1) there are two PEs which each initiate activity in their fanin-tree leaf nodes.

3. T_{dataf low} is the time taken by the dataflow-synchronized message passing and operation firing activity that was initiated by the node iteration (Section 6.1.3). This includes the time taken by node and edge operations and messages in fanin and fanout trees until the final operations before the end of the graph-step.

4. L_reduce is the time taken by the global reduce from all PEs to the sequential controller (Section 6.1.1). The global reduce network is used for both detecting message and operation quiescence. Although the global reduce network is also used to for the high-level gred methods’ global reduce, only quiescence detection is on the critical path of a graph-step. Since quiescence detection works by counting message sends and receives, the increment from the last message receive, through the global reduce network, to the

Fanin Leaf Roots

AVG 0.97 0.97

MIN 0.91 0.92

MAX 1.00 1.00

Table 6.1: Fraction of nodes that are fanin-tree leaves and root nodes across all benchmarks graphs. A node with no fanin-tree nodes is both a fanin-tree leaf and a root.

sequential controller is on the critical path.

N_{P Es} is the number of processing elements. The graph is a pair of nodes and edges:

G = (V, E). The set of nodes assigned to PE i isV_i, so|V| = PNP Es

i=1 V_i. The function pred : V → E maps nodes to their predecessor edges andsucc : V → E maps nodes to the successor edges.

Dalam dokumen A Programming Language for Parallel Graph Algorithms (Halaman 95-99)