Accuracy of Performance Model - A Programming Language for Parallel Graph Algorithms

To gauge the value of our performance model, we measure its accuracy for our applications run on the BEE3 platform. Over all applications and graphs, the mean error of modeled runtime (T_step^(model)) compared to actual runtime (T_step^(actual)) is11%and the maximum error is 46%. The error is calculated as:

| T_step^(model) T_step^(actual)−1|

The performance model tends to underpredict performance and not overpredict. For our benchmark applications and graphs the mean prediction is90%of actual performance.

Figure 6.3 shows that the model under predicts for the Push Relabel and Spatial Router applications. Our model does not include congestion between messages in the network, which causes it to under predict R_net. Congestion occurs when contention between messages through a merge backs up predecessor switches to cause messages not destined for

0 0.2 0.4 0.6 0.8 1 1.2 1.4

cnet_smalltsengex5papex4misex3diffeqalu4dsipdesseqbigkeyapex2s298ellipticfriscsplaex1010pdcs38584.1s38417clma BVZ_128BVZ_64BVZ_32 sqrt8misex2cc count5xp1s382mult32aduke2 Tstep(model) / Tstep(actual)

Application.GraphPush-Relabel Spatial Router Bellman-Ford

ConceptNet

Figure 6.3: Modeled cycles divided by measured cycles for each application and graph.

Maximum and minimum deviations are on the right.

the merge to stall. Our model also ignores the case where a message that is not on the critical path, blocks a message on the critical path.

In general, computation in the four phases global broadcast, node iteration, dataflow activity, and global reduce can overlap. This overlap can cause the model to over predict when actions on the critical path in a particular phase occur before all actions in the pre- vious phase have completed. For example, node iteration on a PE close to the sequential controller starts before timeL_bcast, when the global broadcast reaches the farthest PEs. If one of the nodes close to the sequential controller is on the critical path thenL_bcastdoes not fully contribute toT_step. Node iteration and dataflow activity overlap so when the last node is not on the critical path, the impact of node iteration is not the fullT_node. Finally, when the last message received is close to the sequential controllerLreducedoes not fully contribute toT_step. This under prediction indicates that time saved due to overlapping computation is not having a large impact on performance.

Chapter 7 Optimizations

This chapter explores optimizations for the logic architecture and data placement for GraphStep. We show the benefit of dedicated global broadcast and reduce networks in Section 7.1. Node decomposition (Section 7.2) and placement for locality (Section 7.4) optimize how operations and data are assigned to PEs. Section 7.3 compares alternative synchronization styles for message passing and node firing. Table 7.1 shows the speedup of each optimization.

We evaluate our optimizations on GRAPAL programs mapped to the BEE3 platform.

The use of GraphStep allows us to try these optimizations without changing GRAPAL source program. However, most of these optimizations are relevant for graph algorithms on multiprocessor architectures in general. In particular, a parallel graph algorithm written in a general model run on a conventional multiprocessor architecture can use node decomposition, placement for locality, and any of the synchronization styles in Section 7.3. Sec- tion 7.1 shows the benefit of choosing a multiprocessor architecture with dedicated global networks.

Decomposition Locality Latency Ordering Dedicated Global Networks

AVG 2.18 1.20 1.01 1.51

MIN 1.00 0.79 0.98 1.17

MAX 5.53 1.81 1.03 1.72

Table 7.1: Speedup due to each optimization

L_bcast Lstep

L_nodes Lstep

Ldataf low

Lstep

L_reduce Lstep

AVG 0.28 0.21 0.33 0.18

MIN 0.14 0.07 0.14 0.08

MAX 0.41 0.61 0.52 0.26

Table 7.2: Contributions, after optimization, to the graph-step critical path,Lstep

7.1 Critical Path Latency Minimization

For small graphs the critical path latency dominates graph-step time. To model critical path latency, we remove throughput times from the model forT_stepdefined in Section 6.1. Like T_step, the critical path latency of a graph-step (L_step) contains the global broadcast latency (L_bcast), node iteration latency (L_nodes), and global reduce latency (L_reduce). Instead of T_{dataf low}, the critical path of dataflow activity,L_{dataf low}, is used.

L_step =L_bcast+L_nodes+L_{dataf low}+L_reduce

Table 7.2 shows the contribution of each latency component toL_step for the optimized case. These latencies are calculated with our performance model and averaged over all applications and graphs. To allow averaging, each component of L_step is reported as a fraction of total latency.

We optimize the graph-step latency by providing dedicated networks for global broadcast and reduce, and we evaluate an optimization that orders the iteration over nodes per- formed by each PE.

7.1.1 Global Broadcast and Reduce Optimization

One way in which we customize our logic architecture to GraphStep is that we devote a global broadcast and reduce network to minimizeL_bcast andL_reduce. This is enable by the knowledge of the types of communication in GraphStep. A general actors implementation would have to use the message passing network for this critical global synchronization. In a nonspecialized implementation of global broadcast, a controller PE sends one message to each PE. For global reduce each PE must send to the controller PE, which performs the

0 100 200 300 400 500 600 700 800

cnet_smalltsengex5papex4misex3diffeqalu4dsipdesseqbigkeyapex2s298ellipticfriscsplaex1010pdcs38584.1s38417clma BVZ_128BVZ_64BVZ_32 sqrt8misex2cc count5xp1s382mult32aduke2

Graph

L_bcast L_reduce L_nodes L_dataflow

Spatial Router Push-Relabel

Bellman-Ford ConceptNet

Figure 7.1: Latency of a graph-step (L_step) as a sum of the latencies of the four stages for each application and graph

global reduce computations. The unoptimized global reduce implementation would have to trade off between frequency of quiescence updates and network congestion. The unoptimized case is that messages travel over the packet-switched network and are sent or received one at a time. These unoptimized times may be optimistic since network congestion is not modeled here. The unoptimized global broadcast and reduce are:

L^(unopt)_bcast =^Nmax^{P Es}

i=1 M_i+ 2N_{P Es}dW_bcast W_{f lit} e

L^(unopt)_reduce =^Nmax^{P Es}

i=1 M_i+N_{P Es}dW_reduce W_{f lit} e

The network latency to PE iis M_i. There areN_{P Es} messages sent or received where the number of cycles for each one is the number of flits it requires. Similar to the optimized broadcast (Equation 6.1), there is a factor of 2 because there is one broadcast at the be- ginning of a graph-step and one at the end. Table 7.3 compares the unoptimized global broadcast and reduce to the optimizedL_bcastandL_reduce(Equations 6.1,6.2). The speedups due to optimization on global broadcast and reduce are 2.5and 1.8, respectively. For the totalL_stepthe speedups is1.6.

L^(unopt)_bcast L_bcast

L^(unopt)_reduce L_reduce

L^(unopt)_step Lstep

2.5 1.8 1.6

Table 7.3: Speedup due to using global broadcast and reduce network on the L_bcast and Lreducestages themselves and onLstep

7.1.2 Node Iteration Optimization

We attempted to minimizeT_stepby ordering nodes in each PE to be sensitive to the critical path in message passing and operation firing. The node iteration and dataflow activity stages in a graph-step should be made to overlap so their critical path latency is close to max{Lnodes, Ldataf low} instead of Lnodes +Ldataf low. We define Ldataf low(v) to be the latency of messages and operations from node v’s firing to the end of the dataflow stage.

This latency includes message hops through fanin trees and fanout trees. Now,

L_{dataf low} = max

v∈V L_{dataf low}(v)

This calculation uses a model of the network latencies from PE to PE as well as the latencies of operators in each PE. The node iteration optimization orders nodes in each PE by L_{dataf low}(v)so the higher latency nodes come earlier in the iteration.

Unfortunately, optimized node ordering makes little difference on total runtime. Com- paringT_step for the optimized case to the case where nodes are ordered randomly in PEs gives an average speedup of1.01, with a minimum of0.98and maximum of1.03.

Dalam dokumen A Programming Language for Parallel Graph Algorithms (Halaman 104-109)