To gauge the value of our performance model, we measure its accuracy for our applications run on the BEE3 platform. Over all applications and graphs, the mean error of modeled runtime (Tstep(model)) compared to actual runtime (Tstep(actual)) is11%and the maximum error is 46%. The error is calculated as:
| Tstep(model) Tstep(actual)−1|
The performance model tends to underpredict performance and not overpredict. For our benchmark applications and graphs the mean prediction is90%of actual performance.
Figure 6.3 shows that the model under predicts for the Push Relabel and Spatial Router applications. Our model does not include congestion between messages in the network, which causes it to under predict Rnet. Congestion occurs when contention between mes- sages through a merge backs up predecessor switches to cause messages not destined for
0 0.2 0.4 0.6 0.8 1 1.2 1.4
cnet_smalltsengex5papex4misex3diffeqalu4dsipdesseqbigkeyapex2s298ellipticfriscsplaex1010pdcs38584.1s38417clma BVZ_128BVZ_64BVZ_32 sqrt8misex2cc count5xp1s382mult32aduke2 Tstep(model) / Tstep(actual)
Application.GraphPush-Relabel Spatial Router Bellman-Ford
ConceptNet
Figure 6.3: Modeled cycles divided by measured cycles for each application and graph.
Maximum and minimum deviations are on the right.
the merge to stall. Our model also ignores the case where a message that is not on the critical path, blocks a message on the critical path.
In general, computation in the four phases global broadcast, node iteration, dataflow activity, and global reduce can overlap. This overlap can cause the model to over predict when actions on the critical path in a particular phase occur before all actions in the pre- vious phase have completed. For example, node iteration on a PE close to the sequential controller starts before timeLbcast, when the global broadcast reaches the farthest PEs. If one of the nodes close to the sequential controller is on the critical path thenLbcastdoes not fully contribute toTstep. Node iteration and dataflow activity overlap so when the last node is not on the critical path, the impact of node iteration is not the fullTnode. Finally, when the last message received is close to the sequential controllerLreducedoes not fully contribute toTstep. This under prediction indicates that time saved due to overlapping computation is not having a large impact on performance.
Chapter 7
Optimizations
This chapter explores optimizations for the logic architecture and data placement for GraphStep. We show the benefit of dedicated global broadcast and reduce networks in Section 7.1. Node decomposition (Section 7.2) and placement for locality (Section 7.4) optimize how operations and data are assigned to PEs. Section 7.3 compares alternative synchronization styles for message passing and node firing. Table 7.1 shows the speedup of each optimization.
We evaluate our optimizations on GRAPAL programs mapped to the BEE3 platform.
The use of GraphStep allows us to try these optimizations without changing GRAPAL source program. However, most of these optimizations are relevant for graph algorithms on multiprocessor architectures in general. In particular, a parallel graph algorithm written in a general model run on a conventional multiprocessor architecture can use node decom- position, placement for locality, and any of the synchronization styles in Section 7.3. Sec- tion 7.1 shows the benefit of choosing a multiprocessor architecture with dedicated global networks.
Decomposition Locality Latency Ordering Dedicated Global Networks
AVG 2.18 1.20 1.01 1.51
MIN 1.00 0.79 0.98 1.17
MAX 5.53 1.81 1.03 1.72
Table 7.1: Speedup due to each optimization
Lbcast Lstep
Lnodes Lstep
Ldataf low
Lstep
Lreduce Lstep
AVG 0.28 0.21 0.33 0.18
MIN 0.14 0.07 0.14 0.08
MAX 0.41 0.61 0.52 0.26
Table 7.2: Contributions, after optimization, to the graph-step critical path,Lstep
7.1 Critical Path Latency Minimization
For small graphs the critical path latency dominates graph-step time. To model critical path latency, we remove throughput times from the model forTstepdefined in Section 6.1. Like Tstep, the critical path latency of a graph-step (Lstep) contains the global broadcast latency (Lbcast), node iteration latency (Lnodes), and global reduce latency (Lreduce). Instead of Tdataf low, the critical path of dataflow activity,Ldataf low, is used.
Lstep =Lbcast+Lnodes+Ldataf low+Lreduce
Table 7.2 shows the contribution of each latency component toLstep for the optimized case. These latencies are calculated with our performance model and averaged over all applications and graphs. To allow averaging, each component of Lstep is reported as a fraction of total latency.
We optimize the graph-step latency by providing dedicated networks for global broad- cast and reduce, and we evaluate an optimization that orders the iteration over nodes per- formed by each PE.
7.1.1 Global Broadcast and Reduce Optimization
One way in which we customize our logic architecture to GraphStep is that we devote a global broadcast and reduce network to minimizeLbcast andLreduce. This is enable by the knowledge of the types of communication in GraphStep. A general actors implementation would have to use the message passing network for this critical global synchronization. In a nonspecialized implementation of global broadcast, a controller PE sends one message to each PE. For global reduce each PE must send to the controller PE, which performs the
0 100 200 300 400 500 600 700 800
cnet_smalltsengex5papex4misex3diffeqalu4dsipdesseqbigkeyapex2s298ellipticfriscsplaex1010pdcs38584.1s38417clma BVZ_128BVZ_64BVZ_32 sqrt8misex2cc count5xp1s382mult32aduke2
Graph
Lbcast Lreduce Lnodes Ldataflow
Spatial Router Push-Relabel
Bellman-Ford ConceptNet
Figure 7.1: Latency of a graph-step (Lstep) as a sum of the latencies of the four stages for each application and graph
global reduce computations. The unoptimized global reduce implementation would have to trade off between frequency of quiescence updates and network congestion. The un- optimized case is that messages travel over the packet-switched network and are sent or received one at a time. These unoptimized times may be optimistic since network conges- tion is not modeled here. The unoptimized global broadcast and reduce are:
L(unopt)bcast =NmaxP Es
i=1 Mi+ 2NP EsdWbcast Wf lit e
L(unopt)reduce =NmaxP Es
i=1 Mi+NP EsdWreduce Wf lit e
The network latency to PE iis Mi. There areNP Es messages sent or received where the number of cycles for each one is the number of flits it requires. Similar to the optimized broadcast (Equation 6.1), there is a factor of 2 because there is one broadcast at the be- ginning of a graph-step and one at the end. Table 7.3 compares the unoptimized global broadcast and reduce to the optimizedLbcastandLreduce(Equations 6.1,6.2). The speedups due to optimization on global broadcast and reduce are 2.5and 1.8, respectively. For the totalLstepthe speedups is1.6.
L(unopt)bcast Lbcast
L(unopt)reduce Lreduce
L(unopt)step Lstep
2.5 1.8 1.6
Table 7.3: Speedup due to using global broadcast and reduce network on the Lbcast and Lreducestages themselves and onLstep
7.1.2 Node Iteration Optimization
We attempted to minimizeTstepby ordering nodes in each PE to be sensitive to the critical path in message passing and operation firing. The node iteration and dataflow activity stages in a graph-step should be made to overlap so their critical path latency is close to max{Lnodes, Ldataf low} instead of Lnodes +Ldataf low. We define Ldataf low(v) to be the latency of messages and operations from node v’s firing to the end of the dataflow stage.
This latency includes message hops through fanin trees and fanout trees. Now,
Ldataf low = max
v∈V Ldataf low(v)
This calculation uses a model of the network latencies from PE to PE as well as the laten- cies of operators in each PE. The node iteration optimization orders nodes in each PE by Ldataf low(v)so the higher latency nodes come earlier in the iteration.
Unfortunately, optimized node ordering makes little difference on total runtime. Com- paringTstep for the optimized case to the case where nodes are ordered randomly in PEs gives an average speedup of1.01, with a minimum of0.98and maximum of1.03.