L(unopt)bcast Lbcast
L(unopt)reduce Lreduce
L(unopt)step Lstep
2.5 1.8 1.6
Table 7.3: Speedup due to using global broadcast and reduce network on the Lbcast and Lreducestages themselves and onLstep
7.1.2 Node Iteration Optimization
We attempted to minimizeTstepby ordering nodes in each PE to be sensitive to the critical path in message passing and operation firing. The node iteration and dataflow activity stages in a graph-step should be made to overlap so their critical path latency is close to max{Lnodes, Ldataf low} instead of Lnodes +Ldataf low. We define Ldataf low(v) to be the latency of messages and operations from node v’s firing to the end of the dataflow stage.
This latency includes message hops through fanin trees and fanout trees. Now,
Ldataf low = max
v∈V Ldataf low(v)
This calculation uses a model of the network latencies from PE to PE as well as the laten- cies of operators in each PE. The node iteration optimization orders nodes in each PE by Ldataf low(v)so the higher latency nodes come earlier in the iteration.
Unfortunately, optimized node ordering makes little difference on total runtime. Com- paringTstep for the optimized case to the case where nodes are ordered randomly in PEs gives an average speedup of1.01, with a minimum of0.98and maximum of1.03.
Original Decomposed Speedup
Rdense 1395 490 5.5
Rser 268 60 4.1
Table 7.4: Decomposition reduces bothRdenseandRser.
cessor and successor edges into nodes assigned to a PE. The set of predecessor edges into PEiisPi =pred(Vi)and the set of successors isSi =succ(Vi). So|Pi|=P
v∈Vi∆in(v) and |Si| = P
v∈Vi∆out(v). The load across PEs is the maximum over predecessors and successors:
L=NmaxP Es
i=1 {max(|Pi|,|Si|)} (7.1) Edge Memory and Send Memory in each PE have capacityDedgesand must fit all assigned predecessor and successor nodes, so:
Dedges≥L
In a simple model where all edges are active in a graph-step, the time to input and output messages for the most imbalanced PE is:
Rdense=L
Rser is the component ofTdataf low that models edge operation time and message input and output time (Section 6.1.3). When all edges are active Rdense = Rser, however edges are usually sparsely active soRser ≤ Rdense. We find that minimizing Rdense helps decrease Rser: Table 7.4 uses the performance model to show the impact of decomposition on both Rdense and Rser. On average across benchmark graphs, decomposition reducesRdense by 6.3times which helps decreaseRser by4.7times.
Figure 7.2 shows the speedup from performing decomposition. Graphs with and without decomposition are placed for locality, and nodes are ordered by fanin/fanout tree latency-depth. The mean speedup is2.6, the minimum is1.0and the maximum is5.6.
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
cnet_smalltsengex5papex4misex3diffeqalu4dsipdesseqbigkeyapex2s298ellipticfriscsplaex1010pdcs38584.1s38417BVZ_128BVZ_64BVZ_32 sqrt8misex2cc count5xp1s382mult32aduke2
Speedup
Application.GraphPush-Relabel Spatial Router Bellman-Ford
ConceptNet
Figure 7.2: Speedup due to decomposition on Tstep for each graph which fits without de- composition.
7.2.1 Choosing ∆
limitThe decomposition transform is performed on graphs at runtime before they are loaded into memory. The runtime algorithm places the limit∆limit on in- and out-degrees:
∀v ∈V : ∆limit ≥∆in(v)∧∆limit ≥∆out(v)
Figure 1.5 shows an example transform where∆limit = 2. ∆limit is chosen by the runtime algorithm. Smaller ∆limit enables load-balancing nodes across PEs. However, smaller
∆limitincreases the depth of fanin and fanout trees, which increases the graph-step latency, Ldataf low. So we maximize ∆limit with the constraint that PEs can be load balanced. Per- fectly load balanced PEs have|Pi| =|Si| =|E|/NP Es. For each decomposed node,v, we want∆in(v)≤ |E|/NP Esand∆out(v)≤ |E|/NP Es, so we set:
∆limit = |E|
NP Es
For the purpose of maximizing memory load balance, it is fine to have one node take
0 5 10 15 20 25 30
cnet_smalltsengex5papex4misex3diffeqalu4dsipdesseqbigkeyapex2s298ellipticfriscsplaex1010pdcs38584.1s38417clma BVZ_128BVZ_64BVZ_32 sqrt8misex2cc count5xp1s382mult32aduke2
Memory for Undecomposed Graph / Memory for Decomposed Graph
Application.GraphPush-Relabel Spatial Router Bellman-Ford
ConceptNet
Figure 7.3: The ratio of memory required by the undecomposed graph to memory required for the decomposed graph for each application and graph. RMS is on the right. Uniform PE memory sizes are assumed.
up all or most of a PE. The load balancer can use the many small nodes to pack PEs evenly.
Figure 7.3 shows, for each application and graph, the ratio of memory required when de- composition is not performed to memory required with decomposition. This assumes that memory per PE is uniform, as it is in our logic architecture. The average memory reduction due to decomposition is10times, and the maximum is27times.
Unlike memory requirements, the amount of computation work for a node in a particu- lar graph-step is dynamic. In general only a subset of nodes are active in each graph-step so the observed computation load balance is not the same as the memory load balance and is different for each graph-step. Large decomposed nodes which take up most of the PE could potentially ruin the computation load-balance. Also contributing toTstepis the amount of message traffic. The node to PE placement algorithm tries to place fanin and fanout nodes on the same PE as their descendants. Smaller nodes may result in less message traffic due to fewer long-distance messages, resulting in lower decreasingRnet. However large nodes mean the depth of fanin and fanout trees in smaller, resulting in a lowerLdataf low latency.
Figure 7.4 plots the root mean square of normalizedTstep across all graphs for a range
0.8 1 1.2 1.4 1.6 1.8 2 2.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Normalized Runtime
∆limit / (|E| / NPEs)
min RMS max
Figure 7.4: Root-mean-square along with min and max across applications and graphs of Tstepfor a range of∆limits. ∆limit is normalized to|E|/NP Es.
of∆limits;∆limitis normalized to|E|/NP Es. The choice with the best RMS is the simplest:
∆limit =|E|/NP Es. Figure 7.5 shows the effect of varying∆limiton various contributors to graph-step time, where∆limit is normalized to|E|/NP Es. This figure shows thatLdataf low is the dominating contributor. For∆limit < |E|/NP Esthe depth of fanin and fanout trees increases, so Ldataf low becomes more severe. For ∆limit > |E|/NP Es it is impossible to perfectly load-balance nodes across PEs. Figure 7.6 shows the number of fanin and fanout level across decomposed nodes.