Node Decomposition - A Programming Language for Parallel Graph Algorithms

L^(unopt)_bcast L_bcast

L^(unopt)_reduce L_reduce

L^(unopt)_step Lstep

2.5 1.8 1.6

Table 7.3: Speedup due to using global broadcast and reduce network on the L_bcast and Lreducestages themselves and onLstep

7.1.2 Node Iteration Optimization

We attempted to minimizeT_stepby ordering nodes in each PE to be sensitive to the critical path in message passing and operation firing. The node iteration and dataflow activity stages in a graph-step should be made to overlap so their critical path latency is close to max{Lnodes, Ldataf low} instead of Lnodes +Ldataf low. We define Ldataf low(v) to be the latency of messages and operations from node v’s firing to the end of the dataflow stage.

This latency includes message hops through fanin trees and fanout trees. Now,

L_{dataf low} = max

v∈V L_{dataf low}(v)

This calculation uses a model of the network latencies from PE to PE as well as the latencies of operators in each PE. The node iteration optimization orders nodes in each PE by L_{dataf low}(v)so the higher latency nodes come earlier in the iteration.

Unfortunately, optimized node ordering makes little difference on total runtime. Com- paringT_step for the optimized case to the case where nodes are ordered randomly in PEs gives an average speedup of1.01, with a minimum of0.98and maximum of1.03.

Original Decomposed Speedup

Rdense 1395 490 5.5

Rser 268 60 4.1

Table 7.4: Decomposition reduces bothR_denseandR_ser.

cessor and successor edges into nodes assigned to a PE. The set of predecessor edges into PEiisP_i =pred(V_i)and the set of successors isS_i =succ(V_i). So|P_i|=P

v∈V_i∆_in(v) and |S_i| = P

v∈V_i∆_out(v). The load across PEs is the maximum over predecessors and successors:

L=^Nmax^{P Es}

i=1 {max(|P_i|,|S_i|)} (7.1) Edge Memory and Send Memory in each PE have capacityD_edgesand must fit all assigned predecessor and successor nodes, so:

D_edges≥L

In a simple model where all edges are active in a graph-step, the time to input and output messages for the most imbalanced PE is:

R_dense=L

R_ser is the component ofT_{dataf low} that models edge operation time and message input and output time (Section 6.1.3). When all edges are active R_dense = R_ser, however edges are usually sparsely active soR_ser ≤ R_dense. We find that minimizing R_dense helps decrease R_ser: Table 7.4 uses the performance model to show the impact of decomposition on both R_dense and R_ser. On average across benchmark graphs, decomposition reducesR_dense by 6.3times which helps decreaseR_ser by4.7times.

Figure 7.2 shows the speedup from performing decomposition. Graphs with and without decomposition are placed for locality, and nodes are ordered by fanin/fanout tree latency-depth. The mean speedup is2.6, the minimum is1.0and the maximum is5.6.

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

cnet_smalltsengex5papex4misex3diffeqalu4dsipdesseqbigkeyapex2s298ellipticfriscsplaex1010pdcs38584.1s38417BVZ_128BVZ_64BVZ_32 sqrt8misex2cc count5xp1s382mult32aduke2

Speedup

Application.GraphPush-Relabel Spatial Router Bellman-Ford

ConceptNet

Figure 7.2: Speedup due to decomposition on Tstep for each graph which fits without decomposition.

7.2.1 Choosing ∆

_limit

The decomposition transform is performed on graphs at runtime before they are loaded into memory. The runtime algorithm places the limit∆_limit on in- and out-degrees:

∀v ∈V : ∆_limit ≥∆_in(v)∧∆_limit ≥∆_out(v)

Figure 1.5 shows an example transform where∆_limit = 2. ∆_limit is chosen by the runtime algorithm. Smaller ∆_limit enables load-balancing nodes across PEs. However, smaller

∆_limitincreases the depth of fanin and fanout trees, which increases the graph-step latency, L_{dataf low}. So we maximize ∆_limit with the constraint that PEs can be load balanced. Per- fectly load balanced PEs have|P_i| =|S_i| =|E|/N_{P Es}. For each decomposed node,v, we want∆_in(v)≤ |E|/N_{P Es}and∆_out(v)≤ |E|/N_{P Es}, so we set:

∆_limit = |E|

N_{P Es}

For the purpose of maximizing memory load balance, it is fine to have one node take

0 5 10 15 20 25 30

cnet_smalltsengex5papex4misex3diffeqalu4dsipdesseqbigkeyapex2s298ellipticfriscsplaex1010pdcs38584.1s38417clma BVZ_128BVZ_64BVZ_32 sqrt8misex2cc count5xp1s382mult32aduke2

Memory for Undecomposed Graph / Memory for Decomposed Graph

Application.GraphPush-Relabel Spatial Router Bellman-Ford

ConceptNet

Figure 7.3: The ratio of memory required by the undecomposed graph to memory required for the decomposed graph for each application and graph. RMS is on the right. Uniform PE memory sizes are assumed.

up all or most of a PE. The load balancer can use the many small nodes to pack PEs evenly.

Figure 7.3 shows, for each application and graph, the ratio of memory required when decomposition is not performed to memory required with decomposition. This assumes that memory per PE is uniform, as it is in our logic architecture. The average memory reduction due to decomposition is10times, and the maximum is27times.

Unlike memory requirements, the amount of computation work for a node in a particu- lar graph-step is dynamic. In general only a subset of nodes are active in each graph-step so the observed computation load balance is not the same as the memory load balance and is different for each graph-step. Large decomposed nodes which take up most of the PE could potentially ruin the computation load-balance. Also contributing toT_stepis the amount of message traffic. The node to PE placement algorithm tries to place fanin and fanout nodes on the same PE as their descendants. Smaller nodes may result in less message traffic due to fewer long-distance messages, resulting in lower decreasingRnet. However large nodes mean the depth of fanin and fanout trees in smaller, resulting in a lowerL_{dataf low} latency.

Figure 7.4 plots the root mean square of normalizedTstep across all graphs for a range

0.8 1 1.2 1.4 1.6 1.8 2 2.2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Normalized Runtime

∆limit / (|E| / N_PEs)

min RMS max

Figure 7.4: Root-mean-square along with min and max across applications and graphs of T_stepfor a range of∆_limits. ∆_limit is normalized to|E|/N_{P Es}.

of∆_limits;∆_limitis normalized to|E|/N_{P Es}. The choice with the best RMS is the simplest:

∆limit =|E|/NP Es. Figure 7.5 shows the effect of varying∆limiton various contributors to graph-step time, where∆_limit is normalized to|E|/N_{P Es}. This figure shows thatL_{dataf low} is the dominating contributor. For∆limit < |E|/NP Esthe depth of fanin and fanout trees increases, so L_{dataf low} becomes more severe. For ∆_limit > |E|/N_{P Es} it is impossible to perfectly load-balance nodes across PEs. Figure 7.6 shows the number of fanin and fanout level across decomposed nodes.

Dalam dokumen A Programming Language for Parallel Graph Algorithms (Halaman 109-113)