Asynchronous Dynamic Load Balancing of T

(1)

Asynchronous Dynamic Load Balancing of Tiles

Tung Nguyen Michelle Mills Strout Larry Carter Jeanne Ferrante Computer Science and Engineering Department, UC San Diego

9500 Gilman Drive, San Diego CA 92093-0114 [email protected],fmstrout, carter, [email protected]

Extended Abstract

Many scientic computations have work-intensive kernels consisting of nested loops with a regular stencil of data dependences. For such loop nests, tiling [W96] is a well-known compiler optimization that can help achieve ecient parallelism. A loop nest of depthK can be represented as aK-dimensionalIteration Space Graph(ISG) [W96], where each point is an iteration, and edges between the points represent true, value-ow

dependences between iterations. Tiling is a partitioning of the points of the ISG into regular size and shape units. Tiles can be allocated to dierent processors to support parallelism. In this paper, we consider the simple but common case, shown in Figure 1(a), of a 2-dimensional ISG with stencil of three dependences of distance 1. These point dependences give rise to tile dependences, which require synchronization and communication between tiles. We address the specic case where a tile needs data computed in the tile to its left and the tile below it to execute. Execution of tiles proceeds in a wavefront fashion: each processor will be at least one row ahead of its neighbor to the right. Rebalancing can prevent much larger imbalances.

Our target architecture is a distributed-address space machine such as an IBM SP2 or network of work-stations. Initially, each processor is allocated the same xed number of columns of tiles in block distribution. Each processor executes rows of tiles in its allocated columns a row at a time. If each processor has the same capacity and load, then this allocation should be ecient. However, if processor loads can vary over time, then dynamically adjusting the allocation of columns to achieve a load balance suitable to the individual loads is desirable. Varying processor loads can arise because of an evolving irregular application or because of contention. We present the Left-Right Hando Protocol, a simple dynamic load balancing algorithm which does not require centralized control and is independent of network conguration. Using our algorithm, a given processor communicates only with its left and its right neighbor in successive steps to determine if some columns of tiles should be reallocated, as illustrated in Figure 1(a). When the amount of rebalancing is below a certain threshhold, there is no idle time introduced, that is, the faster processors do not need to wait at barriers for the slower processors.

There is an enormous amount of literature on dynamic load balancing [XL97]. Our scheme uses a local rescheduler; this has advantage over methods using a global scheduler, such as [SS94], which can require a dedicated processor and global synchronization, and may not scale well to a large number of processors. Our technique falls in the class of deterministic iterative load-balancing techniques, more specically diu-sion methods, where the work can \diuse" over time from processors with high \concentrations" of work to a less busy neighbor [XL94]. Diusion methods can be further classied with respect to communication synchronization. Optimal solutions forsynchronousdiusion techniques in the context of non-changing loads

have been determined [Cy89], and in the case of changing loads, the variance can be bounded. However, the required synchronization may be costly. Asynchronous schemes have the potential to be faster, making them preferable in many cases. In the asynchronous case, convergence with non-varying loads can be guaranteed if communication delays are bounded; however, the rate of convergence and the optimal solution are not theoretically known [XL94]. Experimental results are therefore useful in determining the practicality of asyn-chronous schemes. In this paper, we present experiments to determine the eciency and time to convergence of our asynchronous load-balancing scheme on an application that ts our ISG assumptions.

The Left-Right Hando Protocol is given by the nite state machine diagram in Figure 1(b). Both a processor,L, and the processor to its right,R, use the same diagram, although they may be in dierent states

(2)

sends the data needed byRfor its next row, andR waits to receive the data. The processors then return to

theListeningstate. Each processor has acheckpoint in its current row. Prior to reaching the checkpoint, a

processor can service a request to reschedule. However, if a processor reaches its checkpoint before a request is received, execution always continues to the last tile of that row. The states of the nite state machine in Figure 1(b) are partitioned as to whether the processor has neither sent or received a message; has either sent or received, but not both; and nally, has both sent and received.

Intuitively, processor R starts executing tiles in its current row inListening mode, checking for either

a checkpoint message fromL, or whether its checkpoint is reached. If a message is received, thenR knows

that L is ahead in its current row, and moves to Rescheduling state to call its local rescheduler to make a

decision. It then sends a rescheduling message toL(which includes the data below the reallocated tiles), and

proceeds inFinish-Row state to nish the row's remaining work, which may have changed as a result of the

rescheduling. When that work is completed, aLeft-Right Data Transferis performed. On the other hand,

ifR reaches its checkpoint before it has received a message from L, it sendsL a checkpoint message, and

proceeds in Fast-Forward state to execute its current row. After completing this row, R waits to receive a

message fromL, which is either a rescheduling or checkpoint message. In the case of a rescheduling message, R moves to Rescheduling state and processes the message. In both this case and the case of a checkpoint

message,R then performs aLeft-RightDataTransferbefore it returns toListening state.

In the full paper, we experiment with a variety of checkpoint locations and reschedulers. We compare the execution times of no load balancing, our Left-Right Hando protocol technique, and a version which performs global synchronization at the end of every row, on both varying and static but unequal loads. We also measure the time to convergence on static but non-uniform loads.

References

[Cy89] G Cybenko. \ Load balancing for distributed memory multiprocessors",J. of Par. and Distr. Computing, 7, 1989. [SS94] B. S. Siegell and P. Steenkiste. \ Automatic generation of parallel programs with dynamic load balancing",IEEE Symp. on High Performance Dist. Computing, Aug. 1994.

[W96] M. J. Wolfe.High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.

[XL94] C.-Z. Xu and F. C.M. Lau. \Iterative dynamic load balancing in multicomputers".J. of Operational Research Society, 45(7):786{796, 1994.

[XL97] C.-Z. Xu and F. C.M. Lau. Load Balancing in Parallel Computers. Kluwer, 1997.

Acknowledgement:We gratefully acknowledge the guidance of Professor Fran Berman in her CSE 260 class.

P0 P1 P2 P3

Dependence Pattern

(a)

Listening

Fast-Forward

Resched.

Left-Rt. Data Tran.

Finish-Row

Resched.

msg

checkpt. msg

send resched. msg

rec at checkpoint

msg rec resched. rec

send checkpt. msg no send, no receive

send, no receive

send, receive receive, no send

checkpt.

(b)