Data parallel load balancing strategies (1)

(1)

Cyril

Fonlupt

Laboratoire d'Informatique du Littoral

Universite du Littoral

France

Philippe

Marquet

Jean-Luc

Dekeyser

Laboratoire d'Informatique Fondamentale de Lille

Universite de Lille

France

December 17, 1996

Abstract Programming irregular and dynamic data-parallel algorithms requires to take

data distribution into account. The implementation of a load balancing algorithm is a quite dicult task for the programmer. However, a load balancing strategy may be developed independently of the application. The integration of such a strategy in the data-parallel algorithm may be relevant to a library or a data-parallel compiler run-time. We propose load distribution data-parallel algorithms for a class of irregular data-parallel algorithms called stack algorithms. Our algorithms allow the use of regular and/or irregular communication patterns to exchange the works between processors. The results of theoretical analysis of these algorithms are presented. They allow a comparison of the dierent load balancing algorithms and the identication of criterion for the choice of a load balancing algorithm.

(2)

Resume

La programmation d'algorithmes data-paralleles irreguliers et dynamiques neces-site la prise en compte de la distribution des donnees. L'implementation d'un algorithme d'equilibrage de la charge est une t^ache ardue pour le programmeur. Cependant, une strategie d'equilibrage de la charge peu ^etre developpee independamment de l'application. L'integration d'une telle strategie au sein de l'algorithme data-parallele peut relever d'une bibliotheque ou de l'executif d'un compilateur data-parallele. Nous proposons des algorithmes data-paralleles de distribution de la charge pour toute une classe d'algorithmes data-paralleles irreguliers dit algorithmes a pile. Nos algorithmes utilisent des communications regulieres et/ou irregulieres pour realiser les echanges de travail entre les processeurs. Les resultats d'une analyse theorique de nos algorithmes sont presentes. Ils permettent une comparaison des dierents algorithmes et l'etablissement de criteres de choix d'un algorithme d'equilibrage.

(3)

Two parallel models coexist: data-parallelism and task parallelism. The good qualities of the data-parallel model are recognized [2]. Nevertheless, the simplicity and the rigidity of this model are sometimes reproached it. In particular, applications handling irregular data structures do not seem to be able to benet from a data-parallel programming. Our denition of data-parallelism goes against this idea.

Data-parallelism is the expression of single ow algorithms. These algorithms consist in a sequence of elementary instructions applied to scalar or parallel data. Most data-parallel languages confuse data-parallel data structures with regular data structures like arrays; Fortran 90 [3, 19] and High Performance Fortran (HPF [11]) are examples of such languages. However new propositions of languages attempt to widen this notion to irregular structures [7] and to take into account algorithms such as molecular dynamic simulations or computational uid dynamic solvers. It now clearly turns out that data-parallelism does not conne itself to array parallelism.

A data-parallel structure denes a virtual machine that consists of:

elemental virtual processors. Each processor owns an instance of each parallel variables.

Data-parallel operations are locally applied on each processors;

communication links between processors. A communication link represents a

depen-dence between the elemental data owned by two dierent processors. These links allow the exchange of values between processors.

We study the data-parallel implementation of a class of irregular algorithms, called \stack algorithms". A stack algorithm simulates the temporal evolution of a set of independents objects constituent the stack. Every object of the stack is handled with the same elemental algorithm. The treatment of an object modies its characteristics, it may lead to the creation of a son object or to the destruction of the object. The irregularity of these algorithms come from this dynamic variation of the stack during the execution time.

The tracking of particles in a detector such as Monte Carlo simulation methods is a prototype of stack algorithms [28]. Particles are followed step by step in the detector. At every step, the particle might collapse and generate secondary particles or disappear. In the case of new particles, they are stacked and their execution is delayed. The tracking algorithm can easily be parallelized [14].

Tree search algorithms are another example of stack algorithms. Tree search is a central problem for solving dierent problems in Articial Intelligence or Operations Research [16, 26]. Powley et al. [25, 24] have parallelized the tree search using stack algorithms. A node root

expands child nodes which themselves expand other nodes. Successive nodes are generated in parallel and are stacked in a distributed way.

We implement a stack algorithm in distributing the stack of objects on the P processors

(4)

We propose here dynamic data-parallel algorithms to insure the load balance of the dier-ent local stacks on the processors. The implemdier-entation of a load balancing algorithm within a stack algorithm is ensured by a triggering policy. The importance of the triggering mecha-nism is essential [15]. The triggering mechamecha-nisms for our load balancing algorithms have been studied in [8].

Content of the Pap er

We introduce our data-parallel model of computation in the the next section. We then present two families of load balancing algorithms: irregular and regular algorithms. Irreg-ular algorithms trigger load transfers between whatever couple of processors while regIrreg-ular algorithms restrict load transfers with the processors of the neighborhood. A comparison of our algorithms based on a theoretical analysis of their behavior allows the identication of criterion for the choice of a load balancing algorithm. The extension of our algorithms to other data-parallel applications and their integration in data-parallel language compilers are nally examined.

2 Our Parallel Model of Computation

This section presents the parallel machine and the data-parallel language used to describe our algorithms. Some useful notations are also introduced.

2.1 A Parallel Machine

We are working on a parallel virtual machine ofP processors supporting the SPMD execution

model. Each of the P processors is numbered by an unique index between 0 and P ?1.

Communication links connect processors. Two processors can communicate through two virtual networks:

1. A virtual irregular communication network. It is able to manage any communication pattern independently of the communication links.

2. A virtual regular communication network described later in this article. It is exclusively supported by the communication links.

2.2 Processor Workload

In order to estimate the unbalance of the system we need a function to compute the load of the processors. In our case the load of a processor is the number of data it is owning. The load function is denoted w.

The total workload of the systemW is of course equal to:

W = P?1

X i=0

w[i]

The average workload of the system W

(5)

In order to describe our algorithms, we use an extension of the data-parallel language proposed by Hillis and Steele [13]. The instructions are classied into two types:

1. General data-parallel instructions.

2. Data-parallel instructions managing the parallel stack. These instructions realize load transfers between processors.

2.3.1 Variables

Our data-parallel language distinguishes two types of variables. Furthermore, the parallel stack is used in our algorithms:

A parallel variable is distributed on all processors. It is made ofP elementary values. The

instance of the parallel variable x[ ] on the processor kwill be denoted x[k]. A scalar variable consists in a single value. It is accessible on each processor.

The parallel stack is an implicit parallel object distributed on all the processors. On a

given processor, the stack contains the local objects. For a given processor, w[k] is the

size of the local stack.

2.3.2 Data-Parallel Instructions

The data-parallel instructions of the language are described in the following paragraphs.

Data-Parallel Section The instruction

for all kin parallel do cod

triggers a data-parallel activity. It makes all the processors active, thus execute the same data-parallel code c. This code may include any of the data-parallel instruction described

below. In this code c, kindicates the processor index.

At any time the processor activity can be modied by an if then else construction on a parallel variable. Furthermore, a data-parallel while modies the activity for the the execution of the while body. Only those processors that evaluate the controlling expression to true will be active for the current iteration of the body of the while. The data-parallel while iterates while at least a processor is active.

Data-Parallel Reductions The instruction s:= sum(par[k])

is used to compute the sum of a parallel variable par[ ] for all active processors. The results

is a scalar variable. The instruction

s:= count()

(6)

r esul t[k] := rank(i[k]) k(index) 0 1 2 3 4

Activity 1 1 0 1 1

i[ ] 4 4 2 2 0 r esul t[ ] 2 3 ? 1 0

Figure 1: The rank instruction

r esul t[k] :=a[b[k]] k(index) 0 1 2 3 4

],

size

[

k

])

is the reversed operation of the send() instruction. The data are cleared from the remote

stack and stored in the local stack. Note that the activity is associated to the receivers. These transfers are parallelized, the number of moved objects is not necessarily the same for any two processors.

3 Irregular Load Balancing Redistribution Strategies

The load balancing mechanism sends stack elements from processors to other processors. For a given processor, when the elements can migrate all over the system independently of the physical links, the strategy is called irregular. Like Baumgartnet et al. [1] pointed it

out, an irregular communication pattern allows a better redistribution of work. As a rule, communications cost depends heavily on the complexity of the communication pattern.

In the following paragraphs, we present several irregular redistribution mechanisms and we give an upper bound of the complexity for these algorithms.

3.1 The

Central

Algorithm

The Central redistribution algorithm is based on the works of Powley [24] and Hillis [12].

Firstly the average workload of the system is computed and broadcasted to every processor in the system. Then the processors can be classied into three classes:

the idle processors (they have no data to compute);

the overloaded processors (their load is above the average workload

w

); the other ones.

The algorithm tries to match each overloaded with an idle peer. The Central policy is

a two thresholds policy (0,

w

). This means that the ping-pong eect of the one threshold

policy is avoided [10, 27].

The pseudo-code of the Central algorithm is presented in Figure 3. Figure 4 presents

the system before the redistribution phase and the dierent steps of an execution. It is not dicult to prove that as a rule the Centralscheme is not asymptotically convergent towards

(8)

for all k in parallel do

Initializations:

W := sum(w[k]) thr eeshol d:=W =P r cv[k] := nill

r endez vous[k] :=nill

Idle processor enumeration:

ifw[k] = 0then

dest[k] := enumerate() r cv[dest[k]] :=k

fi

Overloaded processor enumeration:

ifw[k]>thr eeshol d then fr iend[k] :=enumerate() r endez vous[k] :=r cv[fr iend[k]]

fi

Redistribution:

ifr endez vous[k]6=nill then

send(r endez vous[k];w[k]=2)

fi od

(9)

k

0 1 2

0 9 10

0 1 2 3 4 5

w [k ]

1 2 3 4 5 6 7 8 9 10 11

0

0 7 4 2 1 8 7 12 9 0 0 7

dest[k ] r cv [k ] fr iend[k ]

0 9 10

r endez v ous[k ]

w [k ] 3 4 4 2 1 4 4 12 9 4 3 7

Overloaded processor enumeration

Final workload

Initial workload

Idle processor enumeration

Redistribution

Figure 4: The dierent steps of an execution of the Centralalgorithm.

(10)

Figure 5: Pseudo-code of the Rendez-Vous algorithm

3.2 The

Rendez-Vous

Algorithm

The Rendez-Vous redistribution algorithm we propose has a more powerful matching scheme

than the Central algorithm. To the opposite of the Central mechanism, the Rendez-Vous

scheme allows a matching between extreme processors: very heavily loaded nodes will be matched with the most lightly nodes and in the same way, processors just above the average load will exchange their work with processors just under the average load.

In order to realize this exchange of work, irregular communications occur in the system. Some of the nodes play the role of mailboxes and allow the other processors to exchange their addresses. The aim of this technique is to realize a very smooth repartition of the overall load.

The pseudo-code of theRendez-Vous algorithm is summed up in gure 5. Note that the

tworank() instructions can be factorized into one singlerank() operation. Figure 6 presents the initial step of the algorithm and the exchange of work among the processors for a given execution of the Rendez-Vous algorithm.

With an analytical method, we show that the Rendez-Vous scheme converges towards

the asymptotical optimum [6]. Furthermore, on a P processors systems, at most log 2(

(11)

k 0 1 2 3 4 5 6 7 8 9 10 11 Underloadedpro cessorsort

dest[k ] r cv [k ]

4

?v al ue[k ] 0 2 3 4 4

0 5 4 3 1 2

0 9 10

w [k ] 0 7 4 2 1 8 7 12 9 0 0 7

2 3 4

r endez v ous[k ] Redistribution

w [k ]

Overloadedpro cessor sort fr iend[k ]

v al ue[k ] 3 0 4 3 8 5 3

4 2 6

3

4 3

0

nill 10 0

1 9

5 2

6 4 5 4 4 4 5 6 5 4 4 6

Finalworkload

Figure 6: Execution of the Rendez-Vous algorithm.

(12)

Rendez-Vous iterations are needed to reach a balanced steady state.

3.3 The

Random

Algorithm

The Random algorithm is based on an simple mechanism. Each time an element is created on a processor, it is sent on a

randomly

selected node anywhere in the system. For each node, the expectation to receive a part of the load is the same regardless of its place in the system.

This redistribution scheme leads to a permanent balanced state while all local stacks are only growing up. The destruction of elements on stacks leads to a permanent unbalanced state which will never be undertaken by the Random algorithm: As a matter of fact, in the case of deletion of elements in the stack, the Random algorithm is unable to react.

Furthermore, if this scheme seems to have theoretically a good behavior [30], it shows experimentally a \not-so-good" behavior due to the high level communications it induces. In fact, if the program tends to generate lots of data, the redistribution wastes more and more time communicating.

4 Regular Pattern Communication Strategies

For strategies with regular communications, a

neighborhood

notion is dened over the net-work topology. The neighborhood is the set of linked processors for a given virtual topology. For a multi-grid topology, a regular communication is characterized by a parallel communi-cation at the same distance in the same direction. For example, the instruction

x[k] :=y[(k+a) modP]

where ais a scalar value, denotes a regular communication on a ring of P processors.

For multi-grid topology, regular communications use the neighborhood network. A regular communication is as a rule more ecient than an irregular communication. However, the number of schema of regular communications is less numerous than the total number of permutations (irregular pattern): a priori, a step of a regular load balancing scheme will have less improvement on the system. Figure 7 presents an example of the dierence between regular and irregular communications. For the same topology, only

one

step is necessary for an irregular communication pattern to redistribute the workload while

two

steps are needed for a regular communication pattern.

In the following paragraphs, we introduce several regular communication algorithms.

4.1 The

Tiling

Algorithm

The Tiling algorithm divides the system in small and disjoint topology sub-domain of pro-cessors called

windows

. A

perfect

load balancing is realized in each window using regular communications. In order to propagate the work over the entire system, the window is shifted for the next load balancing phase.

We describe here the algorithm for 1D ring of processors. In that case, a window consists in a group of 2 connected processors. A perfect load balancing is realized in each window with two regular communications. For a n-dimension grid, the window consists of 2

n processors,

2nregular communications insure a perfect load balance in each window, and then a shift is

(13)

load (w[k])

processor (k)

Initial workload

Irregular communication

Regular communication

Step 2

Step 1

Figure 7: Dierence between a regular and an irregular communication pattern

Firstwindows

Second windows

k 0 1 2 3 4 5 6 7

(14)

for allk in parallel do

First windows:

if(kmod 2) = 0then

aver age[k] :=w[k] +w[k+ 1] aver age[k] :=aver age[k]=2 ifw[k]>aver age[k]then

First regular communication:

send(k+ 1;w[k]?aver age[k]) else

Second regular communication:

receive(k+ 1;aver age[k]?w[k]) fi

fi

Second windows:

if(kmod 2) = 1then

aver age[k] :=w[k] +w[k+ 1] aver age[k] :=aver age[k]=2 ifw[k]>aver age[k]then

First regular communication:

send(k+ 1;w[k]?aver age[k]) else

Second regular communication:

receive(k+ 1;aver age[k]?w[k]) fi

fi od

(15)

Second link

Additionof a fourthlink Additionof a thirdlink

Figure 10: The X-Tiling algorithm: increasing the neighborhood cardinality

Figure 8 shows the division of the processors in windows on a ring of processors. Windows of 2 processors are created and the load is evenly balanced inside these \rst" windows with 2 regular communications. After that, the windows are slightly moved and the process is iterated. The pseudo-code of the Tiling algorithm can be found in Figure 9.

We prove that the load of each processor converges towards the uniform work distribution. For each processor, a sequence represents the load the processor is computing. It can be shown that this sequence is increasing and bounded. This implies that the load of each processor is convergent towards the average load of the system.

4.2 The

X-Tiling

Algorithm

Lulingetal.[18] note that strategies using neighborhood communications are not very ecient

for systems with high number of processors. Note particularly that when the distribution of work is partitioned in underloaded areas and overloaded areas it takes quite a long time for theTiling algorithm to propagate the work from overloaded processors to underloaded ones.

We propose to increase the cardinality of the neighborhood in order to improve the number of schema of regular communications. We add to the topology some links to increase the number of processors in the neighborhood. The links are added so that the topology is to be included into a hyp ercub e topology (Figure 10). The hypercube is the best trade-o

between the number of links and the number of steps needed to connect any two processors. The X-Tiling algorithm is similar to the Tiling algorithm. Successively all processors in

the neighborhood are balanced. The extension of the topology allows the regularity of the communications to be maintained: We keep a very regular communication pattern character-ized by communications in the same direction at the same distance.

(16)

Figure 11: Pseudo-code of theRake algorithm on a ring of processors

timet+1 is the result of the dot product of the matrix of exchange by the load vector at time t. As a rule the matrix has got some strong properties and the convergence can be proven.

Furthermore, the speed of convergence of the algorithm is directly proportional to the second largest eigen value of the matrix of exchange. This property allows us to evaluate and to bound the speed of convergence. We show that by using only regular communications, the

X-Tiling schemes will converge in less than log 2(

P) calls to the load balancing algorithm [6].

4.3 The

Rake

Algorithm

The Tiling and X-Tiling algorithms are regular algorithms using \elementary" exchanges of

loads between processors. We will now consider regular algorithms that used multiple regular communications to achieve a better load balance of the system.

The aim of the Rake algorithm is to redistribute the work in an evenly way on all the

processors. It is well suited for multi-dimensional grid.

TheRakealgorithm uses only regular communications with processors in its neighborhood

set. A number of regular communications are realized to achieve a good redistribution of work. We describe here the Rake algorithm for 1D ring of processors (Figure 11). In that case, the

neighborhood set of a processor consists in two processors: the right processor and the left processor.

The total workload of the P processors is

(17)

P = 5 W = 13 r= 3

w= 2

w

w+ 1 First transfers

Second transfers

send(0;4)

send(1;2) send(4;3) send(3;3)

send(0;2)

send(1;1) Final workload

w

w+ 1

(18)

a

Figure 13: The implementation of the local stacks as a \FIFO" allows a neighborhood con-servation

After one application of the Rake algorithm,P?rprocessors own wdata andrprocessors

own w+ 1 data.

Firstly the average load ( w) is processed. In a rst transfer phase, during P iterations,

each processor gives to its right neighbor the data over the average workload. After this phase, at least P ?r processors own w loads. In a second transfer phase, during r iterations, each

processor gives to its right neighbor the data over w+ 1. This ensures a perfect distribution

of the load. Figure 12 shows the main step of an execution of the Rake algorithm on a ring of processors.

In the case of a multi-dimensional grid, the previous step is done in each of the dimensions. On numerous data-parallel algorithms, especially in image processing, the neighborhood notion has to be kept: A picture can not be cut into several parts and be redistributed anywhere in the system. For example in skeletonization algorithms on a parallel machine, some processors become idle while others own all the remaining work. A load balancing algorithm can greatly improve the performance of the parallel machine. In the case of a work redistribution the locality has to be preserved. If the local stack on each processor is managed as a \FIFO" structure, we assert that \neighbor" data remain on the same processor or on a neighbor processor for the one dimension case. An example of such an execution is displayed in Figure 13.

We theoretically show [8] that in the case of the Rake algorithm the average load of the processors converges to the average workload of the system in the one dimension case. In the case of the multi-grid the Rake algorithm is also convergent but the dierence of load between any 2 processors in the system is bounded by the number of dimensions.

4.4 The

Pre-Computed Sliding

Algorithm

The Pre-Computed Sliding is an improvement of the Rake algorithm. Instead of transferring data over the average workload of the system like the Rake scheme, it computes the minimal number of data exchanges needed to balance the load of the system. Unlike the Rake algo-rithm, the Pre-Computed Sliding may send data in the two directions. We describe here the Pre-Computed Sliding for a linear arrangement of processors.

After one application of the Pre-Computed Sliding algorithm, the r rst processors own

w+ 1 loads and the lastP ?rprocessors own w loads.

(19)

for all

k

in parallel do

Initializa ti ons:

W

:= sum(

w

[

k

])

w

:=

W=P

r

:=

W

mod

k

>

0then

k

])

if

od od

(20)

0 1 2 3 4

goal[k]?goal[k?1]

Left transfers

Iteration 1

P = 5 W = 13 r= 3

w= 2 k

0

0 5 7 13

goal[k] 3 6 9 11 13 tr ansfer[k] 3 6 4 4 0 scanAdd(w[k])

0 5 2 4 {

+2 -2+4 -4 -5

w[k] possibl e[k]

tr ansfer[k] 3 1 2 0 {

+5

No iteration

Right transfers receive()

{ -2 w[k]

possibl e[k]

tr ansfer[k] 0 {

Iteration 2

3 1 2 0

+3 +1-3 -1+2

0 0 0

receive()

(21)

needed load transfers in the left direction. A second communication phase insures the right transfers.

The pseudo-code of the Pre-Computed Sliding is presented in Figure 14. Figure 15 presents an example of execution of the Pre-Computed Sliding algorithm.

Like the Rake algorithm, the implementation of the local stacks as a \FIFO" structure allows the Pre-Computed Sliding algorithm to keep a neighborhood topology for the loads.

4.5 The

Neighbor

Algorithm

The Neighbor algorithm is an adaptation of several MIMD load balancing mechanisms [32, 29]. The target architecture is cut into elementary domains called islands. An island is made of

a center processor and all processors in its neighborhood. The aim of the Neighbor algorithm is to make the load of every node in the island equal. The partial overlapping between the islands allows to propagate the load.

Each center node processes the average load of its island. Overloaded suburban nodes try to give some of its work to the center node.

In a 2D topology, each node belongs to 4 islands as a suburban node and its island. With an analytical method, we have shown that the load of each node tends asymptotically towards the average load of the system. Unfortunately, we were not able to evaluate an upper bound of the number of iterations needed to reach this state. Nevertheless we have empirically shown that it is very slow for systems having a large number of processors.

As for the Tiling algorithm, the Neighbor may be extended by adding links between processors so that the processor topology is to be included in a hypercube.

5 Cost Prediction

When implementing a load balancing scheme, the programmer is in front of the following dilemma: the tradeo between cost and quality has to be maximized.

The cost is relative to the complexity of one iteration of the algorithm.

The quality is the product of one iteration cost by the number of iterations needed to reach

a steady-state where each virtual processor owns the average load of the system. Even if experimental results are important, it is still dicult to compare any two strate-gies [17]. In order to evaluate our stratestrate-gies on a common basis, we have chosen to make a mathematical analysis [8]. It allows us to calculate the number of iterations (quality) to reach a steady state. The mathematical method evaluates the asymptotical behavior of the algorithms. In order to evaluate the cost vs quality dilemma, we use some parameters (t

x)

presented below. Thet

x dene the cost of some basic parallel instructions regardless of the target

architec-ture. These values are dependent on the topology and the network of the virtual machine. For example, the cost t

s of a parallel sort of

P P values on aP P grid is in O(P).

(22)

Table 1: Cost and quality for the dierent load balancing schemes for a virtual grid ofP P

processors

Algorithm Load Balancing Cost Cost for for 1 Iteration convergence

s is the cost of a parallel sort, like the rank() instruction. p

c is the probability of creation of a data for a given period of time; it is used in the Random

algorithm.

The table 1 sums up our main results and points out the cost and quality of our algorithms on a P P processor grid.

Some points are worth mentioning.

One iteration of the X-Tiling algorithm leads to a perfect load balance.

The Tiling and X-Tiling algorithms, which may seem similar, have a highly dierent

quality (O(log 2

P) for X-Tiling and linear for Tiling).

Even if the Central and Rendez-Vous algorithms look very similar, we have proven that

the rst one is not convergent although the second one does converge in O(log 2

P). The Pre-Computed Sliding algorithm is a real improvement over the Rake algorithm;

cost and quality for Pre-Computed Sliding are always weaker than for Rake.

The choice of a good load balancing algorithm is not an easy task. Two questions may arise. For a given architecture, which is the best redistribution scheme. For a given stack algorithm, which is the best load balancing algorithm.

A Load Balancing Scheme for a Given Architecture

An architecture is characterized by one (or several) communication networks. The ratio of regular to irregular communications will be a predominant guide in the choice of an algorithm (specially for parallel machines that simulate the irregular communications on a regular com-munication network). Once this ratio is known, an algorithm from the regular or irregular communication class can be selected.

(23)

A Load Balancing Scheme for a Given Stack Algorithm

We empirically dene the spatial and temporal disorders. They characterize the behavior of the stack algorithms.

The spatial disorder

refers to the distribution of workload throughout the system. Either most of the load is spatially localized (low spatial disorder), or the load is \more" evenly distributed in the system (high spatial disorder).

In the case of high spatial disorder, a strategy with a global exchange of data will be the best choice. Diusion schemes like Tiling will not be very ecient, techniques like X-Tiling will be more suited to respond to high spatial disorder.

The temporal disorder

describes the evolution of load between processors. A system with a high temporal disorder is characterized by great variations (creations and destructions of data) in the processors work (ie the standard deviation between any two processors is high).

A Rendez-Vous-like algorithm may be used if there is a high spatial disorder. A \clever" matching policy allows a quick decreasing of the spatial disorder. For example, a single application of the Rendez-Vous algorithm leads to a good distribution of works, even if it is not optimal. To sum up, in case of high spatial disorder, a load balancing algorithm with a good cost/quality ratio has to be selected.

Furthermore the cost prediction presented in this section have been empirically veried by some experiments on a MasPar parallel computer [5].

6 Conclusion

We have studied the data-parallel implementation of stack algorithms. We have proposed data-parallel algorithms to load balance the parallel implementation of the stack algorithms. We have dened a parallel model of computation and have proposed a number of algo-rithms characterized by regular or irregular communication patterns. The cost and quality of each of our algorithms have been examined. They allow to identify some criteria to choose a redistribution scheme.

Among our algorithms, Rake and Pre-Computed Sliding algorithms allow a neighborhood conservation in one dimension. Some people have proposed algorithms allowing neighborhood conservation in the multi-dimensional case [21, 20]. We are working towards the denition of such algorithms where the optimality may be proven [23, 22].

We are implementing our algorithms in a parallel library. This library is aimed at being used by parallel programmers. A data-parallel language compiler may systematically generate calls to this load balancing library [31].

References

(24)

balanc-[2] Luc Bouge. The data parallel programming model: A semantic perspective. In The Data Parallel Programming Model, pages 4{26. Lectures Notes in Computer Science, Tutorial Serie, vol. 1132, 1996.

[3] Walt Brainerd, Charlie Goldberg, and Jeanne Adams. Programmer's Guide to Fortran 90. Springer-Verlag, third edition edition, 1995.

[4] George Cybenko. Dynamic load balancing for distributed memory multiprocessors. Jour-nal of Parallel and Distributed Computing, 7, 1989.

[5] Jean-Luc Dekeyser, Cyril Fonlupt, and Philippe Marquet. A data-parallel view of the load balancing, experimental results on MasPar MP-1. In Wolfang Gentzsch and Uwe Harms, editors, High Performance Computing and Networking Conference, volume 797 of Lectures Notes in Computer Science, pages 338{343, Munich, Germany, April 1994. [6] Jean-Luc Dekeyser, Cyril Fonlupt, and Philippe Marquet. Analysis of synchronous

dy-namic load balancing algorithms. In Parallel Computing: State-of-the Art Perspective (ParCo'95), volume 11 of Advances in Parallel Computing, pages 455{462, Gent, Bel-gium, September 1995. Elsevier Science Publishers.

[7] Jean-Luc Dekeyser and Philippe Marquet. Supporting irregular and dynamic compu-tations in data-parallel languages. In The Data Parallel Programming Model, pages 197{219. Lectures Notes in Computer Science, Tutorial Serie, vol. 1132, 1996.

[8] Cyril Fonlupt. Distribution Dynamique de Donnees sur Machines SIMD. These de doctorat (PhD Thesis), Laboratoire d'Informatique Fondamentale de Lille, Universite de Lille 1, December 1994. (In French).

[9] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, 1979.

[10] R. S. Harbus. Dynamic process migration: To migrate or not to migrate. Technical Report CSRI-42, University of Toronto, Toronto, Canada, July 1986.

[11] High Performance Fortran Forum. High Performance Fortran language specication. Scientic Programming, 2(1-2):1{170, 1993.

[12] W. Daniel Hillis. The Connection Machine. The MIT Press, Cambridge, MA, 1985. Traduction francaise, Masson, Paris, 1988.

[13] W. Daniel Hillis and Guy L. Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170{1183, December 1986.

[14] D. M. Jones and J. M. Goodfellow. Parallelization strategies for molecular simulation using the Monte Carlo algorithm. Journal of Computational Chemistry, 14(2):127{137, 1993.

(25)

[16] Vipin Kumar and V. Nageshwara Rao. Parallel depth rst search (part II). Int'l Journal of Parallel Programming, 16(6), 1987.

[17] Peter Kok Keong Loh, Wen Jing Hsu, Cai Wentong, and Nadarajah Sriskanthan. How network topology aects dynamic load balancing. IEEE Parallel and Distributed Tech-nology, pages 25{35, Fall 1996.

[18] R. Luling, B. Monien, and F. Ramme. Load balancing in large networks: A comparative study. In 3rd IEEE Symposium on Parallel and Distributed Processing, Dallas, 1991. [19] Mike Metcalf and John Reid. Fortran 90 Explained. Oxford University Press, 1990. [20] Serge Miguet and Jean-Marc Pierson. Dynamic load balancing in a parallel particle

sim-ulation. In High Performance Computing Symposium, pages 420{431, Montreal, Canada, July 1995.

[21] Serge Miguet and Yves Robert. Elastic load-balancing for image-processing algorithms. In H. . Zima, editor, First Int'l ACPC Conf, pages 438{451, Salzburg, Austria, 1991. Springer-Verlag.

[22] David Nicol and David O'Hallaron. Improved algorithms for mapping pipelined and parallel computations. IEEE Transactions on Computers, 40(3):119{134, March 1994. [23] David M. Nicol. Rectilinear partitionning of irregular data parallel computations. Journal

of Parallel and Distributed Computing, 23(2):119{134, November 1994.

[24] Curt Powley, Chris Ferguson, and Richard Korf. Depth-rst heuristic search on a SIMD machine. Articial Intelligence, 60, 1993.

[25] Curt Powley, Chris Ferguson, and Richard E. Korf. Parallel tree search on a SIMD machine. In Third IEEE Symposium on Parallel and Distributed Processing, Dallas, TX, December 1991.

[26] V. Nageshwara Rao and Vipin Kumar. Parallel depth rst search (part I). Int'l Journal of Parallel Programming, 16(6), 1987.

[27] A. Ross and B. McMillin. Experimental comparison of bidding and drafting load sharing protocols. In Proceedings of the 5th Distributed Memory Computing Conference, pages 968{974, April 1990.

[28] R. Y. Rubinstein. Simulation and the Monte-Carlo Method. John Wiley & Sons, New-York, 1981.

[29] V. A. Saletore. A distributive and adaptive dynamic load balancing scheme for par-allel processing of medium-grain tasks. In Proceedings of the 5th Distributed Memory Conference, pages 995{999, April 1990.

(26)

[32] Mark H. Willebeek-LeMair and Anthony P. Reeves. Strategies for dynamic load balancing on highly parallel computers. IEEE Transactions on Parallel and Distributed Systems, 4(9):979{993, September 1993.

[33] Cheng-Zhong Xu and Francis C. M. Lau. Analysis of the generalized dimension exchange method for dynamic load balancing. Journal of Parallel and Distributed Computing, 16:385{393, 1992.

[34] Cheng-Zhong Xu and Francis C. M. Lau. The generalized dimension exchange method for load balancing in k-ary n-cubes and variants. Journal of Parallel and Distributed