An Analysis of Diusive Load-Balancing
RaghuSubramanian IsaacD. Scherson
raghu@ics.uci.edu isaac@ics.uci.edu
Departmentof Information and Computer Science
Universityof California
Irvine, CA 92717-3425
U.S.A.
Abstract
Diusion is a well-known algorithmfor load-balancing in which tasks move from heavily-loaded
pro cessors to lightly-loadedneighb ors. This pap erpresents arigorous analysis ofthe p erformanceof
thediusion algorithmonarbitrarynetworks.
We derive b oth lowerand upp er b ounds onthe runningtimeof thealgorithm. These b oundsare
statedintermsof thenetwork'sbandwidth.
Forthecase of thegeneralizedmesh withwrap-around (which includescommonnetworks like the
ring, 2D-torus, 3D-torus and hyp ercub e), we derive tighter b ounds and conclude that the diusion
algorithmisinecientforlowerdimensionalmeshes.
3
Thisresearchwassupp ortedinpartbytheAirForceOceofScienticResearchundergrantnumb erF49620-92-J-0126,
The load-balancing problem is as follows: Let G b e an undirected connected graph of N no des, and let
M tasks b escatteredamongthe no des. Re-distribute thetasks such that each no de endsup witheither
bM=Nc or dM=Ne tasks. The algorithm must run in a distributed fashion, i.e., each no de's decisions
mustb ebased only on lo calknowledge. (Thisformulationisb orrowedfrom [11] 1
.)
To motivate the ab ove formulation, let us consider two applications in which the the need for
load-balancingarises. Intheseexamples, observethattheall taskshaveroughlythesameexecution time, and
theload-balancing phases attempttoequalize thenumber oftasks ateach no de. Also notethatthehigh
degree ofparallelism makesa centralized load-balancing algorithm unviable.
Back-tracking: Consider a search space consisting of vectors whereeach comp onent can assume
a nite numb er of values. Back-tracking is an algorithm to nd a vector in the search space that
satisessomefeasibilitycondition. Inback-tracking,thesolutionvectorisconstructedincrementally,
comp onentbycomp onent. Ataskcorresp ondstoapartially constructedvector. Eachpartial vector
(task)resides in thelo calmemoryof somepro cessor.
The back-tracking algorithm alternates b etween expansion phases and load-balancing phases. In
an expansion phase, each pro cessor in parallel takes a partial vector (task), if any, from its lo cal
memory. Ifthepartialvectorisin factcomplete,andalsosatisesthefeasibilitycondition,thenthe
back-tracking algorithmreturns thecompleted vectorand terminates. If itis clear that thepartial
vectorcan notb e completed feasibly, then the partial vector vanishes from thelo cal memory, and
thetaskissaid tohaveterminated unsuccessfully. Otherwise, several new partialvectors app earin
thelo calmemoryofthesamepro cessor,corresp onding toeachwayofextending theoriginal partial
vectorbyone morecomp onent,and the taskis said tohavespawned children.
Ina load-balancing phase, partial vectors (tasks)are redistributed amongpro cessors evenly. If the
load-balancing phase is omitted, then with each expansion phase, the distribution of tasks among
pro cessorsgets skewed. Somepro cessorsgetswamp edwithtasks,whileothersstayidle. Thisslows
downtheoverallback-tracking algorithm.
Iftheload-balancing phase provestob e to oexp ensive, thenits costmayb eamortizedbyinvoking
severalexpansion phases foreveryinvo cation oftheload-balancing phase.
SolvingPDEs: Consideratypicaliterativealgorithmtosolveapartialdierentialequation(PDE).
The problem is rst discretized by partitioning space into regions, and the function is tentatively
assumed to b e constant within a region. During each iteration, each region in parallel gets the
function values in theneighb oring regions, and uses them toup date its own function value. If, in
the pro cess, a region discovers that its function value diers drastically from the neighb ors', then
it realizes that assumption that the function is constant within the region maynot b e valid, soit
splitsup intonerregions.
Here,a taskcorresp onds to aregion. Eachtask resides in thelo cal memoryof apro cessor. During
anexpansion phase,eachpro cessorpicks atask,if any,fromits lo calmemoryand executes it. This
mayresultinadditional tasksapp earing in thelo cal memoryofapro cessor,reectingthata region
splitup. Duringa load-balancing phase,tasks areredistributed evenly.
Otherapplicationswhereload-balancing arisesarebranch-and-b oundoptimizations,theoremproving,
interpretationofPROLOG programs,andraytracing[8].
1
Incidentally, thepap erdescrib esasurprising \applicatio n"ofload-balanci ng. Itiswellknownthatp ermutationrouting
canb e reduced to sorting. The authors show that general (many to many) routing canb e reduced to sorting plus
Ring (N 2
log) O (N 2
)
2D-torus (Nlog) O (N)
3D-torus (N 2=3
log) O (N 2=3
)
Hyp ercub e (logNlog) O (logN)
Figure 1. Bounds on the running time of the diusion algorithm for certain commonnetworks. N denotes the
numb erofno desinthenetwork,anddenotesthestandarddeviation(\imbalance")oftheinitialloaddistribution.
Overview of Results
In this pap er, we consider the well-known diusion algorithm for load balancing. The principle b ehind
thediusion algorithmisthat if apro cessor hasmoretasks thananeighb or,thenit sendsa fewtasks to
theneighb or. Thenumb eroftaskssentisprop ortionaltothedierentialb etweenthetwopro cessors, the
prop ortionalityconstant b eing acharacteristic of theconnecting edge.
This pap er presentsa rigorousanalysis of thep erformance of thediusion algorithmon an arbitrary
network. Wederiveb othlowerandupp erb oundsontherunningtimeofthealgorithm. Theseb oundsare
statedin termsofthenetwork'selectrical conductanceanduid conductance(denedin Section3),
whichare measuresof thenetwork'sbandwidth.
If N is the numb er of no des in the network, 0 is the network's electrical conductance, and is
thestandard deviation (\imbalance") of the initial load distribution, then the running timeof the
diusion algorithmis
( log
0
) and O ( N
0 ):
If 8 is the network's uid conductance, and is as ab ove, then the running time of the diusion
algorithmis
( log
8
) and O (
8 2
):
Forthesp ecial caseof an(n
1 2n
2
21112n
d
)mesh withwraparound, weprovidethefollowingtighter
b ound:
(
dlog
sin 2
(
max
i=1111d n
i )
) and O (
d
sin 2
(
max
i=1111d n
i )
): (1)
Figure 1gives afeel forBound 1byshowing theform itassumesin certaincommon cases. From the
table,it isclear thatthe diusion algorithm isinecient forlowerdimensional meshes. Forexample, on
aringand 2Dtorus,thediusion algorithmtakesatleastlineartime, indicating thatit isnob etterthan
acentralized algorithm(inwhich one pro cessor collects all information and directs theloadbalancing).
Comparison with Prior Work
Traditionalformulationsoftheload-balancingproblemallowthetasks'executiontimestodier[3]. These
formulations aremoregeneral than ours,but with thegenerality comesintractability: thesimplest such
formulations turn out to b e NP-complete. As a result, most work has consisted of prop osing ad hoc
(1)foriteration 1to1 b egin
(2) All pro cessorsiparb egin
(3) load[i] numb er of tasksati
(4) Broadcastload[i]toall neighb ors
(5) for eachj that isi'sneighb or b egin
(6) ifload[i]>load[j]then
(7) Send P
ij
(load[i]0load[j])tasks toj
(8) end
(9) parend
(10)end
Figure 2. Algorithmforload-balancing(withdivisibletasks)
Wearguethatthecaseofxed-sized tasksissuciently imp ortant tomeritstudy. First,asillustrated
ab ove, there are several highly parallel applications where the assumption of xed-sized tasks is valid.
Second, it has b een argued that xed-sized tasks adequately mo del variable-sized tasks which can b e
pre-emptedwhentheyexceed acertaintimequantum[1]. Finally,thecaseofxed-sized tasksistractable
andamenable torigorousanalysis.
Diusionisonlyone ofseveral loadbalancingalgorithmsthathaveb eenstudied in thepast[14]. The
diusionalgorithmisstudiedindetailin[2]and[5]. OurpresentationofthealgorithminSection2closely
follows[5]. OuranalysisofthealgorithminSection3extendstheanalysisin[5]. Forexample,[5]provides
explicit b ounds therunning time of thediusion algorithm only in the caseof a hyp ercub e. In contrast,
we provide b ounds foranarbitrary network.
Both [2] and [5] make the simplifying assumption that tasks can b e divided into arbitrary fractions.
InSection 4 thatthis assumption raises thornyproblems that can notb e glossed over. Then we suggest
howthediusion algorithm can b emo diedto handleindivisib il e tasks.
2 Diusive Load Balancing (with Divisible Tasks)
Inthissection,wereviewthediusion algorithmforloadbalancing. Forsimplicity,weassumethattasks
aredivisible into arbitraryfractions. (Forexample, we allowhalf-a-tasktomove acrossan edge, blithely
ignoring that such a thing is meaningless.) Recall that the original aim wasto end up with bM=Nc or
dM=Netasks ateach no de. Nowthatweallowfractional tasks,therevised aimis toend upwithexactly
M=N tasks ateach no de.
In Section 4, we will reconsider the indivisib il i ty of tasks, and show how to mo dify the diusion
algorithmaccordingly.
Theintuitionb ehindthediusionalgorithmisthatifapro cessorhasmoretasksthananeighb or,then
afewtasksdiusetotheneighb or. Thenumb eroftasksthatdiuseisprop ortionaltothedierenceinthe
numb erof tasksat thetwo pro cessors. The prop ortionalityconstantis acharacteristic of theconnecting
edge,and is called its diusivity.
Figure2showsthediusionalgorithmindetail. AlgorithmDiffusemakesuseofanN2N diusivity
matrix,P,which satisesthe following conditions:
P
ii
1=2. (The numb er half 1=2 ischosen only forsimplicity: anyp ositive constant will do. The
imp ortofthecondition is thatP
ii
ij ij
P issymmetric: P
ij =P
ji
P issto chastic: P
n
j=1 P
ij =1
Each pro cessor i has a variable called load[i]. At theb eginning of each iteration,load[i]is set to the
numb eroftasksatvertexi(line (3)ofFigure2). Normallyload[i]wouldb ean integervariable,but since
we areassumingthat tasksaredivisible, itis a real variable.
Then, each pro cessor sends its load toall its neighb ors (line (4)). As a result, each pro cessor knows
theloads ofall its neighb ors.
If a pro cessor's load is heavier than a neighb or's, then the pro cessor sends some of its tasks to the
neighb or(lines(6)and(7)). Thenumb eroftaskssentisprop ortionaltothedierenceinload,theconstant
ofprop ortionalityb eingtheappropriateentryoftheP. (Observethatthisnumb ermayb enon-integral.)
Ontheotherhand,if apro cessor's loadislowerthananeighb or thenitdo es notsendanytasks{rather,
itreceives tasks fromthe neighb or.
Theparendinline(9)tacitlyimpliesabarriersynchronization. Thus,nopro cessormaystartthenext
iterationuntil allpro cessorshavecompleted thecurrentiteration. [2,page515]showsthattheAlgorithm
Diffuse works just as well without the barrier synchronization. We retain the barrier synchronization
to simplify analysis; this yields slightly p essimistic results. In practice, the barrier synchronization is
disp ensed with.
We do not intend the 1 in line (1) tomean that the numb er of iterations is innity. The analysis
in Section 3 will show that, even though the load distribution b ecomes increasingly balanced with each
iterationofAlgorithmDiffuse, itmaynevereverb ecomeexactlybalanced. Wesymb olically denote this
gradualconvergence by an 1 in line (1). In practice, theuser decides on some tolerableimbalance, and
runsenoughiterations toreach withinthat tolerance.
3 Analysis of the Load Balancing Algorithm with Divisible Tasks
Nowwe analyzethe p erformanceof Algorithm Diffuse,still retaining theassumption ofdivisible tasks.
Wederiveb othlowerandupp erb oundsontherunningtimeofthealgorithm. Theseb ounds arestatedin
termsofthenetwork'selectrical conductanceand uid conductance,whicharemeasuresofthenetwork's
bandwidth. Forthecase of ageneralized mesh (withwrap-around), we derive tighterb ounds.
Tostatethemain result ofthis section, letus rst intro duce someterminology.
Foreacht0,dene the loaddistribution, ` (t)
,as
0
B
B
B
B
B
@ `
(t)
1
` (t)
2
.
.
.
` (t)
N 1
C
C
C
C
C
A ;
where` (t)
i
is the numb erof tasks at vertex iafter iteration t ofAlgorithm Diffuse. (Thus, ` (0)
denotes
theinitial load distribution.) Let thetotal load, M,b e P
N
i=1 `
(0)
i
, and dene the balanced distribution, b,
as
0
B
B
B
B
@ M=N
M=N
.
.
.
M=N 1
C
C
C
C
bandwidth.
ImagineG tob ean electrical network, withedges representingresistors. Set theresistance of each
edgeto the recipro cal of corresp onding entryof the diusivity matrix P. Let u and v b e vertices
ofG. Dene Res(u;v) asthe eective electrical resistanceb etween u and v,that is, thevoltageof
v with resp ect to u if a unit of current were to b einjected at v and extracted at u. The electrical
conductanceof Gisdened as
0=min
u;v 1
Res (u;v) :
ConsiderGto b egas distribution network,with edges representing pip es. Set thecapacityof each
edgetothe of corresp onding entryof the diusivitymatrix P. Let S b e a subsetof thevertices of
G,and S b eits complement. Dene Cap(S;S )as theeective capacity b etween S and S, that is,
P
i2S;j2S P
ij
. Theuid conductance 2
Wehave intro duced enoughterminology tostatethemain resultof this section:
Theorem3.1 (Correctness and Complexity of Algorithm Diffuse). The load distribution
con-vergestothebalanced distribution. (regardlessofits initial value):
lim
t!1 l
(t)
=b:
Moreover, thetime for
tofall b elow a presp ecied constant tolerance satisesthefollowing
b ounds:
representsthe imbalance in theinitial loaddistribution.
Forthecaseofan (n
) mesh withwraparound, therunning timesatisesthefollowing
tighter b ounds:
(
Forclarity,we presentthepro of ofTheorem3.1intotwosubsections. The rstsubsection derives the
ErrorBound, andthe secondsubsection completesthepro of.
3.1 The Error Bound
Foreach t0,dene theerror distribution,e (t)
,as` (t)
0b.
Observe thatfrom one iterationtothe next,theloaddistribution changesaccording totheequation
`
Fluid conductance is nota standard term in interconnection network literature, but theidea o ccursin several guises,
since, by insp ection of Algorithm Diffuse, `
Also notethat
Pb=b; (3)
. Equation 3 has the nice signicance that if we
startwitha loaddistribution ofb,thenafter one iteration weend up with bagain.
From Equations 2 and 3, it follows that e (t+1)
distribution transformsin thesamewayastheload distribution.
Since P is a real symmetric matrix, it is diagonalizable, and eigenvectors corresp onding to dierent
eigenvalues form an orthogonal basis [13, page 296]. Let
1
without lossof generality,let them b eordered such that j
1
b ethecorresp onding eigenvectors. From thetheory ofMarkovchains,it is known that
1
=1,that v (1)
issomescalar multiple of
0
jisstrictly less than1 [12].
Sincethecomp onentsofe (t)
sumtozeroandthecomp onentsofthersteigenvectorv (1)
areallequal,
their inner pro duct is zero. So e (t)
has no comp onent in thedirection of basisvectorv
1
, and hence can
b eexpressedas a linearcombination ofv (2)
;...;v (N)
. Observe thatP scalesthelengths of v (2)
jresp ectively, allofwhich arej
2
j<1. ThereforeP scalesthelengthofe (t)
bya factor
j
,which implies
We call equation 4 the Error Bound. Informally, the Error Bound says that the length of the error
vectorshrinks geometrically,where thescale factoris j
2 j.
Note that the ErrorBound is tight. For, if we cho ose e (0)
to b ev
2
then each application of P scales
thelength ofe (t)
byafactorofexactly j
2
j. Hence forthis choice of e (0)
3.2 Conclusion of Proof
From the Error Bound, it follows that l im
t!1
\correctness"partof thetheorem.
Letussaythatwedesireatolerance of. Thenwemustexecutetheb o dyoftheline(2)lo opT times,
whereT is such that
The time for any pro cessor to execute lines (3) and (4) is at most a constant, say c. The time for
pro cessor ito execute the lo op from line (5) toline (8) is prop ortional to thenumb er of tasks it has to
send,whichisatmost P
. So thetotal time foranypro cessor toexecute
Therefore the time for all T iterations is
Using theErrorBound, this expressioncan b eb ounded as
(
j. Hereareseveralformulastodoso,drawnfromthetheoryofrapidly
mixing Markovchains:
Thesecondeigenvalue ofP isb ounded bytheelectrical conductanceofGasfollows[4,Theorem7]:
1020j
Thesecond eigenvalue of P can b eb ounded bytheuid conductance of Gasfollows [7]:
1028(G)j
mesh with wraparound. Dene the matrix P as follows: Set all
diagonal entries to 1=2. If ij is an edge of G,then set P
ij =
1
4d
. Setall other non-diagonal entries
tozero.
Thesecond eigenvalue of thematrixP isgiven bythefollowing equation[4,Theorem10]:
Plugging in theseb ounds forj
2
jin Equation5 provesthecomplexity partof thetheorem.
4 Handling the Indivisibility of Tasks
Algorithm Diffuse assumedthat tasks are divisible. In this section we give examples toshow that the
indivisibi li ty of tasks raises non-trivial problems that can not b e glossed over. Then we show how to
mo dify AlgorithmDiffuse tohandle indivisible tasks.
Once we recognize thattasks areatomic,Algorithm Diffusehasan obviousproblem in Line (7):
Send P
ij
(load[i]0load[j])tasks toj,
b ecausethis quantitymaynotb eintegral.
Letus tryreplacing line (7) with
Send bP
ij
(load[i]0load[j])ctasks.
Theproblem,asFigure3shows,isthattheloaddistribution mayconvergetoanunbalanced distribution.
Ifwetryreplacing line (7) with
Send dP
ij
(load[i]0load[j])etasks,
then an immediate problem is that a pro cessor may not have enough tasks for all its neighb ors. Even
if we are willing to ignore that, Figure 4 shows that the load distribution maykeep oscillating b etween
unbalanced distributions.
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
Figure3. Thenumb erof tasksat adjacentvertices dier byone. Sincethe weighton each edgeis 1=3,we would
have liked to transfer 1=3 of a task across each edge, but in the \o or" scheme,no tasks move. Thus, the load
distributionhasconvergedwrongly.
Flip a biased coin that lands head with probability P
ij
and tail with probability 10P
ij
If a head is obtained then send dP
ij
(load[i]0load[j])etasks. If a tail is obtained then send
bP
ij
(load[i]0load[j])ctasks.
Theintuitionb ehindthisapproachisthatsending 2
3
task(forexample)isthesameassendingawholetask
withprobability 2
3
. However, thealgorithm turns out tohave a curiousb ehaviour: the loaddistribution
balancestoanextent,but failstobalance anyfurther. Moreprecisely,the\entropy"stabilizesatnon-zero
value. (Of course, a fortuitous sequence of coin ips maybalance the load, but that is very unlikely to
happ en.)
As theab ove examples show, theconvergence of the algorithmgreatly dep ends on howthe fractions
arerounded. Below,westatearoundingschemethatguaranteesconvergencetothebalanceddistribution.
We omitthepro of b ecause itisstraightforwardanduninstructive.
Case 1: G is biconnected. FindanorientationofG(thatis, assignadirection toeachedgeofG)such
that
thereare nodirected cycles
thereis a unique maximal anda unique minimal vertex
thereis an edgejoining themaximal totheminimal vertex
Such an orientation may b e found as follows: Find an op en ear decomp osition of G (one exists
b ecauseG is biconnected [6]). Orient therst op en ear E
1
arbitrarily. Assume, by induction, that
theedges of the op en ears E
1 ;E
2 ;...;E
i01
have already b een oriented; that there are no directed
cycles yet; and that the endp oints of the op en ear E
1
are the minimum and maximum vertices.
Nowwe wish to orient the op en ear E
i
. Let the p oints of attachment of E
i
b e thevertices u and
v. Because thecurrent partial orientation is acyclic, directed paths can notexist b othfrom u to v
and v to u. Without loss of generality, assume that there is no directed path from u to v. Then
orient all edges of the op en ear E
i
from v to u. Clearly, this will create no directed cycle, and the
endp oints of the op en ear E
1
remain the minimum and maximum vertices, thus maintaining the
inductive assertion.
Mo difyline (7)of Algorithm Diffuseasfollows:
Iftheedge ij isdirected from ito j,then isends dP
ij
(load[i]0load[j])etasks toj. If the
edgeijis directed from jto i,then isendsbP
ij
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
Figure4. Thenumb erof tasksatadjacent verticesdier by two. Since theweightoneach edgeis 1=3,we would
have likedto transfer 2=3ofatask across each edge, but inthe \ceiling"scheme,awhole task moves. Thus,the
loaddistributionoscillates.
Case 2: G is not biconnected. Findabiconnectedsup ergraphHthatb esimulatedbyGwithconstant
delay. The graph H mayb e constructed by adding edges to Gas follows: foreach cut-vertex u of
G,chain the neighb ors of u in a cycle. It is easy to see that H can b e one-to-one emb edded in G
withdilation 2 and edge-congestion 4. By [9,page 404], G can simulate H with a constant-factor
delay.
Thus,even thoughthenetworkathand isG,thealgorithmcanloadbalance asthoughthenetwork
wereH,incurring only a constant-factordelay.
5 Conclusion
To summarize, we have presented a rigorous analysis of the p erformance of the diusion algorithm on
arbitrarynetworks. We deriveb oth lowerandupp er b oundsontherunning timeofthealgorithm. These
b oundsare statedin terms ofthenetwork'sbandwidth.
Forthecase of thegeneralized mesh withwrap-around, we derive tighter b ounds. and conclude that
thediusion algorithmis inecient forlowerdimensional meshes.
As shown in the back-tracking and PDE examples of Section 1, load-balancing usually arises as a
part of another algorithm. This suggests that a load balancing algorithm should b e not b e judged in
interestingtosee suchan analysis.
References
[1] B.D. Alleyne. Personal communications, dept.of electrical engineering, princeton university, 1994.
[2] D.P.BertsekasandJ.N.Tsitsiklis. Paralleland Distributed Computation: NumericalMethods.
Pren-tice Hall, Englewo o d Clis, NJ, 1989.
[3] T.L. Casavant and J.G.Kuhn. A taxonomyof scheduling in general-purp ose distributed computing
systems. IEEE TransactionsonSoftwareEngineering,14(2):141{154,1988.
[4] A.K. Chandra et al. The electrical resistance of a graph captures its commute and covertimes. In
Symposium on Theory of Computing, pages574{586,may1989.
[5] G.Cyb enko.Dynamicloadbalancingfordistributedmemorymultipro cessors.TheJournalofParallel
and Distributed Computing, 7(2):279{301,1989.
[6] H.Whitney. Non-separable and planar graphs. Transactions of the American MathematicalSociety,
34:339{362,1932.
[7] M. Jerrum and A. Sinclair. Conductance and the rapid mixing prop erty for markov chains: the
approximation of thep ermanent resolved. In Symposium on Theory of Computing, pages 235{243,
may1988.
[8] R.M.Karp. Parallel combinatorialcomputing. InJill P.Mesirov,editor,Very Large Scale
Computa-tion in the 21stCentury, pages 221{238.So cietyfor Industrial and Applied Mathematics,
Philadel-phia, PA,1991.
[9] T.Leighton. IntroductiontoParallelAlgorithmsandArchitectures:ArraysTreesHypercubes.Morgan
Kauman, SanMateo,CA, 1991.
[10] C.E.LeisersonandB.M.Maggs.Communicationecient parallelalgorithmsfordistributed
random-access machines. Algorithmica, 3:53{77,1988.
[11] D. Pelegand E.Upfal. The token distribution problem. InSymposium onFoundationsof Computer
Science,pages 418{427,1986.
[12] E.Seneta. Non-negative Matrices and Markov Chains. Springer-Verlag, NewYork, NY,1981.
[13] G.Strang. LinearAlgebra and its Applications. HarcourtBraceJovanovich, SanDiego, CA,1988.
[14] M.H. Willeb eek-L emair and A.P. Reeves. Strategies for dynamic load balancing on highly parallel