An analysis of diffusive load balancing

(1)

An Analysis of Diusive Load-Balancing

RaghuSubramanian IsaacD. Scherson

[email protected] [email protected]

Departmentof Information and Computer Science

Universityof California

Irvine, CA 92717-3425

U.S.A.

Abstract

Diusion is a well-known algorithmfor load-balancing in which tasks move from heavily-loaded

pro cessors to lightly-loadedneighb ors. This pap erpresents arigorous analysis ofthe p erformanceof

thediusion algorithmonarbitrarynetworks.

We derive b oth lowerand upp er b ounds onthe runningtimeof thealgorithm. These b oundsare

statedintermsof thenetwork'sbandwidth.

Forthecase of thegeneralizedmesh withwrap-around (which includescommonnetworks like the

ring, 2D-torus, 3D-torus and hyp ercub e), we derive tighter b ounds and conclude that the diusion

algorithmisinecientforlowerdimensionalmeshes.

3

Thisresearchwassupp ortedinpartbytheAirForceOceofScienticResearchundergrantnumb erF49620-92-J-0126,

(2)

The load-balancing problem is as follows: Let G b e an undirected connected graph of N no des, and let

M tasks b escatteredamongthe no des. Re-distribute thetasks such that each no de endsup witheither

bM=Nc or dM=Ne tasks. The algorithm must run in a distributed fashion, i.e., each no de's decisions

mustb ebased only on lo calknowledge. (Thisformulationisb orrowedfrom [11] 1

.)

To motivate the ab ove formulation, let us consider two applications in which the the need for

load-balancingarises. Intheseexamples, observethattheall taskshaveroughlythesameexecution time, and

theload-balancing phases attempttoequalize thenumber oftasks ateach no de. Also notethatthehigh

degree ofparallelism makesa centralized load-balancing algorithm unviable.

Back-tracking: Consider a search space consisting of vectors whereeach comp onent can assume

a nite numb er of values. Back-tracking is an algorithm to nd a vector in the search space that

satisessomefeasibilitycondition. Inback-tracking,thesolutionvectorisconstructedincrementally,

comp onentbycomp onent. Ataskcorresp ondstoapartially constructedvector. Eachpartial vector

(task)resides in thelo calmemoryof somepro cessor.

The back-tracking algorithm alternates b etween expansion phases and load-balancing phases. In

an expansion phase, each pro cessor in parallel takes a partial vector (task), if any, from its lo cal

memory. Ifthepartialvectorisin factcomplete,andalsosatisesthefeasibilitycondition,thenthe

back-tracking algorithmreturns thecompleted vectorand terminates. If itis clear that thepartial

vectorcan notb e completed feasibly, then the partial vector vanishes from thelo cal memory, and

thetaskissaid tohaveterminated unsuccessfully. Otherwise, several new partialvectors app earin

thelo calmemoryofthesamepro cessor,corresp onding toeachwayofextending theoriginal partial

vectorbyone morecomp onent,and the taskis said tohavespawned children.

Ina load-balancing phase, partial vectors (tasks)are redistributed amongpro cessors evenly. If the

load-balancing phase is omitted, then with each expansion phase, the distribution of tasks among

pro cessorsgets skewed. Somepro cessorsgetswamp edwithtasks,whileothersstayidle. Thisslows

downtheoverallback-tracking algorithm.

Iftheload-balancing phase provestob e to oexp ensive, thenits costmayb eamortizedbyinvoking

severalexpansion phases foreveryinvo cation oftheload-balancing phase.

SolvingPDEs: Consideratypicaliterativealgorithmtosolveapartialdierentialequation(PDE).

The problem is rst discretized by partitioning space into regions, and the function is tentatively

assumed to b e constant within a region. During each iteration, each region in parallel gets the

function values in theneighb oring regions, and uses them toup date its own function value. If, in

the pro cess, a region discovers that its function value diers drastically from the neighb ors', then

it realizes that assumption that the function is constant within the region maynot b e valid, soit

splitsup intonerregions.

Here,a taskcorresp onds to aregion. Eachtask resides in thelo cal memoryof apro cessor. During

anexpansion phase,eachpro cessorpicks atask,if any,fromits lo calmemoryand executes it. This

mayresultinadditional tasksapp earing in thelo cal memoryofapro cessor,reectingthata region

splitup. Duringa load-balancing phase,tasks areredistributed evenly.

Otherapplicationswhereload-balancing arisesarebranch-and-b oundoptimizations,theoremproving,

interpretationofPROLOG programs,andraytracing[8].

1

Incidentally, thepap erdescrib esasurprising \applicatio n"ofload-balanci ng. Itiswellknownthatp ermutationrouting

canb e reduced to sorting. The authors show that general (many to many) routing canb e reduced to sorting plus

(3)

Ring (N 2

log) O (N 2

)

2D-torus (Nlog) O (N)

3D-torus (N 2=3

log) O (N 2=3

)

Hyp ercub e (logNlog) O (logN)

Figure 1. Bounds on the running time of the diusion algorithm for certain commonnetworks. N denotes the

numb erofno desinthenetwork,anddenotesthestandarddeviation(\imbalance")oftheinitialloaddistribution.

Overview of Results

In this pap er, we consider the well-known diusion algorithm for load balancing. The principle b ehind

thediusion algorithmisthat if apro cessor hasmoretasks thananeighb or,thenit sendsa fewtasks to

theneighb or. Thenumb eroftaskssentisprop ortionaltothedierentialb etweenthetwopro cessors, the

prop ortionalityconstant b eing acharacteristic of theconnecting edge.

This pap er presentsa rigorousanalysis of thep erformance of thediusion algorithmon an arbitrary

network. Wederiveb othlowerandupp erb oundsontherunningtimeofthealgorithm. Theseb oundsare

statedin termsofthenetwork'selectrical conductanceanduid conductance(denedin Section3),

whichare measuresof thenetwork'sbandwidth.

If N is the numb er of no des in the network, 0 is the network's electrical conductance, and is

thestandard deviation (\imbalance") of the initial load distribution, then the running timeof the

diusion algorithmis

( log

0

) and O ( N

0 ):

If 8 is the network's uid conductance, and is as ab ove, then the running time of the diusion

algorithmis

( log

8

) and O (

8 2

):

Forthesp ecial caseof an(n

1 2n

2

21112n

d

)mesh withwraparound, weprovidethefollowingtighter

b ound:

(

dlog

sin 2

(

max

i=1111d n

i )

) and O (

d

sin 2

(

max

i=1111d n

i )

): (1)

Figure 1gives afeel forBound 1byshowing theform itassumesin certaincommon cases. From the

table,it isclear thatthe diusion algorithm isinecient forlowerdimensional meshes. Forexample, on

aringand 2Dtorus,thediusion algorithmtakesatleastlineartime, indicating thatit isnob etterthan

acentralized algorithm(inwhich one pro cessor collects all information and directs theloadbalancing).

Comparison with Prior Work

Traditionalformulationsoftheload-balancingproblemallowthetasks'executiontimestodier[3]. These

formulations aremoregeneral than ours,but with thegenerality comesintractability: thesimplest such

formulations turn out to b e NP-complete. As a result, most work has consisted of prop osing ad hoc

(4)

(1)foriteration 1to1 b egin

(2) All pro cessorsiparb egin

(3) load[i] numb er of tasksati

(4) Broadcastload[i]toall neighb ors

(5) for eachj that isi'sneighb or b egin

(6) ifload[i]>load[j]then

(7) Send P

ij

(load[i]0load[j])tasks toj

(8) end

(9) parend

(10)end

Figure 2. Algorithmforload-balancing(withdivisibletasks)

Wearguethatthecaseofxed-sized tasksissuciently imp ortant tomeritstudy. First,asillustrated

ab ove, there are several highly parallel applications where the assumption of xed-sized tasks is valid.

Second, it has b een argued that xed-sized tasks adequately mo del variable-sized tasks which can b e

pre-emptedwhentheyexceed acertaintimequantum[1]. Finally,thecaseofxed-sized tasksistractable

andamenable torigorousanalysis.

Diusionisonlyone ofseveral loadbalancingalgorithmsthathaveb eenstudied in thepast[14]. The

diusionalgorithmisstudiedindetailin[2]and[5]. OurpresentationofthealgorithminSection2closely

follows[5]. OuranalysisofthealgorithminSection3extendstheanalysisin[5]. Forexample,[5]provides

explicit b ounds therunning time of thediusion algorithm only in the caseof a hyp ercub e. In contrast,

we provide b ounds foranarbitrary network.

Both [2] and [5] make the simplifying assumption that tasks can b e divided into arbitrary fractions.

InSection 4 thatthis assumption raises thornyproblems that can notb e glossed over. Then we suggest

howthediusion algorithm can b emo diedto handleindivisib il e tasks.

2 Diusive Load Balancing (with Divisible Tasks)

Inthissection,wereviewthediusion algorithmforloadbalancing. Forsimplicity,weassumethattasks

aredivisible into arbitraryfractions. (Forexample, we allowhalf-a-tasktomove acrossan edge, blithely

ignoring that such a thing is meaningless.) Recall that the original aim wasto end up with bM=Nc or

dM=Netasks ateach no de. Nowthatweallowfractional tasks,therevised aimis toend upwithexactly

M=N tasks ateach no de.

In Section 4, we will reconsider the indivisib il i ty of tasks, and show how to mo dify the diusion

algorithmaccordingly.

Theintuitionb ehindthediusionalgorithmisthatifapro cessorhasmoretasksthananeighb or,then

afewtasksdiusetotheneighb or. Thenumb eroftasksthatdiuseisprop ortionaltothedierenceinthe

numb erof tasksat thetwo pro cessors. The prop ortionalityconstantis acharacteristic of theconnecting

edge,and is called its diusivity.

Figure2showsthediusionalgorithmindetail. AlgorithmDiffusemakesuseofanN2N diusivity

matrix,P,which satisesthe following conditions:

P

ii

1=2. (The numb er half 1=2 ischosen only forsimplicity: anyp ositive constant will do. The

imp ortofthecondition is thatP

ii

(5)

ij ij

P issymmetric: P

ij =P

ji

P issto chastic: P

n

j=1 P

ij =1

Each pro cessor i has a variable called load[i]. At theb eginning of each iteration,load[i]is set to the

numb eroftasksatvertexi(line (3)ofFigure2). Normallyload[i]wouldb ean integervariable,but since

we areassumingthat tasksaredivisible, itis a real variable.

Then, each pro cessor sends its load toall its neighb ors (line (4)). As a result, each pro cessor knows

theloads ofall its neighb ors.

If a pro cessor's load is heavier than a neighb or's, then the pro cessor sends some of its tasks to the

neighb or(lines(6)and(7)). Thenumb eroftaskssentisprop ortionaltothedierenceinload,theconstant

ofprop ortionalityb eingtheappropriateentryoftheP. (Observethatthisnumb ermayb enon-integral.)

Ontheotherhand,if apro cessor's loadislowerthananeighb or thenitdo es notsendanytasks{rather,

itreceives tasks fromthe neighb or.

Theparendinline(9)tacitlyimpliesabarriersynchronization. Thus,nopro cessormaystartthenext

iterationuntil allpro cessorshavecompleted thecurrentiteration. [2,page515]showsthattheAlgorithm

Diffuse works just as well without the barrier synchronization. We retain the barrier synchronization

to simplify analysis; this yields slightly p essimistic results. In practice, the barrier synchronization is

disp ensed with.

We do not intend the 1 in line (1) tomean that the numb er of iterations is innity. The analysis

in Section 3 will show that, even though the load distribution b ecomes increasingly balanced with each

iterationofAlgorithmDiffuse, itmaynevereverb ecomeexactlybalanced. Wesymb olically denote this

gradualconvergence by an 1 in line (1). In practice, theuser decides on some tolerableimbalance, and

runsenoughiterations toreach withinthat tolerance.

3 Analysis of the Load Balancing Algorithm with Divisible Tasks

Nowwe analyzethe p erformanceof Algorithm Diffuse,still retaining theassumption ofdivisible tasks.

Wederiveb othlowerandupp erb oundsontherunningtimeofthealgorithm. Theseb ounds arestatedin

termsofthenetwork'selectrical conductanceand uid conductance,whicharemeasuresofthenetwork's

bandwidth. Forthecase of ageneralized mesh (withwrap-around), we derive tighterb ounds.

Tostatethemain result ofthis section, letus rst intro duce someterminology.

Foreacht0,dene the loaddistribution, ` (t)

,as

0

B

@ `

(t)

1

` (t)

2

.

` (t)

N 1

C

A ;

where` (t)

i

is the numb erof tasks at vertex iafter iteration t ofAlgorithm Diffuse. (Thus, ` (0)

denotes

theinitial load distribution.) Let thetotal load, M,b e P

N

i=1 `

(0)

i

, and dene the balanced distribution, b,

as

0

B

@ M=N

M=N

.

M=N 1

C

(6)

bandwidth.

ImagineG tob ean electrical network, withedges representingresistors. Set theresistance of each

edgeto the recipro cal of corresp onding entryof the diusivity matrix P. Let u and v b e vertices

ofG. Dene Res(u;v) asthe eective electrical resistanceb etween u and v,that is, thevoltageof

v with resp ect to u if a unit of current were to b einjected at v and extracted at u. The electrical

conductanceof Gisdened as

0=min

u;v 1

Res (u;v) :

ConsiderGto b egas distribution network,with edges representing pip es. Set thecapacityof each

edgetothe of corresp onding entryof the diusivitymatrix P. Let S b e a subsetof thevertices of

G,and S b eits complement. Dene Cap(S;S )as theeective capacity b etween S and S, that is,

P

i2S;j2S P

ij

. Theuid conductance 2

Wehave intro duced enoughterminology tostatethemain resultof this section:

Theorem3.1 (Correctness and Complexity of Algorithm Diffuse). The load distribution

con-vergestothebalanced distribution. (regardlessofits initial value):

lim

t!1 l

(t)

=b:

Moreover, thetime for

tofall b elow a presp ecied constant tolerance satisesthefollowing

b ounds:

representsthe imbalance in theinitial loaddistribution.

Forthecaseofan (n

) mesh withwraparound, therunning timesatisesthefollowing

tighter b ounds:

(

Forclarity,we presentthepro of ofTheorem3.1intotwosubsections. The rstsubsection derives the

ErrorBound, andthe secondsubsection completesthepro of.

3.1 The Error Bound

Foreach t0,dene theerror distribution,e (t)

,as` (t)

0b.

Observe thatfrom one iterationtothe next,theloaddistribution changesaccording totheequation

`

Fluid conductance is nota standard term in interconnection network literature, but theidea o ccursin several guises,

(7)

since, by insp ection of Algorithm Diffuse, `

Also notethat

Pb=b; (3)

. Equation 3 has the nice signicance that if we

startwitha loaddistribution ofb,thenafter one iteration weend up with bagain.

From Equations 2 and 3, it follows that e (t+1)

distribution transformsin thesamewayastheload distribution.

Since P is a real symmetric matrix, it is diagonalizable, and eigenvectors corresp onding to dierent

eigenvalues form an orthogonal basis [13, page 296]. Let

1

without lossof generality,let them b eordered such that j

1

b ethecorresp onding eigenvectors. From thetheory ofMarkovchains,it is known that

1

=1,that v (1)

issomescalar multiple of

0

jisstrictly less than1 [12].

Sincethecomp onentsofe (t)

sumtozeroandthecomp onentsofthersteigenvectorv (1)

areallequal,

their inner pro duct is zero. So e (t)

has no comp onent in thedirection of basisvectorv

1

, and hence can

b eexpressedas a linearcombination ofv (2)

;...;v (N)

. Observe thatP scalesthelengths of v (2)

jresp ectively, allofwhich arej

2

j<1. ThereforeP scalesthelengthofe (t)

bya factor

j

,which implies

We call equation 4 the Error Bound. Informally, the Error Bound says that the length of the error

vectorshrinks geometrically,where thescale factoris j

2 j.

Note that the ErrorBound is tight. For, if we cho ose e (0)

to b ev

2

then each application of P scales

thelength ofe (t)

byafactorofexactly j

2

j. Hence forthis choice of e (0)

3.2 Conclusion of Proof

From the Error Bound, it follows that l im

t!1

\correctness"partof thetheorem.

Letussaythatwedesireatolerance of. Thenwemustexecutetheb o dyoftheline(2)lo opT times,

whereT is such that

The time for any pro cessor to execute lines (3) and (4) is at most a constant, say c. The time for

pro cessor ito execute the lo op from line (5) toline (8) is prop ortional to thenumb er of tasks it has to

send,whichisatmost P

. So thetotal time foranypro cessor toexecute

(8)

Therefore the time for all T iterations is

Using theErrorBound, this expressioncan b eb ounded as

(

j. Hereareseveralformulastodoso,drawnfromthetheoryofrapidly

mixing Markovchains:

Thesecondeigenvalue ofP isb ounded bytheelectrical conductanceofGasfollows[4,Theorem7]:

1020j

Thesecond eigenvalue of P can b eb ounded bytheuid conductance of Gasfollows [7]:

1028(G)j

mesh with wraparound. Dene the matrix P as follows: Set all

diagonal entries to 1=2. If ij is an edge of G,then set P

ij =

1

4d

. Setall other non-diagonal entries

tozero.

Thesecond eigenvalue of thematrixP isgiven bythefollowing equation[4,Theorem10]:

Plugging in theseb ounds forj

2

jin Equation5 provesthecomplexity partof thetheorem.

4 Handling the Indivisibility of Tasks

Algorithm Diffuse assumedthat tasks are divisible. In this section we give examples toshow that the

indivisibi li ty of tasks raises non-trivial problems that can not b e glossed over. Then we show how to

mo dify AlgorithmDiffuse tohandle indivisible tasks.

Once we recognize thattasks areatomic,Algorithm Diffusehasan obviousproblem in Line (7):

Send P

ij

(load[i]0load[j])tasks toj,

b ecausethis quantitymaynotb eintegral.

Letus tryreplacing line (7) with

Send bP

ij

(load[i]0load[j])ctasks.

Theproblem,asFigure3shows,isthattheloaddistribution mayconvergetoanunbalanced distribution.

Ifwetryreplacing line (7) with

Send dP

ij

(load[i]0load[j])etasks,

then an immediate problem is that a pro cessor may not have enough tasks for all its neighb ors. Even

if we are willing to ignore that, Figure 4 shows that the load distribution maykeep oscillating b etween

unbalanced distributions.

(9)

1/3

Figure3. Thenumb erof tasksat adjacentvertices dier byone. Sincethe weighton each edgeis 1=3,we would

have liked to transfer 1=3 of a task across each edge, but in the \o or" scheme,no tasks move. Thus, the load

distributionhasconvergedwrongly.

Flip a biased coin that lands head with probability P

ij

and tail with probability 10P

ij

If a head is obtained then send dP

ij

(load[i]0load[j])etasks. If a tail is obtained then send

bP

ij

(load[i]0load[j])ctasks.

Theintuitionb ehindthisapproachisthatsending 2

3

task(forexample)isthesameassendingawholetask

withprobability 2

3

. However, thealgorithm turns out tohave a curiousb ehaviour: the loaddistribution

balancestoanextent,but failstobalance anyfurther. Moreprecisely,the\entropy"stabilizesatnon-zero

value. (Of course, a fortuitous sequence of coin ips maybalance the load, but that is very unlikely to

happ en.)

As theab ove examples show, theconvergence of the algorithmgreatly dep ends on howthe fractions

arerounded. Below,westatearoundingschemethatguaranteesconvergencetothebalanceddistribution.

We omitthepro of b ecause itisstraightforwardanduninstructive.

Case 1: G is biconnected. FindanorientationofG(thatis, assignadirection toeachedgeofG)such

that

thereare nodirected cycles

thereis a unique maximal anda unique minimal vertex

thereis an edgejoining themaximal totheminimal vertex

Such an orientation may b e found as follows: Find an op en ear decomp osition of G (one exists

b ecauseG is biconnected [6]). Orient therst op en ear E

1

arbitrarily. Assume, by induction, that

theedges of the op en ears E

1 ;E

2 ;...;E

i01

have already b een oriented; that there are no directed

cycles yet; and that the endp oints of the op en ear E

1

are the minimum and maximum vertices.

Nowwe wish to orient the op en ear E

i

. Let the p oints of attachment of E

i

b e thevertices u and

v. Because thecurrent partial orientation is acyclic, directed paths can notexist b othfrom u to v

and v to u. Without loss of generality, assume that there is no directed path from u to v. Then

orient all edges of the op en ear E

i

from v to u. Clearly, this will create no directed cycle, and the

endp oints of the op en ear E

1

remain the minimum and maximum vertices, thus maintaining the

inductive assertion.

Mo difyline (7)of Algorithm Diffuseasfollows:

Iftheedge ij isdirected from ito j,then isends dP

ij

(load[i]0load[j])etasks toj. If the

edgeijis directed from jto i,then isendsbP

ij

(10)

1/3

Figure4. Thenumb erof tasksatadjacent verticesdier by two. Since theweightoneach edgeis 1=3,we would

have likedto transfer 2=3ofatask across each edge, but inthe \ceiling"scheme,awhole task moves. Thus,the

loaddistributionoscillates.

Case 2: G is not biconnected. Findabiconnectedsup ergraphHthatb esimulatedbyGwithconstant

delay. The graph H mayb e constructed by adding edges to Gas follows: foreach cut-vertex u of

G,chain the neighb ors of u in a cycle. It is easy to see that H can b e one-to-one emb edded in G

withdilation 2 and edge-congestion 4. By [9,page 404], G can simulate H with a constant-factor

delay.

Thus,even thoughthenetworkathand isG,thealgorithmcanloadbalance asthoughthenetwork

wereH,incurring only a constant-factordelay.

5 Conclusion

To summarize, we have presented a rigorous analysis of the p erformance of the diusion algorithm on

arbitrarynetworks. We deriveb oth lowerandupp er b oundsontherunning timeofthealgorithm. These

b oundsare statedin terms ofthenetwork'sbandwidth.

Forthecase of thegeneralized mesh withwrap-around, we derive tighter b ounds. and conclude that

thediusion algorithmis inecient forlowerdimensional meshes.

As shown in the back-tracking and PDE examples of Section 1, load-balancing usually arises as a

part of another algorithm. This suggests that a load balancing algorithm should b e not b e judged in

(11)

interestingtosee suchan analysis.

References

[1] B.D. Alleyne. Personal communications, dept.of electrical engineering, princeton university, 1994.

[2] D.P.BertsekasandJ.N.Tsitsiklis. Paralleland Distributed Computation: NumericalMethods.

Pren-tice Hall, Englewo o d Clis, NJ, 1989.

[3] T.L. Casavant and J.G.Kuhn. A taxonomyof scheduling in general-purp ose distributed computing

systems. IEEE TransactionsonSoftwareEngineering,14(2):141{154,1988.

[4] A.K. Chandra et al. The electrical resistance of a graph captures its commute and covertimes. In

Symposium on Theory of Computing, pages574{586,may1989.

[5] G.Cyb enko.Dynamicloadbalancingfordistributedmemorymultipro cessors.TheJournalofParallel

and Distributed Computing, 7(2):279{301,1989.

[6] H.Whitney. Non-separable and planar graphs. Transactions of the American MathematicalSociety,

34:339{362,1932.

[7] M. Jerrum and A. Sinclair. Conductance and the rapid mixing prop erty for markov chains: the

approximation of thep ermanent resolved. In Symposium on Theory of Computing, pages 235{243,

may1988.

[8] R.M.Karp. Parallel combinatorialcomputing. InJill P.Mesirov,editor,Very Large Scale

Computa-tion in the 21stCentury, pages 221{238.So cietyfor Industrial and Applied Mathematics,

Philadel-phia, PA,1991.

[9] T.Leighton. IntroductiontoParallelAlgorithmsandArchitectures:ArraysTreesHypercubes.Morgan

Kauman, SanMateo,CA, 1991.

[10] C.E.LeisersonandB.M.Maggs.Communicationecient parallelalgorithmsfordistributed

random-access machines. Algorithmica, 3:53{77,1988.

[11] D. Pelegand E.Upfal. The token distribution problem. InSymposium onFoundationsof Computer

Science,pages 418{427,1986.

[12] E.Seneta. Non-negative Matrices and Markov Chains. Springer-Verlag, NewYork, NY,1981.

[13] G.Strang. LinearAlgebra and its Applications. HarcourtBraceJovanovich, SanDiego, CA,1988.

[14] M.H. Willeb eek-L emair and A.P. Reeves. Strategies for dynamic load balancing on highly parallel