Load balancing physics routines .

(1)

LOAD BALANCING PHYSICS ROUTINES R. W. FORD

Centre for Novel Computing, The University of Manchester, U.K. and

P. M. BURTON

Meteorological Oce, Bracknell, U.K. ABSTRACT

This paper presents three algorithms for load balancing physics routines with dynamic load imbalance. Results are presented for the two most computa-tionally demanding load imbalanced physics routines (short wave radiation and convection) in the UKMO's forecast and climate models. Results show between 30% and 40% performance improvement in these routines running on the UKMO's Cray T3E.

1 Introduction

The physics phase of forecast and climate models is particularly amenable to large scale parallelism as, typically, each grid column can be independently computed. However, the existence of load imbalance in this phase is well known [6, 8] and has been widely acknowledged as being one of the major factors in loss of performance, particularly when using large numbers of processors, for example see [5, 9, 14, 15]. Interestingly, the same load imbalance problem is also observed in data assimilation [9, 10].

A domain decomposition approach is usually employed during the physics phase of parallel forecast and climate models, where equal sized regions of the computational grid are allocated to processors. In the U.K. Meteorological Oce's (UKMO's) case, these regions are rectangular in latitude and longitude and there are an equal number of processors and regions. If no load balancing is performed, then all processors will compute an equal number of grid columns (ignoring dierences due to integer division) and typically, for every time-step, a particular processor will compute the same grid columns. Unfortunately, the computational cost of grid columns can vary signicantly, leading to load imbalance.

Improving load balance implies a re-distribution of grid columns. For a load balancing algorithm to be eective, the performance benet due to improved load balance must outweigh any load balancing overheads. These overheads can be split into the following:

(2)

Communication. The total amount of communication used to re-distribute the grid columns should be minimised1 and overlapped with computation where possible.

Number of messages. The number of messages required to perform the com-munication should be minimised. If the amount of comcom-munication is xed then this also maximises the size of messages. It also maximises the size of each grid column computed at a time (a chunk), if each chunk associated with a message is computed separately.

Memory requirements. The increased memory requirements should be min-imised.

Chunk size. The chunk size should be maximised. Some of the benets of do-ing so are noted in [3]. On vector architectures, small chunks result in small vector lengths thus reducing performance. On message passing architectures small chunks potentially lead to short messages, which can also be inecient. If the model implementation dynamically allocates arrays (allowing the same exe-cutable to run with varying resolutions and processors without re-compilation) then there is also an overhead per chunk; this is signicant in the UKMO's unied forecast and climate model (UM) [2]. If smaller chunks prove benecial, for example due to better cache use, then the larger chunk can subsequently be blocked.

The relative importance of the overheads described above will vary depending on the characteristics of the model and machine that it is running on. Attempting to minimise all of these overheads is useful, as it maximises the chance of obtaining good performance across a number of current and next generation machines.

As the sophistication of the load balance algorithm increases, there is a correspond-ing improvement in load balance and decrease in the above overheads. However, more sophisticated algorithms take longer to compute and may use global communication to calculate the load imbalance. The philosophy of this paper is to develop more so-phisticated algorithms, as their cost is small compared to the cost of load imbalance and the load balance overheads described above.

(3)

Predictable dynamic imbalancewhere the load also varies over time-steps but in a manner such that the load for each grid column can be accurately estimated before the computation takes place. An example of this type of imbalance is short wave radiation in the UKMO's UM.

Un-predictable dynamic imbalance where the load varies in an un-predictable manner such that the load for each grid column cannot be accurately estimated a priori. This is typically found when work depends on the dynamic meteoro-logical phenomena. Examples are convection and precipitation.

This paper presents solutions targeted at the latter two forms of load imbalance. Clearly, however, dynamic algorithms can also cope with static distributions. These algorithms are tested on the UKMO's two most computationally intensive load-imbalanced routines, short wave radiation and convection. Short wave radiation is used to test our predictable dynamic imbalance algorithm. This re-distributes grid columns with the minimum amount of communication and attempts to minimise the number of messages. Convection is used to test two un-predictable dynamic im-balance algorithms. The rst enables under-loaded processors to steal chunks from over-loaded processors and the second uses the execution time from the previous time-step as an estimate of the load for the current time-step.

In Section 2 we describe how our work complements other work in this area. Sections 3 and 4 present the algorithms developed for predictable and unpredictable dynamic imbalance, respectively. Section 5 gives the results of our experiments for both the UKMO's operational forecast and climate models running on their Cray T3E. Finally, Section 6 draws some conclusions and points the way to future work in this area.

2 Related Work

(4)

Foster and Toonen [12] show that the load balance of the PCCM2 model [16] can be improved by swapping grid columns between processors allocated regions on the opposite side of the globe in longitude. Three ways of achieving this are presented, swapping every other grid column, swapping excess day-lit grid columns and swapping grid columns based on a pre-computed estimate of the cost of both lit and non-lit grid columns. The algorithms work well as (again) the radiation costs dominate. They note that there are other variances, but suggest the partitioning of PCCM2, which pairs symmetric north/south latitudes, compensates for this. This may not always be the case in general (for example in local area models) and is not the case in the UM. Our algorithm for short wave radiation can be considered as a global version of their method. Our algorithm re-partitions data amongst all processors guaranteeing load balance, whereas they load balance purely between pairs which does not. In their case symmetry ensures a good solution. Their methods require communication only between pairs of processors when pre-computing how many grid columns should be sent (whereas ours requires global communication). However, their method does not guarantee to minimise the number of grid columns sent (globally) and therefore data transferred between these processors, which is the dominant communication cost.

Dent and Mozdzynski [13] also employ a global re-partitioning algorithm for an irregular distribution of radiation grid columns in ECMWF's IFS [1]. As in our algorithm, they send grid columns from the most over-loaded processor to the most under-loaded processor until all processors are load balanced. They claim that this minimises the number of messages, however (to our knowledge), they do not try to match pairs, which can give a further reduction in the number of messages required. If the (static or dynamic) computation costs of the grid column can be modelled accurately by a set of weighting factors, typically factoring whether certain compu-tation within the physics routine takes place at grid columns, then the grid columns can be re-partitioned to give load balance. In [3] we used the top level of a set of gather masks to predict the computational cost of each grid column and improve load balance. Isaksen and Hamrud [10] have also suggested, and Saarinen, Vasiljevic and Jarvinen [19] have used, weighting factors for a static imbalance (in data assimila-tion). Foster and Toonen [12] (as mentioned earlier) pre-compute the cost of lit and non-lit grid columns. However, there may be many weighting factors, their accurate calculation may be dicult and their values will vary across architectures and as the physics are improved. Again, a dynamic load balance algorithm could be used in conjunction with this technique.

(5)

could also be used for the rst few time-steps to load balance a static load imbalance problem.

Elbern [17] uses execution time from the previous time-step for a dynamically varying workload, a two dimensional partition and a local shuing of work between adjacent processors. Our algorithm moves grid columns from over-loaded proces-sors to under-loaded procesproces-sors irrespective of their location. We also treat the grid columns as a single vector which allows us to load balance to the resolution of a single grid column rather than a line of grid columns.

If the execution time from the previous time-step is not a good measure of the current load, or the cost of the algorithm is prohibitive, then a fully dynamic algorithm such as Dynamic Chunk or Guided Self Scheduling (GSS) [7] can be used. We present results of a dynamic chunk implementation. This was chosen as it gives a simple control over the chunk size.

3 Predictable dynamic imbalance: Short Wave Radiation

Short wave radiation in the UM is a good example of predictable load imbalance. For the load to be predictable, we must make some assumptions as to its characteristics. Examining the code, we observe that it is only grid columns which are lit that perform computation. In the UM these grid columns are gathered into smaller arrays. Non-lit grid columns can therefore be safely assumed to have zero cost. The next observation is that each lit grid column performs approximately the same amount of computation. Our assumption therefore, is that each lit grid column takes the same amount of time to compute. Earlier work has proven the accuracy of this assumption [3]. Given these assumptions, an equal distribution of lit grid columns across processors should result in an equal amount of time per processor and therefore load balance.

In a global model, half of the grid grid columns (G) will be lit and half will be in darkness. If there are P processors, a standard domain decomposition would give each processorG=P grid grid columns. As the number of lit grid columns isG=2 then a load balanced solution would allocate G=2P grid grid columns to each processor. For a standard domain decomposition, where the grid grid columns are partitioned in latitude and longitude, for any reasonable number of processors, it is easy to see that some processors will have all of their grid columns lit. This suggests that we should observe up to 50% load imbalance in short wave radiation in a global model.

(6)

If the amount of communication required per grid column isC, G

i is the number of grid columns on processor i and G

av is the average number of grid columns per processor G

=p (which should be G=2 in the global case) the minimum amount of communication is C

P

Achieving this re-distribution with the minimum number of messages (and there-fore maximum message size) is much more dicult. We attempt to minimise the number of messages by sending grid columns from the most over-loaded processor to the most under-loaded processor until all processors are load balanced.

Our algorithm orders processors in terms of load2, then sends grid columns from the processor at the head of the list to the processor at the tail of the list. The number of grid columns transferred is the amount required to balance the least imbalanced of the two processors. At least one processor is always guaranteed to be load balanced after this step. Both will be balanced if they are pairs i.e: they are equally over-and under-loaded. If they are not pairs, the reduced imbalance on the remaining processor is checked with the list to see if it has a pair. If so, grid columns are transferred between them and they are removed from the list. If not, the processor is re-inserted in the list. This process continues until all processors are balanced, see [18] for a more detailed analysis of this algorithm.

The UM gathers the lit grid columns on each processor into contiguous dynam-ically allocated arrays. We increase the size of these dynamic arrays appropriately so that there is space to store any grid columns sent from remote processors. This technique allows remote grid columns to be computed in the same chunk as local grid columns, thus maximising chunk sizes.

4 Unpredictable dynamic imbalance

Convection in the UM is a good example of unpredictable dynamic imbalance. The complexity of the control paths in the code make it dicult to obtain an accurate estimation of the computational cost of a grid column (although in [3] we show that some performance improvement is possible using this approach). In this section, we present two possible approaches to unpredictable dynamic imbalance, namely, dynamic chunk and dynamic feedback.

4.1 Dynamic Chunk

Dynamic chunk is one of a class of dynamic algorithms that splits the local grid columns into a number of chunks. Chunks are then computed locally until all are completed. When complete, the processor looks for chunks from remote processors which have yet to be computed and steals them. Dynamic chunk uses a xed chunk size which allows easy control. Such algorithms do not require any application specic information and can be implemented eciently on a Cray T3E. Note, this algorithm

(7)

uses the T3E's global address space, one sided messaging and atomic operations to implement a shared memory algorithm, allowing chunk stealing from a processor while it is computing; this is not possible to implement eciently in message passing without a separate server thread. There is also a clear trade-o between the load balance, which improves as the number of chunks increase (as it is not possible to load balance to a ner resolution than a chunk) and the relative overheads associated with a chunk, which increase as the chunk size decreases.

4.2 Dynamic Feedback

Our dynamic feedback algorithm uses the execution time from the previous time-step to estimate the load per grid column for the current time-step. The rst time-step simply uses the time taken by a processor and the number of grid columns it computes to estimate the load of each grid column. The work is then re-distributed based on this estimation to load balance the code to give each processor an average amount of time. The same algorithm discussed in Section 3 is used to minimise communication and hence the number of messages. However, this version is a little more complicated as we are balancing time, which can be split to any accuracy, and transferring grid columns, which can obviously only be transferred in integer numbers. This means that the pairs option discussed in Section 3 cannot be used and that under-loaded processors cannot be removed from the list, see [18] for more details. Subsequent time-steps use the time from locally and remotely computed chunks to estimate the load.

Each local chunk and each remote chunk is computed and timed separately. This allows accurate timing of all chunks, minimises extra memory use and maximises the size of local chunks. However, although the algorithm also attempts to keep remote chunks as large as possible, these can be arbitrarily small (down to a single grid column). In the UM's convection routine, the cost of a grid column in a small chunk is signicantly greater than the cost of the same grid column in a large chunk, see Section 5.2.

5 Results

5.1 Short Wave Radiation

(8)

average original time gives a measure of the overhead of the algorithm. The dierence between the maximum load balanced time and the average load balanced time gives a measure of the inaccuracy in the assumptions mentioned in Section 3.

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

0 5 10 15 20 25

Time (s)

Timestep

asis max asis av lb max lb av

Figure 1: Predictable dynamic imbalance algorithm with 144 processors

5.2 Convection

Figure 2 shows the results of running the UM with its current static partition (labelled \asis") and and the new dynamic chunk algorithm (labelled \lb"), described in Section 4.1, for three and ve chunks. The example is the same forecast model run used in the previous section, again running on 144 processors (the number of processors used operationally). For both methods the maximum and average time is given. The maximum time shows how long the routine takes, whereas the average time shows how long the routine would take if it were perfectly balanced (the ideal). Our method improves the load balance of convection, reducing the time to solution by over 40%. The dierence between the average load balanced times and average original time is negligible, however there is some load imbalance remaining. Five chunks per processor gives a better load balance solution than three chunks per processor.

(9)

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 50 100 150 200

Time (s)

Timestep

asis max asis average lb 3 segs max lb 3 segs av lb 5 segs max lb 5 segs av

Figure 2: Dynamic Chunk algorithm with 144 processors

due to the small size of the data set. The results of running the UM with the current static partition (labelled \asis") and with both the new predictable dynamic chunk algorithm (labelled \lb chunk") and the dynamic feedback algorithm (labelled \lb feedback") are given in Figure 3. The example is running on 36 processors (the number of processors used operationally). The original code computes convection in 3 chunks on each processor; if only 1 chunk is used we see a corresponding increase in performance. Maximum and average (the ideal) times are given for these two runs. This dierence shows the increased overhead of using smaller chunks. The dynamic chunk algorithm provides further improvement and the feedback algorithm gives the best results, reducing the time to solution by approximately 33%.

(10)

0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13

0 10 20 30 40 50 60 70 80 90 100

Time (s)

Timestep

asis max asis average asis 1seg max asis 1seg average lb chunk 3segs max lb feedback max

Figure 3: Dynamic Chunk and Feedback algorithms with 36 processors

1 2 3 4 5 6 7 8

0 50 100 150 200 250 300 350 400

Time (s)

Segment size

summed convection segment time

(11)

6 Conclusions and Future Work

This paper has presented an eective algorithm for load balancing predictable dy-namic load imbalance, which minimises communication between processors and at-tempts to minimise the number of messages. This results in nearly 40% performance increase for short wave radiation in the UKMO's operational forecast model. This routine will be integrated into the UM in the next release.

It has also presented two eective algorithms for load balancing un-predictable dy-namic load imbalance. These improve the performance of convection in the UKMO's operational forecast and climate models by between 30% and 40%. These algorithms are also expected to be applicable to other physics routines in the UM. The dynamic chunk algorithm is already being used operationally for convection.

For future work we intend to examine the overheads of the dynamic feedback algo-rithm in more detail, introducing modications to increase the chunk size by merging remote chunks, if appropriate, and possibly attempting to reduce the dynamic mem-ory allocation overhead which increases the costs of small chunk sizes. We also intend to implement the most appropriate algorithms in other physics routines and examine the limits of scalability of the model.

7 Acknowledgements

The authors would like to thank the UKMO for funding this research.

References

[1] Barros, S. R. M., D. Dent, L. Isaksen and G. Robinson (1994) The IFS model{ overview and parallel strategies, Proc. 6th Workshop on the Use of Parallel Pro-cessors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 303{318.

[2] Bell, R. S. and A. Dickinson, (1990) The Meteorological Oce Unied Model for data assimilation, climate modelling and NWP and its implementation on a Cray Y-MP, Proc. 4th Workshop on the Use of Parallel Processors in Meteorology, World Scientic.

[3] Ford, R. F., D. Snelling and A. Dickinson, (1994) Controlling load balance, cache use and vector length in the UM, Proc. 6th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 195{205.

(12)

[5] Michalakes, J. G. and R. S. Nanjundiah (1994) Computational load in model physics of the parallel NCAR Community Climate Model, Tech. Rep. ANL/MCS-TM-186, Argonne National Laboratory, Argonne, Illinois.

[6] Michalakes, J. G., (1991) Analysis of workload and load balancing issues in the NCAR Community Climate Model, Tech. Rep. ANL/MCS-TM-144, Argonne Na-tional Laboratory, Argonne, Illinois.

[7] Polychronopoulos, C. P. and D. J. Kuck, (1987) Guided self-scheduling: a practi-cal scheduling scheme for parallel supercomputers, IEEE Trans. Comp. vol. C-36, no. 12, pp. 1425{1439.

[8] Dent D., et al. (1994) The IFS model performance measurements, Proc. of 6th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 352{369.

[9] Burton P. and A. Dickinson (1996) Parallelising the Unied Model for the Cray T3E, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 68{82.

[10] Isaksen L. and Hamrud M., (1996) ECMWF operational forecasting on a dis-tributed memory platform: analysis, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 22{35.

[11] Schattler U. and E. Krenzien (1996) Model development for parallel computers at DWD, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 83{100.

[12] Foster I. T., and B. R. Toonen (1994) Load-balancing algorithms for climate models, Tech. Rep. ANL/MCS-TM-190, Argonne National Laboratory, Argonne, Illinois.

[13] Dent D. and G. Mozdzynski (1996) ECMWF operational forecasting on a dis-tributed memory platform: forecast model, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 36-51.

[14] Eerola K., et al. (1996) A parallel version of the Hirlam forecast model: strategy and results, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 134{143.

(13)

[16] Drake, J., et al. (1994) PCCM2: A GCM adapted for scalable parallel computers, Proc. AMS Annual Meeting, AMS.

[17] Elbern H., (1996) Load Balancing of a comprehensive air quality model, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 429{444.

[18] Ford R. W., (1998) A message minimisation algorithm, CNC Tech. Rep., De-partment of Computer Science, The University of Manchester, Manchester, U.K. [19] Saarinen S., D. Vasiljevic and H. Jarvinen (1996) Parallelization of observation processing, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Homan and N. Kreitz, eds.), World Scientic, pp. 52{67.