Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling

(1)

Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling

Item Type Conference Paper

Authors Malas, T.;Hager, G.;Ltaief, Hatem;Keyes, David E.

Citation Malas, T., Hager, G., Ltaief, H., & Keyes, D. (2015). Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling. Second EAGE Workshop on High Performance Computing for Upstream. doi:10.3997/2214-4609.201414025 Eprint version Post-print

DOI

10.3997/2214-4609.201414025

Publisher EAGE Publications

Journal Second EAGE Workshop on High Performance Computing for Upstream

Download date 2024-01-16 21:56:05

Link to Item

http://hdl.handle.net/10754/578819

(2)

Second EAGE Workshop on High Performance Computing for Upstream Introduction

Stencil computations emerge from finite-difference discretizations of differential operators on Cartesian lattices and can be found at the core of many algorithms in scientific computing, including seismic inversion, where high-order finite different operators prevail. The performance of naïve implementations on modern computer architectures is strongly limited due to a high byte-to-flop ratio (code balance), which is why a collection of temporal blocking techniques has been devised to improve cache reuse and thus performance. Recently, high-order (i.e., long-range) stencils have gained much attention within the oil industry and beyond. High- order stencils pose specific challenges for typical temporal blocking techniques on contemporary multicore processors. Although it is theoretically possible to lower the byte-to-flop ratio to decouple from the memory bandwidth bottleneck, the achievable data reuse is limited

in practice due to cache size restrictions, and in-cache effects are already significant even without temporal blocking when constant coefficients are used. Hence, traditional temporal blocking approaches do not achieve substantial speed-up compared to optimal, spatially blocked implementations. Our multi-threaded wavefront diamond blocking (MWD) approach alleviates the cache size problem and provides the potential for improved concurrency when updating in-cache data.

Multiple threads leverage the shared outer-level cache on the CPU to perform concurrent updates on a single grid tile, thereby effectively lowering the cache constraints and providing more flexibility for assigning parallel resources to grid tiles.

We give an overview of the MWD technique, show its advantages compared to previous temporal blocking solutions, and present benchmark data for a 25-point constant-coefficient stencil (see Fig. 1) on a recent Intel Haswell server processor.

Methodology: Multi-threaded wavefront diamond blocking

We use our cache blocking technique described in (Malas, Hager, Ltaief, Stengel, et al. 2014). The implementation code is also available online (Malas 2015).

Our proposed approach uses a combination of multi-threaded wavefront blocking (similar to (Wellein et al. 2009)) and diamond tiling. Multi-threaded wavefront blocking allows multiple threads to perform updates on the same tile in the shared Last Level Cache (LLC). This is particularly important for high-order stencils, where the steep slope of the tile in time requires large data footprint for little data reuse. For example, the front view of Fig. 2a shows a case where the maximum data reuse for 32 grid cells in one dimension allows for four times speedup. The large number of cores in modern processors further inflates the problem, as a dedicated cache block is usually used per thread.

Diamond tiling offers maximum data reuse for data blocks loaded in the cache memory and provides independent space-time blocks that tessellate, allowing in-place updates, as shown in fig. 2b. In this work diamond tiling is performed along the y-axis, the multi-threaded wavefront is performed along the z-axis, and parallelepiped tiling is usually not used along the x-axis to maximize contiguous memory accesses. Figure 2a shows three perspectives of our space-time block diamond tile. It is similar to the work of (Strzodka et al. 2011), except we use multi-threaded wavefronts instead of a single-thread wavefront.

Figure 1 25-point isotropic constant- coefficient stencil operator.

X Z Y

(3)

Second EAGE Workshop on High Performance Computing for Upstream

We utilize an auto-tuning strategy to select the most performance-efficient diamond tile size. The parameter search space is reduced by our analytical model, which predicts the largest diamond size that fits in a given cache size.

a) Three perspectives of the update order in the extruded diamond cache block. Diamond tiling is performed along the y-axis and multi-threaded wavefront along the z-axis, where 2-threads wavefront is used here. The unit stride in the x-axis is reduced to a single point here. At this diamond width of 32 lattice sites we can achieve a 4× data reuse.

b) Diamond tiling along the y-axis. The figure is squeezed by a factor of four horizontally. The number of tiles per row corresponds to the maximum possible concurrent work at this dimension.

Grey cells correspond to the subdomain boundaries in distributed memory setups and the red arrows correspond to the communication direction. Image from (Malas, Hager, Ltaief, Stengel, et al. 2014) Figure 2 Multi-threaded wavefront diamond tiling algorithm update order. The lattice sites update order is shown in (a), while the diamond tiles tessellation is shown in (b). The front view in (a) corresponds to the same space-time dimensions of (b).

Results

We use MWD to optimize the 25-point constant-coefficient stencil performance. Figure 3 shows performance in stencil updates per second and memory bandwidth measurements at increasing cubic grid size. A spatially blocked version of the code is compared with 1-, 6-, 9-, and 18-thread wavefront

Z"

Y"

T"

.".". ! ! ! !

.".". ! ! ! ! ! ! ! ! ! ! ! !

.".". ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

.".". ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

.".". ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

.".". ! ! ! ! ! ! ! ! ! ! ! !

.".". ! ! ! !

.".". ! ! ! ! " " " " " " " "

.".". ! ! ! ! " " " " " " " " " " " " " " " "

.".". ! ! ! ! " " " " " " " " " " " " " " " " " " " " " " " "

.".". ! ! ! ! " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "

.".". ! ! ! ! " " " " " " " " " " " " " " " " " " " " " " " "

.".". ! ! ! ! " " " " " " " " " " " " " " " "

.".". ! ! ! ! " " " " " " " "

3D"view

Y

TY

Z

thread"2 thread"1 Top"view

Front"view Side"view

."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."." ."."."

16 .".".

15 .".".

14 .".".

13 .".".

12 .".".

11 .".".

10 .".".

9 .".".

8 .".".

7 .".".

6 .".".

5 .".".

4 .".".

3 .".".

2 .".".

1 .".".

0 .".".

.".". P.N

P.2 P.1

Space^&

Time&steps&

(4)

diamond variants. 1WD is similar to the work of (Strzodka et al. 2011), where they call this variant CATS2.

The measurements are performed on an eighteen-core Intel Haswell-EP Xeon E5-2699v3 socket with a nominal clock speed of 2.3GHz. The socket has 128GB of DDR4-2133 RAM, with 46.5GB/s sustained memory bandwidth measured by the STREAM TRIAD (McCalpin 1995) benchmark.

Memory bandwidth measurements are performed through likwid-perfctr from the LIKWID multicore tools collection (“LIKWID Performance Tools” 2015).

We use the spatial blocking implementation described in (Malas, Hager, Ltaief, Stengel, et al. 2014).

Each lattice site update (LUP) in the spatially blocked code requires transferring 16Bytes from the main memory (i.e., update one array and read from two arrays, all in single precision). The predicted performance by the roofline model (Williams, Waterman, and Patterson 2009) is (50GB/s)/(16Bytes/LUP) = 3.125GLUP/s, which is very close to the measurement.

As shown in Fig. 3a, MWD variants achieve about 36% to 77% speedup compared to the spatially blocked code. The 1WD performance degrades at larger grid sizes due to increasing cache capacity misses. 1WD requires maintaining 18 large tiles in the LLC, where the steep slope of the tiles limits the minimum possible tile size, which is also proportional to the leading dimension size. This results in high memory bandwidth usage at 1WD, as shown in Fig. 3b. On the other hand, other MWD variants can utilize larger tiles, which enable more data reuse in the cache memory. The cache saving in MWD leads to good performance even at large grid size. In fact, all the MWD variants do not saturate the memory bandwidth, as presented in Fig. 3b, with 18WD using about 65% of main memory bandwidth at large grids. This makes our approach not only suitable for contemporary processors, but also for future processors as the gap between the memory bandwidth and compute power is expected to increase further.

It can be noticed from Fig. 3a that using more threads per cache block achieves best performance at larger grid size. The memory bandwidth saving becomes more significant than the synchronization overhead within the thread group at larger grids. Our approach reduces data volume over slow data paths, thereby reducing the energy footprint as presented in (Malas, Hager, Ltaief, and Keyes 2014).

a) Performance. b) Memory bandwidth.

Figure 3 Performance and memory bandwidth measurements of the 25-point constant-coefficient stencil computations at increasing grid size. Results are shown for spatially blocked code and several MWD variants.

0 500 1000 1500 2000 2500

Size in each dimension

0 1 2 3 4 5

GLUP/s

Spt.blk.

1WD 18WD 6WD 9WD

0 500 1000 1500 2000 2500

Size in each dimension

0 10 20 30 40 50 60

GB/s

Spt.blk.

1WD 18WD 6WD 9WD

(5)

1WD does not have results at grid size smaller than 288 points along the y-axis, due to the concurrency limit. The minimum diamond width is 16 points and we use 18 threads, which results in 288 points minimum to have sufficient concurrency to use all the cores of the processor.

Conclusion

In this work, we show performance improvement for the 25-point stencil computations using our new temporal blocking scheme, which combines multi-threaded wavefront blocking with diamond tiling.

Our approach utilizes the maximum data reuse advantage of diamond tiling. It also reduces the cache block size requirement for stencil computations, especially for high-order stencils, by allowing threads to share cache blocks. As a result, it reduces the memory bandwidth usage, making it suitable for, future bandwidth-starved systems. MWD provides controllable trade-off between memory bandwidth pressure and thread synchronization overhead, which makes it efficient at various grid sizes and stencil types.

Acknowledgment

T. Malas was supported by the Extreme Computing Research Center at KAUST. Part of this work was supported by the German DFG priority programme 1648 (SPPEXA) within the project Terra- Neo, and by the Competence Network for Scientific High Performance Computing in Bavaria (KONWIHR) under project OMI4papps. We are indebted to Jan Eitzinger and Thomas Röhl (RRZE) who provided support with the LIKWID tool suite.

References

LIKWID Performance Tools. [2015]. http://code.google.com/p/likwid.

Malas, T. [2015] Girih Stencil Optimization Framework. https://github.com/tareqmalas/girih.

Malas, T., Hager, G., Ltaief, H. and Keyes, D. [2014] Towards Energy Efficiency and Maximum Computational Intensity for Stencil Algorithms Using Wavefront Diamond Temporal Blocking.

arXiv preprint arXiv.org/abs/1410.5561.

Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G. and Keyes, D. [2014] Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. To appear in SIAM Journal on Scientific Computing. arXiv preprint arXiv.org/abs/1410.3060.

McCalpin, J. D. [1995] STREAM: Sustainable Memory Bandwidth in High Performance Computers.

Charlottesville, VA. http://www.cs.virginia.edu/stream/.

Strzodka, R., Shaheen, M., Pajak, D. and Seidel, H.-P. [2011] Cache Accurate Time Skewing in Iterative Stencil Computations. In International Conference on Parallel Processing, 571–581.

Wellein, G., Hager, G., Zeiser, T., Wittmann, M. and Fehske, H. [2009] Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization. 33rd Annual IEEE International Computer Software and Applications Conference, 579–586.

Williams, S., Waterman, A. and Patterson, D. [2009] Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4), 65–76.