EXAMINATION COMMITTEE PAGE
8.4 Performance of MP TLR-MVM in SRI .1 Experimental Settings.1Experimental Settings
8.4.2 Performance Results on 3D Seismic Dataset Performance Impact of GPU Streams on TLR-MVMPerformance Impact of GPU Streams on TLR-MVM
We study the the impact of the number of GPU streams on TLR-MVM perfor- mance. We use a single frequency matrix f = 33 Hz as a good proxy of the overall frequency matrices. Figure 8.5 presents the impact on FP32 TLR-MVM using A100 in terms of bandwidth and time to solution. The number of streams has a direct impact on the level of concurrency, as described in Fig. 6.7. The bandwidth starts saturating at around eight streams for FP32 TLR-MVM (70%
of the sustained bandwidth), reaching up to 2.5X speedup compared to single stream execution. Figure 8.6 shows similar performance impact on TLR-MVM when using mixed precisions FP16/FP32 (i.e., MP FP16), BF16/FP32 (i.e., MP BF16), and INT8/INT32/FP16/FP32 (i.e., MP INT8). The bandwidth of MP FP16, MP BF16, and MP INT8 keeps improving as we increase the number of streams. This is an indication that there is room to further maximize hardware occupancy, especially when operating on smaller data structures than pure FP32 TLR-MVM. At 20 streams, MP INT8 TLR-MVM eventually runs faster than MP FP16 / MP BF16 TLR-MVM, which in turn runs faster than FP32 TLR-MVM, thanks to a higher arithmetic intensity. There is no significant difference between MP FP16 / MP BF16 in terms of performance, reaching the same conclusion as for accuracy assessment.
0 5 10 15 20
# of CUDA streams
2500 500750 10001250 15001750 2000
Bandwidth (GB/s)
Bandwidth
0250 500750 10001250 15001750 2000
Time (Micr oseconds)
Time
Figure 8.5: Impact of #streams on Float 32 TLR-MVM using A100
118
0 5 10 15 20
# of CUDA streams
2500 500750 10001250 15001750 2000
Bandwidth (GB/s)
MP FP16 MP BF16 MP INT8
5 10 15 20
# of CUDA streams
0250 500750 10001250 15001750 2000
Time (Micr oseconds)
MP FP16 MP BF16 MP INT8
Figure 8.6: Impact of #streams on Mixed Precision TLR-MVM using A100
Performance Impact of Matrix Reordering and Mixed-Precision Computations on TLR-MVM
We now assess the performance of matrix reordering with various mixed-precision computations on TLR-MVM executed with all the 150 frequency matrices of the seismic datasets. Figure 8.7 compares them against dense MVM and the reference pure FP32 TLR-MVM using A100. First of all, it is important to mention that all dense matrices do not fit in the A100 memory (115GB>80GB), so for the dense MVM, we simply multiply a single MVM time to solution by 150. However, with algebraic compression and matrix reordering, all matrices in FP32 TLR-MVM can then fit into A100 memory. Now, looking at the MP variants of TLR-MVM, we see that FP16 or BF16 TLR-MVM perform similarly. Moreover, phase 3 takes more time than phase 1 due to the shape of the stacked U (short and wide) and V (tall and skinny) matrices during batch MVMs, respectively. Indeed, the former requires reduction steps while the latter is embarrassingly parallel, during a single MVM. When we activate the INT8 MP variant of TLR-MVM (that in fact requires four precisions, as explained in Section 8.2.3), we get additional performance improvement and reach up to 12.5X speedup compared to dense MVM. Phase 2 takes more time in this MP variant due to the additional scaling operation.
Figure 8.8 highlights the incremental impact of matrix reordering on the mem-
119
Figure 8.7: Time breakdown on 150frequency matrices using A100
ory footprint of TLR-MVM. TLR-MVM achieves 4X/17X memory footprint re- ductions against dense MVM with normal and Hilbert matrix reordering, respec- tively. When MP INT8 variant is enabled, this translates into 70X reduction of memory footprint compared to dense MVM.
DenseMVM orderingNormal HilbertFP32 MPHilbert FP16 MP INT8Hilbert
10 0 10
1 10
2
Dataset Size (GB)
115.27 GB
31.86 GB
7.33 GB
3.66 GB
1.83 GB 63x
Figure 8.8: Memory footprint
1 2 4 8
# GPUs 0
2000 4000 6000 8000 10000
Time (Microseconds)
FP32MP FP16 MP BF16 MP INT8
Figure 8.9: Scalability
Performance Scalability and Roofline of MP TLR-MVM
An analysis of the scalability of the various MP variants of our TLR-MVM al- gorithm is finally shown in Fig. 8.9 when processing all frequency matrices and up to eight A100s. Our various implementations deliver linear scalability, given
120
18 1
4 1
2 1 2 4 8 16 32 64 128256
Operational Intensity (Flops/Byte)
1632 12864 256512 10242048 40968192 16384 32768 65536 131072
GFlops/s
Peak FP32 Peak FP16
HBM Bandwidth 2039.0 GB/s
FP32 w/ Normal: 1338 GB/s FP16 w/ Normal: 841 GB/s INT8 w/ Normal: 546 GB/s
Figure 8.10: Roofline with default or- dering
18 1
4 1
2 1 2 4 8 16 32 64 128256
Operational Intensity (Flops/Byte)
1632 12864 256512 10242048 40968192 16384 32768 65536 131072
GFlops/s
Peak FP32 Peak FP16
HBM Bandwidth 2039.0 GB/s
FP32 w/ Hilbert: 689 GB/s FP16 w/ Hilbert: 377 GB/s INT8 w/ Hilbert: 207 GB/s
Figure 8.11: Roofline with Hilbert or- dering
the massive parallelism exposed to the CUDA Graph via streams, as explained in Section 6.3.3. Moreover, Figs. 8.10 and 8.11 show the roofline performance model of our MP TLR-MVM kernel with normal/Hilbert matrix reordering, respectively.
We see an increase of the arithmetic intensity with MP enabled, since we stream less data volume from main memory. We can further saturate and reach higher sustained bandwidth with larger datasets, while ultimately remaining faster with Hilbert.
TLR-MVM Performance on Various Architectures
(a) Memory bandwidth. (b) Time to solution.
Figure 8.12: Performance results comparison of a single frequency matrix (f = 33 Hz).
Red rectangles in (a) correspond to theoretical memory bandwidth on each system.
Figures 8.12a and 8.12b show memory bandwidth and time to solution achieved by a single FP32 TLR-MVM on five different systems (see Table 6.1), respec-
121
tively. The selected frequency matrix is a good proxy of the 150 matrices of the realistic synthetic datasets. These two figures permit to compare our TLR-MVM implementation against the traditional dense MVM kernel on various hardware architectures. On Rome, FP32 TLR-MVM with default ordering scores more than an order of magnitude performance due to the high capacity of the Last- Level Cache (LLC), but also thanks to the proper MPI+OpenMP binding to map the processes/threads onto the underlying core complex (CCX) AMD packaging.
Indeed, we launch 32 MPI processes with four OpenMP threads matching the four cores per CCX sharing same LLC and reach a throughput higher than the theoretical memory bandwidth. On ICX, our TLR-MVM with default ordering saturates bandwidth compared to the dense MVM, since the compressed data of TLR-MVM is also now more amenable to fit into cache. The same analysis ap- plies for A64FX where our TLR-MVM benefits from cache effect along with HBM technology. On Aurora and A100, our TLR-MVM with default ordering is able to take advantage of both specific architectures (i.e., long vector and SIMD), while also exploiting HBM. When we enable Hilbert ordering, additional performance gains can be noticed on all architectures in terms of time to solution, although the sustained bandwidth decreases for all except ICX. It seems that TLR-MVM with Hilbert ordering creates even more opportunities to fit into cache on ICX, which engenders a throughput even higher than the theoretical memory bandwidth.
We made an attempt to develop MP variants of TLR-MVM for ICX and A64FX (not shown) since some levels of support is available. This means that we need to replace our memory-bound MVM kernel by the GEMM kernel that would operate on a single vector. As explained in Section 8.2.1, we achieve low performance compared to FP32 TLR-MVM on the respective hardware. We would need to have further support from vendors to reconsider these preliminary implementations in the future.
122
Overall Performance of the Application
To conclude, we present a short summary of the overall impact of MP TLR- MVM on the entire SRI algorithm. Following [1], we consider a scenario where we are interested to redatum the surface seismic data to a regular grid of size 41×71 covering an area of about 1.3km2 at a given subsurface depth. Figure 8.13 presents a time breakdown of the two phases involved in the entire SRI application. It is clear that the compression phase (blue bar) has a minimal overhead, also due to the embarrassingly parallel nature of this process. On the other hand, the actual redatuming process accounts for more than 90% of the overall time to solution; most of this time is actually spent in the various TLR- MVMs computations (orange bar), whilst the rest of the time is shared by the various FFTs (green bar) involved in the MDC operator and some vector-vector and vector-scalar operations in the LSQR solver (red bar). Note that if dense MVMs had been used here instead of MP TLR-MVMs, the overall time to solution would have further increased according to our findings in Figure 8.7.
0 5 10 15 20 25 30 35
Time (Minutes)
Compression MP INT8 TLR-MVM FFT LSQR Overhead
Figure 8.13: Time breakdown of SRI workflow in [1] after integrating MP INT8 TLR- MVM on one A100 GPU