Performance of TLR-MVM on Cerebras CS-2 Sys- temtem

EXAMINATION COMMITTEE PAGE

7.6 Porting TLR-MVM onto AI Accelerators

7.6.6 Performance of TLR-MVM on Cerebras CS-2 Sys- temtem

Impact of Tile Size on Memory Bandwidth

To understand the impact of the tile size on memory bandwidth, we perform a single precision batch of MVMs with constant size N on each PE of a single Cerebras CS-2 system. Figure 7.19 shows the aggregate memory bandwidth across a fabric of 750×994 PEs, i.e., the maximum the SDK allows to use as some PEs are reserved for routing data on and off the wafer, for a simulated CS-2 and a real CS-

103

Figure 7.18: Design of a new communication-avoiding implementation of the TLR-MVM kernel

2. The simulated memory bandwidth numbers are calculated via the performance model by taking the cycle count for performing a single MVM on a single PE, and scaling up the bandwidth to the entire 750×994 grid of PEs. The bandwidth for the real CS-2 was calculated by performing 10,000 MVMs on each PE, dividing the measured cycle count by 10,000, taking the worst measured cycle count among all 745,500 PEs. We can thus calculate the memory bandwidth by: bytes accessed

× 850MHz / cycle count. The relative bandwidth quickly saturates to 2PB/s as we increase the matrix size, which transitions the batch MVM execution from a memory-bound to a compute-bound operation. This synthetic benchmark using constant ranks is an ideal case and gives a good perspective for our TLR-MVM kernel. The absolute bandwidth based on the actual memory accesses shows the hardware capability by achieving 3X speedup compared to the relative bandwidth.

Impact of the Stack Width on Hardware Occupancy

We show the performance of the TLR-MVM kernel on a single CS-2 on the five validated configurations. Because these configurations engender larger memory

104

Figure 7.19: Impact of the tile size on the relative and absolute memory bandwidths for performing a single precision batched MVMs with constant matrix size N

footprints than can fit on a single CS-2, we partition the dataset into six shards by splitting the stack width accordingly to ensure evenly distributed workloads as much as possible. Table 7.1 identifies the stack width such that each dataset shard nearly fills all PEs.

nb acc stack width PEs used Occupancy

25 0.0001 64 4417690 99%

50 0.0001 32 4330150 97%

70 0.0001 23 4416383 98%

50 0.0003 18 4445947 99%

70 0.0003 14 4252877 95%

Table 7.1: Configurations delivering proper MDD accuracy

Reporting Bandwidth Metrics

Since the workload is embarrassingly parallel, with no communication or synchro- nization between shards of the dataset, we can compute total aggregate bandwidth by taking the worst cycle count across all PEs and all runs. Table 7.2 shows the total number of memory accesses in bytes and worst cycle count across all PEs.

Table 7.3 presents the aggregate relative and absolute bandwidth using six shards running successively on a single CS-2. We achieve the maximum relative sustained bandwidth of 12.26PB/s on six shards (14.44 µs) with nb = 50 and

105

nb acc Worst Relative memory Absolute memory cycle cnt accesses accesses

25 0.0001 21350 2.94E+11 6.85E+11

50 0.0001 19214 2.60E+11 6.71E+11

70 0.0001 19131 2.60E+11 6.89E+11

50 0.0003 12275 1.64E+11 3.89E+11

70 0.0003 12999 1.64E+11 4.06E+11

Table 7.2: Worst cycle count / # of memory accesses (bytes)

acc= 3e−4, a nearly linear speedup compared to the bandwidth achieved in the ideal synthetic case on a single CS-2 reported in Fig. 7.19.

nb acc Aggregate relative Aggregate absolute bandwidth (PB/s) bandwidth (PB/s)

25 0.0001 11.24 26.19

50 0.0001 11.70 30.15

70 0.0001 11.92 31.62

50 0.0003 12.26 29.05

70 0.0003 11.60 28.79

Table 7.3: Aggregate bandwidth metrics on six shards

Performance Scalability

Table 7.4 shows strong scaling performance on the configuration nb = 25 and acc = 1e −4, based on the first strong scaling strategy and up to 12, 16, and 20 shards. This corresponds to maximum stack widths of 64, 32, 25, and 19, respectively, to ensure enough concurrency is exposed to maintain most of PEs busy with work. To compute the aggregate bandwidth, we again use the worst cycle count recorded across all PEs for each shard configuration. We achieve 95%

parallel efficiency on 20 shards, due to lower arithmetic intensity of the batched MVMs, as we decrease the stack width.

Table 7.4 also includes a 48-shard configuration based on the second strong scaling strategy. This strategy favors concurrency over memory footprint by repli- cating the bases such that the real and imaginary components are handled sepa- rately via eight batched MVMs. The resulting relative bandwidth for the 48-shard

106

configuration is 86.74PB/s (6.227 µs with 97% parallel efficiency).

Shards Stack width Aggregate relative Aggregate absolute bandwidth (PB/s) bandwidth (PB/s)

6 64 11.24 26.19

12 32 22.13 51.17

16 24 29.28 67.40

20 19 35.77 81.97

48 64 86.74 202.20

Table 7.4: Aggregate bandwidth for the nb= 25 and acc= 1e−4 configuration, parti- tioned into 6, 12, 16, and 20 shards, and the 48-shard case using the first and second strong scaling strategies, respectively.

Roofline Performance Models

10 ² 10 ¹ 10⁰ 10¹ 10²

Flop/Byte

10 ¹ 10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷

GFlop/s

11.24 PB/s Memory

120 PB/s

CS-2

A100 AuroraA64FX Ice LakeRome 784 GB/s 34.0 PFlops

Cerebras System (six Cerebras CS-2) One NVIDIA A100

One NEC SX-Aurora TSUBASA One Fujitu A64FX

Two-socket 64-core AMD EYPC Rome Two-socket 28-core Intel Ice Lake TLR-MVM on six Cerebras CS-2

Figure 7.20: Roofline performance models of 6-shard configuration compared with other vendor hardware solutions

10 ² 10 ¹ 10⁰ 10¹ 10²

Flop/Byte

10⁵ 10⁶ 10⁷ 10⁸ 10⁹

GFlop/s

Cerebras Fugaku Frontier LeonardoLUMI Summit 81.6 PFlops

Memory

960 PB/s 86.74 PB/s

202.20 PB/s

Cerebras System (48 Cerebras CS-2) Fugaku (159000 Fujitsu A64FX) Frontier (37888 AMD MI250X) LUMI (10240 AMD MI250X) Leonardo (13824 NVIDIA A100) Summit (27648 NVIDIA V100) TLR-MVM on 48 Cerebras CS-2 (Relative) TLR-MVM on 48 Cerebras CS-2 (Absolute)

Figure 7.21: Roofline performance models of 48-shard configuration compared with current top 5 supercomputers

Figure 7.20 shows the roofline performance models of the minimum configurations of various contemporary vendor offerings required to host our real seismic processing workload in memory, namely six CS-2 systems, compared with single devices from NVIDIA, NEC, and Fujitsu, and two-socket x86 solutions from AMD and Intel. The arithmetic intensity (Flop/Byte) for the TLR-MVM on Cerebras CS-2 is lower than on the other architectures due to the strategy of splitting real and imaginary parts of complex batched MVMs into four real batched MVMs.

107

This strategy doubles the number of memory accesses required for TLR-MVM due to the flat memory machine model, as opposed to cache-based architectures.

Our communication-avoiding TLR-MVM implementation scores more than three orders of magnitude higher bandwidth than the sustained bandwidth achieved on a single NVIDIA A100 GPU. Figure 7.21 shows roofline performance models of 48 CS-2 systems, compared with the world’s current top 5 supercomputers. In particular, we highlight our relative and absolute sustained bandwidth metrics obtained for our TLR-MVM implementation against the theoretical peak bandwidth of Frontier, Fugaku, LUMI, Leonardo, and Summit. The absolute sustained bandwidth reaches an impressive 202PB/s, which demonstrates the hardware ca- pabilities of Cerebras CS-2 systems, as long as the workloads can fit into the local SRAM. Tile low-rank matrix approximation is a versatile algorithmic enabler that has demonstrated the ability to extract high performance from a broad range of hardware solutions.

Assessing Power Consumption

We profile the power consumption of a single CS-2 running the worst-case load- balanced shard (out of the six shards) for the real dataset with the configuration of nb = 25, acc = 1e−4, and stack width set to 64. The TLR-MVM kernel is run in a loop for 10⁸ iterations to collect sustained power statistics, given the five- second coarse granularity of the CS-2 power measurement tool. We report a steady low power consumption of 16kW, which corresponds to 36.50GFlops/W in terms of energy efficiency. This contrasts with the higher power profile of matrix-free stencil workloads [108] of close to 23kW. This may be due to the extensive inter- PE communications required via the fabric interconnect for stencil updates, which are not required for our MDD application, thanks to the communication-avoiding memory layout. The Cerebras 36.50GFlops/W for TLR-MVM, which behaves as a compute-bound kernel, compares favorably with the 52GFlops/W of Frontier and LUMI systems on the HPL-dominated workload of the Top500/Green500 ranking.

Chapter 8 Seismic Redatuming by Inversion Using Mixed Precision Batched TLR-MVM

8.1 Mixed Precision TLR-MVM (MP TLR-MVM)

Dalam dokumen Exploiting Data Sparsity in Matrix Algorithms for Adaptive Optics and Seismic Redatuming (Halaman 102-108)