MP TLR-MVM Implementation and Performance Op- timizationtimization

EXAMINATION COMMITTEE PAGE

8.2 MP TLR-MVM Implementation and Performance Op- timizationtimization

Chapter 8 Seismic Redatuming by Inversion Using Mixed Precision Batched TLR-MVM

8.1 Mixed Precision TLR-MVM (MP TLR-MVM)

109

on the level of support. If FP16/INT8 data types are not supported, we can- not deploy our kernels on the system. If these two data types are supported, we need at least to have some numerical libraries with MVM BLAS functions that provide FP16/INT8 interfaces. In our experience, the GEMM BLAS function is usually the one enriched with support for such data types. However, replacing our memory-bound MVM kernel by the GEMM kernel operating on a single vector may be an overkill. It may provide portability, but performance may not be there.

The third option is to design anew kernel that not only fully support FP16/INT8 data types but also perform the MVM operation with memory-aware optimizations to reduce latency overheads. We introduce new CUDA kernels for the latter, as explained in the next section.

CUDA Thread Element of U / V bases and x 32 Threads

Vectorized Memory Access Warp-level Reduction

U / V bases

Vector x Thread 0

Thread 1

Thread 2

Thread 3 32 Threads

32 Threads

Vectorized Memory Access

Figure 8.1: Vectorized access and CUDA block layout for FP16 complex MVM kernel

8.2.2 MP TLR-MVM Using FP16/FP32

We show the design of our FP16/FP32 MVM kernel which is a building block of MP TLR-MVM using FP16/FP32. We define a struct that contains two half numbers as FP16 complex data type. We carry on the implementation using this new data type. The input U/V bases of A are assumed to be in FP16.

110

Vectorized global memory access. We set block size as 128 and each warp (32 CUDA cores) handles every four rows/columns of MVM kernel when manipulating the inputU/V bases. The reason for this is to get coalesced memory access for both inputU/V bases and input vector. To do this, we want to use 128- bit load and store as much as possible. Since each element of theU/V bases is two half numbers (32 bits), we need four half complex numbers stored contiguously to trigger 128-bit load and store. Since the U/V bases are column-major, this means one thread needs to access four rows/columns at a time. For instance, for phase 1, each thread loads four elements from the vector, once at a time, which means the threads also need to access the four columns of matrix U bases. To ensure the vectorized access, we cast the input pointer tofloat4 whenever we need to store or load from global memory. We use reinterpret cast conversion that is provided by C++. Fig. 8.1 shows the design of how we assign the entries ofU/V bases to different CUDA cores. Each CUDA core in the same warp loads a 4x4 block from theU/V bases near to each other at the same rows/columns. They use four cuComplex numbers as buffer to accumulate the results using FP32 compute units.

Warp level reduction. After looping through the rows, the CUDA cores inside the same warp use warp level primitive functions such as shfl down sync to exchange the four partial reduction results. After warp level reduction operations, the first thread inside the warp will have the final results. We again cast the four cuComplex numbers as float4 to maximize occupancy. Since BF16 is also supported on CUDA, we also provide BF16/FP32 MP TLR-MVM following the same optimizations described in this section.

8.2.3 MP TLR-MVM Using INT8/INT32/FP16/FP32

We extend the vectorization and reduction optimizations for MP TLR-MVM to support even lower precisions. The input U/V bases of A are assumed now to be in INT8. The original seismic data may not fall in range of INT8, e.g. [-

111

128,127]. We scale the real/imaginary part of each U/V bases separately using their maximum absolute value. In this way we can preserve the information in the matrix as well as execute the MVMs in INT8 for TLR-MVM with limited precision loss. We now define a structure that contains two int8 t as INT8 Complex data type. In the same manner, we operate 16 INT8 together to use 128 bits load and store as much as possible. We access the V bases of A in the same manner, following Fig. 8.1. We accumulate in INT32 during phase 1. At the end of phase 1, we cast the final results obtained after reduction into FP32 and scale the results back to original seismic data. We need to do this because phase 2 will shuffle the output vector of phase 1, and it will be difficult to use the output of phase 2 if the maximum absolute value is not consistent. We then cast the output to FP16 as an intermediate step and store it back to global memory. We can launch phase 2 with memory shuffling. We scale only at the end of phase 2 by taking all FP16 entries and convert them to INT8. We can then launch phase 3 where we read the U bases of A in INT8 and multiply them against the INT8 results coming from phase 2. Our code to support INT8 entries forU/V bases turns out to operate on four different precisions.

Dalam dokumen Exploiting Data Sparsity in Matrix Algorithms for Adaptive Optics and Seismic Redatuming (Halaman 108-111)