EXAMINATION COMMITTEE PAGE
8.3 Numerical Accuracy of MP TLR-MVM in SRI
111
128,127]. We scale the real/imaginary part of each U/V bases separately using their maximum absolute value. In this way we can preserve the information in the matrix as well as execute the MVMs in INT8 for TLR-MVM with limited precision loss. We now define a structure that contains two int8 t as INT8 Complex data type. In the same manner, we operate 16 INT8 together to use 128 bits load and store as much as possible. We access the V bases of A in the same manner, following Fig. 8.1. We accumulate in INT32 during phase 1. At the end of phase 1, we cast the final results obtained after reduction into FP32 and scale the results back to original seismic data. We need to do this because phase 2 will shuffle the output vector of phase 1, and it will be difficult to use the output of phase 2 if the maximum absolute value is not consistent. We then cast the output to FP16 as an intermediate step and store it back to global memory. We can launch phase 2 with memory shuffling. We scale only at the end of phase 2 by taking all FP16 entries and convert them to INT8. We can then launch phase 3 where we read the U bases of A in INT8 and multiply them against the INT8 results coming from phase 2. Our code to support INT8 entries forU/V bases turns out to operate on four different precisions.
112
Figure 8.2: Schematic representation of the impact of different modelling operators
To begin with, let us briefly discuss how each of the frequency matrices used to perform the forward and adjoint operations in equations 7.3 and 7.4 are formed and how they relate to the physical seismic acquisition geometry. A schematic representation of a seismic acquisition geometry is displayed in panel (a) of Figure 8.2 alongside a single frequency matrix of the kernel R (panel (b) of Figure 8.2).
As commonly done in geophysical processing, the data recorded from sources to receivers along linesLi is organized such that sources and receivers correspond to rows and columns in the matrix, respectively. The overall matrix can be therefore interpreted as a stack a smaller matrices where each of these submatrices contains sources from line Li and receivers from line Lj. Note than the choice of the tile size for the TLR compression algorithm can, but does not need to, follow the structure of these ‘geographical’ blocks. Given this arrangement of the kernel matrix, a viable approach is to divide the entire matrix into two bands: tiles inside the inner band are compressed as usual and stored in FP32 precision, whilst tiles in the two outer bands are discarded (i.e., set to zero). By doing so, we are implicitly assuming that contributions to the integrals in equation 7.3 and 7.4 from receiver lines away from the source of interest have less impact in the overall modelling (or inversion) outputs than those from nearby receivers. Although this
113
Data size: 1.83GB 7.6GB
1.83GB 24.3GB
SNR:15.3 15.3 15.6
Figure 8.3: Wavefields obtained by applying the entire inversion (equations 7.1 and 7.2) using TLR compressedRkernel and different precisions. (a) FP32, (b) FP16, (c) INT8, (d) FP32/ZERO. (e), (f ), (g), and (h) corresponding errors with default ordering. (i), (j), (k) corresponding errors with Hilbert ordering (note that the solution with Hilbert ordering is not shown). (l) SNR for all of the precisions discussed in the main text.
seems a reasonable argument to make, panels (c) and (d) in Figure 8.2 show that both the far away and nearby receivers are required in order to fully construct the seismic events when applying the forward modelling operator. More specifically, events in the reflection data (thick rays in panels (c) and (d)) that experience a bounce with a subsurface reflector below the focus point generally require far away receivers (e.g., x1R) past the focusing point xF to be captured, whilst those that have a shallow bounce require nearby receivers (e.g., x2R). Based on these physical arguments, we will see that only in the case we are interested to keep high accuracy in the reconstruction of events at small offset (i.e., small distance between the horizontal location of the focusing point and a given receiver) whilst accepting a certain degree of error for those at far offset, zero-ing out the off-diagonal tiles represents an acceptable trade-off between accuracy and computational cost.
Figure 8.3 presents a comparison of a variety of inversion results (i.e., the final total Green’s function g) along eight different receiver lines, as shown in the insert. In all cases, TLR compression is performed using the following parameters:
tile-size nb = 256 and accuracy ϵ = 0.001 - for a more in depth analysis on the
114
choice of the parameters nb and nearly identical to that obtained by a classical implementation of the MDC operator that uses a dense, FP32 kernel R. Panel (e) displays the error of the estimated wavefield with respect to the true solution, which in this case can be computed by means of a numerical finite-difference modelling code using the exact subsurface velocity and density model. Note that this solution is obviously not available in real life applications when we cannot access the exact subsurface models. Two key observations can be made at this point: i) the solution of equation 7.1 inevitably introduces an error due to the fact that we are discretizing and truncating a spatial integral (equations 7.3 and 7.4) - we will call this,theoretical error as it arises independent on the algorithmic choices made in the implementation of the MDC operator; ii) as shown in panel (l), the SNR of the solution obtained using TLR compression is nearly identical to TLR compression is suitable for the SRI application. Finally, panel (i) displays the error associated to the scenario where the frequency matrices of the kernel R are re-arranged using Hilbert sorting prior to compression. This results highlight the importance of Hilbert sorting, as it not only lowers the overall size of the dataset to be stored and used for computations as we will discuss later in more details (see Figure 8.8), but also it does so without compromising on the quality of the inversion product.
Next, we start to lower the precision used to represent the real and imaginary parts of all of the U and V bases of the kernel matrices. More precisely, FP16 and INT8 are used to produce the wavefields in panels (b) and (c), respectively.
Once again, when looking at the corresponding error panels both with original and Hilbert ordering (namely panels (f-g) and (j-k)), we can conclude that the entire inversion can be safely carried out using lower precision arithmetic both for storage and computations. This is also confirmed by the computed SNRs shown in panel (l), which are nearly identical for the dense, FP32, and FP16 cases and just slightly smaller for the INT8 case. We reach the same conclusions for BF16 and do not show the results due to space limitations. For the sake of comparison
115
and for sanity check, we also consider the scenario where FP32 precision is used for all the tiles in the inner band whilst tiles in the two outer bands are zeroed out. Here, the size of the inner band is defined via the Half Band Length (HBF) parameter. Panel (d) displays the estimated wavefield forHBF = 5 meaning that the inner band is composed of 5 tiles around the main diagonal, whilst the rest of the tiles constitutes the outer band. In this case, the error clearly increases (panel (h)). More importantly, when looking at the SNR panel, we can conclude that the FP32/ZERO strategy is not appealing in that when choosing a HBF=5 that leads to comparative data size to the INT8 case, the SNR is about 3dB lower than that of the the INT8 case. Even though these off-diagonal tiles may contain low contributions, they simply cannot be ignored.
Figure 8.4: Trace comparison for the full wavefield for receivers 1000 (left) and 4000 (right). An amplitude gain oft2 is applied to all traces.
Finally, to further assess the INT8 and FP32/ZERO strategies, a single trace comparison for receivers 1000 and 4000 is shown in Figure 8.4. An amplitude gain of t2 is applied to all traces. Once again, we can observe a good match between the estimate from the INT8 strategy (blue trace) and the original dense- based approach (red line). Similarly, both of them match pretty well the ground truth solution (black lines). On the other hand, the trace obtained using the FP32/ZERO deviates from the true solution especially for receiver 1000, which corresponds to the far offset case.
To conclude, we wish to remark that the wavefields shown in Figure 8.3 are
116
Table 8.1: Hardware/software specifications.
Vendor Intel AMD Fujitsu NVIDIA NEC
Family Ice EPYC Primergy Ampere SX-Aurora
Lake Rome A64FX GPU TSUBASA
Model Xeon(R) Gold 6330 7702 FX1000 A100 B300-8
Node(s)/Card(s) 1 1 1 1 8
Socket(s) 2 2 4 N/A N/A
Cores 56 128 48 6912 8
GHz 2.0 2.2 2.2 2.6 1.6
Memory 1004GB DDR4 512GB DDR4 32GB HBM 80GB HBM2e 48GB HBM2
Sustained BW 320GB/s 330GB/s 800GB/s 2.0TB/s 1.5TB/s
LLC 86MB 256MB 32MB 40MB 16MB
Sustained BW 1.1TB/s 4TB/s 3.6TB/s 4.8TB/s 2.1TB/s
Compiler GCC compiler 8.2.0 GCC compiler 8.2.0 Fujitsu compiler 4.5.0 NVCC 11.5 NEC compiler 3.1.1 BLAS library Intel OneAPI MKL 2022.0.2 BLIS 3.0.0 Fujitsu SSL II cuBLAS 11.5 NEC NLC 2.1.0 MPI library OpenMPI 4.1.2 OpenMPI 4.1.2 Fujitsu MPI 4.0.1 N/A NEC MPI 2.13.0
Support FP16 HW/SW No/Yes No/No Yes/Yes Yes/Yes No/No
Support INT8 HW/SW No/Yes No/No Yes/Yes Yes/Yes No/No
obtained by solving Equation 7.1, followed by the evaluation of Equation 7.2.
Noting that two integral operators must be evaluated in each forward pass of the Marchenko equations (and two in each adjoint pass), the kernel is applied 42 times to obtain the Green’s function when solving the system of equations with 10 iterations of CGLS. This may raise some concern as to whether the solver is sensitive to the accumulation of small approximation errors due to the TLR compressed kernel (especially when compression is used alongside lower precision, e.g., FP16 or INT8). Conversely, our numerical results show the reconstructed Green’s functions with different choices of precision are perceptually identical to the one obtained using FP32 precision (Figure 8.3(a)).