Exploiting Data Sparsity in Matrix Algorithms for Adaptive Optics and Seismic Redatuming

I would like to thank Professor Matteo Ravasi for the guidance and intellectual challenges he tirelessly gave me. Finally, I would like to thank my father JianGuo Hong, my mother WenZhen Yu and the love of my life, Cici Ling.

EXAMINATION COMMITTEE PAGE

Algorithmic Adaptations for Ambitious Applications on Austere Architectures

We illustrate these statements on a few "extreme" applications, with some algorithmic techniques, on a wide variety of architectures that populate most of the supercomputer cabinets on the floor and are being delivered as of this writing. For the former, we adapt by exploiting data sparsity inherent in the physics of the problem and which can be augmented in our matrix data structures by physics-informed ordering.

Computational Challenges in AO system of Ground- Based TelescopesBased Telescopes

The time for solving matrix-vector multiplications of control matrices should be less than 250 µs. The performance of an AO system is limited by the accuracy of the atmospheric parameters available to the AO controller.

Figure 1.1: The concept of MCAO in which multi-directional measurements from several WFS are used to drive several DM, each optically conjugated to a given physical altitude of turbulence.

Computational Challenges in Seismic Redatuming by Inversion (SRI)

The current state-of-the-art implementation [9] allows time to solution in minutes for large-scale systems with a reduced set of system parameters. Such methods allow precise editing of seismic data recorded with a full wavefield, including multiple arrivals.

Stochastic Second-Order Methods

Algebraic Compression Methods

Low-rank approaches thus appear as efficient algorithmic techniques that facilitate the convergence of HPC and Big Data problems [18]. Several low-rank matrix approximations exist, particularly as exploited in hierarchical matrices (H-matrices.

Mixed Precision Computations

However, some scientific applications have an accuracy requirement higher than half precision to achieve the accuracy required by the application. There are many benefits to using mixed precision, such as less communication, memory usage, and faster computation.

Figure 1.3: FP32, BF16 and FP16 format defined by IEEE-754 standard

Ground-Based Telescopes Computation

Seismic Redatuming by Inversion

Stochastic Second-Order Algorithms

The Gauss-Newton method and its globally convergent variants are often the methods of choice to handle nonlinear least-squares problems. When set to a small value (or even zero in some LM variants), the update algorithm corresponds to the Gauss-Newton step.

Algebraic Compression

When set to a large value, updating the algorithm corresponds to a gradient step. Other variants of the LM algorithm replace the Gauss-Newton subproblem with a random model that is only accurate with a given probability [ 48 , 49 ].

Mixed Precision Computation

Contributions

Stochastic Levenberg-Marquardt (SLM) method
Tile Low-Rank Matrix Vector Multiplication (TLR- MVM)MVM)
Batched and Mixed Precision TLR-MVM

We provide load-balanced batch TLR-MVM to mitigate the rank difference problem in the real dataset. To our knowledge, this is one of the first times a combination of mixed precisions and low-ranking matrix approaches has been demonstrated in the context of scientific HPC applications.

Organization

In this chapter, we first introduce the workload of soft real-time controller of AO system. Then we introduce SLM method designed to improve the efficiency of SRTC of AO system.

SRTC Workload in Giant Ground-based Telescopes

Currently, the standard LM algorithm computes all samples in this finite-valued problem to determine the gradient and Jacobian. In contrast, the SLM algorithm selects a portion of data samples at each iteration to perform calculations, reducing the algorithmic complexity of standard LM.

Stochastic Levenberg-Marquardt (SLM) Method

Algorithm Description
Theory of SLM Method

We note that the scaling used in the definition of the stochastic matrix Sk makes its expectation equal to the identity matrix I. This consequence provides a convergence trend, i.e. the least square of the norm of the gradient is reduced by O.

SLM method Implementation and Performance Opti- mizationmization

The most time-consuming part is the execution of the stochastic approximated Hessian and Gradient (HG) kernel on multiple GPUs. This minibatch size is determined by the product of the data fraction (df) and the total data set size.

Numerical Accuracy of SLM in SRTC Workload

Performance Impact of Data Fraction

The SR is widely used in the ground-based astronomy community as an image quality indicator of the AO performance. The normalized gradient norm∥˜g∥∞ of our SLM method decreases. the fastest among all other methods.

Figure 4.1: Theoretical convergence and SLM df = 0.001 vs Iteration. (left); Normalized

Performance of SLM in SRTC Workload .1 Experimental Settings.1Experimental Settings

HG Kernel and SLM Benchmarks
Scalability Results

Parallel efficiency decreases as the data fraction shrinks as GPUs run out of work and cannot maintain high utilization rates. In addition, the HG kernel can capture the computing power trend across NVIDIA GPU generations. In this chapter, we first discuss the use of linear quadratic Gaussian control in the AO system of ground-based telescopes.

Figure 4.2: Performance impact of the optimization techniques on HG using single A100 GPU with df = 0.1

LQG and DARE in AO System

The Kalman filter accounts for the uncertainties and noise in the system, providing an accurate estimate of the current state of the wavefront. These commands are sent to the deformable mirror or other controls in the AO system. However, despite its potential benefits, it is not yet used in the MAVIS AO system.

Task-based Algorithms and Chameleon Dense Linear Algebra LibraryAlgebra Library

For this reason, we choose LU factorization without pivoting as the solver in line 10 of Algorithm 2 instead of QR-based solver.

Performance of DARE Solver

With a higher number of modes (spatial resolution), we achieve around 14 TFlops/s on a single A100 NVIDIA GPU, which is 68.8% of the theoretical 20.5 Flop/s.

Mixed Precision Chameleon Library Design

Memory efficiency: Lower precision formats require fewer bits to represent numbers, resulting in lower memory consumption. Improved performance: Lower precision arithmetic operations can be executed faster on modern hardware, such as GPUs or specialized accelerators. The precision conversion is transparent to end users such as domain scientists in astronomy and is delivered to the execution system immediately before computation to ensure the least memory consumption.

Mixed Precision GEMM in Chameleon

HRTC Workload in Giant Ground-based Telescopes

In this strategy, the RTC delay introduced into the entire AO pipeline is entirely determined by the time to solution of the MVM predictive wavefront reconstructor. The output of the MVM corresponds to a command vector that triggers the movement of the actuators behind each deformable mirror to compensate for atmospheric turbulence in real time. Assuming a typical WFS sampling time of 1ms (to maintain maximum uptime even in the low sodium season) and to obtain adequate AO rejection bandwidth, the total AO loop delay should be 2 frames (i.e. 2ms ). ) or lower.

Tile Low-Rank Matrix Vector Multiplication (TLR- MVM)MVM)

Arithmetic Complexity of TLR-MVM

We go through all the tiles in the matrixA and get the results shown in Fig. Since MVM is inherently memory-bound, we stack the bases to increase data locality in the high-level caches of the memory subsystem. In the phase 1 calculation, the column size of each GEMV is nb and the row size of each GEMV is the sum of the ranks along this tile column.

Figure 6.3: Stacked bases for TLR- TLR-compressed format

TLR-MVM Implementation and Performance Opti- mizationmization

OpenMP Implementation
MPI+OpenMP Implementation
CUDA Implementation

The authors in [67] implement TLR-MVM in C++ on NEC SX-Aurora TSUB-ASA vector engine using standard MPI+OpenMP programming models. Since the single complex data type for MVM (i.e., CGEMV) is widely supported by vendors' numerical libraries, we propose to further extend TLR-MVM to a larger family of systems, including x86, ARM, and GPU. Since SRI involves many frequency matrices, we merge each phase of TLR-MVM across all frequency matrices to further expose parallelism, reduce CUDA Graph overhead, and increase hardware occupancy.

Figure 6.7: CUDA Graph of a single TLR-MVM with 20 streams

Numerical Accuracy of TLR-MVM HRTC workload

Instead, the numerical accuracy is assessed by comparing the obtained SR of a compressed matrix to the obtained SR of the original control matrix (so if there is no compression, the resulting numerical accuracy is 1.0). From Figure 6.9, it is clear that there is an inevitable, albeit predictable, trade-off between numerical accuracy and speedup factor. For the various atmospheric conditions tested, a speedup factor of about 3.0 comes with very little loss in SR.

Figure 6.8: Strehl Ratio vs Speedup for the MAVIS system

Performance of TLR-MVM in HRTC Workload .1 Experimental Settings.1Experimental Settings

Synthetic Datasets
MAVIS Datasets

In particular, we see the performance of TLR-MVM across three generations of NVIDIA GPUs, namely P100/V100/A100. AMD Rome and NEC Aurora are under 200 µs for a single TLR-MVM call, opening up new possibilities for the future. Figures 6.18 show the roofline performance models of our TLR-MVM implementation on AMD EPYC Rome and Fujitsu ARM A64FX processors on Mavis dataset.

Figure 6.10: Performance impact of tile sizes

Seismic Redatuming by Inversion

Impact of Matrix Reordering
Impact on SRI Computational Stages

In both cases, special attention was paid to the implementation of the integral operator and the handling of kernels that cannot fit directly into the main memory of single computing node. Hilbert rearrangement provides the best compression rate by better exploiting the data sparsity of the frequency matrices. While the authors in [67] only focused on the implementation of the forward batch MVM operation, here we cover the entire SRI workflow where an iterative solver is used to solve the Marchenko equations.

Figure 7.1: Single frequency matrix (f = 33Hz) with default (left) and Hilbert space- space-filling (right) matrix reordering techniques.

Batched TLR-MVM

Arithmetic Complexity of Batched TLR-MVM

In seismic application, suppose there are F frequency matrices, the overall flops and memory bandwidth of dense MVM are F×2mn and F×B(mn+n+m)t, respectively. The overall flops and memory bandwidth of TLR-MVM for all frequency matrices F is 4KF ×nb and B(2KF×nb+4Kt F+m+n), where KF is the sum of all ranks over all frequencies. If the rank sum K is not large enough, the algorithm may not saturate the memory bandwidth due to a suboptimal hardware occupancy.

Batched TLR-MVM Implementation and Performance OptimizationOptimization

Merge-Phase Strategy for Intra-Node Load Balanc- ing
ZigZag Mapping Strategy for Inter-Node Load Bal- ancingancing

According to the flop calculation of TLR-MVM, the rank sum K of the frequency matrix plays an important role. The ZigZag Mapping strategy was designed in light of the increasing rank summary of seismic redating application. The ZigZag mapping strategy is designed to achieve load balancing between nodes. To leverage performance on distributed memory systems, we allocate frequency matrices evenly across compute nodes.

Figure 7.3: Pseudo-codes of the Merge-Phase and the ZigZag mapping strategy

Numerical Accuracy of Batched TLR-MVM in SRI

Finally, all numerical results presented below will be shown for a single focal point in the subsurface at location xF m. To begin with, we consider the same synthetic geological model used in [1]. their Fig. 3) and compare the reconstruction of the joint Green's function g = g++g− obtained using the original R-kernel and the TLR-compressed R-kernel with different combinations of tile size and precision. Although not discussed by the authors in [106], this rank imbalance of different matrices inevitably leads to load imbalance.

Performance of Batched TLR-MVM in SRI .1 Experimental Settings.1Experimental Settings

Synthetic Datasets
Performance Impact Of Optimization Techniques

TLR-MVM achieves 8.94 and 1.58 performance speedups compared to close MVMs on an Intel CSL and a NEC VE, respectively. Intel CSL TLR-MVM NEC VE TLR-MVM Intel CSL Dense MVM NEC VE Dense MVM. Furthermore, TLR-MVM scores up to 3.3X performance speedup over dense MVM on a single NEC VE.

Figure 7.8: Performance analysis of TLR-MVM on synthetic datasets

Porting TLR-MVM onto AI Accelerators

The Graphcore IPU Hardware Technology
Implementation Details on GraphCore IPU
Performance of TLR-MVM on IPU
The Cerebras CS-2 Wafer Scale Engine
Implementation Details on Cerebras CS-2 System
Performance of TLR-MVM on Cerebras CS-2 Sys- temtem

For the seismic imaging application and its large data sets, the TLR-MVM algorithm comes to the rescue to reconcile the IPU architecture with the application. We are developing a new high performance TLR-MVM kernel that takes advantage of the high SRAM memory bandwidth of Cerebras CS-2 systems while mapping the seismic datasets to the millions of PEs. We show the performance of the TLR-MVM kernel on a single CS-2 on the five validated configurations.

Figure 7.14 shows the performance of TLR-MVM for the seismic imaging applica- applica-tion using Hilbert ordering on two IPUs

Mixed Precision TLR-MVM (MP TLR-MVM) .1 Algorithm Description

Arithmetic Complexity of MP TLR-MVM

MP TLR-MVM Implementation and Performance Op- timizationtimization

Hardware/Software Limitations for MP TLR-MVM
MP TLR-MVM Using FP16/FP32
MP TLR-MVM Using INT8/INT32/FP16/FP32

Each CUDA core in the same warp loads a 4x4 block of U/V bases close to each other in the same rows/columns. In this way, we can preserve the information in the matrix and perform MVM in INT8 for TLR-MVM with limited loss of accuracy. We manage 16 INT8s together in the same way to use 128-bit load and store as much as possible.

Figure 8.1: Vectorized access and CUDA block layout for FP16 complex MVM kernel

Numerical Accuracy of MP TLR-MVM in SRI

A schematic representation of the seismic acquisition geometry is shown in panel (a) of Figure 8.2 next to one kernel frequency matrix R (panel (b) of Figure 8.2). Finally, panel (i) shows the error associated with a scenario where the kernel frequency matrices R are rescaled using Hilbert sorting before compression. Here, the size of the inner band is determined by the Half Band Length (HBF) parameter.

Figure 8.2: Schematic representation of the impact of different modelling operators

Performance of MP TLR-MVM in SRI .1 Experimental Settings.1Experimental Settings

Performance Results on 3D Seismic Dataset Performance Impact of GPU Streams on TLR-MVMPerformance Impact of GPU Streams on TLR-MVM

We study the impact of the number of GPU streams on the performance of TLR-MVM. As explained in Section 8.2.1, we achieve low performance compared to FP32 TLR-MVM on the corresponding hardware. In conclusion, we present a brief summary of the general impact of MP TLR-MVM on the overall SRI algorithm.

Figure 8.5: Impact of #streams on Float 32 TLR-MVM using A100

Summary

In addition, we present a new, powerful implementation of the Multi-Dimensional Convolution operator that uses algebraic compression and mixed precision, and use it to efficiently solve the Seismic Redatuming by Inversion (SRI) problem. Powered by Tile Low-Rank matrix approaches, we can speed up the most time consuming modeling operator operations, i.e. the matrix vector multiplications. In addition, we integrate two mixed-precision algorithms into the core of TLR-MVM to further improve the computational intensity of the kernel.

Future Work

Ltaief, “Tile Low-Rank GEMM Using Batched Operations on GPUs,” in Proceedings of Euro-Par 2018: Parallel Pro-. Exploiting data sparsity for large-scale matrix computation,” in Proceedings of the European Conference on Parallel Processing. Rousset, “A Novel Fast And Accurate Pseudo-analytical Simulation Approach For MOAO,” in Adaptive Optics Systems IV, ser.

APPENDICES