I would like to thank Professor Matteo Ravasi for the guidance and intellectual challenges he tirelessly gave me. Finally, I would like to thank my father JianGuo Hong, my mother WenZhen Yu and the love of my life, Cici Ling.
EXAMINATION COMMITTEE PAGE
Algorithmic Adaptations for Ambitious Applications on Austere Architectures
We illustrate these statements on a few "extreme" applications, with some algorithmic techniques, on a wide variety of architectures that populate most of the supercomputer cabinets on the floor and are being delivered as of this writing. For the former, we adapt by exploiting data sparsity inherent in the physics of the problem and which can be augmented in our matrix data structures by physics-informed ordering.
Computational Challenges in AO system of Ground- Based TelescopesBased Telescopes
The time for solving matrix-vector multiplications of control matrices should be less than 250 µs. The performance of an AO system is limited by the accuracy of the atmospheric parameters available to the AO controller.
Computational Challenges in Seismic Redatuming by Inversion (SRI)
The current state-of-the-art implementation [9] allows time to solution in minutes for large-scale systems with a reduced set of system parameters. Such methods allow precise editing of seismic data recorded with a full wavefield, including multiple arrivals.
Stochastic Second-Order Methods
Algebraic Compression Methods
Low-rank approaches thus appear as efficient algorithmic techniques that facilitate the convergence of HPC and Big Data problems [18]. Several low-rank matrix approximations exist, particularly as exploited in hierarchical matrices (H-matrices.
Mixed Precision Computations
However, some scientific applications have an accuracy requirement higher than half precision to achieve the accuracy required by the application. There are many benefits to using mixed precision, such as less communication, memory usage, and faster computation.
Ground-Based Telescopes Computation
Seismic Redatuming by Inversion
Stochastic Second-Order Algorithms
The Gauss-Newton method and its globally convergent variants are often the methods of choice to handle nonlinear least-squares problems. When set to a small value (or even zero in some LM variants), the update algorithm corresponds to the Gauss-Newton step.
Algebraic Compression
When set to a large value, updating the algorithm corresponds to a gradient step. Other variants of the LM algorithm replace the Gauss-Newton subproblem with a random model that is only accurate with a given probability [ 48 , 49 ].
Mixed Precision Computation
Contributions
- Stochastic Levenberg-Marquardt (SLM) method
- Tile Low-Rank Matrix Vector Multiplication (TLR- MVM)MVM)
- Batched and Mixed Precision TLR-MVM
We provide load-balanced batch TLR-MVM to mitigate the rank difference problem in the real dataset. To our knowledge, this is one of the first times a combination of mixed precisions and low-ranking matrix approaches has been demonstrated in the context of scientific HPC applications.
Organization
In this chapter, we first introduce the workload of soft real-time controller of AO system. Then we introduce SLM method designed to improve the efficiency of SRTC of AO system.
SRTC Workload in Giant Ground-based Telescopes
Currently, the standard LM algorithm computes all samples in this finite-valued problem to determine the gradient and Jacobian. In contrast, the SLM algorithm selects a portion of data samples at each iteration to perform calculations, reducing the algorithmic complexity of standard LM.
Stochastic Levenberg-Marquardt (SLM) Method
- Algorithm Description
- Theory of SLM Method
We note that the scaling used in the definition of the stochastic matrix Sk makes its expectation equal to the identity matrix I. This consequence provides a convergence trend, i.e. the least square of the norm of the gradient is reduced by O.
SLM method Implementation and Performance Opti- mizationmization
The most time-consuming part is the execution of the stochastic approximated Hessian and Gradient (HG) kernel on multiple GPUs. This minibatch size is determined by the product of the data fraction (df) and the total data set size.
Numerical Accuracy of SLM in SRTC Workload
- Performance Impact of Data Fraction
The SR is widely used in the ground-based astronomy community as an image quality indicator of the AO performance. The normalized gradient norm∥˜g∥∞ of our SLM method decreases. the fastest among all other methods.
Performance of SLM in SRTC Workload .1 Experimental Settings.1Experimental Settings
- HG Kernel and SLM Benchmarks
- Scalability Results
Parallel efficiency decreases as the data fraction shrinks as GPUs run out of work and cannot maintain high utilization rates. In addition, the HG kernel can capture the computing power trend across NVIDIA GPU generations. In this chapter, we first discuss the use of linear quadratic Gaussian control in the AO system of ground-based telescopes.
LQG and DARE in AO System
The Kalman filter accounts for the uncertainties and noise in the system, providing an accurate estimate of the current state of the wavefront. These commands are sent to the deformable mirror or other controls in the AO system. However, despite its potential benefits, it is not yet used in the MAVIS AO system.
Task-based Algorithms and Chameleon Dense Linear Algebra LibraryAlgebra Library
For this reason, we choose LU factorization without pivoting as the solver in line 10 of Algorithm 2 instead of QR-based solver.
Performance of DARE Solver
With a higher number of modes (spatial resolution), we achieve around 14 TFlops/s on a single A100 NVIDIA GPU, which is 68.8% of the theoretical 20.5 Flop/s.
Mixed Precision Chameleon Library Design
Memory efficiency: Lower precision formats require fewer bits to represent numbers, resulting in lower memory consumption. Improved performance: Lower precision arithmetic operations can be executed faster on modern hardware, such as GPUs or specialized accelerators. The precision conversion is transparent to end users such as domain scientists in astronomy and is delivered to the execution system immediately before computation to ensure the least memory consumption.
Mixed Precision GEMM in Chameleon
HRTC Workload in Giant Ground-based Telescopes
In this strategy, the RTC delay introduced into the entire AO pipeline is entirely determined by the time to solution of the MVM predictive wavefront reconstructor. The output of the MVM corresponds to a command vector that triggers the movement of the actuators behind each deformable mirror to compensate for atmospheric turbulence in real time. Assuming a typical WFS sampling time of 1ms (to maintain maximum uptime even in the low sodium season) and to obtain adequate AO rejection bandwidth, the total AO loop delay should be 2 frames (i.e. 2ms ). ) or lower.
Tile Low-Rank Matrix Vector Multiplication (TLR- MVM)MVM)
- Algorithm Description
- Arithmetic Complexity of TLR-MVM
We go through all the tiles in the matrixA and get the results shown in Fig. Since MVM is inherently memory-bound, we stack the bases to increase data locality in the high-level caches of the memory subsystem. In the phase 1 calculation, the column size of each GEMV is nb and the row size of each GEMV is the sum of the ranks along this tile column.
TLR-MVM Implementation and Performance Opti- mizationmization
- OpenMP Implementation
- MPI+OpenMP Implementation
- CUDA Implementation
The authors in [67] implement TLR-MVM in C++ on NEC SX-Aurora TSUB-ASA vector engine using standard MPI+OpenMP programming models. Since the single complex data type for MVM (i.e., CGEMV) is widely supported by vendors' numerical libraries, we propose to further extend TLR-MVM to a larger family of systems, including x86, ARM, and GPU. Since SRI involves many frequency matrices, we merge each phase of TLR-MVM across all frequency matrices to further expose parallelism, reduce CUDA Graph overhead, and increase hardware occupancy.
Numerical Accuracy of TLR-MVM HRTC workload
Instead, the numerical accuracy is assessed by comparing the obtained SR of a compressed matrix to the obtained SR of the original control matrix (so if there is no compression, the resulting numerical accuracy is 1.0). From Figure 6.9, it is clear that there is an inevitable, albeit predictable, trade-off between numerical accuracy and speedup factor. For the various atmospheric conditions tested, a speedup factor of about 3.0 comes with very little loss in SR.
Performance of TLR-MVM in HRTC Workload .1 Experimental Settings.1Experimental Settings
- Synthetic Datasets
- MAVIS Datasets
In particular, we see the performance of TLR-MVM across three generations of NVIDIA GPUs, namely P100/V100/A100. AMD Rome and NEC Aurora are under 200 µs for a single TLR-MVM call, opening up new possibilities for the future. Figures 6.18 show the roofline performance models of our TLR-MVM implementation on AMD EPYC Rome and Fujitsu ARM A64FX processors on Mavis dataset.
Seismic Redatuming by Inversion
- Impact of Matrix Reordering
- Impact on SRI Computational Stages
In both cases, special attention was paid to the implementation of the integral operator and the handling of kernels that cannot fit directly into the main memory of single computing node. Hilbert rearrangement provides the best compression rate by better exploiting the data sparsity of the frequency matrices. While the authors in [67] only focused on the implementation of the forward batch MVM operation, here we cover the entire SRI workflow where an iterative solver is used to solve the Marchenko equations.
Batched TLR-MVM
- Algorithm Description
- Arithmetic Complexity of Batched TLR-MVM
In seismic application, suppose there are F frequency matrices, the overall flops and memory bandwidth of dense MVM are F×2mn and F×B(mn+n+m)t, respectively. The overall flops and memory bandwidth of TLR-MVM for all frequency matrices F is 4KF ×nb and B(2KF×nb+4Kt F+m+n), where KF is the sum of all ranks over all frequencies. If the rank sum K is not large enough, the algorithm may not saturate the memory bandwidth due to a suboptimal hardware occupancy.
Batched TLR-MVM Implementation and Performance OptimizationOptimization
- Merge-Phase Strategy for Intra-Node Load Balanc- ing
- ZigZag Mapping Strategy for Inter-Node Load Bal- ancingancing
According to the flop calculation of TLR-MVM, the rank sum K of the frequency matrix plays an important role. The ZigZag Mapping strategy was designed in light of the increasing rank summary of seismic redating application. The ZigZag mapping strategy is designed to achieve load balancing between nodes. To leverage performance on distributed memory systems, we allocate frequency matrices evenly across compute nodes.
Numerical Accuracy of Batched TLR-MVM in SRI
Finally, all numerical results presented below will be shown for a single focal point in the subsurface at location xF m. To begin with, we consider the same synthetic geological model used in [1]. their Fig. 3) and compare the reconstruction of the joint Green's function g = g++g− obtained using the original R-kernel and the TLR-compressed R-kernel with different combinations of tile size and precision. Although not discussed by the authors in [106], this rank imbalance of different matrices inevitably leads to load imbalance.
Performance of Batched TLR-MVM in SRI .1 Experimental Settings.1Experimental Settings
- Synthetic Datasets
- Performance Impact Of Optimization Techniques
TLR-MVM achieves 8.94 and 1.58 performance speedups compared to close MVMs on an Intel CSL and a NEC VE, respectively. Intel CSL TLR-MVM NEC VE TLR-MVM Intel CSL Dense MVM NEC VE Dense MVM. Furthermore, TLR-MVM scores up to 3.3X performance speedup over dense MVM on a single NEC VE.
Porting TLR-MVM onto AI Accelerators
- The Graphcore IPU Hardware Technology
- Implementation Details on GraphCore IPU
- Performance of TLR-MVM on IPU
- The Cerebras CS-2 Wafer Scale Engine
- Implementation Details on Cerebras CS-2 System
- Performance of TLR-MVM on Cerebras CS-2 Sys- temtem
For the seismic imaging application and its large data sets, the TLR-MVM algorithm comes to the rescue to reconcile the IPU architecture with the application. We are developing a new high performance TLR-MVM kernel that takes advantage of the high SRAM memory bandwidth of Cerebras CS-2 systems while mapping the seismic datasets to the millions of PEs. We show the performance of the TLR-MVM kernel on a single CS-2 on the five validated configurations.
Mixed Precision TLR-MVM (MP TLR-MVM) .1 Algorithm Description
- Arithmetic Complexity of MP TLR-MVM
MP TLR-MVM Implementation and Performance Op- timizationtimization
- Hardware/Software Limitations for MP TLR-MVM
- MP TLR-MVM Using FP16/FP32
- MP TLR-MVM Using INT8/INT32/FP16/FP32
Each CUDA core in the same warp loads a 4x4 block of U/V bases close to each other in the same rows/columns. In this way, we can preserve the information in the matrix and perform MVM in INT8 for TLR-MVM with limited loss of accuracy. We manage 16 INT8s together in the same way to use 128-bit load and store as much as possible.
Numerical Accuracy of MP TLR-MVM in SRI
A schematic representation of the seismic acquisition geometry is shown in panel (a) of Figure 8.2 next to one kernel frequency matrix R (panel (b) of Figure 8.2). Finally, panel (i) shows the error associated with a scenario where the kernel frequency matrices R are rescaled using Hilbert sorting before compression. Here, the size of the inner band is determined by the Half Band Length (HBF) parameter.
Performance of MP TLR-MVM in SRI .1 Experimental Settings.1Experimental Settings
- Performance Results on 3D Seismic Dataset Performance Impact of GPU Streams on TLR-MVMPerformance Impact of GPU Streams on TLR-MVM
We study the impact of the number of GPU streams on the performance of TLR-MVM. As explained in Section 8.2.1, we achieve low performance compared to FP32 TLR-MVM on the corresponding hardware. In conclusion, we present a brief summary of the general impact of MP TLR-MVM on the overall SRI algorithm.
Summary
In addition, we present a new, powerful implementation of the Multi-Dimensional Convolution operator that uses algebraic compression and mixed precision, and use it to efficiently solve the Seismic Redatuming by Inversion (SRI) problem. Powered by Tile Low-Rank matrix approaches, we can speed up the most time consuming modeling operator operations, i.e. the matrix vector multiplications. In addition, we integrate two mixed-precision algorithms into the core of TLR-MVM to further improve the computational intensity of the kernel.
Future Work
Ltaief, “Tile Low-Rank GEMM Using Batched Operations on GPUs,” in Proceedings of Euro-Par 2018: Parallel Pro-. Exploiting data sparsity for large-scale matrix computation,” in Proceedings of the European Conference on Parallel Processing. Rousset, “A Novel Fast And Accurate Pseudo-analytical Simulation Approach For MOAO,” in Adaptive Optics Systems IV, ser.
APPENDICES