This volume summarizes the research done and the results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing". 113 in Springer's Lecture Notes in Computer Science and Engineering series, corresponding report of the first phase of SPPEXA funding.
SPPEXA: The Priority Program
Remarks on the Priority Program SPPEXA
Nagel – and three of the four researchers who have acted as program coordinators over the years – Philipp Neumann, Benjamin Uekermann and Severin Reiz – highlight the SPPEXA program itself. It provides an overview of SPPEXA's design and implementation, highlights its accompanying and supporting activities (internationalisation, especially with France and Japan; workshops; PhD courses; diversity measures) and provides some statistics.
1 Preparation
2 Design Principles
Application software is the "user" of HPC systems, usually appearing as legacy code that has been developed over many years. Therefore, the concept of SPPEXA was to have a set of larger projects or project consortia (research units - . Forschergruppen), all of which should address at least two of the six big themes with their own research agenda; and all this should combine a suitable large-scale application with methodological advances of HPC.
3 Funded Projects and Internal Structure
ESSEX-2 – Sparse Solver-Gerät im Exa-Maßstab. Gerhard Wellein (Erlangen), Achim Basermann (Köln), Holger Fehske (Greifswald), Georg Hager (Erlangen), Bruno Lang (Wuppertal), Tetsuya Sakurai (Tsukuba; japanischer Partner. yo. ExaDG – eingestellter High Order Galerkin für Exa-Skala. Guido Kanschat (Heidelberg), Katharina Kormann (München), Martin Kronbichler (München) und Wolfgang A.
4 SPPEXA Goes International
5 Joint Coordinated Activities
2017: Sebastian Schweikl (Passau, Bachelorarbeit), Simon Schwitanski (Aachen, Masterarbeit) und Moritz Kreutzer (Erlangen, Doktorarbeit);. 2018/2019: Piet Jarmatz (München/Hamburg, Masterarbeit) sowie Sebastian Kuckuk und Christian Schmitt (Erlangen, Doktorarbeit).
6 HPC Goes Data
HPC centers around the world are seeing an increasing share of data-driven workloads on their devices. If artificial intelligence, machine learning or deep learning have become so popular recently, it is much more due to the fact that an established methodology can succeed because of HPC than because of the new AI/ML/DL methodology itself.
7 Shaping the Landscape
This is not surprising: as science and scientific methodology evolve, so do the kinds of research conducted in that context. It was and still is an enabler of numerical simulation, and it has become a crucial enabler of data analysis and artificial intelligence.
8 Concluding Remarks
Qualification
Software from Project Consortia
Project Consortia Key Publications
Röhrig-Zöllner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Basermann, A., Hager, G., Wellein, G., Fehske, H.: Increasing the efficiency of the Jacobi–Davidson method by blocking. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp.
Collaboration from France
1 HPC Software in Three Phases
Implementing them with high performance in parallel and/or distributed systems is a delicate task and requires a large ecosystem of people who have cross-disciplinary skills. These software frameworks provide the environment for high-performance programming and often hide the complexity of parallel and/or distributed architectures.
2 Trilateral Projects in SPPEXA and Their Impact
EXAMAG is an ad-hoc HPC software of the processing category of the classification given in the previous section. An important number of indirect results of SPPEXA activities (workshops, “open” trilateral meetings, doctoral retreats, etc.) created new connections between German and French colleagues and students.
3 What Will Be Next?
The images or other third-party materials in this chapter are included under the Creative Commons license of the chapter, unless otherwise noted in a line of credit accompanying the materials. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by law or exceeds the permitted use, you must obtain permission directly from the copyright holder.
Collaboration from Japan
ESSEX-II has developed pK-Open-HPC (an extended version of ppOpen-HPC, a framework for exa-feasible applications), such as preconditioned iterative solvers for quantum science. In the ESSEX-II project, CRAFT (a library for application-level checkpoint/restart and automatic fault tolerance) was developed for fault tolerance in exa-scale systems using point checking.
Xevolver and ExaFSA
For example, some loop optimizations with compiler directives are required for the NEC SX-ACE vector computer system to vectorize properly and thus run loops efficiently. In this project, Xevolver is used to apply loop optimizations without major changes to the original code.
The Role of Japan in HPC Collaborations
SPPEXA Project Consortia Reports
Ad hoc File Systems at Extreme Scales
1 Introduction
In addition, the ADA-FS project investigated how such a temporary and on-demand burst buffer file system can be integrated into the workflow of batch systems in supercomputing environments. Section 4 discusses how we can detect system resources, such as the amount of nodes' local storage or a node's NUMA configuration that can be used to deploy the GekkoFS file system, even on heterogeneous compute nodes.
2 GekkoFS—A Temporary Burst Buffer File System for HPC
Related Work
- General-Purpose Parallel File Systems
- Node-Local Burst Buffers
- Metadata Scalability
Burst buffers are fast, intermediate storage systems that aim to reduce the load on the global file system and reduce the I/O load of an application [38]. This node-local storage is then used to create a (distributed) file system that spans a number of nodes over the lifetime of a compute job, for example.
Design
- POSIX Semantics
- Architecture
- GekkoFS Client
- GekkoFS Daemon
GekkoFS daemons consist of three parts: (1) a Key Value Store (KV Store) used for storing metadata; (2) an I/O persistence layer that reads/writes data from/to the underlying local storage system; and (3) an RPC-based communication layer that accepts local and remote connections to handle file system operations. This allows GekkoFS to be network independent and to efficiently transfer large data within the file system.
Evaluation
- Experimental Setup
- Metadata Performance
- Data Performance
2 GekkoFS file creates, statistics, and discard throughput for an increased number of nodes compared to a Luster file system. 3 GekkoFS sequential throughput for each process running on its own file compared to simple SSD peak throughput.
3 Scheduling and Deployment
- Walltime Prediction
- Node Prediction
- On Demand Burst Buffer Plugin
- Related Work
Figure 5 shows a comparison of user-specified wall times and predicted wall times. T˜Req=TRun+λ(TReq−TRun) with λ, where TReq is the user-requested wall time and TRun is the task execution time.
4 Resource and Topology Detection
Design and Implementation
The data model of the source database is shown in Fig.9 and consists of four simple tables. 8 Overview of resource discovery components, the blue components are part of the sysmap tool suite, the resource database is highlighted as the yellow box, the red box represents the query component, in this case the Job-Scheduler.
5 On Demand File System in HPC Environment
Deploying on Demand File System
The batch system manages cluster resources and runs user jobs on assigned nodes. These scripts are used to clean, prepare or test the full functionality of the nodes.
Benchmarks
With reduced number of core switches, the write throughput drops due to the reduced network capacity. If storage pools are created according to the topology, it is possible to achieve the same performance as with all six switches.
Concurrent Data Staging
6 GekkoFS on NVME Based Storage Systems
For reading, the bandwidth achieved is about 14 GB/s, the values are more stable than for the sequential case, which may be some caching effects. We were able to find that the different access patterns make no difference to the write bandwidth.
7 Conclusion
Parts of this research were carried out using the Mogon II supercomputer and services offered by the Johannes Gutenberg University Mainz. The authors gratefully acknowledge the GWK support for funding this project by providing computer time through the Center for Information Services and HPC (ZIH) at TU Dresden at HPC-DA.
The AIMES Project
The main goals of the project were: (1) to increase programmability; (2) increasing storage efficiency; (3) provide a common benchmark for icosahedral models. Targeting different architectures to run high-level code is an important aspect of the compilation process.
2 Related Work
Domain-Specific Languages
Compression
3 Towards Higher-Level Code Design
- Our Approach
- Extending Modeling Language
- Extensions and Domain-Specific Concepts
- Code Example
- Workflow and Tool Design
- Configuration
- Estimating DSL Impact on Code Quality and Development Costs
A basic set of declaration specifiers allows you to control the dimensionality of the grid and the localization of the field relative to the grid. We can improve the use of caches and thus the performance of the application by 30–.
4 Evaluating Performance of our DSL
- Test Applications
- Test Systems
- Evaluating Blocking
- Evaluating Vectorization and Memory Layout Optimization
- Evaluating Inter-Kernel Optimization
- Scaling with Multiple-Node Runs
- Exploring the Use of Alternative DSLs: GridTools
We used 1,048,576 grid points (and more points over multi-node runs) to discern the surface of the world. As part of the AIMES project, we evaluated the viability of using GridTools for the dynamic core of NICAM: namely NICAM-DC.
5 Massive I/O and Compression
- The Scientific Compression Library (SCIL)
- Supported Quantities
- Compression Chain
- Algorithms
Based on the defined quantities, the values and the characteristics of the data to compress, the appropriate compression algorithm is selected. Based on the basic data type provided, the initial stage of the chain is entered.
6 Evaluation of the Compression Library SCIL
Single Core Performance
Algorithm ratio MiB/s MiB/s (a) 1%absolute tolerance. 4) output of the NICAM Icosahedral Global Atmospheric model which stored 83 variables as NetCDF. The characteristics of scientific data vary and so does the locality of data within data sets.
Compression in HDF5 and NetCDF
Enabling lossy compression and accepting an accuracy of 1% or 0.1% can improve performance in this example by up to 13x. The SCIL library can be used directly by existing models, avoiding the overhead and obtaining the results as shown above.
7 Standardized Icosahedral Benchmarks
IcoAtmosBenchmark v1: Kernels from Icosahedral Atmospheric Models
- Documentation
The values of the input variables in the kernel routine's argument list are stored as a data file, just before the call in the original model run. Similarly, the values of the output variables in the argument list are stored immediately after the call in the implementation.
8 Summary and Conclusion
In: Proceedings of the 23rd International Symposium on High Performance Parallel and Distributed Computing, p.
DASH has been developed in the context of the SPPEXA priority program for Exascale computing since 2013. We first provide an overview of the DASH Runtime System (DART) in Sect.2, focusing on features related to task execution with global dependencies and dynamic hardware topology discovery.
2 The DASH Runtime System
Tasks with Global Dependencies
- Distributed Data Dependencies
- Ordering of Dependencies
- Implementation
- Results: Blocked Cholesky Factorization
- Related Work
Finally, the application waits for the completion of the global task graph execution in the tocomplete() call. Implementations of the APGAS model can be found in the X10 [9] and Chapel [8] languages, as well as part of UPC and UPC++[21].
Dynamic Hardware Topology
- Locality-Aware Virtual Process Topology
- Locality Domain Graph
- Dynamic Hardware Locality
- Supporting Portable Efficiency
In the MPI-based implementation of the DASH runtime, a unit corresponds to an MPI rank, but can occupy a locality domain containing multiple CPU cores or, in principle, multiple compute nodes. Partial results of units are then only reduced at the unit in the respective leadership team.
3 DASH C ++ Data Structures and Algorithms
Smart Data Structures: Halo
- Stencil Specification
- Region and Halo Specifications
- Global Boundary Specification
- Halo Wrapper
- Stencil Operator and Stencil Iterator
- Performance Comparison
Instead, this special halo region can be written by the simulation (initially alone or regularly updated). The halo data exchange can be done per region or for all regions at once.
Parallel Algorithms: Sort
- Preliminaries
- Related Work
- Histogram Sort
- Evaluation and Conclusion
Due to the high communication volume, communication computation overlap is required to achieve good scaling efficiency. Similar to the Butterfly algorithm, i sends to destination (i + r) (modp) and receives from (i−r) (modp) in roundr [33].
4 Use Cases and Applications
A Productivity Study: The Cowichan Benchmarks
- The Cowichan Problems
- The Parallel Programming Approaches Compared
- Implementation Challenges and DASH Features Used
- Evaluation
- Summary
There are two versions of the Cowichan problems, and here we restrict ourselves to a subset of the problems in the second set. We compare our implementation of the Cowichan problems with existing solutions in the following four programming approaches.
Task-Based Application Study: LULESH
In a second step, the iteration chunks of the task loops were connected through dependencies, allowing a breadth-first scheduler to execute tasks from different task loop statements simultaneously. For larger problem sizes (s = 3003 elements per node), the DASH port scale acceleration over the reference implementation is about 25%.
5 Outlook and Conclusion
MEPHISTO
Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications (HPCC 2016), Sydney, pp In: Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications (HPCC 2016), Sydney, pp.
Exascale
A key issue for large-scale simulations is the scalable (in terms of size and parallelism) generation of the sparse matrix representing the model Hamiltonian. Finally, in Section 7, we highlight the collaborations initiated and supported by SPPEXA through the ESSEX-II project.
2 Summary of the ESSEX-I Software Structure
After presenting a brief overview of the most important achievements in the first phase of the ESSEX-I project, Sects.2 and 3 detail the algorithmic developments in ESSEX-II, especially regarding preconditioners and projection-based methods for obtaining intrinsic values. The particular choice of C and σ affects the performance of the spMVM run; optimal values usually depend on the matrix and hardware.
3 Algorithmic Developments
- Preconditioners (ppOpen-SOL)
- Regularization
- Hierarchical Parallel Reordering
- Multiplicative Schwarz-Type Block Red-Black Gauß–Seidel Smoother
- The BEAST Framework for Interior Definite Generalized Eigenproblems
- Projector Types
- Flexibility, Adaptivity and Auto-Tuning
- Levels of Parallelism
- A Posteriori Cross-Interval Orthogonalization
- Robustness and Resilience
- Further Progress on Contour Integral-Based Eigensolvers
- Relationship Among Contour Integral-Based Eigensolvers
- Extension to Nonlinear Eigenvalue Problems
- Recursive Algebraic Coloring Engine (RACE)
Therefore, we use an adaptation of the Conjugate Gradient (CG) method for complex symmetric matrices called COCG (conjugate orthogonal conjugate gradient [79]). Figure 4 shows the effect of the MS-BRB-GS(α) smoother on a single node of the ITO system (Intel Xeon Gold 6154 (Skylake-SP) Cluster at Kyushu University).
4 Hardware Efficiency and Scalability
Tall and Skinny Matrix-Matrix Multiplication (TSMM) on GPGPUs
Sub-optimal” is a well-defined term here since the multiplication of a M×KmatrixAmet aK×NmatrixBmetKM, Nand smallM, Ni is a memory-bound operation: AtM=N, its computational intensity is only M/8 flop/byte. 12 Best performance for each matrix size with M = Nin compared to the roofline limit, cuBLAS and CUTLASS, with K=223.
BEAST Performance and Scalability on Modern Hardware
- Node-Level Performance
- Massively Parallel Performance
An unusual finding was found on the CPU-only SNG system: although the code ran fastest with pure OpenMP on one node (223 Gflop/s), the increased performance was observed to be better with one MPI process per socket. The ideal scaling and efficiency figures in Figures 14a–c use the best value at the smallest number of nodes in the array as a reference.
5 Scalable and Sustainable Software
- PHIST and the Block-ILU
- Integration of the Block-ILU Preconditioning Technique
- BEAST
- CRAFT
- CRAFT Benchmark Application
- ScaMaC
The current status of the software developed in the ESSEX-II project is summarized as follows. 19 Sparseness pattern of the Hubbard (Hubbard, n_sites=40, n_fermions=20) and spin chain (SpinChainXXZ, n_sites=32, n_up=8) case.
6 Application Results
Eigensolvers in Quantum Physics: Graphene, Topological Insulators, and Beyond
Such large-scale calculations rely heavily on the optimized spMMVM and TSMM kernels of the GHOST library (see section 5). This problem can be partially solved by overlapping communication with computations, as in the spMMVM (see section 2), but a full solution to restore parallel efficiency is not yet available.
New Applications in Nonlinear Dynamical Systems
In order to appreciate the numerical progress reflected in these figures, the different scaling of the NMVM numerical effort (measured in terms of the dominant performance of the spMVM) for computing extremal and intrinsic eigenvalues must be taken into account (cf. discussion in [62]). The sparsity pattern is not regular and is data dependent as it reflects neighborhood relationships in the data.
7 International Collaborations
Chen, H., Maeda, Y., Imakura, A., Sakurai, T., Tisseur, F.: Improving the numerical stability of the Sakurai-Sugiura method for quadratic eigenvalue problems. Galgon, M., Krämer, L., Lang, B., Alvermann, A., Fehske, H., Pieper, A.: Improving the robustness of the FEAST algorithm and solving eigenvalue problems from graphene nanoribbons.
Galerkin for the Exa-Scale
If we define the general goal as the minimum computational cost to achieve a predetermined accuracy, this goal can be divided into three components, namely discretization efficiency in terms of the number of degrees of freedom (DoF) and time steps, solver efficiency in terms of the number of iterations, and execution efficiency [22]:. In Section 3, we describe optimizations of the conjugate gradient method for efficient memory access and communication.
2 Node-Level Performance Through Matrix-Free Implementation