2.1 The Sparse Grid Combination Technique

In our project, we overcome this problem with the help of sparse network combination technique. These hierarchical solutions are then locally added to the corresponding subspaces of the corresponding sparse network.

Fig. 1 Grids resulting from a 2D combination technique with min = (1, 1) and max = (3, 3).

Load Balancing

The simplest approach would be to simply consider the number of grid points [9], assuming that all grids of the same level sum||1 take the same time to execute. In this model, the runtime contribution of the number of grid points N is evaluated with a functor (N.

Fault Tolerant Combination Technique

7 If a process error occurs in one of the process groups, the remaining ranks are declared as backup processes [19]. However, this process can waste valuable resources if only a few ranks in the process pool have failed.

4 Numerical Results 4.1 Convergence

Linear Runs
Nonlinear Runs
Scaling Analysis

Load Balancing

Fault Tolerance

Fault Tolerant Combination Technique
libSpina

The legend on the lower plot applies in the same way to the upper one. We can see that the overhead of debugging is negligible compared to the run time of the solver.

Fig. 9 Error of g (λ max ) compared to the reference solution with = (3, 1, 8, 8, 8)

5 Conclusion

There was no performance degradation of FT-GENE for a large number of nodes, as can be seen in the lower part of Table 3. Furthermore, FT-GENE demonstrated a reasonable ability to tolerate faults, even under the constraints of relying solely on standard MPI, causing no performance cost.

Obersteiner, M., Parra Hinojosa, A., Heene, M., Bungartz, H.J., Pflüger, D.: A highly scalable algorithm-based fault-tolerant solver for gyrokinetic plasma simulations. Parra Hinojosa, A., Harding, B., Hegland, M., Bungartz, H.J.: Dealing with silent data corruption with a sparse grid combination technique.

EXAMAG: Towards Exascale

Simulations of the Magnetic Universe

1 Introduction

2 The IllustrisTNG and Auriga Simulations

Outside the galactic centers, differential rotation in the discs leads to linear amplification of the magnetic fields, typically saturating around z= 0.5-0. 3 Faraday rotation maps seen by an external observer for three of the Auriga simulations.

Fig. 1 Thin projections through the TNG100 simulation, showing (from top to bottom) the gas density field, the metallicity, the magnetic field strength, the dark matter density, and the stellar density

3 Discontinuous Galerkin Hydrodynamics for Astrophysical Applications

The Discontinuous Galerkin Method on a Cartesian Mesh with Automatic Mesh Refinement

In [5] we developed a consistent finite volume entropy scheme based on a symmetrized version of the MHD equations. The useful properties of the DG method found for hydrodynamic simulations were also confirmed by Guillet et al.

Fig. 4 A Kelvin-Helmholtz simulation with fourth order DG and adaptive mesh refinement (see [28])

The Discontinuous Galerkin Method on a Moving Mesh

This leads to significant geometric complications in formulating a consistent time evolution of the basis function expansion. EXAMAG: Simulations of the Magnetic Universe 343 solution in three dimensions, so the path to a high-order DG method in 3D using AREPO's moving Voronoi lattice is now open.

Fig. 6 Convergence of the MHD isodensity vortex problem computed for different DG orders.

Turbulence Simulations with Higher Order Numerics

4 Performance and Public Release of the Two Cosmological Hdyrodynamical Simulation Codes GADGET-4 and

We compare different numerical schemes to control the divergence of the magnetic field, including constrained transport (CT), Dedner refinement, and Powell terms. 9 Scaling of the GADGET-4 code for cosmological simulations of structure formation on the SuperMUC-NG supercomputer.

Fig. 8 Growth rate of the magnetic field strength in units of the eddy-turnover time t ed = L/(2 M c s ) (top panel) and the ratio of the magnetic to the kinetic energy in the saturated state (bottom panel) for isothermal turbulence in a uniform box for di

5 Summary and Discussion

Marinacci, F., Vogelsberger, M., Pakmor, R., Torrey, P., Springel, V., Hernquist, L., Nelson, D., Weinberger, R., Pillepich, A., Naiman, J., Genel, S.: First results from IllustrisTNG simulations and magnetic fields. Nelson, D., Pillepich, A., Springel, V., Weinberger, R., Hernquist, L., Pakmor, R., Genel, S., Torrey, P., Vogelsberger, M., Kauffmann, G., Marinacci, F., Naiman, J.: First results from color simulations:

EXASTEEL: Towards a Virtual

Laboratory for the Multiscale Simulation of Dual-Phase Steel Using

High-Performance Computing

Let us note that the microscopic problems can be solved in parallel once the solution of the macroscopic problem is available as input. For scaling results of the FE2 method to more than one million MPI rows, see Fig.3i Sect.3.2and [64].

2 Nakajima Test

It uses knowledge about the position of the crack and evaluates large and small strain values in the last recorded image before cracking along cross-sections perpendicular to the crack. For the least-squares fit, we need to determine optimal fitting windows for both sides of the crack separately; see Fig.2 and 8 (below).

Fig. 1 Left: Cross section of the initial test setup of the Nakajima test. Right: Dimensions of a specimen used for the simulation of the Nakajima test with a shaft length of 25 mm and a parallel shaft width of 90 mm

3 FE2TI: A Highly Scalable Implementation of the FE 2 Algorithm

FE 2 Algorithm
FE2TI Software Package
Contact Kinematics and Incorporation of Contact Constraints for Frictionless Contact in FE2TI
Algorithmic Improvements in FE2TI

Checkpoint/Restart

However, the contact contributions to the stiffness matrix and the right-hand side are calculated in the deformable body coordinate system. The time required for a single macroscopic Newtonian step again depends on the solution time of the microscopic problems.

Fig. 3 Weak scalability of the FE2TI software on the JUQUEEN supercomputer [44]. Left: Time to solution of a single load step solving a three-dimensional heterogeneous hyperelastic model problem; uses Q1 finite elements (macro) and P2 finite elements (micr

4 Numerical Simulation of the Nakajima Test Using FE2TI

Description of Specimen Geometry
Exploiting Symmetry
Failure Criterion
Numerical Realization of the Experimental Cross Section Method

To simulate one quarter, we need to add further boundary conditions along the symmetric boundaries of the quarter. See Fig. 9 for an example of the evolution of the failure criterion W during the Nakajima simulation.

Fig. 7 Left: Symmetric computational domain from the full sample sheet and additional Dirichlet boundary conditions along the symmetric boundary surfaces

5 The Virtual Forming Limit Diagram Computed with FE2TI

9 Evolution of the failure criterion W during the simulation for the test piece with a parallel shaft width of 50 mm. 10 Final solution of the simulation with a test piece with a parallel shaft width of 70 mm; z direction (top left), thickness (top right), von Mises stress (bottom left), major strain (bottom center), and thinning rate (bottom right).

Fig. 9 Evolution of failure criterion W during the simulation for the specimen with a width of the parallel shaft of 50 mm

6 Linear and Nonlinear Parallel Solvers

Nonlinear FETI-DP Framework

Improving Energy Efficiency
Globalization

14 Comparison of classical Newton-Krylov-FETI-DP (NK), Nonlinear-FETI-DP-1 (NL1), Nonlinear-FETI-DP-2 (NL2), Nonlinear-FETI-DP-3 (NL3) and Nonlinear-FETI-DP-4 (NL4) using the JUQUEEN supercomputer. 17 Total solvation energy for Newton-Krylov-FETI-DP (NK) and Nonlinear-FETI-DP-3 (NL3) with normal barrier (b) and test-sleep approach (b-ts).

Fig. 13 The coarse problem is marked with squares, the interior degrees of freedom by circles, and the dual degrees of freedom by dots

Nonlinear BDDC Methods

For 262,144 subdomains, Newton-Krylov-BDDC failed to converge within 20 Newton iterations. Right: Weak scaling of linear BDDC solver with estimated coarse solution (with AMG) on JUQUEEN (BG/Q) and Theta (KNL) supercomputers for a heterogeneous linear elastic model problem in two dimensions with 14k d.o.f.

7 Multicore Performance Engineering of Sparse Triangular Solves in PARDISO Using the Roofline Performance Model

The performance of the forward and backward replacement process is analyzed and benchmarked for a representative set of sparse dies on modern x86-type multicore architectures. It captures the behavior of the measured performance quite well compared to the original roofline model.

Table 5 Details of the Intel and AMD hardware systems evaluated

8 Improvement of the Mechanical Model for Forming Simulations

Initial Volumetric Strains (IVS) Approach
Parameter Identification Approach for Ferritic Mixed Hardening
Quantification of Uncertain Material Properties
Crystal Plasticity
Macroscopic Yield Surface Based on Polycrystalline RVEs
One-Way Coupled Simulation of Deep-Drawing Using Polycrystalline Unit Cells

However, the overall small differences in Bauschinger factor values indicate that the height of the ferrite curve is the efficiency. Because ferrite kinematic hardening has a strong influence on the macroscopic response of DP.

Fig. 20 (a) Illustrations of the steps involved in the (modified) initial volumetric strains approach, (b) comparison of macroscopic stress-strain curves for FE 2 uniaxial tension calculations for simplified microstructure with spherical inclusion: (i) mod

9 Conclusion

Proceedings of the 14th International Conference on Domain Decomposition Methods in Science and Engineering.http://www.ddm.org/DD14. Klawonn, A., Lanser, M., Rheinbach, O.: Toward extremely scalable nonlinear domain decomposition methods for elliptic partial differential equations.

ExaStencils: Advanced Multigrid Solver Generation

1 Overview of ExaStencils 1.1 Project Vision

Project Results

A smaller task was to decide how to describe aspects of the execution platform in a platform description language (TPDL) [75] (see Sect.3.3). Project ExaStencils had several case studies whose scope was to demonstrate the practicality and flexibility of the approach (see Section 5).

Project Peripherals

They include studies close to real-world problems: the simulation of non-Newtonian and non-isothermal fluids (see Section 5.3) and a molecular dynamics simulation (see Section 5.4). We performed two additional studies on the edges of ExaStencils, exploring alternative approaches (see Section 6).

2 The Domain of Multigrid Stencil Codes 2.1 Multigrid

Advancing the Mathematical Analysis of Multigrid Methods

For this purpose, we need to determine the contraction properties of the iteration operator of the multigrid method for a given configuration. ExaStensils: Advanced Multigrid Solver Generation 413 The code in Listing1 is essentially a direct implementation of the formula for the iteration operator.

Advancing Multigrid Components

The power of the software lies in the fact that arbitrary complex expressions can be analyzed. 3 Convergence behaviors of two block dampers applied to the Stokes equations discretized using scaled unit-square grids with periodic and Dirichlet boundary conditions.

Fig. 2 Unknowns that are included in the blocks of the smoothing steps. (a) Vanka smoother

3 Stencil-Specific Programming in ExaStencils

The Domain-Specific Language ExaSlang
An ExaSlang Example
The Target-Platform Description Language
Generation of Target Code
Target-Specific Optimization
Parallelization
Compositional Optimization
Feature-Based Domain-Specific Optimization

Features and configuration options can have a significant impact on the performance of the generated code. ExaStencils: Advanced Multigrid Solver Generation 433 of different sampling strategies on the accuracy of the performance influence models [80].

5 Case Studies

Scalar Elliptic Partial Differential Equations
Image Processing
Computational Fluid Dynamics
Molecular Dynamics Simulation
Porous Media
A Multigrid Solver in SPIRAL

In Table 1, we compare the performance of the automatically generated multigrid solver and the legacy solver that was written manually in C. The result of the project was an embedded-DSL architecture based on runtime metaprogramming.

Fig. 10 Weak scaling for generated multigrid solvers using V(3, 3)-cycles to solve for Poisson’s equation in 3D on JUQUEEN [47, 48]

7 The Legacy of ExaStencils 7.1 Effort

Outreach

An ExaStencils-generated application solves the same problem as the EXA-DUNE application, but uses sets of methods different from the DUNE framework [9] with the aim of optimizing performance. Using this automated approach, we were even able to identify a bug and an inconsistency in the DUNE framework.

Potential

I: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), s. I: Proceedings of the Workshop on Python for High-Performance and Scientific Computing (PyHPC), s. eds.) Euro-Par 2014: Parallel Processing Workshops, Part II.

ExtraPeak: Advanced Automatic Performance Modeling for HPC

This article summarizes the results of the Catwalk and ExtraPeak projects, which aim to improve this situation. The main goal of Catwalk, the first of the two projects, was to make performance modeling more attractive to a broader audience of HPC application developers by providing a method to create informative performance models as automatically as possible, in a simple and intuitive way.

2 Overview of Contributions

Catwalk

ExtraPeak: Advanced Automatic Performance Modeling for HPC Applications 457 as well as some state-of-the-art MPI [37] and OpenMP [24] implementations. The results include a scalability testing framework that combines performance modeling with expectations to systematically validate the scalability of libraries [37], a tool that automatically instruments applications to generate performance models at runtime [4], and a fully static method to derive loop iteration numbers from loops with affine loop guards and space transfer functions to constrain the search model 2].

ExtraPeak

We helped validate the complexity of Relearn, a code that simulates structural plasticity in the brain. In the rest of the article, we further describe the extensions of the base method as well as the advanced methods that build on it.

3 Extra-P

The formula represents a previously chosen metric as a function of the number of processes, and allows other parameters to be represented as well. After a 90-minute theoretical explanation of the method and tool, users were able to model the performance of two example applications, SWEEP3D and BLAST, in a 90-minute hands-on session.

4 Developments of the Base Methods

Multi-Parameter Modeling

The size of the search space for this approach is as follows, given parameters and an n-term model for each of them. The number of subsets of a set of elements is 2n, so the total size of the search space is 2n·m.

Table 1 displays the number of times the modeling identified the entire function correctly and the times only the order term was identified correctly

Segmented Modeling

In all three cases, the model generation for an entire application took only seconds and was at least a hundred times faster than the full search. For three or more parameters, the size of the search space would have completely prevented such a comparison, which also means that the full search offers no viable alternative beyond two parameters.

Iterative Search Space Refinement

We developed an algorithm that helps Extra-P detect segmentation in the data before the models are created. The results of this evaluation show that the proposed algorithm can be used as an efficient way to find segmentation in performance data when generating empirical performance models.

5 Developments of the Advanced Methods

Lightweight Requirements Engineering

As the basis of our approach, we define a very simple notion of requirements that supports their quantification in terms of the amount of data to be stored, processed or transmitted by an application. The requirements for LULESH are listed at the top of the table as part of Step I.

Configuring and Understanding the Performance of Task-Based Applications

The last column shows the input size, derived from our models by letting the efficiency be 0.8 and the number of cores 60. Together, these terms reflect the interplay between the number of cores and the input size, and the effect this has on efficiency.

6 Ongoing Work

Reducing the Cost of Measurements with Sparse Modeling

Therefore, the design of the experiment determines the quality of the model as well as the overall cost of the modeling process. Rather than requiring all combinations of all values for each parameter, it might be sufficient to select a sequence of five measurements for each parameter to create single-parameter models, but careful analysis is required to ensure that reducing the number and cost of measurements does not compromise the quality of the results.

Taint Analysis for Many-Parameter Modeling

Filled circles represent a subset that is likely to be sufficient to establish a pattern of performance. We are currently in the process of analyzing the limits of this approach and trying to determine if there are any conditions required for it to be successful, or if it is a valid solution for most applications.

7 Related Work

This approach requires the user to define either a linear or power law expectation for the performance model, as opposed to the greater freedom afforded by the normal form of the performance model as defined in our approach. Marin and Mellor-Crummey use semi-automatically derived performance models to predict performance on different architectures [30].

8 Conclusion

In: Proceedings of the 1st International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pp. In: Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), ASE ’15, pp.

FFMK: A Fast and Fault-Tolerant

Microkernel-Based System for Exascale Computing

In the high-performance computing (HPC) community, therefore, the operating system is sometimes considered "in the way". In the following sections, we summarize the general architecture of the FFMK OS platform and provide an overview of its higher-level services.

Fig. 1 FFMK software architecture: compute processes with performance-critical parts of (MPI) runtime and communication driver execute directly on L4 microkernel; functionality that is not critical for performance is offloaded to the L 4 Linux kernel, whic

2 Building Blocks for a Scalable HPC Operating System 2.1 The Case for a Multi-Kernel Operating System

L4Re Microkernel and L 4 Linux
Decoupled Execution on L 4 Linux
Hardware Performance Variation
Scalable Diffusion-Based Load Balancing
Beyond L4: Improving Scalability with M 3 and S EMPER OS

I/O Device Support An important feature of the L4Re microkernel is that it maps hardware interrupts to IPC messages. However, faithful virtualization is not the only way to run a legacy operating system on top of the L4Re microkernel.

Fig. 3 Schematic view of the decoupling mechanism. The L4Re microkernel runs on every core of the system, while the virtualized L 4 Linux runs on a subset of those cores only

3 Algorithms and System Support for Fault Tolerance

Resilience in MPI Collective Operations