The “Interpolated Factored Green Function” Method

Various numerical results presented in this thesis illustrate the character of the proposed IFGF-accelerated parallel acoustic solver. An overview of previous work in this area and the relevance of the IFGF method is therefore given in Section 1.2.

Integral equations

The notation used in this thesis does not explicitly indicate the dependence of the Green function 𝐺 on the wave number 𝜅 for reasons of readability. Even combined layer formulations, i.e. linear combinations of the single layer (1.6) and the double layer (1.7), can be used, as shown in section 5.5.

Previous work and contribution

Each box in the tree B is thus equipped with a set of box-centered spherical cone segments at a corresponding level of the cone hierarchy C. These two selection criteria are actually related, since the interpolability of polynomials used in the IFGF approach has direct implications on the ranking of the interpolated values.

Content and layout of this thesis

PRELIMINARIES

GMRES

The use of the GMRES algorithm and other iterative solvers of Krylov subspace linear equations in the context of the considered problems of integral equations motivates our consideration of the fundamental problem discussed in this thesis, namely the accelerated evaluation of discrete integral operators. The details of the GMRES algorithm, which do not affect the innovations presented in this thesis, are not discussed in detail here.

Chebyshev interpolation

The direct calculation of (2.7) and (2.6) is expensive due to the evaluation of the triple sums. In the current implementation of the IFGF method, (2.6) is naively evaluated due to the non-uniformity of the targets and the small expansion sizes.

HPC basics

Modern computing nodes generally follow a non-uniform memory access (NUMA) design (as opposed to uniform memory access (UMA)), where the access times to the shared memory depend on the location of the memory relative to the multi-core processor access to that end. Based on the functions and synchronization capabilities provided by MPI, a program can be launched as a set of multiple processes (identified in what follows by their corresponding integer value rank within the group of all processes launched by a given program be launched).

THE INTERPOLATED FACTORED GREEN FUNCTION METHOD

Factorization of the Green function

Abox𝐵(𝑥 , 𝐻) centered at Then, letting 𝐼𝑆(𝑥) denote the field generated at a point 𝑥 by all point sources contained in 𝐵(𝑥𝑆, 𝐻), we will consider, in in particular, the local operator assessment problem.

Analyticity

As indicated above, the analytical properties of the factor𝑔𝑆 play a central role in the proposed algorithm. In addition to the factorization (3.5), the proposed strategy relies on the use of the singularity solution change of variables.

Figure 3.2: Surrogate Source factorization test, set up as illustrated in Figure 3.2a.

Interpolation procedure

A two-dimensional illustration of the cone domains and associated cone segments is provided in Figure 3.3. Thus, increasingly larger real-space cone segments are used as the distance between the interpolation cone segments to the origin grows.

Figure 3.3: Schematic two-dimensional illustration of a set of cone domains 𝐸 𝛾 , together with the associated cone segments 𝐶 𝛾 that result under the parametrization (3.10)

Box octree structure

Similarly, the understanding of the analytic factor and the centered factor, as in (3.7), is appropriately expanded as follows in the context of the octree structure. The associated cousin-box concept is defined in terms of the hierarchical parent-child relationship in the octreeB, in which the definitions of parent-boxP𝐵𝑑. The concept of side boxes is illustrated in Figure 3.8 for a two-dimensional example, in which the cousins of the level 4 box𝐵4.

It follows that all cousin boxes of a given level-𝑑 box are contained in the set of 6×6×6 level-𝑑 boxes contained in the 3×3×3 level-(𝑑−1) neighbors of the parent box.

Figure 3.5: Comparison of the errors resulting from 𝑟 - and 𝑠 -based interpolation strategies for the problem of interpolation of the analytic factor 𝑔 𝑆 in the interval [ 𝑟

Cone segments

Figure 3.3 shows a two-dimensional illustration of the interpolation domains and associated cone segments. Finally, the set of all relevant cone segments, R𝐶, is taken as the union of the level-𝑑 relevant cone segments over all levels in the box-oct tree structure. 𝐵 the level 𝑑 phase of the algorithm ((𝐷−1) ≥𝑑 ≥ 3) continues by using the previously calculated (in the level 𝑑+1 phase) level 𝑑 spherical coordinate interpolants 𝐼𝑃𝐶𝑑.

In particular, all interpolation points within relevant cone segments at level𝑑 are also targets of the interpolation performed at level(𝑑+1).

Figure 3.9: Two-dimensional illustrative sketch of the naming scheme used for box-centered cone segments 𝐶 𝑑

Complexity analysis

The complexity of the IFGF algorithm is equal to the number of arithmetic operations performed in Algorithm 2. To evaluate this complexity, we first consider the cost of the level 𝐷-specific evaluations performed in the “for” loop, starting in line 2. We next consider the section of the algorithm in the loop starting in line 10, which repeats O(log𝑁) times (since 𝐷 ∼ log𝑁).

In the specific Laplace case 𝜅 = 0, the cost of the algorithm is still O(𝑁log𝑁) operations, given the O(𝑁log𝑁)cost required for the interpolation to surface points.

Figure 3.12: Visual representation of the IFGF algorithm, outlined in Algorithm 2, and also expressed in Algorithm 6 in terms of three fundamental functions called LevelDEvaluations , Propagation and Interpolation

MASSIVELY PARALLEL IFGF METHOD

OpenMP parallelization

Our approach to efficiently parallelize the LevelDEvaluations function is based on changing the viewpoint from iterating through the fields corresponding to the levels 𝐷. In contrast to the serial implementation of the IFGF method presented in Section 3.6, the practical implementation of this parallel approach requires the algorithm to first determine the appropriate field R𝐵𝐶𝑑. Using some notation in Note 12, the resulting parallel propagation algorithm is presented in Algorithm 8.

This approach avoids both the difficulties mentioned at the beginning of Section 4.1 (concerning the existence of a small number of relevant boxes in the upper levels of the octree structure), and thread safety problems similar to those above in the context are discussed. of the propagation function.

MPI parallelization

Problem decomposition and data distribution
Practical implementation of the box-cone data structures
Data communication

The distribution of surface discretization points is orchestrated based on the order of the set of corresponding fields R𝑑. The green numbers indicate the order of the cone segments in the proposed Morton-based cone segment order. Therefore, the smallest boxes in the octotree structure represent the smallest "unit" for the distribution of surface discretization points.

The distribution of surface discretization points is used to equally divide among all MPI sequences the work performed in the Interpolation Function (OpenMP Algorithm 9).

Figure 4.1: Left panel: Two-dimensional example of an ordering of the cone segments based on the Morton order of the boxes on level 𝑑 = 3 with four cone segments per box

Parallel linearithmic complexity analysis

In the case of propagation communication, we note that every corresponding segment of the cone at any level is 𝑑 =𝐷. In the case of interpolation communication, the corresponding cone segment coefficients must finally be communicated to the ranks that store the surface discretization points included in fields that are cousins of the field co-centered with the corresponding cone segment. First, at the lowest level 𝐷 each corresponding box has at most 𝐾 = 189 related boxes, and since by design the surface discretization points in each smallest box are stored in a single MPI rank (Section 4.2.1), it follows that O (1) (at most 189) of different MPI ranks requires the coefficients contained in each corresponding segment of the cone.

It follows that each relevant cone segment is communicated to one (1) number of MPI ranks for all levels𝑑, establishing the validity of point 3) for the interpolation communication function and completing the proof of the linear complexity of the proposed parallel IFGF algorithm.

NUMERICAL EXAMPLES

Background for numerical examples

Test geometries
Compiler and hardware
Hardware pinning
Data points
Numerical error estimation
Weak and strong parallel efficiency concepts

In all tests presented below, and in accordance with the notation introduced in the previous sections, 𝑁 denotes the number of surface discretization points, 𝑑 the size of the geometry (cf. Section 5.1.1), 𝑁𝑟 the number of MPI rows, and 𝑁𝑐 the total number of cores used . The numerical results shown in this section demonstrate the linear arithmetic scaling of the serial and the parallel IFGF methods. This is a direct consequence of the O (𝑃2) scaling of the computation time, where 𝑃 denotes the number of interpolation points per cone segment.

The results show that the linear algorithmic complexity and memory requirements of the basic IFGF algorithm are maintained in the parallel environment.

Figure 5.2: Illustration of the linearithmic complexity of the parallel IFGF method, for the prolate spheroid geometry, on 30 compute nodes, with error 𝜀 ≈ 1

Higher order results

Laplace equation

The precomputation times (not shown) are negligible in this case, since the cost of the most cost-intensive part of the precomputation algorithm, namely the determination of the relevant cone segments, is negligible in the current Laplace context. According to the IFGF Laplace algorithmic prescription, a fixed number of cone segments per box is used across all levels in the hierarchical data structure.

Full solver and sample engineering problems

Scattering boundary-value problem
Integral representations and integral equations
Surface representation
Chebyshev-based rectangular-polar integral equation solver
Integration algorithm for singular interactions
Integration algorithm for non-singular interactions
IFGF method for the combined-layer formulation
Scattering by a sphere
Scattering by a submarine geometry
Scattering by an aircraft nacelle

The submarine hull is aligned with the 𝑧 axis and the sail is parallel to the +𝑦 axis; the face of the vessel points in the +𝑧 direction. The incident plane wave hits the vessel head-on, and we see in Figures 5.6(a) and 5.6(b) that the strongest interaction occurs around the ship's bow and diving planes (also known as hydroplanes). For a closer examination, Figures 5.9(b) and 5.9(c) show the field from above, but with the scattered surfaces removed.

The maximum magnitude of the far-field increases by a factor of about 1.5 for the 81.8𝜆 wave compared to the 40.9𝜆 case.

Figure 5.3: Same as Figure 3.4, but only showing the interpolation strategy in the 𝑠 variable, although, for the single layer potential and the double layer potential.

Strong parallel scaling

Figures 5.11(f) and 5.11(g), where the geometry is not included, present the far-field, with the positive𝑧direction pointing out of the page, for the 40.9𝜆and 81.8𝜆cases, respectively. Considering the requirements of the strong-scaling setup, test problems were chosen that could be executed within a reasonable time on a single core and with the memory available in the corresponding computing node. The tables clearly show that, in all cases, the IFGF parallel efficiency is essentially independent of the geometry type.

The character of the IFGF algorithm under weak- and strong-scale hardware doubling tests, as discussed above in this section and in Section 4.3, will ensure.

Figure 5.13: Visualization of the strong parallel efficiency 𝐸 𝑠

Weak parallel scaling

The number of nodes, each containing𝑁𝑐=56 cores, is kept proportional to the number of surface discretization points, as required by the weak-scaling paradigm.

Table 5.19: Strong parallel scaling test of the distributed-memory MPI implemen- implemen-tation from 𝑁 𝑐 = 56 to 𝑁 𝑐 = 896 cores (1 to 16 compute nodes) with 4 MPI ranks per node for the sphere geometry.

Large sphere tests

The sphere of acoustic magnitude 1.389𝜆 in this table coincides with the largest sphere test case considered in [58]. The largest discretization shown in the current table for the discretization points of the sphere test cases, a limit caused by the largest number represented by a 32-bit integer assumed in our geometry generation code, is which will be avoided in later implementations of the code by switching to 64-bit integers), is slightly smaller than the discretization considered in [58] under a core execution of 131,072. Other test cases listed in Table 5.25 include an example for a much larger sphere, 2.048𝜆 in diameter, as well as other 1.389𝜆 test cases for different accuracies and discretization sizes—and, in all cases, based on of memory consumption ranging between ≈1.2TB and ≈4TB.

CONCLUDING REMARKS

Conclusions

Future work

Chapter 6. discretization points per box, which negatively affects the overall performance of the algorithm), which splits boxes more closely aligned with the position and number of surface discretization points, rather than using a fixed 𝐷-leveled box octotree. Clearly, this is not a new technique and has been used in the context of other acceleration methods, as shown for example in [20]. Such a flexible octree structure may result in lower parallel efficiency and its viability in the context of the IFGF method is subject to further investigation.

In particular, the use of GPUs to speed up the interpolation processes, which represent the most time-consuming part of the IFGF method, seems to be a very promising avenue of research.

BIBLIOGRAPHY

Is it possible to avoid the FEM contamination effect for the Helmholtz equation considering high wavenumbers. Dispersion and contamination of the FEM solution of the Helmholtz equation in one, two and three dimensions. More scalable and efficient parallelization of the adaptive integral method - II. part: using BIOEM.

I:SC ’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.

INDEX