• Tidak ada hasil yang ditemukan

3.7 Results and Discussion

3.7.3 Closure

0 5 10 15 20 25 30 35 40 45

31 41 51 61 71 81 91 101

Total time (s)

Nodes in each direction Separate Kernels

Same Kernel

(a) HEX20

0 0.5 1 1.5 2 2.5 3

2769 17434 57082 732007 1568610

Total time (s)

Nodes in mesh Separate Kernels

Same Kernel

(b) TET4

Figure 3.17: Matrix generation time for (a) HEX20 and (b) TET4 type elements based on position of the numerical integration kernel for connecting rod example.

Table 3.5: Element type, number of nodes, number of elements, runtime and memory used for storing the global stiffness matrix on GPU global memory are presented for example 2. S1, S2 and S3 un- der the runtime column denote mesh preprocessor, index computation and value computation stages respectively.

Element type Nodes Elements Runtime (s) Memory

S1 S2 S3 (MBs)

HEX20

22374 36326 60186 76396 88872 113503

4134 13958 11988 15483 18163 23547

0.04 0.01 0.02 0.07 0.01 0.06 0.08 0.03 0.10 0.12 0.04 0.13 0.11 0.05 0.15 0.14 0.06 0.20

14.49 23.54 39.00 49.50 57.59 74.54

second and third stages respectively. For these two stages, a generalized algorithm with different sparse formats was presented.

The proposed strategy was tested on two examples and it was found to outperform the sharedNZ implementation of Cecka et al. (2011), showing a speedup of 4×to 6×for all element types over a wide range of mesh sizes. Higher order hexahedral elements are found to achieve more speedup than the low order tetrahedral elements on a particular benchmark. The mesh preprocessor stage was found to consume a very less amount of time with respect to the total assembly time. The index computation kernel was found to consume significantly less percentage of total assembly especially for the lower order element. It was also concluded that performing assembly and numerical integration in the same kernel performed worse than having separate kernels for the two processes. This difference is more pronounced for lower order elements.

Although assembly-based methods in FEA are relatively simpler to implement, they can become intractable for large problems, where, despite using the most optimized sparse storage schemes, the memory requirements and cost of data movement become prohibitively high. This leads to another class of solvers called matrix-free or assembly-free methods, where all the FEA computations are carried out at the element level, obviating the need to explicitly store the entire global stiffness matrix at any point of the application. Matrix-free methods are discussed in the following chapters in the context of large-scale FEA acceleration of topology optimization problems.

Chapter 4

Mesh reduction strategy for

structural topology optimization

In the second objective of the thesis, we aim to combine the algorithm-level and HPC-based modifications in order to significantly reduce the computational cost of density-based topology optimization using solid isotropic material with penalization (SIMP) method.

Two of the most successful algorithm-level modifications for GPU-based large-scale topology op- timization are the use of matrix-free FEA methods and the reduction of design variables by a local update strategy for the optimization algorithm. In this work, these two algorithm-level modifications are combined with efficient GPU-based acceleration using a novel mesh reduction strategy that aims to reduce the computational complexity of conventional structural topology optimization. In the proposed strategy, the effective number of design variables is reduced by using the concept of active nodes and activeelements in the finite element mesh. A node is considered to beactiveif it is a part of at least one element with non-zero density. An element, on the other hand, is considered to beactiveif it contains at least oneactive node. Nodes and elements that do not satisfy these conditions, are consideredinactive, and can be expelled from the computation without significant effect on the final outcome. Introducing this concept to GPU-accelerated density-based topology optimization algorithms is not straightforward due to the parallel nature of the computation and the need to dynamically update the finite element mesh at every optimization iteration. Another important challenge is locating the boundary condition nodes after performing mesh reduction to remove the inactivenodes and elements on GPU. This issue is addressed by introducing a novel mesh numbering scheme to facilitate parallel identification ofactive nodes using the proposed GPU-based algorithm. The preconditioned conjugate gradient (PCG) solver is further developed using the proposed strategy and the numbering scheme.

4.1 FEA and matrix free PCG

In order to find the responses of a structure under the given loading conditions, FEA needs to be performed at each iteration of topology optimization. Typically, regular grids are used for topology optimization in the literature (Mart´ınez-Frutos et al., 2017, Sharma et al., 2011, 2014, Sharma and Deb, 2014, Ram and Sharma, 2017). In the present work also, a structured grid with elements of identical dimensions is used. This ensures that the elemental stiffness matrices for all the elements are numerically identical and, hence, need to be calculated only once for the entire topology optimization process. This elemental stiffness calculation is done at the host, and the data is sent to GPU at the beginning. A matrix-free approach is taken for solving the linear system of equations after applying the boundary conditions. This is an on-the-fly approach in which the entire final global stiffness matrix is never constructed. The preconditioned conjugate gradient method is used as the solution technique.

The details of the matrix-free PCG with Jacobi pre-conditioner (Shewchuk, 1994) is given in algorithm 8.

Algorithm 8:Preconditioned conjugate gradient

Input : F: global load vector;K: global stiffness matrix;M: preconditioner Output:[x]: array of displacements ;

1 i←0;

2 r←b−Ax; SpMV SAXPY

3 d←M−1r;

4 δnew←rTd; IP

5 δ0←δnew;

6 whilei≤imax∧δnew> ϵ2δ0 do

7 q←Ad; SpMV

8 α← δdnewTq ; IP

9 x←x+αd; SAXPY

10 r←r−αq; SAXPY

11 s←M−1r;

12 δold←δnew;

13 δnew←rTs; IP

14 β← δδnew

old;

15 d←s+βd; SAXPY

16 i←i+ 1;

17 end

It can be seen from algorithm 8 that it consists of several linear algebraic operations which contribute to all the computations in the algorithm. The following three different types of linear algebraic operations are identified in this algorithm.

• Sparse matrix-vector multiplication (SpMV)

• Inner product (IP)

• A X plus Y (SAXPY)

From lines 1 to 5, exactly one SpMV, one IP and one SAXPY operation is performed. Within the loop (between line 6 and line 17), one SpMV, two IPs and three SAXPYs are performed in each iteration.

Among these operations, SpMV consumes the most amount of time. The matrix-free SpMV is shown in algorithm 9 that is adapted from (Markall et al., 2013). The GPU kernel is designed in a node-by-node manner for executing SpMV. This implementation is referred to as the standard implementation in the rest of the paper. Inside the kernel, in line 1, a unique thread id is assigned to each node of the mesh.

Following this, the elemental stiffness matrix is loaded into the shared memory in line 2. Thereafter, a thread synchronization is performed in line 3 to ensure that the process of copying into shared memory is complete. For each node, a loop is launched (line 6) that iterates through all of its neighboring elements.

Another nested loop at line 10 iterates through all four nodes of a particular element. In lines 12 - 15, the multiplication is performed at the nodal level for each neighboring element of the node assigned to the thread. The results are written to the global memory in lines 18 and 19. In the next section, the proposed mesh reduction strategy along with its implementation for matrix-free PCG is discussed.

Algorithm 9:Standard matrix-free SpMV kernel for CG

Input : [v]: vector to be multiplied; [Ke]: elemental stiffness matrix;n: total nodes Output:[result]: result of SpMV operation;

1 tx=blockIdx.x×blockDim.x+threadIdx.x;

2 Load elemental stiffness in shared memory;

3 syncthreads();

4 if tx < nthen

5 eId=tx;

6 forAll neighboring elements oftxdo

7 e=elementN umber;

8 l=localposition;

9 d=density[e];

10 forj←0 : 4do

11 p =connect[4e+j];;

12 temp+ =v[2p]×(sh Ke[2l][2j]×pow(d, penal));

13 temp1 + =v[2p]×(sh Ke[2l+ 1][2j]×pow(d, penal));

14 temp+ =v[2p+ 1]×(sh Ke[2l][2j+ 1]×pow(d, penal));

15 temp1 + =v[2p+ 1]×(sh Ke[2l+ 1][2j+ 1]×pow(d, penal));

16 end

17 end

18 result[2×tx] =temp;

19 result[2×tx+ 1] =temp1;

20 end