PCG with Active Nodes - GPU Acceleration of Finite Element Analysis and Its Application to Larg

For the proposed implementation, the PCG algorithm (as shown in algorithm 8) is modified to only work with the active nodes in the mesh at that particular iteration. After the list of active nodes is prepared as discussed in section 4.4, the PCG algorithm starts as shown in algorithm 8. The key

Algorithm 10:Computing active nodes

Input : n: Total number of nodes;e: Total number of elements;D[e]: Array containing density of all the elements

Output:[aN]: array containing the active nodes;new count: number of active nodes

1 thrust::device vector < int > t aN[n];

2 aN←thrust::raw pointer cast(t aN);

3 thrust::sequence(t aN.begin(), t aN.end(),0);

4 aN←aN kernel(aN);

5 new count=thrust::count if(t aN.begin(), t aN.end(), is positive());

6 thrust::remove if(t aN.begin(), t aN.end(), is negative());

3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2

5 6 7 2

1 3 10 11 14 15

−5 6 7 −5 −5 10 11 −5 −514 15 −5 2

1 3

1 2 4

5 6

9 10

13 14 15

11 7

16 8

t_aN (raw pointer aN ) thrust device vector

thrust::remove_if aN_kernel()

Figure 4.4: Calculation of active nodes of the example mesh.

difference is that, now instead of declaring the arraysr in line 2,xin line 2,din line 3 andq in line 7 with a size of (n×ndof), they are declared with a reduced size of (size[aN]×ndof). Here, n is the total nodes in the mesh and ndof is the DOF per node. The SpMV operations are performed using a custom GPU kernel, whereas, IP and SAXPY operations are performed using thrust on the reduced size arrays. These reduced size arrays also include the displacement array resulting from FEA. Since the other parts of the SIMP algorithm are designed to work with the full sized displacement array, a thrust scatter operation is performed after each FEA to increase the size of the array from (size[aN]×ndof) to (n×ndof). Here, the aNarray is used as the map required by the thrust scatter operation. The details of SpMV kernel and thrust operations are given in the following sections.

4.5.1 SPMV kernel

The matrix-free SpMV kernel with the reduced mesh strategy is shown in algorithm 11. The array containing the active nodesaN(as discussed in section 4.4) is computed and sent to GPU before launch of the kernel as shown in line 1. A none-by-node implementation is used for uniform comparison with the standard implementation. At the start of the kernel, the elemental stiffness matrix is loaded into the shared memory as shown in line 3. This is followed by a barrier synchronization in line 4 for ensuring that the data is copied entirely before any computation can be performed. This kernel is launched

Algorithm 11:Proposed matrix-free SpMV kernel for CG

Input : [v]: vector to be multiplied; [Ke]: elemental stiffness matrix;[aN]: array of ative nodes

Output:[result]: result of SpMV operation;

1 Compute[aN] using algorithm 10;

2 tx=blockIdx.x×blockDim.x+threadIdx.x;

3 Load elemental stiffness in shared memory;

4 syncthreads();

5 if tx < count activethen

6 eId=aN[tx];

7 forAll neighboring elements oftxdo

8 e=elementN umber;

9 l=localposition;

10 d=density[e];

11 forj←0 : 4do

12 p=reverse aN[connect[4×e+j]];

13 temp+ =v[2p]×(sh Ke[2l][2j]×powf(d, penal));

14 temp1 + =v[2p]×(sh Ke[2l+ 1][2j]×powf(d, penal));

15 temp+ =v[2p+ 1]×(sh Ke[2l][2j+ 1]×powf(d, penal));

16 temp1 + =v[2p+ 1]×(sh Ke[2l+ 1][2j+ 1]×powf(d, penal));

17 end

18 end

19 result[2×tx] =temp;

20 result[2×tx+ 1] =temp1;

21 end

with threads equal to the number of active nodes in the mesh (count active) in that iteration of SIMP method as shown in line 5. The element id eIdis calculated from the thread id txin line 6. Following this, a loop is called that iterates through all the neighboring elements of the assigned node in line 7.

The values of the global element number, local position of the node in that element and the elemental density are loaded in lines 8, 9, and 10. Following this, another loop is called that iterates through all the nodes of a neighboring element in line 11. Inside this loop, the node-level multiplications and additions are performed from line 13 to line 16. Finally, the results are stored on the global memory in theresultarray in lines 19 and 20.

It is noted that all the arrays in the CG algorithm are shortened according to the size ofcount active.

This includes the arrayv[] to which the matrix is to be multiplied. This creates out-of-bound memory accesses for lines 13 to 16 in algorithm 11. The reason is that in line 12, the value of node number p, that is read from the connectivity matrix can range between anything from 0 to n. However, for accessing the shortenedv[] vector, the maximum permissible value ofpiscount activeinstead ofn. To remedy this problem, we introduce another map array calledreverse aNthat is computed at the start of CG iterations. It has the size of maximum nodes in the mesh (n). It is calculated from theaNarray in the following manner.

reverse aN[aN[i]] =i, fori←1 :n (4.1) It can be seen that by design of the proposed implementation, only small changes need to be made to the standard GPU kernel for utilizing the mesh reduction strategy. In case of a element-by-element, or DOF-by-DOF implementation of matrix-free, one needs to construct the list of active elements or active DOFs in a similar manner as the nodes. However, due to the node-by-node implementation, the well-known problem of race condition is not observed due to the absence of sharing. In case of other thread allocation strategies, this issue needs to be handled.

4.5.2 Thrust operations

Apart from the SpMV operation, all the vector operations in PCG are performed using thrust. In algorithm 8, the lines (2, 4, 8, 9, 10, 13, 15) where SAXPY and IP are mentioned, are the ones signifying vector operations on GPU using thrust. In the standard implementation, these operations are performed on arrays of size (n×2). However, on the proposed implementation, this number is reduced to (count active×2). This strategy imparts two key benefits to the matrix-free PCG. Firstly, it reduces the memory required by thrust to allocate the arrays in the standard implementation. Secondly, the number of floating point operations are also significantly reduced. However, one key issue after shortening all the arrays is that the locations of the boundary nodes get distorted and undesirable search operations are required in order to find the location of those nodes. This is handled by renumbering the mesh so that the boundary nodes are always stored starting from node number zero as discussed in section 4.3.

The other nodes are not numbered unless all the boundary nodes are numbered already. At the end, the shortened displacement array needs to be expanded into the full displacement array so that it can be used by the topology optimization operators. This is done by the thrust::scatter operation at the end of each FEA. The map required by the scatter operation is theaNarray of active nodes.

Dalam dokumen GPU Acceleration of Finite Element Analysis and Its Application to Large-Scale Structural Topology Optimization (Halaman 86-89)