• Tidak ada hasil yang ditemukan

ˆ Domain Decomposition (DD) Preconditioner: decomposes the domain of the problem into subdomains, each of which can be solved independently [70]. The main advantage of DD preconditioners is their ability to handle problems with irregular geometries efficiently. DD preconditioners can be used with any iterative solver, such as the conjugate gradient (CG) method. However, the convergence rate of the iterative solver may depend on the size of the subdomains, the number of subdomains, and the overlap between them [71].

The CG method is known for its fast convergence rate for symmetric and positive def- inite matrices, which makes it ideally suited for problems in FEA [72]. In addition, it is memory-efficient for large-scale problems. The CG method is easily parallelizable, mak- ing it scalable for the solution of very large linear systems on modern high-performance computing platforms [73].

Considering these advantages, CG method with diagonal preconditioning is chosen for solving the system of linear equations in this thesis. Diagonal preconditioner is chosen because it is computationally efficient and easy to implement, as it only requires the diagonal elements of the matrix to be computed and stored. This makes it particularly attractive for problems with a large number of unknowns, as the computational cost of the preconditioner is proportional to the size of the matrix [67].

latency of data and instructions. Whereas, the bulk of a GPU’s chip area is dedicated to a large number of parallel computing threads. Individual GPU processors are called called streaming processors (SP) and each has its own register memory. A collection of SPs constitutes a streaming multiprocessor (SM). A control unit and cache are shared among SPs within SM.

DRAM CACHE CONTROL

ALU ALU ALU ALU

CPU

DRAM GPU

SM SP

Figure 2.3: Simplified model of CPU and CPU architectures.

The parallel program that runs on GPU is defined by a function called ‘kernel’. The kernel is invoked by the host (CPU) after specifying the number of threads. These threads are grouped into thread ‘blocks.’ Threads within a block can communicate and share information. However, threads within one block cannot communicate directly with threads within other block. The total number of threads specified during kernel launch is divided into these thread blocks, which are referred to as a ‘grid.’ GPU architecture clubs the individual threads into a group of 32 called as a ‘warp’. At a time a single instruction is executed by the all threads of a warp. Programmer can control only the size of the grid and blocks, while GPU handles the formation of warps.

A GPU device is comprised of various memory types, each with different character- istics and functions. Any application’s performance is highly dependent on the optimal utilization of these memories. Table 2.1 lists the memory types, their location on the device, and their scope of access.

The register and local memories are private to each thread. The shared memory is exclusive to a thread block and generally used by the threads of a block to communicate and share data. Access to on-chip memories are faster. Global memory is the largest in the size and accessible to all the thread throughout the application run time. However,

Table 2.1: Different types of memories available on a GPU device Memory Location Accessed by

Register On-chip Thread Local On-chip Thread Shared On-chip Block Global Off-chip Grid Constant Off-chip Grid

transactions between threads and global memory are slower than on-chip memories.

Global memory is generally used to share and copy data from the host (CPU). The constant memory is read-only type, which remains unchanged throughout the program.

Various programming paradigms such as CUDA, OpenCL, and OpenACC are avail- able to develop kernels for GPU. OpenCL and OpenGL are API-based programming models that work over GPU devices. CUDA is a programming model developed by NVIDIA to run on GPUs manufactured by them. CUDA works as an extension of C/C++/Fortran languages and is the most popular tool for writing GPU kernels. It provides flexibility and control to the programmer for building parallel applications us- ing GPUs. Since CUDA allows a maximum control over the available GPU resource, the responsibility of achieving a good performance and accurate results lies with the pro- grammers. To accomplish this, the following considerations must be taken into account.

Race Condition:

A race condition occurs when two or more threads with access to the same memory location attempt to modify the same data at the same time. Multiple threads can there- fore have different values for the same memory address at the same time. This occurs because the programmer cannot anticipate the order in which threads will access the data. Consequently, race conditions might lead to inconsistent and inaccurate results and should be avoided.

Thread Divergence:

The GPU architecture is designed in such a way that all the threads in a warp execute the same instruction. However, if the threads in a warp execute different instructions due to instruction level branching, this is known as ‘thread divergence.’ It wastes computing resources and must be avoided.

Memory Coalescing:

Memory coalescing is a technique that enables the GPU’s global memory to be utilized optimally. It is achieved when threads within a block access consecutive global memory addresses. In this scenario, GPU bundles all memory accesses into a single memory transaction. For optimal performance, global memory accesses must be coalesced.