High Performance Computing (HPC)

(1)

Dr. Noha MM.

Computer Science Department Thebes Academy

High Performance Computing (HPC)

Lecture 5

(2)

Con tent

• Aims for HPC

• Concepts and Terminology

• Parallel Implementation

• The Layers of Implementing An Application in Hardware or

Software Using Parallel Machines

(3)

Aims for HPC



Users of high-performance computing generally have different, but related, goals:

◦ Maximize Performance – measured by MIPS, MFLOPS.

◦ Minimize turnaround time to complete specific application problem, or

◦ Maximize the problem size that can be solve in a given amount of time.

◦ Necessary for large-scale problems that could not be done otherwise.



These are true supercomputers, cost tens of $M, O(1000) times faster than desktop systems

.

3

(4)

Concepts and Terminology:

General Terminology

Task – A logically discrete section of computational work

Parallel Task – Task that can be executed by multiple processors safely

Communications – Data exchange between parallel tasks

Synchronization – The coordination of parallel tasks

in real time

(5)

Cont.

Granularity – The ratio of computation to communication

 Coarse – High computation, low communication

 Fine – Low computation, high communication

Parallel Overhead

 Synchronizations

 Data Communications

 Overhead imposed by compilers, libraries, tools, operating systems, etc.

(6)

Parallel Implementation

 The benefits of parallel computing need to take into consideration

 Number of processors being deployed

 Communication overhead of

 processor - to – processor, and

 processor - to - memory.

(7)

The Layers of Implementing An Application in

Hardware or Software Using Parallel Machines

(8)

(9)

Layer 1: Implementation

(a) Software Implementation

 Realization of the algorithm or the application on a parallel computer platform.

 This chapter concentrates (software Implementation)

(b) Custom Hardware Implementation

 The realization on an application - specific parallel processor system using

 Application - Specific Integrated circuits (ASICs), or

 Field - Programmable Gate Array (FPGA).

(10)

Layer 2: Coding Layer

 The parallel algorithm coded using a language.

◦ Used language depends on the target parallel computing platform.

(a) Concurrency Platform Implementation

 Mapping the algorithm on a parallel computing platform by parallel

programming which facilitated by what is called concurrency platforms,

 Tools help programmer to manage threads &

timing of task execution on the processors.

 Examples of concurrency platforms include:

 Cilk + + , openMP, or Compute Unified Device Architecture (CUDA)

(b) VLSI Tools

 Mapping algorithm on a custom parallel computer such as systolic arrays.

 The programmer uses:

◦ Hardware Description Language (HDL), or

◦ Verilog or very high - speed integrated circuit hardware (VHDL).

(11)

Layer 3: Parallelization &

Scheduling

 This layer accepts the algorithm description from layer 4, and

 Produces thread timing and assignment to processors for software implementation.

Alternatively,

 This layer produces task scheduling and assignment to

processors for custom hardware very large - scale integration (VLSI) implementation.

(12)

Layer 4: Algorithm Design

 Required Computations to implement the application

 Define tasks of the algorithm and their interdependences

 It might or might not display parallelism at this stage

 It should not be concerned with task timing or task allocation to processors.



The results of this layer are:

 Task Dependence Graph,

 Task Directed Graph (DG)

(13)

Layer 5: Application



The application layer is closest to the end-user. It allows users to interact with other software application.



Some input/output (I/O) specifications might be concerned with:

 Where data is stored, and

The desired timing relations of data.

(14)

Computer Programming

Serial Computer Programming

 The programmer writes a code in a high - level language such as C, Java, or FORTRAN, and

 The code is compiled without further input from the programmer.

◦ More significantly, the programmer does not need to know the hardware details of the computing platform.

Parallelizing Compilers

 It looks for simple loops and spread them among the processors.

 Such compilers could easily tackle what is termed embarrassingly parallel algorithms

Parallel Processing

 The programmer must have intimate knowledge of:

 How the processors interact together, and

 When the algorithm tasks are to be executed.

(15)

Programming Parallel Computers

Possible approaches to programming parallel computers:

 Message passing programming

It uses not very intelligent compilers, and requires the programmer to explicitly do all the data distribution and message passing.

Message passing programming can be used to implement any kind of parallel application, including irregular and asynchronous problems that are very hard to implement efficiently using a data parallel approach.

MPI (Message Passing Interface)

 Shared memory programming

It is easier than message passing, but still requires the programmer to provide one-at-a-time access to shared data using locks and semaphores or synchronization.

PVM (Parallel Virtual Machine), Open MP

(16)

Message Passing Programming

 It is a programming paradigm targeted at distributed memory MIMD machines, and can be emulated on shared memory machines.

 data distribution and communication in SIMD machines is regular so can be handled by data parallel compiler.

 The programmer must explicitly specify:

 Parallel execution of code on difference processors,

 Distribution of data between processors, and

 manage exchange of data between processors when it is needed.

 Data is exchanged using calls to ‘message passing’

libraries, such as the Message Passing Interface (MPI) libraries, from a standard sequential language such as Fortran, C or C++.

 Programming Model of Multi/Distributed computer

(17)

Shared Memory Programming

 OpenMP provides high-level standard shared

memory compiler directives and library routines for Fortran and C/C++ (similar idea to HPC and MPI).

 OpenMP developed from successful compiler

directives used by vectorizing compilers for vector machines like CRAYs,

 e.g., to vectorize loops over each element of an array - makes it very easy to convert sequential code to run well on vector machines.

 Alternative approach

 Programmer to create multiple concurrent threads that can be executed on different processors and share access to the same pool of memory.

 Programming Model of Multiprocessor

(18)

Multi-Processor Parallelism

 Utilize multiple processors to speed up program run-time, by dividing the entire computation among the processors.



Achieved by splitting a large data set among processors. Each processor does computation on its section of the data, concurrently (in

parallel) with other processors.



Introduces an extra level in memory hierarchy – Processors may need to access data stored in

memory of another processor, or in large

banks of memory shared between processors.



It is useful for applications with regular data structure

19

Data Parallelism

(19)

ALGORITHMS

The IEEE Standard Dictionary of Electrical and Electronics Terms

defines an algorithm as:

“ A prescribed set of well - defined rules or processes to solve a problem in a finite number of steps ”.

 Some tasks can run concurrently in parallel and some must run serially or sequentially one after the other.

 i.e., any algorithm is composed of a serial part and a parallel part

The basic components defining an algorithm are

1. Different tasks,

2. Dependencies among tasks where a task output is used as another’ s input,

3. Set of primary inputs needed by the algorithm, and 4. Set of primary outputs produced by the algorithm.

(20)

General Terminology

Task

 A logically discrete section of computational work

Parallel Task

 Task that can be executed by multiple processors safely

Communications

 Data exchange between parallel tasks

Synchronization

 The coordination of parallel tasks in real time

Granularity

– The ratio of Computation to Communication (CCR Ratio)

Coarse – High computation, low communication (Coarse Grain)

Fine – Low computation, high communication (Soft Grain)

Parallel Overhead

Synchronizations

Data Communications

Overhead imposed by compilers, libraries, tools, operating systems, etc.

(21)

Parallel Processing

(Several processing elements working to solve a single problem)

• Elapsed Time = Computation time + Communication time +

Synchronization time

(22)

Principles of Parallel Computing

1. Finding enough parallelism (Amdahl’s law)

2. Granularity

 Need large enough amount of work to hide Communication overhead

3. Locality

 large memories are slow, fast memories are small.

4. Load balance

5. Coordination and synchronization

6. Performance modeling

All these things make parallel programming harder than sequential programming

(23)

Fundamental Issues

 Is the problem agreeable to parallelization?

 How to decompose the problem to exploit parallelism?

 What machine architecture should be used?

 What parallel resources are available?

 What kind of speedup is desired?

(24)