Dr. Noha MM.
Computer Science Department Thebes Academy
High Performance Computing (HPC)
Lecture 5
Con tent
• Aims for HPC
• Concepts and Terminology
• Parallel Implementation
• The Layers of Implementing An Application in Hardware or
Software Using Parallel Machines
Aims for HPC
Users of high-performance computing generally have different, but related, goals:
◦ Maximize Performance – measured by MIPS, MFLOPS.
◦ Minimize turnaround time to complete specific application problem, or
◦ Maximize the problem size that can be solve in a given amount of time.
◦ Necessary for large-scale problems that could not be done otherwise.
These are true supercomputers, cost tens of $M, O(1000) times faster than desktop systems.
3
Concepts and Terminology:
General Terminology
Task – A logically discrete section of computational work
Parallel Task – Task that can be executed by multiple processors safely
Communications – Data exchange between parallel tasks
Synchronization – The coordination of parallel tasks
in real time
Cont.
Granularity – The ratio of computation to communication
Coarse – High computation, low communication
Fine – Low computation, high communication
Parallel Overhead
Synchronizations
Data Communications
Overhead imposed by compilers, libraries, tools, operating systems, etc.
Parallel Implementation
The benefits of parallel computing need to take into consideration
Number of processors being deployed
Communication overhead of
processor - to – processor, and
processor - to - memory.
The Layers of Implementing An Application in
Hardware or Software Using Parallel Machines
Layer 1: Implementation
(a) Software Implementation
Realization of the algorithm or the application on a parallel computer platform.
This chapter concentrates (software Implementation)
(b) Custom Hardware Implementation
The realization on an application - specific parallel processor system using
Application - Specific Integrated circuits (ASICs), or
Field - Programmable Gate Array (FPGA).
Layer 2: Coding Layer
The parallel algorithm coded using a language.
◦ Used language depends on the target parallel computing platform.
(a) Concurrency Platform Implementation
Mapping the algorithm on a parallel computing platform by parallel
programming which facilitated by what is called concurrency platforms,
Tools help programmer to manage threads &
timing of task execution on the processors.
Examples of concurrency platforms include:
Cilk + + , openMP, or Compute Unified Device Architecture (CUDA)
(b) VLSI Tools
Mapping algorithm on a custom parallel computer such as systolic arrays.
The programmer uses:
◦ Hardware Description Language (HDL), or
◦ Verilog or very high - speed integrated circuit hardware (VHDL).
Layer 3: Parallelization &
Scheduling
This layer accepts the algorithm description from layer 4, and
Produces thread timing and assignment to processors for software implementation.
Alternatively,
This layer produces task scheduling and assignment to
processors for custom hardware very large - scale integration (VLSI) implementation.
Layer 4: Algorithm Design
Required Computations to implement the application
Define tasks of the algorithm and their interdependences
It might or might not display parallelism at this stage
It should not be concerned with task timing or task allocation to processors.
The results of this layer are:
Task Dependence Graph,
Task Directed Graph (DG)
Layer 5: Application
The application layer is closest to the end-user. It allows users to interact with other software application.
Some input/output (I/O) specifications might be concerned with:
Where data is stored, and
The desired timing relations of data.
Computer Programming
Serial Computer Programming
The programmer writes a code in a high - level language such as C, Java, or FORTRAN, and
The code is compiled without further input from the programmer.
◦ More significantly, the programmer does not need to know the hardware details of the computing platform.
Parallelizing Compilers
It looks for simple loops and spread them among the processors.
Such compilers could easily tackle what is termed embarrassingly parallel algorithms
Parallel Processing
The programmer must have intimate knowledge of:
How the processors interact together, and
When the algorithm tasks are to be executed.
Programming Parallel Computers
Possible approaches to programming parallel computers:
Message passing programming
It uses not very intelligent compilers, and requires the programmer to explicitly do all the data distribution and message passing.
Message passing programming can be used to implement any kind of parallel application, including irregular and asynchronous problems that are very hard to implement efficiently using a data parallel approach.
MPI (Message Passing Interface)
Shared memory programming
It is easier than message passing, but still requires the programmer to provide one-at-a-time access to shared data using locks and semaphores or synchronization.
PVM (Parallel Virtual Machine), Open MP
Message Passing Programming
It is a programming paradigm targeted at distributed memory MIMD machines, and can be emulated on shared memory machines.
data distribution and communication in SIMD machines is regular so can be handled by data parallel compiler.
The programmer must explicitly specify:
Parallel execution of code on difference processors,
Distribution of data between processors, and
manage exchange of data between processors when it is needed.
Data is exchanged using calls to ‘message passing’
libraries, such as the Message Passing Interface (MPI) libraries, from a standard sequential language such as Fortran, C or C++.
Programming Model of Multi/Distributed computer
Shared Memory Programming
OpenMP provides high-level standard shared
memory compiler directives and library routines for Fortran and C/C++ (similar idea to HPC and MPI).
OpenMP developed from successful compiler
directives used by vectorizing compilers for vector machines like CRAYs,
e.g., to vectorize loops over each element of an array - makes it very easy to convert sequential code to run well on vector machines.
Alternative approach
Programmer to create multiple concurrent threads that can be executed on different processors and share access to the same pool of memory.
Programming Model of Multiprocessor
Multi-Processor Parallelism
Utilize multiple processors to speed up program run-time, by dividing the entire computation among the processors.
Achieved by splitting a large data set among processors. Each processor does computation on its section of the data, concurrently (in
parallel) with other processors.
Introduces an extra level in memory hierarchy – Processors may need to access data stored in
memory of another processor, or in large
banks of memory shared between processors.
It is useful for applications with regular data structure
19
Data Parallelism
ALGORITHMS
The IEEE Standard Dictionary of Electrical and Electronics Terms
defines an algorithm as:
“ A prescribed set of well - defined rules or processes to solve a problem in a finite number of steps ”.
Some tasks can run concurrently in parallel and some must run serially or sequentially one after the other.
i.e., any algorithm is composed of a serial part and a parallel part
The basic components defining an algorithm are
1. Different tasks,2. Dependencies among tasks where a task output is used as another’ s input,
3. Set of primary inputs needed by the algorithm, and 4. Set of primary outputs produced by the algorithm.
General Terminology
Task
A logically discrete section of computational work
Parallel Task
Task that can be executed by multiple processors safely
Communications
Data exchange between parallel tasks
Synchronization
The coordination of parallel tasks in real time
Granularity
– The ratio of Computation to Communication (CCR Ratio)
Coarse – High computation, low communication (Coarse Grain)
Fine – Low computation, high communication (Soft Grain)
Parallel Overhead
Synchronizations
Data Communications
Overhead imposed by compilers, libraries, tools, operating systems, etc.
Parallel Processing
(Several processing elements working to solve a single problem)
• Elapsed Time = Computation time + Communication time +
Synchronization time
Principles of Parallel Computing
1. Finding enough parallelism (Amdahl’s law)
2. Granularity
Need large enough amount of work to hide Communication overhead
3. Locality
large memories are slow, fast memories are small.
4. Load balance
5. Coordination and synchronization
6. Performance modeling
All these things make parallel programming harder than sequential programming
Fundamental Issues
Is the problem agreeable to parallelization?
How to decompose the problem to exploit parallelism?
What machine architecture should be used?
What parallel resources are available?
What kind of speedup is desired?