Pemrosesan Paralel. Contoh aplikasi di atas melibatkan komputasi intensif dan memerlukan daya olah yang tinggi. Performance: Weather Prediction

(1)

Pemrosesan Paralel

Kudang B. Seminar

Kebutuhan Komputer Berkinerja Tinggi

• Peramalan cuaca

• Aerodinamik

• Kercerdasan buatan: robotik

• Rekayasa genetik

Contoh aplikasi di atas melibatkan

komputasi intensif dan memerlukan

daya olah yang tinggi

Example 1: Weather Prediction

•

Area, segments

– 3000*3000*11 cubic miles

– .1*.1*.1 cubic mile: ~ 1011_segments

•

Two day prediction

– half hour periods: ~ 100 periods

•

Computation per segment

– Temp, Pressure, Humidity, Wind speed, Wind direction

– Assume ~ 100 FLOPs

Performance: Weather Prediction

• Computational requirement: 10

15

• Serial supercomputer: 10

9

_instr/sec

• Total serial time: 10

6

_{sec =}

₂₈₀

_hours

• Not too good for

48 hour weather

prediction

(2)

Parallel Weather Prediction

•

1 K workstations, grid connected

– 108_{segment computations per processor}

– 108_{instructions per second}

– 100 instructions per segment computation – 100 time steps: 104_{seconds =}_{~3 hours}

•Much more acceptable

– Assumption: Communication not a problem here

•

More workstations:

– finer grid – better accuracy

Example 2: N body problem

•

Astronomy: bodies in space

– Attract each other: Gravitational force Newtons law – O(n*n) calculations per “snapshot”

•Galaxy: ~ 1011_{bodies -> ~ 10}22_calculations

•Calculation 1 micro sec

•Snapshot: 1016_{secs = ~10}11 _{days = ~ 3*10}8_years

– Is parallelism going to help us? NO

– What does help? Better algorithm: Barnes Hut

•Divides the space in “quad tree” •Treats “far’ away quads as one body

Other Challenging Applications

•

Satellite data acquisition: billions of bits / sec

•

Satellite data processing

– Pollution levels, Remote sensing of materials – Image recognition

•

Discrete optimization problems

– Planning, Scheduling, VLSI design

•

Material modeling

•

Nuclear weapons modeling (ASCI)

•

Airplane/Satellite/Vehicle design

Application Specific Architectures

•

Mapping an algorithm directly onto hardware

– ASICs: Application Specific Integrated Circuits – Levels of ‘specificity’

•Full custom ASICs •Standard cell ASICs •Field programmable gate arrays

– Computational models

•Dataflow graphs •Systolic arrays

– Orders of magnitude better performance – Orders of magnitude lower power

(3)

ASICS cont’

•

How much faster than General purpose?

– Example: 1D 1024 FFT

•General purpose machine (G4): 25 micro secs •ASIC device (MIT Lincoln Labs): 32 nano secs •ASIC device uses 20 milliwatts (100 * less power)

•

Future designs:

– 2 tera ops in small ( < cubic ft ) device – Target applications

•FFT

•Finite Impulse Response (FIR) Filters •Matrix multiply

•QR decomposition

Contoh Nyata

• Peramalan cuaca 24 jam di UK melibatkan sekitar 1012_operasi

untuk dieksekusi. Ini memerlukan waktu 2.7 hours pada mesin Cray-1 (berkemampuan 108_{operasi per detik}_).

Berapa

operasi untuk peramalan mingguan, bulanan,

tahunan?

• Menurut Einstein kecepatan cahaya: 3 x 108_m/dt_{. Dua}

peralatan elektronik yang masing-masing mampu melakukan

1012_{operasi/detik}_{dan terpisah dengan jarak}_{0.5 mm.}_Dalam

hal ini akan lebih lama waktu yang diperlukan bagi sinyal melakukan perjalanan antar dua peralatan tersebut daripada waktu yang diperlukan untuk melakukan eksekusi operasi (10 -12_detik)_{oleh salah satu peralatan elektronik tersebut.}

Jadi

faktor pembatasnya adalah kecepatan cahaya.

SOLUSI:

mendayagunakan paralelisme

Motivation of Parallel Computing

•

Parallel Computing is cost effective

– Off the shelf, commodity processors are very fast – Memory is very cheap

– Building a processor that is a small factor faster costs an order of magnitude more

– NoW is the time!

•Cheapest way to get more performance: multiprocessor •NoW: Networks of workstations

•Workstation can be an SMP •SMP: Symmetric Multi Processor

– Shared memory – Bus

Wile E. Coyote’s Parallel Computer

• Get a lot of the fastest processors

• Get a lot of memory per processor

• Get the fastest network

• Hook it all together

(4)

Now you need to program it!

Parallel programming introduces:

– Task partitioning, task scheduling – Data partitioning – Synchronization – Load balancing – Latency issues

•

hiding

•

tolerance

Problem with Wile E. Coyote Architecture

Von Neumann Machines not built for //ism •To get high speed, processors have lots of state

– Cache, stack, global memory

•To tolerate latency, we need fast context switch. WHY?

•No free lunch: can’t have both

– Certainly not if the processor was not designed for both

•Memory wall: memory gets slower and slower

– in terms of number of cycles it takes to access

•Memory hierarchy gets more and more complex •Memory accesses block

– No split phase memory access

Sequential vs Parallel Algorithms

•

Efficient Parallel Algorithms

– Maximize parallelism

– Minimize synchronization, remote accesses – Efficiency is Architecture Dependent

•

Efficient Sequential Algorithms

– Minimize time, space – Efficiency is portable

•Efficient C program on Pentium ~ Efficient C program on Alpha

Speedup

•

Ideal: n processors  n fold speed up

– Ideal not always possible. WHY?

– Tasks are data dependent – Not all processors are always busy – Remote data

•

Super linear speedup: >n speedup

– Nonsense! Because we can execute the faster parallel program sequentially

– No nonsense!! Because parallel computers do not just have more processors, they have more caches

(5)

Parallel Programming

•

Parallel Programming Paradigms

– Super compilers

•20 years of parallelizing compilers and what do we get? ..not much: we understand loops (a bit)

– Multithreading

•Pthreads, Solaris threads, not much difference

– Message Passing

•MPI rules, ..well, there is PVM (parallel virtual machine)

– Data parallel programming

•Niche work, but important

Implicit vs Explicit //ism

•

Implicit: super compilers

– Extract parallelism from sequential program – The general case is too hard

•pointers, aliases, recursion, separate compilation •dynamic dependence distances in array references

•

Explicit Parallelism: threads or messages – Complicates programming

•creation, allocation, scheduling of processes •data partitioning

•Synchronization ( locks, messages )

Pemrosesan Sekuensial & Paralel

3 x lebih cepat dari

Klasifikasi Mesin Paralel

Models of Computation ( Flynn 1966 )

1. Single Instruction Stream, Single Data Stream : SISD.

2. Multiple Instruction Stream, Single Data Stream : MISD.

3. Single Instruction Stream, Multiple Data Stream : SIMD.

4. Multiple Instruction Stream, Multiple Data Stream : MIMD.

(6)

SISD Computers

Untuk operasi a1 + a2 + a3 + … + an memerlukan sebanyak n akses ke memori oleh prosesor dan sebanyak n-1 operasi penjumlahan. Jadi kompleksitas waktu operasi adalah O(n).

von Neumann Architecture

Computer

MISD Computers

N prosesor yang memiliki unit kontrol pribadi, berbagi guna memori bersama (shared memori).

Parallelisme diperoleh dengan menugaskan semua prosesor mengerjakan operasi/tugas yang berbeda secara simultan pada data yang sama.

SIMD Computers

N prosesor beroperasi dibawah kendali aliran instruksi tunggal yang dikeluarkan oleh unit kontrol pusat.

(7)

MIMD Computers

_{Potensi dari 4 kelas komputer}

SPMD Computers

Program yang sama dieksekusi pada prosesor komputer MIMD. SPMD bukan merupakan paradigma hardware, ini adalah software ekuivalen dari SIMD, namun bersifat asynchronous.

Perhatikan instruksi IF X = 0 THEN S1 ELSE S2

Asumsikan X = 0 pada prosesor P1, dan untukX != 0 pada prosesor P2

Proses P1 mengeksekusi S1 paralel dengan prosesor P2 mengeksekusi S2 ( ini tidak dapat terjadi pada SIMD )

Pemrosesan Paralel. Contoh aplikasi di atas melibatkan komputasi intensif dan memerlukan daya olah yang tinggi. Performance: Weather Prediction