Pemrosesan Paralel
Kudang B. Seminar
Kebutuhan Komputer Berkinerja Tinggi
•
Peramalan cuaca
•
Aerodinamik
•
Kercerdasan buatan: robotik
•
Rekayasa genetik
Contoh aplikasi di atas melibatkan
komputasi intensif dan memerlukan
daya olah yang tinggi
Example 1: Weather Prediction
•
Area, segments– 3000*3000*11 cubic miles
– .1*.1*.1 cubic mile: ~ 1011 segments
•
Two day prediction– half hour periods: ~ 100 periods
•
Computation per segment– Temp, Pressure, Humidity, Wind speed, Wind direction
– Assume ~ 100 FLOPs
Performance: Weather Prediction
•
Computational requirement: 10
15•
Serial supercomputer: 10
9instr/sec
•
Total serial time: 10
6sec =
280
hours
•
Not too good for
48
hour weather
prediction
Parallel Weather Prediction
•
1 K workstations, grid connected– 108 segment computations per processor
– 108 instructions per second
– 100 instructions per segment computation – 100 time steps: 104 seconds = ~3 hours
•Much more acceptable
– Assumption: Communication not a problem here
•
More workstations:– finer grid – better accuracy
Example 2: N body problem
•
Astronomy: bodies in space– Attract each other: Gravitational force Newtons law – O(n*n) calculations per “snapshot”
•Galaxy: ~ 1011 bodies -> ~ 1022 calculations
•Calculation 1 micro sec
•Snapshot: 1016 secs = ~1011 days = ~ 3*108 years
– Is parallelism going to help us? NO
– What does help? Better algorithm: Barnes Hut
•Divides the space in “quad tree” •Treats “far’ away quads as one body
Other Challenging Applications
•
Satellite data acquisition: billions of bits / sec•
Satellite data processing– Pollution levels, Remote sensing of materials – Image recognition
•
Discrete optimization problems– Planning, Scheduling, VLSI design
•
Material modeling•
Nuclear weapons modeling (ASCI)•
Airplane/Satellite/Vehicle designApplication Specific Architectures
•
Mapping an algorithm directly onto hardware– ASICs: Application Specific Integrated Circuits – Levels of ‘specificity’
•Full custom ASICs •Standard cell ASICs •Field programmable gate arrays
– Computational models
•Dataflow graphs •Systolic arrays
– Orders of magnitude better performance – Orders of magnitude lower power
ASICS cont’
•
How much faster than General purpose?– Example: 1D 1024 FFT
•General purpose machine (G4): 25 micro secs •ASIC device (MIT Lincoln Labs): 32 nano secs •ASIC device uses 20 milliwatts (100 * less power)
•
Future designs:– 2 tera ops in small ( < cubic ft ) device – Target applications
•FFT
•Finite Impulse Response (FIR) Filters •Matrix multiply
•QR decomposition
Contoh Nyata
• Peramalan cuaca 24 jam di UK melibatkan sekitar 1012operasi
untuk dieksekusi. Ini memerlukan waktu 2.7 hours pada mesin Cray-1 (berkemampuan 108 operasi per detik).
Berapa
operasi untuk peramalan mingguan, bulanan,
tahunan?
• Menurut Einstein kecepatan cahaya: 3 x 108 m/dt. Dua
peralatan elektronik yang masing-masing mampu melakukan
1012 operasi/detik dan terpisah dengan jarak 0.5 mm. Dalam
hal ini akan lebih lama waktu yang diperlukan bagi sinyal melakukan perjalanan antar dua peralatan tersebut daripada waktu yang diperlukan untuk melakukan eksekusi operasi (10 -12 detik) oleh salah satu peralatan elektronik tersebut.
Jadi
faktor pembatasnya adalah kecepatan cahaya.
SOLUSI:
mendayagunakan paralelisme
Motivation of Parallel Computing
•
Parallel Computing is cost effective– Off the shelf, commodity processors are very fast – Memory is very cheap
– Building a processor that is a small factor faster costs an order of magnitude more
– NoW is the time!
•Cheapest way to get more performance: multiprocessor •NoW: Networks of workstations
•Workstation can be an SMP •SMP: Symmetric Multi Processor
– Shared memory – Bus
Wile E. Coyote’s Parallel Computer
•
Get a lot of the fastest processors
•
Get a lot of memory per processor
•
Get the fastest network
•
Hook it all together
Now you need to program it!
Parallel programming introduces:
– Task partitioning, task scheduling – Data partitioning – Synchronization – Load balancing – Latency issues
•
hiding•
toleranceProblem with Wile E. Coyote Architecture
Von Neumann Machines not built for //ism •To get high speed, processors have lots of state
– Cache, stack, global memory
•To tolerate latency, we need fast context switch. WHY?
•No free lunch: can’t have both
– Certainly not if the processor was not designed for both
•Memory wall: memory gets slower and slower
– in terms of number of cycles it takes to access
•Memory hierarchy gets more and more complex •Memory accesses block
– No split phase memory access
Sequential vs Parallel Algorithms
•
Efficient Parallel Algorithms– Maximize parallelism
– Minimize synchronization, remote accesses – Efficiency is Architecture Dependent
•
Efficient Sequential Algorithms– Minimize time, space – Efficiency is portable
•Efficient C program on Pentium ~ Efficient C program on Alpha
Speedup
•
Ideal: n processors n fold speed up– Ideal not always possible. WHY?
– Tasks are data dependent – Not all processors are always busy – Remote data
•
Super linear speedup: >n speedup– Nonsense! Because we can execute the faster parallel program sequentially
– No nonsense!! Because parallel computers do not just have more processors, they have more caches
Parallel Programming
•
Parallel Programming Paradigms– Super compilers
•20 years of parallelizing compilers and what do we get? ..not much: we understand loops (a bit)
– Multithreading
•Pthreads, Solaris threads, not much difference
– Message Passing
•MPI rules, ..well, there is PVM (parallel virtual machine)
– Data parallel programming
•Niche work, but important
Implicit vs Explicit //ism
•
Implicit: super compilers– Extract parallelism from sequential program – The general case is too hard
•pointers, aliases, recursion, separate compilation •dynamic dependence distances in array references
•
Explicit Parallelism: threads or messages – Complicates programming•creation, allocation, scheduling of processes •data partitioning
•Synchronization ( locks, messages )
Pemrosesan Sekuensial & Paralel
3 x lebih cepat dari
Klasifikasi Mesin Paralel
Models of Computation ( Flynn 1966 )1. Single Instruction Stream, Single Data Stream : SISD.
2. Multiple Instruction Stream, Single Data Stream : MISD.
3. Single Instruction Stream, Multiple Data Stream : SIMD.
4. Multiple Instruction Stream, Multiple Data Stream : MIMD.
SISD Computers
Untuk operasi a1 + a2 + a3 + … + an memerlukan sebanyak n akses ke memori oleh prosesor dan sebanyak n-1 operasi penjumlahan. Jadi kompleksitas waktu operasi adalah O(n).
von Neumann Architecture
Computer
MISD Computers
N prosesor yang memiliki unit kontrol pribadi, berbagi guna memori bersama (shared memori).
Parallelisme diperoleh dengan menugaskan semua prosesor mengerjakan operasi/tugas yang berbeda secara simultan pada data yang sama.
SIMD Computers
N prosesor beroperasi dibawah kendali aliran instruksi tunggal yang dikeluarkan oleh unit kontrol pusat.
MIMD Computers
Potensi dari 4 kelas komputer
SPMD Computers
Program yang sama dieksekusi pada prosesor komputer MIMD. SPMD bukan merupakan paradigma hardware, ini adalah software ekuivalen dari SIMD, namun bersifat asynchronous.
Perhatikan instruksi IF X = 0 THEN S1 ELSE S2
Asumsikan X = 0 pada prosesor P1, dan untukX != 0 pada prosesor P2
Proses P1 mengeksekusi S1 paralel dengan prosesor P2 mengeksekusi S2 ( ini tidak dapat terjadi pada SIMD )