• Tidak ada hasil yang ditemukan

icpp04_keynote.ppt 991KB Jun 23 2011 12:32:20 PM

N/A
N/A
Protected

Academic year: 2017

Membagikan "icpp04_keynote.ppt 991KB Jun 23 2011 12:32:20 PM"

Copied!
53
0
0

Teks penuh

(1)

© 2004 Mark D. Hill Wisconsin Multifacet Project

A Future for

Parallel Computer Architectures

Mark D. Hill

Computer Sciences Department University of Wisconsin—Madison

Multifacet Project (www.cs.wisc.edu/multifacet)

August 2004

(2)

Wisconsin Multifacet Project

2

© 2004 Mark D. Hill

Summary

• Issues

– Moore’s Law, etc.

– Instruction Level Parallelism for More Performance – But Memory Latency Longer (e.g., 200 FP multiplies)

• Must Exploit Memory Level Parallelism

– At Thread: Runahead & Continual Flow Pipeline

– At Processor: Simultaneous Multithreading

(3)

Wisconsin Multifacet Project

3

© 2004 Mark D. Hill

Outline

• Computer Architecture Drivers

– Moore’s Law, Microprocessors, & Caching

• Instruction Level Parallelism (ILP) Review

• Memory Level Parallelism (MLP)

• Improving MLP of Thread

• Improving MLP of a Core or Chip

(4)

Wisconsin Multifacet Project

4

© 2004 Mark D. Hill

(5)

Wisconsin Multifacet Project

5

© 2004 Mark D. Hill

What If Your Salary?

• Parameters

– $16 base

– 59% growth/year – 40 years

• Initially $16  buy book

• 3rd year’s $64  buy computer game

• 16th year’s $27,000  buy car

• 22nd year’s $430,000  buy house

• 40th year’s > billion dollars  buy a lot

(6)

Wisconsin Multifacet Project

6

© 2004 Mark D. Hill

Microprocessor

• First Microprocessor in 1971 – Processor on one chip – Intel 4004

– 2300 transistors – Barely a processor

– Could access 300 bytes

of memory (0.0003 megabytes)

(7)

Wisconsin Multifacet Project

7

© 2004 Mark D. Hill

Other “Moore’s Laws”

• Other technologies improving rapidly

– Magnetic disk capacity – DRAM capacity

– Fiber-optic network bandwidth

• Other aspects improving slowly

– Delay to memory – Delay to disk

– Delay across networks

• Computer Implementor’s Challenge

– Design with dissimilarly expanding resources

– To Double computer performance every two years

(8)

Wisconsin Multifacet Project

8

© 2004 Mark D. Hill

Caching & Memory Hierarchies, cont.

• VAX-11/780

– 1 Instruction = Memory • Now

– 100s Instructions = Memory

• Caching Applied Recursively – Registers

– Level-one cache – Level-two cache – Memory

– Disk

(9)

Wisconsin Multifacet Project

9

© 2004 Mark D. Hill

Outline

• Computer Architecture Drivers

• Instruction Level Parallelism (ILP) Review – Pipelining & Out-of-Order

– Intel P3, P4, & Banias

• Memory Level Parallelism (MLP)

• Improving MLP of Thread

• Improving MLP of a Core or Chip

(10)

Wisconsin Multifacet Project

10

© 2004 Mark D. Hill

Instruction Level Parallelism (ILP) 101

• Non-Pipelined (Faster via Bit Level Parallelism (BLP))

• Pipelined (ILP + BLP; 1st microprocessors RISC)

Time 

I

ns

tr

ns

Time 

I

ns

tr

(11)

Wisconsin Multifacet Project

11

© 2004 Mark D. Hill

Instruction Level Parallelism 102

• SuperScalar (& Pipelined)

• Add Cache Misses in red

Time 

I

ns

tr

ns

Time 

I

ns

tr

(12)

Wisconsin Multifacet Project

12

© 2004 Mark D. Hill

Instruction Level Parallelism 103

• Out-of-Order (& SuperScalar & Pipelined)

• In-order fetch, decode, rename, & issuing of instructions with good branch prediction

• Out-of-order speculative execution of instructions in “window”, honoring data dependencies

• In-order retirement,

preserving sequential instruction semantics

Time 

I

ns

tr

(13)

Wisconsin Multifacet Project

13

© 2004 Mark D. Hill

Out-of-Order Example: Intel x86 P6 Core

• “CISC” Twist to Out-of-Order

– In-order front end cracks x86 instructions into micro-ops (like RISC instructions) – Out-of-order execution

– In-Order retirement of micro-ops in x86 instruction groups

• Used in Pentium Pro, II, & III

– 3-way superscalar of micro-ops

– 10-stage pipeline (for branch misprediction penalty) – Sophisticated branch prediction

(14)

Wisconsin Multifacet Project

14

© 2004 Mark D. Hill

Pentium 4 Core [Hinton 2001]

• Follow basic approach of P6 core

• Trace Cache stores dynamic micro-op sequences

• 20-stage pipeline (for branch misprediction penalty)

• 128 active micro-ops (48 loads & 24 stores)

(15)

Wisconsin Multifacet Project

15

© 2004 Mark D. Hill

Intel Kills Pentium 4 Roadmap

• Why? I can speculate

• Too Much Power?

– More transistors

– Higher-frequency transistors

– Designed before power became first-order design constraint

• Too Little Performance? Time/Program =

– Instructions/Program * Cycles/Instruction * Time/Cycle

• For x86: Instructions/Cycle * Frequency

(16)

Wisconsin Multifacet Project

16

© 2004 Mark D. Hill

Pentium M / Banias [Gochman 2003]

• For laptops, but now more general

– Key: Feature must add 1% performance for 3% power

– Why: Increasing voltage for 1% perf. costs 3% power

• Techniques

– Enhance Intel SpeedStep™ – Shorter pipeline (more like P6)

– Better branch predictor (e.g., loops) – Special handling of memory stack – Fused micro-ops

(17)

Wisconsin Multifacet Project

17

© 2004 Mark D. Hill

What about Future for Intel & Others?

• Worry about power & energy (not this talk)

• Memory latency too great for out-of-order cores to tolerate (coming next)

(18)

Wisconsin Multifacet Project

18

© 2004 Mark D. Hill

Outline

• Computer Architecture Drivers

• Instruction Level Parallelism (ILP) Review

• Memory Level Parallelism (MLP) – Cause & Effect

• Improving MLP of Thread

• Improving MLP of a Core or Chip

(19)

Wisconsin Multifacet Project

19

© 2004 Mark D. Hill

Out-of-Order w/ Slower Off-Chip Misses

• Out-of-Order (& Super-Scalar & Pipelined)

• But Off-Chip Misses are now hundreds of cycles

Time 

I

ns

tr

ns

Good Case!

Time 

I

ns

tr

(20)

Wisconsin Multifacet Project

20

© 2004 Mark D. Hill

Out-of-Order w/ Slower Off-Chip Misses

• More Realistic Case

• Why does yellow instruction block?

– Assumes 4-instruction window (maximum outstanding) – Yellow instruction awaits “instruction - 4” (1st cache miss)

– Actual widows are 32-64 instructions, but L2 miss slower

• Key Insight: Memory-Level Parallelism (MLP) [Chou, Fahs, & Abraham, ISCA 2004]

Time 

I

ns

tr

ns

I1 I2 I3

I4

(21)

Wisconsin Multifacet Project

21

© 2004 Mark D. Hill

Out-of-Order & Memory Level Parallism (MLP)

• Good Case

• Bad Case

Compute & Memory Phases

Compute & Memory Phases

MLP = 2

(22)

Wisconsin Multifacet Project

22

© 2004 Mark D. Hill

MLP Model

• MLP = # Off-Chip Accesses / # Memory Phases • Execution has Compute & Memory Phases

– Compute Phase largely overlaps Memory Phase – In Limit as Memory Latency increases, …

• Compute Phase hidden by Memory Phase

– Execution Time = # Memory Phases * Memory Latency

• Execution Time =

(23)

Wisconsin Multifacet Project

23

© 2004 Mark D. Hill

MLP Action Items

• Execution Time =

(MLP / # Off-Chip Accesses) * Memory Latency

• Reduce # Off-Chip Accesses

– E.g., better caches or compression (Multifacet)

• Reduce Memory Latency

– E.g., on-chip memory controller (AMD)

• Increase MLP (next slides)

(24)

Wisconsin Multifacet Project

24

© 2004 Mark D. Hill

What Limits MLP in Processor? [Chou et al.]

• Issue window and reorder buffer size

• Instruction fetch off-chip accesses

• Unresolvable mispredicted branches

• Load and branch issue restrictions

(25)

Wisconsin Multifacet Project

25

© 2004 Mark D. Hill

What Limits MLP in Program?

• Depending on data from off-chip memory accesses

• For addresses

– Bad: Pointer chasing with poor locality

– Good: Array where address calculation separate from data

• For unpredictable branch decisions

– Bad: Branching on data values with poor locality

– Good: Iterative loops with highly predictable branching

• But, as programmer, which accesses go off-chip?

• Also: very poor instruction locality

(26)

Wisconsin Multifacet Project

26

© 2004 Mark D. Hill

Outline

• Computer Architecture Drivers

• Instruction Level Parallelism (ILP) Review

• Memory Level Parallelism (MLP)

• Improving MLP of Thread

– Runahead, Continual Flow Pipeline

• Improving MLP of a Core or Chip

(27)

Wisconsin Multifacet Project

27

© 2004 Mark D. Hill

Runahead Example

• Base Out-of-Order, MLP = 1

• With Runahead, MLP = 2

I1 I2 I3

I4

4-instrn window

1. Normal mode

3. Runahead mode 2. Checkpoint

(28)

Wisconsin Multifacet Project

28

© 2004 Mark D. Hill

Runahead Execution [Dundas ICS97, Mutlu HPCA03]

1. Execute normally until instruction M’s off-chip access blocks issue of more instructions

2. Checkpoint processor

3. Discard instruction M, set M’s destination register to

poisoned, & speculatively Runahead

– Instructions propagate poisoned from source to destination – Seek off-chip accesses to start prefetches & increase MLP

4. Restore checkpoint when off-chip access M returns

(29)

Wisconsin Multifacet Project

29

© 2004 Mark D. Hill

Continual Flow Pipeline [

Srinivasan ASPLOS04

]

Simplified Example

Have off-chip access M free many resources, but SAVE

Keep decoding instructions

SAVE instructions dependent on M

Execute instructions independent of M

(30)

Wisconsin Multifacet Project

30

© 2004 Mark D. Hill

Implications of Runahead, & Continual Flow

• Runahead

– Discards dependent instructions

– Speculatively executes independent instructions

– When miss returns, re-executes dependent & independent instrns

• Continual Flow Pipeline

– Saves dependent instructions

– Executes independent instructions

– When miss returns, executes only saved dependent instructions

• Assessment

– Both allow MLP to break past window limits

– Both limited by branch prediction accuracy on unresolved branches – Continual Flow Pipeline sounds even more appealing

(31)

Wisconsin Multifacet Project

31

© 2004 Mark D. Hill

Outline

• Computer Architecture Drivers

• Instruction Level Parallelism (ILP) Review

• Memory Level Parallelism (MLP)

• Improving MLP of Thread

• Improving MLP of a Core or Chip – Core: Simultaneous Multithreading – Chip: Chip Multiprocessing

(32)

Wisconsin Multifacet Project

32

© 2004 Mark D. Hill

Getting MLP from Thread Level Parallelism

• Runahead & Continual Flow seek MLP for Thread

• More MLP for Processor?

– More parallel off-chip accesses for a processor? – Yes: Simultaneous Multithreading

• More MLP for Chip?

– More parallel off-chip accesses for a chip? – Yes: Chip Multiprocessing

(33)

Wisconsin Multifacet Project

33

© 2004 Mark D. Hill

Simultaneous Multithreading [U Washington]

• Turn a physical processor into S logical processors

• Need S copies of architectural state, S=2, 4, (8?)

– PC, Registers, PSW, etc. (small!)

• Completely share

– Caches, functional units, & datapaths

• Manage via threshold sharing, partition, etc.

– Physical registers, issue queue, & reorder buffer

• Intel calls Hyperthreading in Pentium 4

(34)

Wisconsin Multifacet Project

34

© 2004 Mark D. Hill

Simultaneous Multithreading Assessment

• Programming

– Supports finer-grained sharing than old-style MP – But gains less than S and S is small

• Have Multi-Threaded Workload

– Hides off-chip latencies better than Runahead – E.g, 4 threads w/ MLP 1.5 each  MLP = 6

• Have Single-Threaded Workload

– Base SMT No Help

– Many “Helper Thread” Ideas

• Expect SMT in processors for servers

(35)

Wisconsin Multifacet Project

35

© 2004 Mark D. Hill

Want to Spend More Transistors

• Not worthwhile to spend it all on cache

• Replicate Processor

• Private L1 Caches

– Low latency – High bandwidth

• Shared L2 Cache

(36)

Wisconsin Multifacet Project

36

© 2004 Mark D. Hill

Piranha Processing Node

Alpha core: 1-issue, in-order, 500MHz

CPU

Next few slides from

Luiz Barosso’s ISCA 2000 presentation of

Piranha: A Scalable Architecture

(37)

Wisconsin Multifacet Project

37

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

(38)

Wisconsin Multifacet Project

38

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

(39)

Wisconsin Multifacet Project

39

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

(40)

Wisconsin Multifacet Project

40

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

8 banks

(41)

Wisconsin Multifacet Project

41

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE) prog., 1K instr.,

even/odd interleaving D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(42)

Wisconsin Multifacet Project

42

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE): prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router topology independent 32GB/sec total bandwidth D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(43)

Wisconsin Multifacet Project

43

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE): prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router topology independent 32GB/sec total bandwidth

D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(44)

Wisconsin Multifacet Project

45

© 2004 Mark D. Hill

0 50 100 150 200 250 300 350 P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue N o rm a li ze d E x e c u ti o n T im e L2Miss L2Hit CPU 233 145 100 34 350 191 100 44 OLTP DSS

• Piranha’s performance margin 3x for OLTP and 2.2x for DSS

• Piranha has more outstanding misses  better utilizes memory system

(45)

Wisconsin Multifacet Project

46

© 2004 Mark D. Hill

Chip Multiprocessing Assessment: Servers

• Programming

– Supports finer-grained sharing than old-style MP – But not as fine as SMT (yet)

– Many cores can make performance gain large

• Can Yield MLP for Chip!

– Can do CMP of SMT processors

– C cores of S-way SMT with T-way MLP per thread – Yields Chip MLP of C*S*T (e.g., 8*2*2 = 32)

• Most Servers have Multi-Threaded Workload

• CMP is a Server Inflection Point

(46)

Wisconsin Multifacet Project

47

© 2004 Mark D. Hill

Chip Multiprocessing Assessment: Clients

• Most Client (Today) have Single-Threaded Workload

– Base CMP No Help

– Use Thread Level Speculation? – Use Helper Threads?

• CMPs for Clients?

– Depends on Threads

(47)

Wisconsin Multifacet Project

48

© 2004 Mark D. Hill

Outline

• Computer Architecture Drivers

• Instruction Level Parallelism (ILP) Review

• Memory Level Parallelism (MLP)

• Improving MLP of Thread

• Improving MLP of a Core or Chip

• CMP Systems

– Small, Medium, but Not Large

(48)

Wisconsin Multifacet Project

49

© 2004 Mark D. Hill

Small CMP Systems

• Use One CMP (with C cores of S-way SMT)

– C starts 2-4 and grows to 16-ish

– S starts at 2, may stay at 2 or grow to 4 – Fits on your desk!

• Directly Connect CMP (C) to Memory Controller (M) or DRAM

• If Threads Useful

– >10X Performance vs. Uniprocesor

– >>10X Cost-Performance vs. non-CMP SMP

• Commodity Server!

M

(49)

Wisconsin Multifacet Project

50

© 2004 Mark D. Hill

Medium CMP Systems

• Use 2-16 CMPs (with C cores of S-way SMT)

– Small: 4*4*2 = 32

– Large: 16*16*4 = 1024

• Connect CMPs & Memory Controllers (or DRAM)

C C

C C

M M

M M

Processor-Centric

M M

M M

C C

C C

Memory-Centric

M M

C C

M M

C C

(50)

Wisconsin Multifacet Project

51

© 2004 Mark D. Hill

Large CMP Systems?

• 1000s of CMPs?

• Will not happen in the commercial market

– Instead will network CMP systems into clusters – Enhance availability & reduces cost

– Poor latency acceptable

• Market for large scientific machines probably ~$0 Billion

• Market for large government machines similar

– Nevertheless, government can make this happen (like bombers)

• The rest of us will use

– a small- or medium-CMP system

(51)

Wisconsin Multifacet Project

52

© 2004 Mark D. Hill

Wisconsin Multifacet (www.cs.wisc.edu/multifacet)

• Designing Commercial Servers

• Availability: SafetyNet Checkpointing [ISCA 2002]

• Programability: Flight Data Recorder [ISCA 2003]

• Methods: Simulating a $2M Server on a $2K PC

[Computer 2003]

• Performance: Cache Compression [ISCA 2004]

(52)

Wisconsin Multifacet Project

53

© 2004 Mark D. Hill

Token Coherence [IEEE MICRO Top Picks 03]

• Coherence Invariant (for any memory block at any time):

– One writer or multiple readers

• Implemented with distributed Finite State Machines • Indirectly enforced (bus order, acks, blocking, etc.)

• Token Coherence Directly Enforces

– Each memory block has T tokens

– Token count store with data (even in messages) – Processor needs all T tokens to write

– Processor needs at least one token to read

• Last year: Glueless Multiprocessor

– Speedup 17-54% vs directory

• This Year: Medium CMP Systems

– Flat for correctness

(53)

Wisconsin Multifacet Project

54

© 2004 Mark D. Hill

Conclusions

Must Exploit Memory Level Parallelism!

At Thread: Runahead & Continual Flow Pipeline

At Processor: Simultaneous Multithreading

At Chip: Chip Multiprocessing

Referensi

Dokumen terkait

metode penelitian akuntansi; kerangka konseptual; struktur teori akuntansi; pendapatan, keuntungan, beban dan.. kerugian; aktiva dan pengukuran; pengungkapan; dan juga

Pokja Barang/Jasa Konsultansi dan Jasa Lainnya pada Unit Layanan Pengadaan Barang/Jasa Kabupaten Aceh Barat Daya akan melakukan klarifikasi dan/atau verifikasi kepada penerbit

diartikan sebagai ko-eksistensi antara berbagai sistem hukum dalam lapangan sosial tertentu yang dikaji dan sangat menonjolkan dikotomi antara hukum negara dan berbagai macam

…... Al- qurˈan In Word. Mkdu Dasar-Da sar Pendidikan Agama Islam Untuk Pendidikan Perguruan Tinggi. Jakarta: Pt Bumi Aksara. Pendidikan Agama Islam Dalam Perspektif

Penilaian dilakukan dengan cara pengurangan nilai jika tidak memenuhi persyaratan penilaian pada nomor 1 dan bonus nilai bagi yang dapat menyelesaikan tugas lebih cepat

Despite the governments claims that the 2012 Olympics will bring a huge boost to UK tourism there is strong evidence to the contrary and the government is coming under

Perencanaan operasioanl tahunan yang tertuang dalam anggaran pendapatan dan belanja negara (APBD) merupakan penjabaran dari pokok-pokok kebijaksanaan yang

Bahwa walaupun para Terlapor merupakan pelaku usaha sebagaimana dimaksud Pasal 1 angka 5 UU Nomor 5 Tahun 1999 akan tetapi pada saat Terlapor mengadakan kesepakatan bertindak