icpp04_keynote.ppt 991KB Jun 23 2011 12:32:20 PM

(1)

© 2004 Mark D. Hill Wisconsin Multifacet Project

A Future for

Parallel Computer Architectures

Mark D. Hill

Computer Sciences Department University of Wisconsin—Madison

Multifacet Project (www.cs.wisc.edu/multifacet)

August 2004

(2)

Wisconsin Multifacet Project

2

Summary

• Issues

– Moore’s Law, etc.

– Instruction Level Parallelism for More Performance – But Memory Latency Longer (e.g., 200 FP multiplies)

• Must Exploit Memory Level Parallelism

– At Thread: Runahead & Continual Flow Pipeline

– At Processor: Simultaneous Multithreading

(3)

Wisconsin Multifacet Project

3

Outline

• Computer Architecture Drivers

– Moore’s Law, Microprocessors, & Caching

• Instruction Level Parallelism (ILP) Review

• Memory Level Parallelism (MLP)

• Improving MLP of Thread

• Improving MLP of a Core or Chip

(4)

Wisconsin Multifacet Project

4

(5)

Wisconsin Multifacet Project

5

What If Your Salary?

• Parameters

– $16 base

– 59% growth/year – 40 years

• Initially $16  buy book

• 3rd year’s $64  buy computer game

• 16th year’s $27,000  buy car

• 22nd year’s $430,000  buy house

• 40th year’s > billion dollars  buy a lot

(6)

Wisconsin Multifacet Project

6

Microprocessor

• First Microprocessor in 1971 – Processor on one chip – Intel 4004

– 2300 transistors – Barely a processor

– Could access 300 bytes

of memory (0.0003 megabytes)

(7)

Wisconsin Multifacet Project

7

Other “Moore’s Laws”

• Other technologies improving rapidly

– Magnetic disk capacity – DRAM capacity

– Fiber-optic network bandwidth

• Other aspects improving slowly

– Delay to memory – Delay to disk

– Delay across networks

• Computer Implementor’s Challenge

– Design with dissimilarly expanding resources

– To Double computer performance every two years

(8)

Wisconsin Multifacet Project

8

Caching & Memory Hierarchies, cont.

• VAX-11/780

– 1 Instruction = Memory • Now

– 100s Instructions = Memory

• Caching Applied Recursively – Registers

– Level-one cache – Level-two cache – Memory

– Disk

(9)

Wisconsin Multifacet Project

9

Outline

• Instruction Level Parallelism (ILP) Review – Pipelining & Out-of-Order

– Intel P3, P4, & Banias

(10)

Wisconsin Multifacet Project

10

Instruction Level Parallelism (ILP) 101

• Non-Pipelined (Faster via Bit Level Parallelism (BLP))

• Pipelined (ILP + BLP; 1st microprocessors RISC)

Time 



I

ns

tr

ns

Time 



I

ns

tr

(11)

Wisconsin Multifacet Project

11

Instruction Level Parallelism 102

• SuperScalar (& Pipelined)

• Add Cache Misses in red

Time 



I

ns

tr

ns

Time 



I

ns

tr

(12)

Wisconsin Multifacet Project

12

Instruction Level Parallelism 103

• Out-of-Order (& SuperScalar & Pipelined)

• In-order fetch, decode, rename, & issuing of instructions with good branch prediction

• Out-of-order speculative execution of instructions in “window”, honoring data dependencies

• In-order retirement,

preserving sequential instruction semantics

Time 



I

ns

tr

(13)

Wisconsin Multifacet Project

13

Out-of-Order Example: Intel x86 P6 Core

• “CISC” Twist to Out-of-Order

– In-order front end cracks x86 instructions into micro-ops (like RISC instructions) – Out-of-order execution

– In-Order retirement of micro-ops in x86 instruction groups

• Used in Pentium Pro, II, & III

– 3-way superscalar of micro-ops

– 10-stage pipeline (for branch misprediction penalty) – Sophisticated branch prediction

(14)

Wisconsin Multifacet Project

14

Pentium 4 Core [Hinton 2001]

• Follow basic approach of P6 core

• Trace Cache stores dynamic micro-op sequences

• 20-stage pipeline (for branch misprediction penalty)

• 128 active micro-ops (48 loads & 24 stores)

(15)

Wisconsin Multifacet Project

15

Intel Kills Pentium 4 Roadmap

• Why? I can speculate

• Too Much Power?

– More transistors

– Higher-frequency transistors

– Designed before power became first-order design constraint

• Too Little Performance? Time/Program =

– Instructions/Program * Cycles/Instruction * Time/Cycle

• For x86: Instructions/Cycle * Frequency

(16)

Wisconsin Multifacet Project

16

Pentium M / Banias [Gochman 2003]

• For laptops, but now more general

– Key: Feature must add 1% performance for 3% power

– Why: Increasing voltage for 1% perf. costs 3% power

• Techniques

– Enhance Intel SpeedStep™ – Shorter pipeline (more like P6)

– Better branch predictor (e.g., loops) – Special handling of memory stack – Fused micro-ops

(17)

Wisconsin Multifacet Project

17

What about Future for Intel & Others?

• Worry about power & energy (not this talk)

• Memory latency too great for out-of-order cores to tolerate (coming next)

(18)

Wisconsin Multifacet Project

18

Outline

• Memory Level Parallelism (MLP) – Cause & Effect

(19)

Wisconsin Multifacet Project

19

Out-of-Order w/ Slower Off-Chip Misses

• Out-of-Order (& Super-Scalar & Pipelined)

• But Off-Chip Misses are now hundreds of cycles

Time 



I

ns

tr

ns

Good Case!

Time 



I

ns

tr

(20)

Wisconsin Multifacet Project

20

Out-of-Order w/ Slower Off-Chip Misses

• More Realistic Case

• Why does yellow instruction block?

– Assumes 4-instruction window (maximum outstanding) – Yellow instruction awaits “instruction - 4” (1st cache miss)

– Actual widows are 32-64 instructions, but L2 miss slower

• Key Insight: Memory-Level Parallelism (MLP) [Chou, Fahs, & Abraham, ISCA 2004]

Time 



I

ns

tr

ns

I1 I2 I3

I4

(21)

Wisconsin Multifacet Project

21

Out-of-Order & Memory Level Parallism (MLP)

• Good Case

• Bad Case

Compute & Memory Phases

MLP = 2

(22)

Wisconsin Multifacet Project

22

MLP Model

• MLP = # Off-Chip Accesses / # Memory Phases • Execution has Compute & Memory Phases

– Compute Phase largely overlaps Memory Phase – In Limit as Memory Latency increases, …

• Compute Phase hidden by Memory Phase

– Execution Time = # Memory Phases * Memory Latency

• Execution Time =

(23)

Wisconsin Multifacet Project

23

MLP Action Items

• Execution Time =

(MLP / # Off-Chip Accesses) * Memory Latency

• Reduce # Off-Chip Accesses

– E.g., better caches or compression (Multifacet)

• Reduce Memory Latency

– E.g., on-chip memory controller (AMD)

• Increase MLP (next slides)

(24)

Wisconsin Multifacet Project

24

What Limits MLP in Processor? [Chou et al.]

• Issue window and reorder buffer size

• Instruction fetch off-chip accesses

• Unresolvable mispredicted branches

• Load and branch issue restrictions

(25)

Wisconsin Multifacet Project

25

What Limits MLP in Program?

• Depending on data from off-chip memory accesses

• For addresses

– Bad: Pointer chasing with poor locality

– Good: Array where address calculation separate from data

• For unpredictable branch decisions

– Bad: Branching on data values with poor locality

– Good: Iterative loops with highly predictable branching

• But, as programmer, which accesses go off-chip?

• Also: very poor instruction locality

(26)

Wisconsin Multifacet Project

26

Outline

– Runahead, Continual Flow Pipeline

(27)

Wisconsin Multifacet Project

27

Runahead Example

• Base Out-of-Order, MLP = 1

• With Runahead, MLP = 2

I1 I2 I3

I4

4-instrn window

1. Normal mode

3. Runahead mode 2. Checkpoint

(28)

Wisconsin Multifacet Project

28

Runahead Execution [Dundas ICS97, Mutlu HPCA03]

1. Execute normally until instruction M’s off-chip access blocks issue of more instructions

2. Checkpoint processor

3. Discard instruction M, set M’s destination register to

poisoned, & speculatively Runahead

– Instructions propagate poisoned from source to destination – Seek off-chip accesses to start prefetches & increase MLP

4. Restore checkpoint when off-chip access M returns

(29)

Wisconsin Multifacet Project

29

Continual Flow Pipeline [

Srinivasan ASPLOS04

]

Simplified Example

Have off-chip access M free many resources, but SAVE

Keep decoding instructions

SAVE instructions dependent on M

Execute instructions independent of M

(30)

Wisconsin Multifacet Project

30

Implications of Runahead, & Continual Flow

• Runahead

– Discards dependent instructions

– Speculatively executes independent instructions

– When miss returns, re-executes dependent & independent instrns

• Continual Flow Pipeline

– Saves dependent instructions

– Executes independent instructions

– When miss returns, executes only saved dependent instructions

• Assessment

– Both allow MLP to break past window limits

– Both limited by branch prediction accuracy on unresolved branches – Continual Flow Pipeline sounds even more appealing

(31)

Wisconsin Multifacet Project

31

Outline

• Improving MLP of a Core or Chip – Core: Simultaneous Multithreading – Chip: Chip Multiprocessing

(32)

Wisconsin Multifacet Project

32

Getting MLP from Thread Level Parallelism

• Runahead & Continual Flow seek MLP for Thread

• More MLP for Processor?

– More parallel off-chip accesses for a processor? – Yes: Simultaneous Multithreading

• More MLP for Chip?

– More parallel off-chip accesses for a chip? – Yes: Chip Multiprocessing

(33)

Wisconsin Multifacet Project

33

Simultaneous Multithreading [U Washington]

• Turn a physical processor into S logical processors

• Need S copies of architectural state, S=2, 4, (8?)

– PC, Registers, PSW, etc. (small!)

• Completely share

– Caches, functional units, & datapaths

• Manage via threshold sharing, partition, etc.

– Physical registers, issue queue, & reorder buffer

• Intel calls Hyperthreading in Pentium 4

(34)

Wisconsin Multifacet Project

34

Simultaneous Multithreading Assessment

• Programming

– Supports finer-grained sharing than old-style MP – But gains less than S and S is small

• Have Multi-Threaded Workload

– Hides off-chip latencies better than Runahead – E.g, 4 threads w/ MLP 1.5 each  MLP = 6

• Have Single-Threaded Workload

– Base SMT No Help

– Many “Helper Thread” Ideas

• Expect SMT in processors for servers

(35)

Wisconsin Multifacet Project

35

Want to Spend More Transistors

• Not worthwhile to spend it all on cache

• Replicate Processor

• Private L1 Caches

– Low latency – High bandwidth

• Shared L2 Cache

(36)

Wisconsin Multifacet Project

36

Piranha Processing Node

Alpha core: 1-issue, in-order, 500MHz

CPU

Next few slides from

Luiz Barosso’s ISCA 2000 presentation of

Piranha: A Scalable Architecture

(37)

Wisconsin Multifacet Project

37

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

(38)

Wisconsin Multifacet Project

38

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

(39)

Wisconsin Multifacet Project

39

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

(40)

Wisconsin Multifacet Project

40

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

8 banks

(41)

41

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

L2 cache:

shared, 1MB, 8-way

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE) prog., 1K instr.,

even/odd interleaving D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(42)

Wisconsin Multifacet Project

42

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

L2 cache:

shared, 1MB, 8-way

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE): prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router topology independent 32GB/sec total bandwidth D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(43)

Wisconsin Multifacet Project

43

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

L2 cache:

shared, 1MB, 8-way

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE): prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router topology independent 32GB/sec total bandwidth

D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(44)

Wisconsin Multifacet Project

45

0 50 100 150 200 250 300 350 P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue N o rm a li ze d E x e c u ti o n T im e _L2Miss L2Hit CPU 233 145 100 34 350 191 100 44 OLTP DSS

• Piranha’s performance margin 3x for OLTP and 2.2x for DSS

• Piranha has more outstanding misses  better utilizes memory system

(45)

46

Chip Multiprocessing Assessment: Servers

• Programming

– Supports finer-grained sharing than old-style MP – But not as fine as SMT (yet)

– Many cores can make performance gain large

• Can Yield MLP for Chip!

– Can do CMP of SMT processors

– C cores of S-way SMT with T-way MLP per thread – Yields Chip MLP of C*S*T (e.g., 8*2*2 = 32)

• Most Servers have Multi-Threaded Workload

• CMP is a Server Inflection Point

(46)

Wisconsin Multifacet Project

47

Chip Multiprocessing Assessment: Clients

• Most Client (Today) have Single-Threaded Workload

– Base CMP No Help

– Use Thread Level Speculation? – Use Helper Threads?

• CMPs for Clients?

– Depends on Threads

(47)

Wisconsin Multifacet Project

48

Outline

• CMP Systems

– Small, Medium, but Not Large

(48)

Wisconsin Multifacet Project

49

Small CMP Systems

• Use One CMP (with C cores of S-way SMT)

– C starts 2-4 and grows to 16-ish

– S starts at 2, may stay at 2 or grow to 4 – Fits on your desk!

• Directly Connect CMP (C) to Memory Controller (M) or DRAM

• If Threads Useful

– >10X Performance vs. Uniprocesor

– >>10X Cost-Performance vs. non-CMP SMP

• Commodity Server!

M

(49)

Wisconsin Multifacet Project

50

Medium CMP Systems

• Use 2-16 CMPs (with C cores of S-way SMT)

– Small: 4*4*2 = 32

– Large: 16*16*4 = 1024

• Connect CMPs & Memory Controllers (or DRAM)

C C

M M

Processor-Centric

M M

C C

Memory-Centric

M M

C C

M M

C C

(50)

Wisconsin Multifacet Project

51

Large CMP Systems?

• 1000s of CMPs?

• Will not happen in the commercial market

– Instead will network CMP systems into clusters – Enhance availability & reduces cost

– Poor latency acceptable

• Market for large scientific machines probably ~$0 Billion

• Market for large government machines similar

– Nevertheless, government can make this happen (like bombers)

• The rest of us will use

– a small- or medium-CMP system

(51)

Wisconsin Multifacet Project

52

Wisconsin Multifacet (www.cs.wisc.edu/multifacet)

• Designing Commercial Servers

• Availability: SafetyNet Checkpointing [ISCA 2002]

• Programability: Flight Data Recorder [ISCA 2003]

• Methods: Simulating a $2M Server on a $2K PC

[Computer 2003]

• Performance: Cache Compression [ISCA 2004]

(52)

53

Token Coherence [IEEE MICRO Top Picks 03]

• Coherence Invariant (for any memory block at any time):

– One writer or multiple readers

• Implemented with distributed Finite State Machines • Indirectly enforced (bus order, acks, blocking, etc.)

• Token Coherence Directly Enforces

– Each memory block has T tokens

– Token count store with data (even in messages) – Processor needs all T tokens to write

– Processor needs at least one token to read

• Last year: Glueless Multiprocessor

– Speedup 17-54% vs directory

• This Year: Medium CMP Systems

– Flat for correctness

(53)

Wisconsin Multifacet Project

54

Conclusions

Must Exploit Memory Level Parallelism!

At Thread: Runahead & Continual Flow Pipeline

At Processor: Simultaneous Multithreading

At Chip: Chip Multiprocessing