© 2004 Mark D. Hill Wisconsin Multifacet Project
A Future for
Parallel Computer Architectures
Mark D. Hill
Computer Sciences Department University of Wisconsin—Madison
Multifacet Project (www.cs.wisc.edu/multifacet)
August 2004
Wisconsin Multifacet Project
2
© 2004 Mark D. Hill
Summary
• Issues
– Moore’s Law, etc.
– Instruction Level Parallelism for More Performance – But Memory Latency Longer (e.g., 200 FP multiplies)
• Must Exploit Memory Level Parallelism
– At Thread: Runahead & Continual Flow Pipeline
– At Processor: Simultaneous Multithreading
Wisconsin Multifacet Project
3
© 2004 Mark D. Hill
Outline
• Computer Architecture Drivers
– Moore’s Law, Microprocessors, & Caching
• Instruction Level Parallelism (ILP) Review
• Memory Level Parallelism (MLP)
• Improving MLP of Thread
• Improving MLP of a Core or Chip
Wisconsin Multifacet Project
4
© 2004 Mark D. Hill
Wisconsin Multifacet Project
5
© 2004 Mark D. Hill
What If Your Salary?
• Parameters
– $16 base
– 59% growth/year – 40 years
• Initially $16 buy book
• 3rd year’s $64 buy computer game
• 16th year’s $27,000 buy car
• 22nd year’s $430,000 buy house
• 40th year’s > billion dollars buy a lot
Wisconsin Multifacet Project
6
© 2004 Mark D. Hill
Microprocessor
• First Microprocessor in 1971 – Processor on one chip – Intel 4004
– 2300 transistors – Barely a processor
– Could access 300 bytes
of memory (0.0003 megabytes)
Wisconsin Multifacet Project
7
© 2004 Mark D. Hill
Other “Moore’s Laws”
• Other technologies improving rapidly
– Magnetic disk capacity – DRAM capacity
– Fiber-optic network bandwidth
• Other aspects improving slowly
– Delay to memory – Delay to disk
– Delay across networks
• Computer Implementor’s Challenge
– Design with dissimilarly expanding resources
– To Double computer performance every two years
Wisconsin Multifacet Project
8
© 2004 Mark D. Hill
Caching & Memory Hierarchies, cont.
• VAX-11/780
– 1 Instruction = Memory • Now
– 100s Instructions = Memory
• Caching Applied Recursively – Registers
– Level-one cache – Level-two cache – Memory
– Disk
Wisconsin Multifacet Project
9
© 2004 Mark D. Hill
Outline
• Computer Architecture Drivers
• Instruction Level Parallelism (ILP) Review – Pipelining & Out-of-Order
– Intel P3, P4, & Banias
• Memory Level Parallelism (MLP)
• Improving MLP of Thread
• Improving MLP of a Core or Chip
Wisconsin Multifacet Project
10
© 2004 Mark D. Hill
Instruction Level Parallelism (ILP) 101
• Non-Pipelined (Faster via Bit Level Parallelism (BLP))
• Pipelined (ILP + BLP; 1st microprocessors RISC)
Time
I
ns
tr
ns
Time
I
ns
tr
Wisconsin Multifacet Project
11
© 2004 Mark D. Hill
Instruction Level Parallelism 102
• SuperScalar (& Pipelined)
• Add Cache Misses in red
Time
I
ns
tr
ns
Time
I
ns
tr
Wisconsin Multifacet Project
12
© 2004 Mark D. Hill
Instruction Level Parallelism 103
• Out-of-Order (& SuperScalar & Pipelined)
• In-order fetch, decode, rename, & issuing of instructions with good branch prediction
• Out-of-order speculative execution of instructions in “window”, honoring data dependencies
• In-order retirement,
preserving sequential instruction semantics
Time
I
ns
tr
Wisconsin Multifacet Project
13
© 2004 Mark D. Hill
Out-of-Order Example: Intel x86 P6 Core
• “CISC” Twist to Out-of-Order
– In-order front end cracks x86 instructions into micro-ops (like RISC instructions) – Out-of-order execution
– In-Order retirement of micro-ops in x86 instruction groups
• Used in Pentium Pro, II, & III
– 3-way superscalar of micro-ops
– 10-stage pipeline (for branch misprediction penalty) – Sophisticated branch prediction
Wisconsin Multifacet Project
14
© 2004 Mark D. Hill
Pentium 4 Core [Hinton 2001]
• Follow basic approach of P6 core
• Trace Cache stores dynamic micro-op sequences
• 20-stage pipeline (for branch misprediction penalty)
• 128 active micro-ops (48 loads & 24 stores)
Wisconsin Multifacet Project
15
© 2004 Mark D. Hill
Intel Kills Pentium 4 Roadmap
• Why? I can speculate
• Too Much Power?
– More transistors
– Higher-frequency transistors
– Designed before power became first-order design constraint
• Too Little Performance? Time/Program =
– Instructions/Program * Cycles/Instruction * Time/Cycle
• For x86: Instructions/Cycle * Frequency
Wisconsin Multifacet Project
16
© 2004 Mark D. Hill
Pentium M / Banias [Gochman 2003]
• For laptops, but now more general
– Key: Feature must add 1% performance for 3% power
– Why: Increasing voltage for 1% perf. costs 3% power
• Techniques
– Enhance Intel SpeedStep™ – Shorter pipeline (more like P6)
– Better branch predictor (e.g., loops) – Special handling of memory stack – Fused micro-ops
Wisconsin Multifacet Project
17
© 2004 Mark D. Hill
What about Future for Intel & Others?
• Worry about power & energy (not this talk)
• Memory latency too great for out-of-order cores to tolerate (coming next)
Wisconsin Multifacet Project
18
© 2004 Mark D. Hill
Outline
• Computer Architecture Drivers
• Instruction Level Parallelism (ILP) Review
• Memory Level Parallelism (MLP) – Cause & Effect
• Improving MLP of Thread
• Improving MLP of a Core or Chip
Wisconsin Multifacet Project
19
© 2004 Mark D. Hill
Out-of-Order w/ Slower Off-Chip Misses
• Out-of-Order (& Super-Scalar & Pipelined)
• But Off-Chip Misses are now hundreds of cycles
Time
I
ns
tr
ns
Good Case!
Time
I
ns
tr
Wisconsin Multifacet Project
20
© 2004 Mark D. Hill
Out-of-Order w/ Slower Off-Chip Misses
• More Realistic Case
• Why does yellow instruction block?
– Assumes 4-instruction window (maximum outstanding) – Yellow instruction awaits “instruction - 4” (1st cache miss)
– Actual widows are 32-64 instructions, but L2 miss slower
• Key Insight: Memory-Level Parallelism (MLP) [Chou, Fahs, & Abraham, ISCA 2004]
Time
I
ns
tr
ns
I1 I2 I3
I4
Wisconsin Multifacet Project
21
© 2004 Mark D. Hill
Out-of-Order & Memory Level Parallism (MLP)
• Good Case
• Bad Case
Compute & Memory Phases
Compute & Memory Phases
MLP = 2
Wisconsin Multifacet Project
22
© 2004 Mark D. Hill
MLP Model
• MLP = # Off-Chip Accesses / # Memory Phases • Execution has Compute & Memory Phases
– Compute Phase largely overlaps Memory Phase – In Limit as Memory Latency increases, …
• Compute Phase hidden by Memory Phase
– Execution Time = # Memory Phases * Memory Latency
• Execution Time =
Wisconsin Multifacet Project
23
© 2004 Mark D. Hill
MLP Action Items
• Execution Time =
(MLP / # Off-Chip Accesses) * Memory Latency
• Reduce # Off-Chip Accesses
– E.g., better caches or compression (Multifacet)
• Reduce Memory Latency
– E.g., on-chip memory controller (AMD)
• Increase MLP (next slides)
Wisconsin Multifacet Project
24
© 2004 Mark D. Hill
What Limits MLP in Processor? [Chou et al.]
• Issue window and reorder buffer size
• Instruction fetch off-chip accesses
• Unresolvable mispredicted branches
• Load and branch issue restrictions
Wisconsin Multifacet Project
25
© 2004 Mark D. Hill
What Limits MLP in Program?
• Depending on data from off-chip memory accesses
• For addresses
– Bad: Pointer chasing with poor locality
– Good: Array where address calculation separate from data
• For unpredictable branch decisions
– Bad: Branching on data values with poor locality
– Good: Iterative loops with highly predictable branching
• But, as programmer, which accesses go off-chip?
• Also: very poor instruction locality
Wisconsin Multifacet Project
26
© 2004 Mark D. Hill
Outline
• Computer Architecture Drivers
• Instruction Level Parallelism (ILP) Review
• Memory Level Parallelism (MLP)
• Improving MLP of Thread
– Runahead, Continual Flow Pipeline
• Improving MLP of a Core or Chip
Wisconsin Multifacet Project
27
© 2004 Mark D. Hill
Runahead Example
• Base Out-of-Order, MLP = 1
• With Runahead, MLP = 2
I1 I2 I3
I4
4-instrn window
1. Normal mode
3. Runahead mode 2. Checkpoint
Wisconsin Multifacet Project
28
© 2004 Mark D. Hill
Runahead Execution [Dundas ICS97, Mutlu HPCA03]
1. Execute normally until instruction M’s off-chip access blocks issue of more instructions
2. Checkpoint processor
3. Discard instruction M, set M’s destination register to
poisoned, & speculatively Runahead
– Instructions propagate poisoned from source to destination – Seek off-chip accesses to start prefetches & increase MLP
4. Restore checkpoint when off-chip access M returns
Wisconsin Multifacet Project
29
© 2004 Mark D. Hill
Continual Flow Pipeline [
Srinivasan ASPLOS04]
Simplified Example
Have off-chip access M free many resources, but SAVE
Keep decoding instructions
SAVE instructions dependent on M
Execute instructions independent of M
Wisconsin Multifacet Project
30
© 2004 Mark D. Hill
Implications of Runahead, & Continual Flow
• Runahead
– Discards dependent instructions
– Speculatively executes independent instructions
– When miss returns, re-executes dependent & independent instrns
• Continual Flow Pipeline
– Saves dependent instructions
– Executes independent instructions
– When miss returns, executes only saved dependent instructions
• Assessment
– Both allow MLP to break past window limits
– Both limited by branch prediction accuracy on unresolved branches – Continual Flow Pipeline sounds even more appealing
Wisconsin Multifacet Project
31
© 2004 Mark D. Hill
Outline
• Computer Architecture Drivers
• Instruction Level Parallelism (ILP) Review
• Memory Level Parallelism (MLP)
• Improving MLP of Thread
• Improving MLP of a Core or Chip – Core: Simultaneous Multithreading – Chip: Chip Multiprocessing
Wisconsin Multifacet Project
32
© 2004 Mark D. Hill
Getting MLP from Thread Level Parallelism
• Runahead & Continual Flow seek MLP for Thread
• More MLP for Processor?
– More parallel off-chip accesses for a processor? – Yes: Simultaneous Multithreading
• More MLP for Chip?
– More parallel off-chip accesses for a chip? – Yes: Chip Multiprocessing
Wisconsin Multifacet Project
33
© 2004 Mark D. Hill
Simultaneous Multithreading [U Washington]
• Turn a physical processor into S logical processors
• Need S copies of architectural state, S=2, 4, (8?)
– PC, Registers, PSW, etc. (small!)
• Completely share
– Caches, functional units, & datapaths
• Manage via threshold sharing, partition, etc.
– Physical registers, issue queue, & reorder buffer
• Intel calls Hyperthreading in Pentium 4
Wisconsin Multifacet Project
34
© 2004 Mark D. Hill
Simultaneous Multithreading Assessment
• Programming
– Supports finer-grained sharing than old-style MP – But gains less than S and S is small
• Have Multi-Threaded Workload
– Hides off-chip latencies better than Runahead – E.g, 4 threads w/ MLP 1.5 each MLP = 6
• Have Single-Threaded Workload
– Base SMT No Help
– Many “Helper Thread” Ideas
• Expect SMT in processors for servers
Wisconsin Multifacet Project
35
© 2004 Mark D. Hill
Want to Spend More Transistors
• Not worthwhile to spend it all on cache
• Replicate Processor
• Private L1 Caches
– Low latency – High bandwidth
• Shared L2 Cache
Wisconsin Multifacet Project
36
© 2004 Mark D. Hill
Piranha Processing Node
Alpha core: 1-issue, in-order, 500MHz
CPU
Next few slides from
Luiz Barosso’s ISCA 2000 presentation of
Piranha: A Scalable Architecture
Wisconsin Multifacet Project
37
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Wisconsin Multifacet Project
38
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Wisconsin Multifacet Project
39
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
Wisconsin Multifacet Project
40
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL
8 banks
Wisconsin Multifacet Project
41
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec
Protocol Engines (HE & RE) prog., 1K instr.,
even/odd interleaving D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Wisconsin Multifacet Project
42
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec
Protocol Engines (HE & RE): prog., 1K instr.,
even/odd interleaving
System Interconnect:
4-port Xbar router topology independent 32GB/sec total bandwidth D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Wisconsin Multifacet Project
43
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec
Protocol Engines (HE & RE): prog., 1K instr.,
even/odd interleaving
System Interconnect:
4-port Xbar router topology independent 32GB/sec total bandwidth
D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Wisconsin Multifacet Project
45
© 2004 Mark D. Hill
0 50 100 150 200 250 300 350 P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue N o rm a li ze d E x e c u ti o n T im e L2Miss L2Hit CPU 233 145 100 34 350 191 100 44 OLTP DSS
• Piranha’s performance margin 3x for OLTP and 2.2x for DSS
• Piranha has more outstanding misses better utilizes memory system
Wisconsin Multifacet Project
46
© 2004 Mark D. Hill
Chip Multiprocessing Assessment: Servers
• Programming
– Supports finer-grained sharing than old-style MP – But not as fine as SMT (yet)
– Many cores can make performance gain large
• Can Yield MLP for Chip!
– Can do CMP of SMT processors
– C cores of S-way SMT with T-way MLP per thread – Yields Chip MLP of C*S*T (e.g., 8*2*2 = 32)
• Most Servers have Multi-Threaded Workload
• CMP is a Server Inflection Point
Wisconsin Multifacet Project
47
© 2004 Mark D. Hill
Chip Multiprocessing Assessment: Clients
• Most Client (Today) have Single-Threaded Workload
– Base CMP No Help
– Use Thread Level Speculation? – Use Helper Threads?
• CMPs for Clients?
– Depends on Threads
Wisconsin Multifacet Project
48
© 2004 Mark D. Hill
Outline
• Computer Architecture Drivers
• Instruction Level Parallelism (ILP) Review
• Memory Level Parallelism (MLP)
• Improving MLP of Thread
• Improving MLP of a Core or Chip
• CMP Systems
– Small, Medium, but Not Large
Wisconsin Multifacet Project
49
© 2004 Mark D. Hill
Small CMP Systems
• Use One CMP (with C cores of S-way SMT)
– C starts 2-4 and grows to 16-ish
– S starts at 2, may stay at 2 or grow to 4 – Fits on your desk!
• Directly Connect CMP (C) to Memory Controller (M) or DRAM
• If Threads Useful
– >10X Performance vs. Uniprocesor
– >>10X Cost-Performance vs. non-CMP SMP
• Commodity Server!
M
Wisconsin Multifacet Project
50
© 2004 Mark D. Hill
Medium CMP Systems
• Use 2-16 CMPs (with C cores of S-way SMT)
– Small: 4*4*2 = 32
– Large: 16*16*4 = 1024
• Connect CMPs & Memory Controllers (or DRAM)
C C
C C
M M
M M
Processor-Centric
M M
M M
C C
C C
Memory-Centric
M M
C C
M M
C C
Wisconsin Multifacet Project
51
© 2004 Mark D. Hill
Large CMP Systems?
• 1000s of CMPs?
• Will not happen in the commercial market
– Instead will network CMP systems into clusters – Enhance availability & reduces cost
– Poor latency acceptable
• Market for large scientific machines probably ~$0 Billion
• Market for large government machines similar
– Nevertheless, government can make this happen (like bombers)
• The rest of us will use
– a small- or medium-CMP system
Wisconsin Multifacet Project
52
© 2004 Mark D. Hill
Wisconsin Multifacet (www.cs.wisc.edu/multifacet)
• Designing Commercial Servers
• Availability: SafetyNet Checkpointing [ISCA 2002]
• Programability: Flight Data Recorder [ISCA 2003]
• Methods: Simulating a $2M Server on a $2K PC
[Computer 2003]
• Performance: Cache Compression [ISCA 2004]
Wisconsin Multifacet Project
53
© 2004 Mark D. Hill
Token Coherence [IEEE MICRO Top Picks 03]
• Coherence Invariant (for any memory block at any time):
– One writer or multiple readers
• Implemented with distributed Finite State Machines • Indirectly enforced (bus order, acks, blocking, etc.)
• Token Coherence Directly Enforces
– Each memory block has T tokens
– Token count store with data (even in messages) – Processor needs all T tokens to write
– Processor needs at least one token to read
• Last year: Glueless Multiprocessor
– Speedup 17-54% vs directory
• This Year: Medium CMP Systems
– Flat for correctness
Wisconsin Multifacet Project
54
© 2004 Mark D. Hill
Conclusions
Must Exploit Memory Level Parallelism!
At Thread: Runahead & Continual Flow Pipeline
At Processor: Simultaneous Multithreading
At Chip: Chip Multiprocessing