© 2004 Mark D. Hill Wisconsin Multifacet Project
Future Computer Advances are
Between a Rock (Slow Memory)
and a Hard Place (Multithreading)
Mark D. Hill
Computer Sciences Dept.
and Electrical & Computer Engineer Dept.
University of Wisconsin—Madison
Multifacet Project (
www.cs.wisc.edu/multifacet
)
October 2004
Wisconsin Multifacet Project
2
© 2004 Mark D. Hill
Executive Summary: Problem
• Expect computer performance doubling every 2 years
• Derives from Technology & Architecture
• Technology will advance for ten or more years
• But Architecture faces a
Rock: Slow Memory
– a.k.a. Wall [Wulf & McKee 1995]
• Prediction: Popular Moore’s Law (doubling
performance) will end soon, regardless of
the real Moore’s Law (doubling transistors)
Wisconsin Multifacet Project
3
© 2004 Mark D. Hill
Executive Summary: Recommendation
• Chip Multiprocessing (CMP)
Can Help
– Implement multiple processors per chip
– >>10x cost-performance for multithreaded workloads – What about software with one apparent thread?
• Go to
Hard Place: Mainstream Multithreading
– Make most workloads flourish with chip multiprocessing – Computer architects can help, but long run
– Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware)
Wisconsin Multifacet Project
4
© 2004 Mark D. Hill
Outline
• Executive Summary
• Background
– Moore’s Law – Architecture
– Instruction Level Parallelism – Caches
• Going Forward Processor Architecture Hits Rock
• Chip Multiprocessing to the Rescue?
Wisconsin Multifacet Project
5
© 2004 Mark D. Hill
Society Expects A Popular Moore’s Law
Computing critical: commerce, education, engineering, entertainment, government, medicine, science, …
– Servers (> PCs) – Clients (= PCs)
– Embedded (< PCs)
• Come to expect a misnamed “Moore’s Law”
– Computer performance doubles every two years (same cost)
Progress in next two years = All past progress
• Important Corollary
– Computer cost halves every two years (same performance)
In ten years, same performance for 3% (sales tax – Jim Gray)
• Derives from Technology & Architecture
Wisconsin Multifacet Project
6
© 2004 Mark D. Hill
(Technologist’s) Moore’s Law Provides Transistors
Number of transistors per chip doubles every two years (18 months)
Wisconsin Multifacet Project
7
© 2004 Mark D. Hill
Performance from Technology & Architecture
Reprinted from Hennessy and Patterson,"Computer Architecture:
Wisconsin Multifacet Project
8
© 2004 Mark D. Hill
Architects Use Transistors To Compute Faster
• Bit Level Parallelism (BLP)
within Instructions
• Instruction Level Parallelism (ILP)
among Instructions
• Scores of speculative instructions look sequential!
Time
I
ns
tr
ns
Time
I
ns
tr
Wisconsin Multifacet Project
9
© 2004 Mark D. Hill
Architects Use Transistors Tolerate Slow Memory
• Cache
– Small, Fast Memory
– Holds information (expected) to be used soon
– Mostly Successful
• Apply Recursively
– Level-one cache(s) – Level-two cache
Wisconsin Multifacet Project
10
© 2004 Mark D. Hill
Outline
• Executive Summary
• Background
• Going Forward Processor Architecture Hits Rock
– Technology Continues – Slow Memory
– Implications
• Chip Multiprocessing to the Rescue?
Wisconsin Multifacet Project
11
© 2004 Mark D. Hill
Future Technology Implications
• For (at least) ten years, Moore’s Law continues
– More repeated doublings of number of transistors per chip – Faster transistors
• But hard for processor architects to use
– More transistors due global wire delays
– Faster transistors due too much dynamic power
• Moreover, hitting a Rock: Slow Memory
Wisconsin Multifacet Project
12
© 2004 Mark D. Hill
Rock: Memory Gets (Relatively) Slower
Reprinted from Hennessy and Patterson,"Computer Architecture:
Wisconsin Multifacet Project
13
© 2004 Mark D. Hill
Impact of Slow Memory (Rock)
• Off-Chip Misses are now hundreds of cycles
• More Realistic Case
Good Case!
Time
I
ns
tr
ns
Time
I
ns
tr
ns
I1 I2 I3
I4
window = 4 (64)
Compute Phases
Wisconsin Multifacet Project
14
© 2004 Mark D. Hill
Implications of Slow Memory (Rock)
• Increasing
Memory
Latency hides
Compute
Phase
• Near Term Implications
– Reduce memory latency – Fewer memory accesses
– More Memory Level Parallelism (MLP)
• Longer Term Implications
– What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000?
Wisconsin Multifacet Project
15
© 2004 Mark D. Hill
Assessment So Far
• Appears
– Popular Moore’s Law (doubling performance)
will end soon, regardless of the
real Moore’s Law (doubling transistors)
• Processor performance hitting
Rock (Slow Memory)
• No known way to overcome this, unless
• Redefine performance in Popular Moore’s Law
– From Processor Performance
Wisconsin Multifacet Project
16
© 2004 Mark D. Hill
Outline
• Executive Summary
• Background
• Going Forward Processor Architecture Hits Rock
• Chip Multiprocessing to the Rescue?
– Small & Large CMPs – CMP Systems
– CMP Workload
Wisconsin Multifacet Project
17
© 2004 Mark D. Hill
Performance for Chip, not Processor or Thread
• Chip Multiprocessing (CMP)
• Replicate Processor
• Private L1 Caches
– Low latency – High bandwidth
• Shared L2 Cache
Wisconsin Multifacet Project
18
© 2004 Mark D. Hill
Piranha Processing Node
Alpha core: 1-issue, in-order, 500MHz
CPU
Next few slides from
Luiz Barosso’s ISCA 2000 presentation of
Piranha: A Scalable Architecture
Wisconsin Multifacet Project
19
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Wisconsin Multifacet Project
20
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Wisconsin Multifacet Project
21
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
Wisconsin Multifacet Project
22
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL
8 banks
Wisconsin Multifacet Project
23
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec
Protocol Engines (HE & RE)
prog., 1K instr., even/odd interleaving D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Wisconsin Multifacet Project
24
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec
Protocol Engines (HE & RE): prog., 1K instr.,
even/odd interleaving
System Interconnect:
4-port Xbar router topology independent 32GB/sec total bandwidth D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Wisconsin Multifacet Project
25
© 2004 Mark D. Hill
Piranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS) 32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec
Protocol Engines (HE & RE): prog., 1K instr.,
even/odd interleaving
System Interconnect:
4-port Xbar router topology independent 32GB/sec total bandwidth D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Wisconsin Multifacet Project
26
© 2004 Mark D. Hill 0 50 100 150 200 250 300 350 P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue
Normalized Execution Time
L2Miss L2Hit CPU 233 145 100 34 350 191 100 44 OLTP DSS
• Piranha’s performance margin 3x for OLTP and 2.2x for DSS
• Piranha has more outstanding misses better utilizes memory system
Wisconsin Multifacet Project
27
© 2004 Mark D. Hill
Simultaneous Multithreading (SMT)
• Multiplex
S
logical processors on each processor
– Replicate registers, share caches, & manage other parts – Implementation factors keep S small, e.g., 2-4
• Cost-effective gain if threads available
– E.g, S=2 1.4x performance
• Modest cost
– Limits waste if additional logical processor(s) not used
Wisconsin Multifacet Project
28
© 2004 Mark D. Hill
Small CMP Systems
• Use One CMP (with
C
cores of
S
-way SMT)
– C=[2,16] & S=[2,4] C*S = [4,64]
– Size of a small PC!
• Directly Connect
CMP (C)
to
Memory Controller (M)
or DRAM
M
Wisconsin Multifacet Project
29
© 2004 Mark D. Hill
Medium CMP Systems
• Use 2-16 CMPs (with C cores of S-way SMT)
– Smaller: 2*4*4 = 32 – Larger: 16*16*4 = 1024 – In a single cabinet
• Connecting CMPs & Memory Controllers/DRAM & many issues
C
C
C
C
M
M
M
M
Processor-Centric
M
M
C
C
M
M
C
C
Wisconsin Multifacet Project
30
© 2004 Mark D. Hill
Inflection Points
• Inflection point occurs when
– Smooth input change leads – Disruptive output change
• Enough transistors for …
– 1970s simple microprocessor – 1980s pipelined RISC
– 1990s speculative out-of-order – 2000s …
• CMP will be Server Inflection Point
– Expect >10x performance for less cost – Implying, >>10x cost-performance
Wisconsin Multifacet Project
31
© 2004 Mark D. Hill
So What’s Wrong with CMP Picture?
• Chip Multiprocessors
– Allow profitable use of more transistors – Support modest to vast multithreading
– Will be inflection point for commercial servers
• But
– Many workloads have single thread (available to run)
– Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing)
• Go to a
Hard Place
Wisconsin Multifacet Project
32
© 2004 Mark D. Hill
Outline
• Executive Summary
• Background
• Going Forward Processor Architecture Hits Rock
• Chip Multiprocessing to the Rescue?
• Go to the Hard Place of Mainstream Multithreading
Wisconsin Multifacet Project
33
© 2004 Mark D. Hill
Thread Parallelism from Fringe to Center
• History
– Automatic Computer (vs. Human) Computer
– Digital Computer (vs. Analog) Computer
• Must Change
– Parallel Computer (vs. Sequential) Computer
– Parallel Algorithm (vs. Sequential) Algorithm
– Parallel Programming (vs. Sequential) Programming
– Parallel Library (vs. Sequential) Library
– Parallel X (vs. Sequential) X
Wisconsin Multifacet Project
34
© 2004 Mark D. Hill
Computer Architects Can Contribute
• Chip Multiprocessor Design
– Transcend pre-CMP multiprocessor design
– Intra-CMP has lower latency & much higher bandwidth
• Hide Multithreading (Helper Threads)
• Assist Multithreading (Thread-Level Speculation)
• Ease Multithreaded Programming (Transactions)
Wisconsin Multifacet Project
35
© 2004 Mark D. Hill
But All of Computer Science is Needed
• Hide Multithreading (Libraries & Compilers)
• Assist Multithreading (Development Environments)
• Ease Multithreaded Programming (Languages)
• Divide & Conquer Multithreaded Complexity
(Theory & Abstractions)
• Must Enable
– 99% of programmers think sequentially while – 99% of instructions execute in parallel
Wisconsin Multifacet Project
36
© 2004 Mark D. Hill
Summary
• (Single-Threaded) Computing faces a Rock: Slow Memory
• Popular Moore’s Law (doubling performance) will end soon
• Chip Multiprocessing Can Help
– >>10x cost-performance for multithreaded workloads – What about software with one apparent thread?
• Go to Hard Place: Mainstream Multithreading
– Make most workloads flourish with chip multiprocessing – Computer architects can help, but long run