cstb04_rockNhardplace.ppt 790KB Jun 23 2011 12:31:46 PM

(1)

© 2004 Mark D. Hill Wisconsin Multifacet Project

Future Computer Advances are

Between a Rock (Slow Memory)

and a Hard Place (Multithreading)

Mark D. Hill

Computer Sciences Dept.

and Electrical & Computer Engineer Dept.

University of Wisconsin—Madison

Multifacet Project (

www.cs.wisc.edu/multifacet

)

October 2004

(2)

Wisconsin Multifacet Project

2

Executive Summary: Problem

• Expect computer performance doubling every 2 years

• Derives from Technology & Architecture

• Technology will advance for ten or more years

• But Architecture faces a

Rock: Slow Memory

– a.k.a. Wall [Wulf & McKee 1995]

• Prediction: Popular Moore’s Law (doubling

performance) will end soon, regardless of

the real Moore’s Law (doubling transistors)

(3)

Wisconsin Multifacet Project

3

Executive Summary: Recommendation

• Chip Multiprocessing (CMP)

Can Help

– Implement multiple processors per chip

– >>10x cost-performance for multithreaded workloads – What about software with one apparent thread?

• Go to

Hard Place: Mainstream Multithreading

– Make most workloads flourish with chip multiprocessing – Computer architects can help, but long run

– Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware)

(4)

Wisconsin Multifacet Project

4

Outline

• Executive Summary

• Background

– Moore’s Law – Architecture

– Instruction Level Parallelism – Caches

• Going Forward Processor Architecture Hits Rock

• Chip Multiprocessing to the Rescue?

(5)

Wisconsin Multifacet Project

5

Society Expects A Popular Moore’s Law

Computing critical: commerce, education, engineering, entertainment, government, medicine, science, …

– Servers (> PCs) – Clients (= PCs)

– Embedded (< PCs)

• Come to expect a misnamed “Moore’s Law”

– Computer performance doubles every two years (same cost)

  Progress in next two years = All past progress

• Important Corollary

– Computer cost halves every two years (same performance)

  In ten years, same performance for 3% (sales tax – Jim Gray)

• Derives from Technology & Architecture

(6)

Wisconsin Multifacet Project

6

(Technologist’s) Moore’s Law Provides Transistors

Number of transistors per chip doubles every two years (18 months)

(7)

Wisconsin Multifacet Project

7

Performance from Technology & Architecture

Reprinted from Hennessy and Patterson,"Computer Architecture:

(8)

Wisconsin Multifacet Project

8

Architects Use Transistors To Compute Faster

• Bit Level Parallelism (BLP)

within Instructions

• Instruction Level Parallelism (ILP)

among Instructions

• Scores of speculative instructions look sequential!

Time 



I

ns

tr

ns

Time 



I

ns

tr

(9)

Wisconsin Multifacet Project

9

Architects Use Transistors Tolerate Slow Memory

• Cache

– Small, Fast Memory

– Holds information (expected) to be used soon

– Mostly Successful

• Apply Recursively

– Level-one cache(s) – Level-two cache

(10)

Wisconsin Multifacet Project

10

Outline

• Background

– Technology Continues – Slow Memory

– Implications

(11)

Wisconsin Multifacet Project

11

Future Technology Implications

• For (at least) ten years, Moore’s Law continues

– More repeated doublings of number of transistors per chip – Faster transistors

• But hard for processor architects to use

– More transistors due global wire delays

– Faster transistors due too much dynamic power

• Moreover, hitting a Rock: Slow Memory

(12)

Wisconsin Multifacet Project

12

Rock: Memory Gets (Relatively) Slower

Reprinted from Hennessy and Patterson,"Computer Architecture:

(13)

Wisconsin Multifacet Project

13

Impact of Slow Memory (Rock)

• Off-Chip Misses are now hundreds of cycles

• More Realistic Case

Good Case!

Time 



I

ns

tr

ns

Time 



I

ns

tr

ns

I1 I2 I3

I4

window = 4 (64)

Compute Phases

(14)

Wisconsin Multifacet Project

14

Implications of Slow Memory (Rock)

• Increasing

Memory

Latency hides

Compute

Phase

• Near Term Implications

– Reduce memory latency – Fewer memory accesses

– More Memory Level Parallelism (MLP)

• Longer Term Implications

– What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000?

(15)

Wisconsin Multifacet Project

15

Assessment So Far

• Appears

– Popular Moore’s Law (doubling performance)

will end soon, regardless of the

real Moore’s Law (doubling transistors)

• Processor performance hitting

Rock (Slow Memory)

• No known way to overcome this, unless

• Redefine performance in Popular Moore’s Law

– From Processor Performance

(16)

Wisconsin Multifacet Project

16

Outline

• Background

– Small & Large CMPs – CMP Systems

– CMP Workload

(17)

Wisconsin Multifacet Project

17

Performance for Chip, not Processor or Thread

• Chip Multiprocessing (CMP)

• Replicate Processor

• Private L1 Caches

– Low latency – High bandwidth

• Shared L2 Cache

(18)

Wisconsin Multifacet Project

18

Piranha Processing Node

Alpha core: 1-issue, in-order, 500MHz

CPU

Next few slides from

Luiz Barosso’s ISCA 2000 presentation of

Piranha: A Scalable Architecture

(19)

Wisconsin Multifacet Project

19

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

(20)

Wisconsin Multifacet Project

20

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

(21)

Wisconsin Multifacet Project

21

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

(22)

Wisconsin Multifacet Project

22

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

8 banks

(23)

Wisconsin Multifacet Project

23

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

L2 cache:

shared, 1MB, 8-way

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE)

prog., 1K instr., even/odd interleaving D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(24)

Wisconsin Multifacet Project

24

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

L2 cache:

shared, 1MB, 8-way

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE): prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router topology independent 32GB/sec total bandwidth D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(25)

Wisconsin Multifacet Project

25

Piranha Processing Node

CPU

L1 caches:

I&D, 64KB, 2-way

L2 cache:

shared, 1MB, 8-way

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE): prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router topology independent 32GB/sec total bandwidth D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(26)

Wisconsin Multifacet Project

26

Normalized Execution Time

L2Miss L2Hit CPU 233 145 100 34 350 191 100 44 OLTP DSS

• Piranha’s performance margin 3x for OLTP and 2.2x for DSS

• Piranha has more outstanding misses  better utilizes memory system

(27)

Wisconsin Multifacet Project

27

Simultaneous Multithreading (SMT)

• Multiplex

S

logical processors on each processor

– Replicate registers, share caches, & manage other parts – Implementation factors keep S small, e.g., 2-4

• Cost-effective gain if threads available

– E.g, S=2  1.4x performance

• Modest cost

– Limits waste if additional logical processor(s) not used

(28)

Wisconsin Multifacet Project

28

Small CMP Systems

• Use One CMP (with

C

cores of

S

-way SMT)

– C=[2,16] & S=[2,4]  C*S = [4,64]

– Size of a small PC!

• Directly Connect

CMP (C)

to

Memory Controller (M)

or DRAM

M

(29)

Wisconsin Multifacet Project

29

Medium CMP Systems

• Use 2-16 CMPs (with C cores of S-way SMT)

– Smaller: 2*4*4 = 32 – Larger: 16*16*4 = 1024 – In a single cabinet

• Connecting CMPs & Memory Controllers/DRAM & many issues

C

M

Processor-Centric

M

C

M

C

(30)

Wisconsin Multifacet Project

30

Inflection Points

• Inflection point occurs when

– Smooth input change leads – Disruptive output change

• Enough transistors for …

– 1970s simple microprocessor – 1980s pipelined RISC

– 1990s speculative out-of-order – 2000s …

• CMP will be Server Inflection Point

– Expect >10x performance for less cost – Implying, >>10x cost-performance

(31)

Wisconsin Multifacet Project

31

So What’s Wrong with CMP Picture?

• Chip Multiprocessors

– Allow profitable use of more transistors – Support modest to vast multithreading

– Will be inflection point for commercial servers

• But

– Many workloads have single thread (available to run)

– Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing)

• Go to a

Hard Place

(32)

Wisconsin Multifacet Project

32

Outline

• Background

• Go to the Hard Place of Mainstream Multithreading

(33)

Wisconsin Multifacet Project

33

Thread Parallelism from Fringe to Center

• History

– Automatic Computer (vs. Human)  Computer

– Digital Computer (vs. Analog)  Computer

• Must Change

– Parallel Computer (vs. Sequential)  Computer

– Parallel Algorithm (vs. Sequential)  Algorithm

– Parallel Programming (vs. Sequential)  Programming

– Parallel Library (vs. Sequential)  Library

– Parallel X (vs. Sequential)  X

(34)

Wisconsin Multifacet Project

34

Computer Architects Can Contribute

• Chip Multiprocessor Design

– Transcend pre-CMP multiprocessor design

– Intra-CMP has lower latency & much higher bandwidth

• Hide Multithreading (Helper Threads)

• Assist Multithreading (Thread-Level Speculation)

• Ease Multithreaded Programming (Transactions)

(35)

Wisconsin Multifacet Project

35

But All of Computer Science is Needed

• Hide Multithreading (Libraries & Compilers)

• Assist Multithreading (Development Environments)

• Ease Multithreaded Programming (Languages)

• Divide & Conquer Multithreaded Complexity

(Theory & Abstractions)

• Must Enable

– 99% of programmers think sequentially while – 99% of instructions execute in parallel

(36)

Wisconsin Multifacet Project

36

Summary

• (Single-Threaded) Computing faces a Rock: Slow Memory

• Popular Moore’s Law (doubling performance) will end soon

• Chip Multiprocessing Can Help

– >>10x cost-performance for multithreaded workloads – What about software with one apparent thread?

• Go to Hard Place: Mainstream Multithreading

– Make most workloads flourish with chip multiprocessing – Computer architects can help, but long run