• Tidak ada hasil yang ditemukan

cstb04_rockNhardplace.ppt 790KB Jun 23 2011 12:31:46 PM

N/A
N/A
Protected

Academic year: 2017

Membagikan "cstb04_rockNhardplace.ppt 790KB Jun 23 2011 12:31:46 PM"

Copied!
36
0
0

Teks penuh

(1)

© 2004 Mark D. Hill Wisconsin Multifacet Project

Future Computer Advances are

Between a Rock (Slow Memory)

and a Hard Place (Multithreading)

Mark D. Hill

Computer Sciences Dept.

and Electrical & Computer Engineer Dept.

University of Wisconsin—Madison

Multifacet Project (

www.cs.wisc.edu/multifacet

)

October 2004

(2)

Wisconsin Multifacet Project

2

© 2004 Mark D. Hill

Executive Summary: Problem

• Expect computer performance doubling every 2 years

• Derives from Technology & Architecture

• Technology will advance for ten or more years

• But Architecture faces a

Rock: Slow Memory

– a.k.a. Wall [Wulf & McKee 1995]

• Prediction: Popular Moore’s Law (doubling

performance) will end soon, regardless of

the real Moore’s Law (doubling transistors)

(3)

Wisconsin Multifacet Project

3

© 2004 Mark D. Hill

Executive Summary: Recommendation

• Chip Multiprocessing (CMP)

Can Help

– Implement multiple processors per chip

– >>10x cost-performance for multithreaded workloads – What about software with one apparent thread?

• Go to

Hard Place: Mainstream Multithreading

– Make most workloads flourish with chip multiprocessing – Computer architects can help, but long run

– Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware)

(4)

Wisconsin Multifacet Project

4

© 2004 Mark D. Hill

Outline

• Executive Summary

• Background

– Moore’s Law – Architecture

– Instruction Level Parallelism – Caches

• Going Forward Processor Architecture Hits Rock

• Chip Multiprocessing to the Rescue?

(5)

Wisconsin Multifacet Project

5

© 2004 Mark D. Hill

Society Expects A Popular Moore’s Law

Computing critical: commerce, education, engineering, entertainment, government, medicine, science, …

– Servers (> PCs) – Clients (= PCs)

– Embedded (< PCs)

• Come to expect a misnamed “Moore’s Law”

– Computer performance doubles every two years (same cost)

  Progress in next two years = All past progress

• Important Corollary

– Computer cost halves every two years (same performance)

  In ten years, same performance for 3% (sales tax – Jim Gray)

• Derives from Technology & Architecture

(6)

Wisconsin Multifacet Project

6

© 2004 Mark D. Hill

(Technologist’s) Moore’s Law Provides Transistors

Number of transistors per chip doubles every two years (18 months)

(7)

Wisconsin Multifacet Project

7

© 2004 Mark D. Hill

Performance from Technology & Architecture

Reprinted from Hennessy and Patterson,"Computer Architecture:

(8)

Wisconsin Multifacet Project

8

© 2004 Mark D. Hill

Architects Use Transistors To Compute Faster

• Bit Level Parallelism (BLP)

within Instructions

• Instruction Level Parallelism (ILP)

among Instructions

• Scores of speculative instructions look sequential!

Time 

I

ns

tr

ns

Time 

I

ns

tr

(9)

Wisconsin Multifacet Project

9

© 2004 Mark D. Hill

Architects Use Transistors Tolerate Slow Memory

• Cache

– Small, Fast Memory

– Holds information (expected) to be used soon

– Mostly Successful

• Apply Recursively

– Level-one cache(s) – Level-two cache

(10)

Wisconsin Multifacet Project

10

© 2004 Mark D. Hill

Outline

• Executive Summary

• Background

• Going Forward Processor Architecture Hits Rock

– Technology Continues – Slow Memory

– Implications

• Chip Multiprocessing to the Rescue?

(11)

Wisconsin Multifacet Project

11

© 2004 Mark D. Hill

Future Technology Implications

• For (at least) ten years, Moore’s Law continues

– More repeated doublings of number of transistors per chip – Faster transistors

• But hard for processor architects to use

– More transistors due global wire delays

– Faster transistors due too much dynamic power

• Moreover, hitting a Rock: Slow Memory

(12)

Wisconsin Multifacet Project

12

© 2004 Mark D. Hill

Rock: Memory Gets (Relatively) Slower

Reprinted from Hennessy and Patterson,"Computer Architecture:

(13)

Wisconsin Multifacet Project

13

© 2004 Mark D. Hill

Impact of Slow Memory (Rock)

• Off-Chip Misses are now hundreds of cycles

• More Realistic Case

Good Case!

Time 

I

ns

tr

ns

Time 

I

ns

tr

ns

I1 I2 I3

I4

window = 4 (64)

Compute Phases

(14)

Wisconsin Multifacet Project

14

© 2004 Mark D. Hill

Implications of Slow Memory (Rock)

• Increasing

Memory

Latency hides

Compute

Phase

• Near Term Implications

– Reduce memory latency – Fewer memory accesses

– More Memory Level Parallelism (MLP)

• Longer Term Implications

– What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000?

(15)

Wisconsin Multifacet Project

15

© 2004 Mark D. Hill

Assessment So Far

• Appears

– Popular Moore’s Law (doubling performance)

will end soon, regardless of the

real Moore’s Law (doubling transistors)

• Processor performance hitting

Rock (Slow Memory)

• No known way to overcome this, unless

• Redefine performance in Popular Moore’s Law

– From Processor Performance

(16)

Wisconsin Multifacet Project

16

© 2004 Mark D. Hill

Outline

• Executive Summary

• Background

• Going Forward Processor Architecture Hits Rock

• Chip Multiprocessing to the Rescue?

– Small & Large CMPs – CMP Systems

– CMP Workload

(17)

Wisconsin Multifacet Project

17

© 2004 Mark D. Hill

Performance for Chip, not Processor or Thread

• Chip Multiprocessing (CMP)

• Replicate Processor

• Private L1 Caches

– Low latency – High bandwidth

• Shared L2 Cache

(18)

Wisconsin Multifacet Project

18

© 2004 Mark D. Hill

Piranha Processing Node

Alpha core: 1-issue, in-order, 500MHz

CPU

Next few slides from

Luiz Barosso’s ISCA 2000 presentation of

Piranha: A Scalable Architecture

(19)

Wisconsin Multifacet Project

19

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

(20)

Wisconsin Multifacet Project

20

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

(21)

Wisconsin Multifacet Project

21

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

(22)

Wisconsin Multifacet Project

22

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

8 banks

(23)

Wisconsin Multifacet Project

23

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE)

prog., 1K instr., even/odd interleaving D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(24)

Wisconsin Multifacet Project

24

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE): prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router topology independent 32GB/sec total bandwidth D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(25)

Wisconsin Multifacet Project

25

© 2004 Mark D. Hill

Piranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS) 32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE): prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router topology independent 32GB/sec total bandwidth D$ I$ L2$ ICS CPU D$ I$ L2$ L2$ CPU D$ I$ CPU D$ I$ L2$ CPU D$ I$ L2$ CPU D$ I$ L2$ L2$ CPU D$ I$ L2$ CPU D$ I$ MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL MEM-CTL

(26)

Wisconsin Multifacet Project

26

© 2004 Mark D. Hill 0 50 100 150 200 250 300 350 P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue P1 500 MHz 1-issue INO 1GHz 1-issue OOO 1GHz 4-issue P8 500MHz 1-issue

Normalized Execution Time

L2Miss L2Hit CPU 233 145 100 34 350 191 100 44 OLTP DSS

• Piranha’s performance margin 3x for OLTP and 2.2x for DSS

• Piranha has more outstanding misses  better utilizes memory system

(27)

Wisconsin Multifacet Project

27

© 2004 Mark D. Hill

Simultaneous Multithreading (SMT)

• Multiplex

S

logical processors on each processor

– Replicate registers, share caches, & manage other parts – Implementation factors keep S small, e.g., 2-4

• Cost-effective gain if threads available

– E.g, S=2  1.4x performance

• Modest cost

– Limits waste if additional logical processor(s) not used

(28)

Wisconsin Multifacet Project

28

© 2004 Mark D. Hill

Small CMP Systems

• Use One CMP (with

C

cores of

S

-way SMT)

– C=[2,16] & S=[2,4]  C*S = [4,64]

– Size of a small PC!

• Directly Connect

CMP (C)

to

Memory Controller (M)

or DRAM

M

(29)

Wisconsin Multifacet Project

29

© 2004 Mark D. Hill

Medium CMP Systems

• Use 2-16 CMPs (with C cores of S-way SMT)

– Smaller: 2*4*4 = 32 – Larger: 16*16*4 = 1024 – In a single cabinet

• Connecting CMPs & Memory Controllers/DRAM & many issues

C

C

C

C

M

M

M

M

Processor-Centric

M

M

C

C

M

M

C

C

(30)

Wisconsin Multifacet Project

30

© 2004 Mark D. Hill

Inflection Points

• Inflection point occurs when

– Smooth input change leads – Disruptive output change

• Enough transistors for …

– 1970s simple microprocessor – 1980s pipelined RISC

– 1990s speculative out-of-order – 2000s …

• CMP will be Server Inflection Point

– Expect >10x performance for less cost – Implying, >>10x cost-performance

(31)

Wisconsin Multifacet Project

31

© 2004 Mark D. Hill

So What’s Wrong with CMP Picture?

• Chip Multiprocessors

– Allow profitable use of more transistors – Support modest to vast multithreading

– Will be inflection point for commercial servers

• But

– Many workloads have single thread (available to run)

– Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing)

• Go to a

Hard Place

(32)

Wisconsin Multifacet Project

32

© 2004 Mark D. Hill

Outline

• Executive Summary

• Background

• Going Forward Processor Architecture Hits Rock

• Chip Multiprocessing to the Rescue?

• Go to the Hard Place of Mainstream Multithreading

(33)

Wisconsin Multifacet Project

33

© 2004 Mark D. Hill

Thread Parallelism from Fringe to Center

• History

– Automatic Computer (vs. Human)  Computer

– Digital Computer (vs. Analog)  Computer

• Must Change

– Parallel Computer (vs. Sequential)  Computer

– Parallel Algorithm (vs. Sequential)  Algorithm

– Parallel Programming (vs. Sequential)  Programming

– Parallel Library (vs. Sequential)  Library

– Parallel X (vs. Sequential)  X

(34)

Wisconsin Multifacet Project

34

© 2004 Mark D. Hill

Computer Architects Can Contribute

• Chip Multiprocessor Design

– Transcend pre-CMP multiprocessor design

– Intra-CMP has lower latency & much higher bandwidth

• Hide Multithreading (Helper Threads)

• Assist Multithreading (Thread-Level Speculation)

• Ease Multithreaded Programming (Transactions)

(35)

Wisconsin Multifacet Project

35

© 2004 Mark D. Hill

But All of Computer Science is Needed

• Hide Multithreading (Libraries & Compilers)

• Assist Multithreading (Development Environments)

• Ease Multithreaded Programming (Languages)

• Divide & Conquer Multithreaded Complexity

(Theory & Abstractions)

• Must Enable

– 99% of programmers think sequentially while – 99% of instructions execute in parallel

(36)

Wisconsin Multifacet Project

36

© 2004 Mark D. Hill

Summary

• (Single-Threaded) Computing faces a Rock: Slow Memory

• Popular Moore’s Law (doubling performance) will end soon

• Chip Multiprocessing Can Help

– >>10x cost-performance for multithreaded workloads – What about software with one apparent thread?

• Go to Hard Place: Mainstream Multithreading

– Make most workloads flourish with chip multiprocessing – Computer architects can help, but long run

Referensi

Dokumen terkait

Setelah diumumkannya penetapan Pemenang pengadaan ini, maka kepada Peserta dapat menyampaikan sanggahan secara elektronik melalui aplikasi SPSE atas penetapan pemenang kepada Pokja

Partisipasi dan perjuangan rakyat Indonesia dalam upaya bela negara pada masa yang lalu, adalah sebagai berikut , kecuali ………a. pertempuran 10 November di Surabaya

Ayah menyiangi rumput separuh dari kebunnya dan anaknya mengerjakan sepertiganya?. dengan luas kebun adalah

MANDALA INDONESIA TECHNOLOGY Divisi Training IT, WINTECH, adalah merupakan unit tugas yang harus diikuti oleh setiap mahasiswa Desain Komunikasi Visual di

dengan judulnya Colorful dengan konsep pewarnaan yang disukai anak kecil, eye catching, dan menyenangkan. Tentu saja dengan bentukan desain yang sangat familiar

Metoda evaluasi yang dipakai adalah sistem gugur dengan ambang batas teknis baik pada unsur-unsur maupun nilai total teknis dimana setiap dokumen yang dinyatakan

know, the research result is a fact that is recognized the truth in the past,. present, and in the future, however, the research result in this case

[r]