dally.ppt 3432KB Jun 23 2011 12:31:30 PM

(1)

Tomorrow's Computin g Engines

WJD Feb 3, 1998 1

Tomorrow’s Computing Engines

February 3, 1998

Symposium on High-Performance Computer Architecture

William J. Dally

Computer Systems Laboratory Stanford University

(2)

WJD Feb 3, 1998 2

Focus on Tomorrow, not Yesterday

General’s tend to always fight the last war

Computer architects tend to always design the last computer

old programs

(3)

WJD Feb 3, 1998 3

Some Previous “Wars” (1/3)

MARS Router 1984

Torus Routing Chip 1985

Network Design Frame 1988

(4)

WJD Feb 3, 1998 4

Some Previous “Wars” (2/3)

(5)

WJD Feb 3, 1998 5

(6)

WJD Feb 3, 1998 6

• Driven by tomorrow’s applications - media

(7)

WJD Feb 3, 1998 7

90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000

• Quote from Scott Kirkpatric of IBM (talk abstract) • Media applications include

– video encode/decode

– polygon & image-based graphics

– audio processing - compression, music, speech - recognition/synthesis

– modulation/demodulation at audio and video rates

• These applications involve stream processing • So do

(8)

WJD Feb 3, 1998 8

Typical Media Kernel

Image Warp and Composite

• Read 10,000 pixels from memory

• Perform 100 16-bit integer operations on each pixel • Test each pixel

• Write 3,000 result pixels that pass to memory

• Little reuse of data fetched from memory

– each pixel used once

• Little interaction between pixels

– very insensitive to operation latency

(9)

WJD Feb 3, 1998 9

Telepresence: A Driving Application

Acquire 2D Images

Extract Depth (3D Images)

Segmentation Model Extraction

Compression

Decompression Rendering

Display 3D Scene

Most kernels: Latency insensitive

(10)

WJD Feb 3, 1998 10

Tomorrow’s Technology is Wire Limited

(11)

WJD Feb 3, 1998 11

Technology scaling makes communication the

scarce resource

0.35m

64Mb DRAM 16 64b FP Proc

400MHz

0.10m

4Gb DRAM 1K 64b FP Proc

2.5GHz

1997 ₂₀₀₇

18mm 12,000 tracks

1 clock

32mm 90,000 tracks

20 clocks

(12)

WJD Feb 3, 1998 12

On-chip wires are getting slower

x₁ x₂

y

x₂ = s x₁ 0.5x R₂ = R₁/s2 _4x

C₂ = C₁ 1x

t_w2 = R₂C₂y2_{= t}

w1/s2 4x

t_w2/t_g2= t_w1/(t_g1s3₎ _8x

v = 0.5(t_gRC)-1/2_(m/s)

v₂ = v₁s1/2 _0.7x

vt_g = 0.5(t_g/RC)1/2_(m/gate)

v₂t_g2 = v₁t_g1s3/2 _0.35x

t_w = RCy2 _RCy2 _RCy2

(13)

WJD Feb 3, 1998 13

Bandwidth and Latency of Modern VLSI

Size 10

1 100 103 ₁₀4 ₁₀5

10 100

1 103

Latency

Latency Bandwidth

1

0.01

10-4

10-6 Bandwidth

(14)

WJD Feb 3, 1998 14

Architecture for Locality Exploit high on-chip bandwidth

Off-chip RAM P in -B an dw id th , 2 G B /s Vector Reg File 104 32-bit ALUs

50GB/s Sw

itc

h

(15)

WJD Feb 3, 1998 15

• Aimed at media processing

– stream based – latency tolerant – low-precision – little reuse

– lots of conditionals

• Use the large number of devices available on future chips

• Make efficient use of scarce communication resources

– bandwidth hierarchy

– no centralized resources

• Approach the performance of a special-purpose

(16)

WJD Feb 3, 1998 16

Why do Special-Purpose Processors Perform Well?

(17)

WJD Feb 3, 1998 17

Care and Feeding of ALUs

Data Bandwidth

Instruction Bandwidth

Regs

Instr. Cache

IR

IP

(18)

WJD Feb 3, 1998 18

Three Key Problems

• Instruction bandwidth • Data bandwidth

(19)

WJD Feb 3, 1998 19

A Bandwidth Hierarchy

SDRAM

SDRAM Str

ea m in g M em or y 1.6GB/s V ec to r R eg is te r F ile 50GB/s ALU Cluster ALU Cluster ALU Cluster 500GB/s

13 ALUs per cluster

•Solves data bandwidth problem

(20)

WJD Feb 3, 1998 20

A Streaming Memory System

Address Generator

Address Generator IX

D

C

ro

ss

ba

r

Reorder Queue

SDRAM Bank

(21)

WJD Feb 3, 1998 21

Streaming Memory Performance

Bank Queue Effectiveness

0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000 1.60000 1.80000

1 2 4 8 16 32 64 Infinite

Queue Size C yc le s /A cc e s s

• Exploit latency insensitivity for improved bandwidth

(22)

WJD Feb 3, 1998 22

Compound Vector Operations 1 Instruction does lots of work

LD Vd Vx

Mem AG

VRF Memory Instructions

Control Store uIP

Op V0 V1 V2 V3 V4 V5 V6 V7 Compound Vector Instruction

Op Ra Rb Op Ra Rb Op Ra Rb

1 CV Inst (50b)

(23)

WJD Feb 3, 1998 23

Scheduling by Simulated Annealing

• List scheduling assumes global communication

– does poorly when

communication exposed

• View scheduling as a CAD problem (place and route)

– generate naïve ‘feasible’ schedule

– iteratively improve schedule by moving operations.

ALUs Time

(24)

WJD Feb 3, 1998 24

Typical Annealing Schedule

0 20 40 60 80 100 120 140 160 180

1 2001 4001 6001 8001 10001 12001 14001 16001 18001

13 166

(25)

WJD Feb 3, 1998 25

Conventional Approaches to

Data-Dependent Conditional Execution

(26)

WJD Feb 3, 1998 26

Zero-Cost Conditionals

• Most Approaches to Conditional Operations are Costly

– Branching control flow - dead issue slots on mispredicted branches – Predication (SIMD select, masked vectors) - large fraction of

execution ‘opportunities’ go idle.

• Conditional Vectors

– append an element to an output stream depending on a case

variable.

Result Stream

Case Stream {0,1}

0 1

(27)

WJD Feb 3, 1998 27

Application Sketch - Polygon Rendering

V1 V2

V3

V1 V2 V3 X Y RGB

Y X1 X2 RGB1 RGB Y

X1 X2

UV

UV1 UV

Vertex

Span

X Y RGB UV _Pixel

Y X

X Y RGB Textured_Pixel

(28)

WJD Feb 3, 1998 28

Status

• Working simulator of Imagine

• Simple kernels running on simulator

– FFT

• Applications being developed

– Depth extraction, video compression, polygon rendering, image-based graphics

(29)

WJD Feb 3, 1998 29

Acknowledgements

• Students/Staff

– Don Alpert (Intel) – Chris Buehler (MIT) – J.P Grossman (MIT) – Brad Johanson

– Ujval Kapasi – Brucek Khailany

– Abelardo Lopez-Lagunas – Peter Mattson

– John Owens – Scott Rixner

• Helpful Suggestions

– Henry Fuchs (UNC) – Pat Hanrahan

– Tom Knight (MIT) – Marc Levoy

(30)

WJD Feb 3, 1998 30

Conclusion

• Work toward tomorrow’s computing engines

• Targeted toward media processing

– streams of low-precision samples – little reuse

– latency tolerant

• Matched to the capabilities of communication-limited

technology

– explicit bandwidth hierarchy

– explicit communication between units – communication exposed