Tomorrow's Computin g Engines
WJD Feb 3, 1998 1
Tomorrow’s Computing Engines
February 3, 1998
Symposium on High-Performance Computer Architecture
William J. Dally
Computer Systems Laboratory Stanford University
Tomorrow's Computin g Engines
WJD Feb 3, 1998 2
Focus on Tomorrow, not Yesterday
General’s tend to always fight the last war
Computer architects tend to always design the last computer
old programs
Tomorrow's Computin g Engines
WJD Feb 3, 1998 3
Some Previous “Wars” (1/3)
MARS Router 1984
Torus Routing Chip 1985
Network Design Frame 1988
Tomorrow's Computin g Engines
WJD Feb 3, 1998 4
Some Previous “Wars” (2/3)
Tomorrow's Computin g Engines
WJD Feb 3, 1998 5
Tomorrow's Computin g Engines
WJD Feb 3, 1998 6
Tomorrow’s Computing Engines
• Driven by tomorrow’s applications - media
Tomorrow's Computin g Engines
WJD Feb 3, 1998 7
90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000
• Quote from Scott Kirkpatric of IBM (talk abstract) • Media applications include
– video encode/decode
– polygon & image-based graphics
– audio processing - compression, music, speech - recognition/synthesis
– modulation/demodulation at audio and video rates
• These applications involve stream processing • So do
Tomorrow's Computin g Engines
WJD Feb 3, 1998 8
Typical Media Kernel
Image Warp and Composite
• Read 10,000 pixels from memory
• Perform 100 16-bit integer operations on each pixel • Test each pixel
• Write 3,000 result pixels that pass to memory
• Little reuse of data fetched from memory
– each pixel used once
• Little interaction between pixels
– very insensitive to operation latency
Tomorrow's Computin g Engines
WJD Feb 3, 1998 9
Telepresence: A Driving Application
Acquire 2D Images
Extract Depth (3D Images)
Segmentation Model Extraction
Compression
Decompression Rendering
Display 3D Scene
Most kernels: Latency insensitive
Tomorrow's Computin g Engines
WJD Feb 3, 1998 10
Tomorrow’s Technology is Wire Limited
Tomorrow's Computin g Engines
WJD Feb 3, 1998 11
Technology scaling makes communication the
scarce resource
0.35m
64Mb DRAM 16 64b FP Proc
400MHz
0.10m
4Gb DRAM 1K 64b FP Proc
2.5GHz
1997 2007
18mm 12,000 tracks
1 clock
32mm 90,000 tracks
20 clocks
Tomorrow's Computin g Engines
WJD Feb 3, 1998 12
On-chip wires are getting slower
x1 x2
y
y
x2 = s x1 0.5x R2 = R1/s2 4x
C2 = C1 1x
tw2 = R2C2y2 = t
w1/s2 4x
tw2/tg2= tw1/(tg1s3) 8x
v = 0.5(tgRC)-1/2 (m/s)
v2 = v1s1/2 0.7x
vtg = 0.5(tg/RC)1/2 (m/gate)
v2tg2 = v1tg1s3/2 0.35x
tw = RCy2 RCy2 RCy2
Tomorrow's Computin g Engines
WJD Feb 3, 1998 13
Bandwidth and Latency of Modern VLSI
Size 10
1 100 103 104 105
10 100
1 103
Latency
Latency Bandwidth
1
0.01
10-4
10-6 Bandwidth
Tomorrow's Computin g Engines
WJD Feb 3, 1998 14
Architecture for Locality Exploit high on-chip bandwidth
Off-chip RAM P in -B an dw id th , 2 G B /s Vector Reg File 104 32-bit ALUs
50GB/s Sw
itc
h
Tomorrow's Computin g Engines
WJD Feb 3, 1998 15
Tomorrow’s Computing Engines
• Aimed at media processing
– stream based – latency tolerant – low-precision – little reuse
– lots of conditionals
• Use the large number of devices available on future chips
• Make efficient use of scarce communication resources
– bandwidth hierarchy
– no centralized resources
• Approach the performance of a special-purpose
Tomorrow's Computin g Engines
WJD Feb 3, 1998 16
Why do Special-Purpose Processors Perform Well?
Tomorrow's Computin g Engines
WJD Feb 3, 1998 17
Care and Feeding of ALUs
Data Bandwidth
Instruction Bandwidth
Regs
Instr. Cache
IR
IP
Tomorrow's Computin g Engines
WJD Feb 3, 1998 18
Three Key Problems
• Instruction bandwidth • Data bandwidth
Tomorrow's Computin g Engines
WJD Feb 3, 1998 19
A Bandwidth Hierarchy
SDRAM
SDRAM
SDRAM
SDRAM Str
ea m in g M em or y 1.6GB/s V ec to r R eg is te r F ile 50GB/s ALU Cluster ALU Cluster ALU Cluster 500GB/s
13 ALUs per cluster
•Solves data bandwidth problem
Tomorrow's Computin g Engines
WJD Feb 3, 1998 20
A Streaming Memory System
Address Generator
Address Generator IX
D
C
ro
ss
ba
r
Reorder Queue
Reorder Queue
SDRAM Bank
Tomorrow's Computin g Engines
WJD Feb 3, 1998 21
Streaming Memory Performance
Bank Queue Effectiveness
0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000 1.60000 1.80000
1 2 4 8 16 32 64 Infinite
Queue Size C yc le s /A cc e s s
• Exploit latency insensitivity for improved bandwidth
Tomorrow's Computin g Engines
WJD Feb 3, 1998 22
Compound Vector Operations 1 Instruction does lots of work
LD Vd Vx
Mem AG
VRF Memory Instructions
Control Store uIP
Op V0 V1 V2 V3 V4 V5 V6 V7 Compound Vector Instruction
Op Ra Rb Op Ra Rb Op Ra Rb
1 CV Inst (50b)
Tomorrow's Computin g Engines
WJD Feb 3, 1998 23
Scheduling by Simulated Annealing
• List scheduling assumes global communication
– does poorly when
communication exposed
• View scheduling as a CAD problem (place and route)
– generate naïve ‘feasible’ schedule
– iteratively improve schedule by moving operations.
ALUs Time
Tomorrow's Computin g Engines
WJD Feb 3, 1998 24
Typical Annealing Schedule
0 20 40 60 80 100 120 140 160 180
1 2001 4001 6001 8001 10001 12001 14001 16001 18001
13 166
Tomorrow's Computin g Engines
WJD Feb 3, 1998 25
Conventional Approaches to
Data-Dependent Conditional Execution
Tomorrow's Computin g Engines
WJD Feb 3, 1998 26
Zero-Cost Conditionals
• Most Approaches to Conditional Operations are Costly
– Branching control flow - dead issue slots on mispredicted branches – Predication (SIMD select, masked vectors) - large fraction of
execution ‘opportunities’ go idle.
• Conditional Vectors
– append an element to an output stream depending on a case
variable.
Result Stream
Case Stream {0,1}
0 1
Tomorrow's Computin g Engines
WJD Feb 3, 1998 27
Application Sketch - Polygon Rendering
V1 V2
V3
V1 V2 V3 X Y RGB
Y X1 X2 RGB1 RGB Y
X1 X2
UV
UV1 UV
Vertex
Span
X Y RGB UV Pixel
Y X
X Y RGB TexturedPixel
Tomorrow's Computin g Engines
WJD Feb 3, 1998 28
Status
• Working simulator of Imagine
• Simple kernels running on simulator
– FFT
• Applications being developed
– Depth extraction, video compression, polygon rendering, image-based graphics
Tomorrow's Computin g Engines
WJD Feb 3, 1998 29
Acknowledgements
• Students/Staff
– Don Alpert (Intel) – Chris Buehler (MIT) – J.P Grossman (MIT) – Brad Johanson
– Ujval Kapasi – Brucek Khailany
– Abelardo Lopez-Lagunas – Peter Mattson
– John Owens – Scott Rixner
• Helpful Suggestions
– Henry Fuchs (UNC) – Pat Hanrahan
– Tom Knight (MIT) – Marc Levoy
Tomorrow's Computin g Engines
WJD Feb 3, 1998 30
Conclusion
• Work toward tomorrow’s computing engines
• Targeted toward media processing
– streams of low-precision samples – little reuse
– latency tolerant
• Matched to the capabilities of communication-limited
technology
– explicit bandwidth hierarchy
– explicit communication between units – communication exposed