Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

(1)

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for

Convolutional Neural Networks

Yu-Hsin Chen

¹

, Joel Emer

^{1, 2}

, Vivienne Sze

¹

1 MIT ² NVIDIA

(2)

Contributions of This Work

• A novel energy-efficient CNN dataflow that has been verified in a fabricated chip, Eyeriss.

• A taxonomy of CNN dataflows that classifies previous work into three categories.

• A framework that compares the energy efficiency of different dataflows under same area and CNN setup.

4000 µm

4000 µm

On-chip Buffer Spatial PE Array

Eyeriss

[ISSCC, 2016]

A reconfigurable CNN processor

35 fps @ 278 mW^*

* AlexNet CONV layers

(3)

Deep Convolutional Neural Networks

Classes FC

Layers

Modern deep CNN: up to 1000 CONV layers

CONV Layer CONV

Layer

Low-level Features

High-level Features

(4)

Deep Convolutional Neural Networks

CONV Layer CONV

Layer

Low-level Features

High-level Features

Classes FC

Layers 1 – 3 layers

(5)

Deep Convolutional Neural Networks

Classes CONV

Layer CONV

Layer

FC Layers

Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption

(6)

R

Filter

R

High-Dimensional CNN Convolution

E

Partial Sum (psum) Accumulation

Input Image (Feature Map) Output Image

Element-wise Multiplication

H

a pixel

H

(7)

R H

Filter

R

High-Dimensional CNN Convolution

E

Sliding Window Processing Input Image (Feature Map)

a pixel

Output Image

H E

(8)

H

High-Dimensional CNN Convolution

R

R C

Input Image

Output Image Filter C

Many Input Channels (C)

E

H E

(9)

High-Dimensional CNN Convolution

E

Output Image Many

Filters (M)

Many

Output Channels (M)

M

…

R

1

R

R C

M

H

Input Image

C C

H E

(10)

High-Dimensional CNN Convolution

…

M

…

Many

Input Images (N) Many

Output Images (N)

…

R

R C C

Filters

E

E H

C

H

H C

1

E

1

N N

H E

(11)

Memory Access is the Bottleneck

filter weight ALU image pixel

partial sum updated partial sum

Memory Read MAC^* Memory Write

* multiply-and-accumulate

(12)

Memory Access is the Bottleneck

ALU

* multiply-and-accumulate

DRAM DRAM

• Example: AlexNet [NIPS 2012] has 724M MACs à 2896M DRAM accesses required

Worst Case: all memory R/W are DRAM accesses

(13)

Memory Access is the Bottleneck

ALU

Extra levels of local memory hierarchy Mem

DRAM Mem DRAM

(14)

Memory Access is the Bottleneck

ALU

Memory Read Memory Write

Extra levels of local memory hierarchy

1

Opportunities: data reuse 1 local accumulation Mem

DRAM Mem DRAM

MAC^*

(15)

Types of Data Reuse in CNN

Filter Reuse Convolutional Reuse Image Reuse

CONV layers only (sliding window)

CONV and FC layers CONV and FC layers (batch size > 1)

Filter Image Filters

2 1

Image

Filter

2 1

Images

Image pixels Filter weights

Reuse: Reuse: Image pixels Reuse: Filter weights

(16)

Memory Access is the Bottleneck

ALU

Memory Read Memory Write

** AlexNet CONV layers

1) Can reduce DRAM reads of filter/image by up to 500×^**

1

Opportunities: data reuse 1 local accumulation Mem

1

MAC^*

(17)

Memory Access is the Bottleneck

1) Can reduce DRAM reads of filter/image by up to 500×

2) Partial sum accumulation does NOT have to access DRAM 1

2

ALU

Memory Read Memory Write

2 1

Opportunities: data reuse 1 2 local accumulation Mem

MAC^*

(18)

Memory Access is the Bottleneck

Opportunities: data reuse local accumulation

• Example: DRAM access in AlexNet can be reduced from 2896M to 61M (best case)

1) Can reduce DRAM reads of filter/image by up to 500×

2) Partial sum accumulation does NOT have to access DRAM

1 2

ALU

Memory Read Memory Write

2 1

Mem

1 2

MAC^*

(19)

Spatial Architecture for CNN

Processing Element (PE)

Global Buffer (100 – 500 kB)

ALU

ALU ALU

ALU

ALU ALU

ALU

ALU ALU

ALU

DRAM

Local Memory Hierarchy

• Global Buffer

• Direct inter-PE network

• PE-local memory (RF)

Control

Reg File 0.5 – 1.0 kB

(20)

Low-Cost Local Data Access

DRAM Global Buffer

PE

PE PE

ALU fetch data to run a MAC here

ALU

Buffer ^ALU

RF ^ALU

Normalized Energy Cost^*

200×

6×

PE ^ALU 2×

1×

1× (Reference)

DRAM ^ALU

0.5 – 1.0 kB

100 – 500 kB

NoC: 200 – 1000 PEs

(21)

Low-Cost Local Data Access

ALU

Buffer ^ALU

RF ^ALU

Normalized Energy Cost^*

6×

PE ^ALU 2×

1×

1× (Reference)

0.5 – 1.0 kB

100 – 500 kB

How to exploit data reuse and local accumulation with limited low-cost local storage?

1 2

NoC: 200 – 1000 PEs

specialized processing dataflow required!

(22)

Taxonomy of

Existing Dataflows

• Weight Stationary (WS)

• Output Stationary (OS)

• No Local Reuse (NLR)

(23)

Weight Stationary (WS)

• Minimize weight read energy consumption

− maximize convolutional and filter reuse of weights

• Examples:

[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014]

[Park, ISSCC 2015] [Origami, GLSVLSI 2015]

Global Buffer

W0 W1 W2 W3 W4 W5 W6 W7

Psum Pixel

PE

Weight

(24)

• Minimize partial sum R/W energy consumption

− maximize local accumulation

• Examples:

Output Stationary (OS)

[Gupta, ICML 2015] [ShiDianNao, ISCA 2015]

[Peemen, ICCD 2013]

Global Buffer

P0 P1 P2 P3 P4 P5 P6 P7

Pixel Weight

PE

Psum

(25)

• Use a large global buffer as shared storage

− Reduce DRAM access energy consumption

• Examples:

No Local Reuse (NLR)

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014]

[Zhang, FPGA 2015]

PE

Pixel

Psum

Global Buffer Weight

(26)

Energy Efficiency Comparison

0 0.5 1 1.5 2

WS OSA OSB OSC NLR RS

No rm . E n e rg y /O p

Dataflows

NLR

WS OS_A OS_B OS_C Row

Stationary Normalized

Energy/MAC

CNN Dataflows Variants of OS

• Same total area • 256 PEs

• AlexNet Configuration^* • Batch size = 16

* AlexNet CONV layers

(27)

Energy-Efficient Dataflow:

Row Stationary (RS)

• Maximize reuse and accumulation at RF

• Optimize for overall energy efficiency

instead for only a certain data type

(28)

1D Row Convolution in PE

* =

Filter Input Image Output Image

(29)

* =

Filter Partial Sums

a b c a b c

a b c d e

Reg File PE

b a c

d c

e b a

Input Image

(30)

1D Row Convolution in PE

* =

Filter

a b c a b c

a b c d e

e d

PE

b a c

Reg File

b a c

a

Partial Sums Input Image

(31)

* =

^{a b c}

a b c d e Partial Sums Input Image

PE

b a c

Reg File

c b d

b e

a

Filter

a b c

(32)

* =

^{a b c}

a b c d e Partial Sums Input Image

PE

b a c

Reg File

d c e

c

b a

Filter

a b c

(33)

PE

b a c

Reg File

d c e

c

b a

• Maximize row convolutional reuse in RF

− Keep a filter row and image sliding window in RF

• Maximize row psum accumulation in RF

(34)

2D Convolution in PE Array

Row 1 Row 1

* =

*

PE 1

(35)

2D Convolution in PE Array

Row 1 Row 1

Row 2 Row 2

Row 3 Row 3 Row 1

* =

*

PE 1

PE 2

PE 3

(36)

Row 1 Row 1

Row 2 Row 2

Row 3 Row 3 Row 1

* =

Row 1 Row 2

Row 2 Row 3

Row 3 Row 4

* =

* *

Row 2

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

(37)

PE 1

Row 1 Row 1

PE 2

Row 2 Row 2

PE 3

Row 3 Row 3 Row 1

* =

PE 4

Row 1 Row 2

PE 5

Row 2 Row 3

PE 6

Row 3 Row 4 Row 2

* =

PE 7

Row 1 Row 3

PE 8

Row 2 Row 4

PE 9

Row 3 Row 5 Row 3

* =

* * *

(38)

Convolutional Reuse Maximized

Row 1

Row 2

Row 3 Row 1

Row 2

Row 3

Row 4 Row 2

Row 3

Row 4

Row 5 Row 3

* * *

Filter rows are reused across PEs horizontally

Row 1

Row 2

Row 3

Row 1

Row 2

Row 3

Row 1

Row 2

Row 3

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9

(39)

Convolutional Reuse Maximized

Row 1

Row 2

Row 3

Row 1

Row 2

Row 3

Row 2

Row 1

Row 2

Row 3

* * *

Image rows are reused across PEs diagonally

Row 1

Row 2

Row 3

Row 2

Row 3

Row 4

Row 3

Row 4

Row 5

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9

(40)

Maximize 2D Accumulation in PE Array

Row 1 Row 1

Row 2 Row 2

Row 3 Row 3

Row 1 Row 2

Row 2 Row 3

Row 3 Row 4

Row 1 Row 3

Row 2 Row 4

Row 3 Row 5

* * *

Partial sums accumulate across PEs vertically

Row 1 Row 2 Row 3

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9

(41)

Dimensions Beyond 2D Convolution

3 Multiple Channels

1 Multiple Images ² Multiple Filters

(42)

Filter Reuse in PE

R R C

H C

H

H C

H

Row 1 Row 1 Channel 1

Image 1

*

⁼ ^{Row 1}

Psum 1 Filter 1

Channel 1 Row 1 Row 1 Image 2

*

⁼ ^{Row 1}

Processing in PE: concatenate image rows

Channel 1 ^{Row 1}

*

Image 1 & 2

=

Psum 1 & 2 Filter 1

Row 1 Row 1 Row 1 Row 1 share the same filter row

(43)

Image Reuse in PE

R R C

R R

C H

C

H

Image 1

*

⁼ ^{Row 1}

*

⁼ ^{Row 1}

share the same image row

Processing in PE: interleave filter rows

*

Image 1

=

Psum 1 & 2 Filter 1 & 2

Row 1 Channel 1

(44)

Channel Accumulation in PE

R R C

H C

H

Image 1

*

⁼ ^{Row 1}

*

⁼ ^{Row 1}

accumulate psums Row 1

+

Row 1 ⁼ ^{Row 1}

(45)

Channel Accumulation in PE

R R C

H C

H

Image 1

*

⁼ ^{Row 1}

*

⁼ ^{Row 1}

Channel 1 & 2

Image 1

=

Psum Filter 1

*

^{Row 1}

Processing in PE: interleave channels

accumulate psums

(46)

CNN Convolution – The Full Picture

Multiple images:

Multiple filters:

Multiple channels:

PE Row 1 Row 1

PE Row 2 Row 2

PE Row 3 Row 3

PE Row 1 Row 2

PE Row 2 Row 3

PE Row 3 Row 4

PE Row 1 Row 3

PE Row 2 Row 4

PE Row 3 Row 5

* * *

Image 1

=

Psum Filter 1

*

Image 1

=

Psum 1 & 2 Filter 1 & 2

*

Image 1 & 2

=

Psum 1 & 2 Filter 1

(47)

Simulation Results

• Same total hardware area

• 256 PEs

• AlexNet Configuration

• Batch size = 16

(48)

Dataflow Comparison: CONV Layers

RS uses 1.4× – 2.5× lower energy than other dataflows

Normalized Energy/MAC

ALU RF NoC buffer DRAM

0 0.5 1 1.5 2

WS OS_A OS_B OS_C NLR RS

CNN Dataflows

(49)

Dataflow Comparison: CONV Layers

0 0.5 1 1.5 2

Normalized Energy/MAC

psums weights pixels

CNN Dataflows

(50)

Dataflow Comparison: FC Layers

0 0.5 1 1.5 2

psums weights pixels Normalized

Energy/MAC

CNN Dataflows

RS uses at least 1.3× lower energy than other dataflows

(51)

Row Stationary: Layer Breakdown

ALU RF NoC buffer DRAM

2.0e10

1.5e10

1.0e10

0.5e10

0 L1 L2 L3 L4 L5 L6 L7 L8

Normalized Energy

(1 MAC = 1)

CONV Layers FC Layers

Total Energy

80% 20%

(52)

Summary

• We propose a Row Stationary (RS) dataflow to exploit the low-cost local memories in a spatial architecture.

• RS optimizes for best overall energy efficiency while existing CNN dataflows only focus on certain data types.

• RS has higher energy efficiency than existing dataflows

− 1.4× – 2.5× higher in CONV layers

− at least 1.3× higher in FC layers. (batch size ≥ 16)

• We have verified RS in a fabricated CNN processor chip, Eyeriss

4000 µm

4000 µm

(53)

Thank You

Learn more about Eyeriss at

http://eyeriss.mit.edu