• Tidak ada hasil yang ditemukan

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

N/A
N/A
Protected

Academic year: 2024

Membagikan "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks"

Copied!
53
0
0

Teks penuh

(1)

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for

Convolutional Neural Networks

Yu-Hsin Chen

1

, Joel Emer

1, 2

, Vivienne Sze

1

1 MIT 2 NVIDIA

(2)

Contributions of This Work

• A novel energy-efficient CNN dataflow that has been verified in a fabricated chip, Eyeriss.

• A taxonomy of CNN dataflows that classifies previous work into three categories.

• A framework that compares the energy efficiency of different dataflows under same area and CNN setup.

4000 µm

4000 µm

On-chip Buffer Spatial PE Array

Eyeriss

[ISSCC, 2016]

A reconfigurable CNN processor

35 fps @ 278 mW*

* AlexNet CONV layers

(3)

Deep Convolutional Neural Networks

Classes FC

Layers

Modern deep CNN: up to 1000 CONV layers

CONV Layer CONV

Layer

Low-level Features

High-level Features

(4)

Deep Convolutional Neural Networks

CONV Layer CONV

Layer

Low-level Features

High-level Features

Classes FC

Layers 1 – 3 layers

(5)

Deep Convolutional Neural Networks

Classes CONV

Layer CONV

Layer

FC Layers

Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption

(6)

R

Filter

R

High-Dimensional CNN Convolution

E

E

Partial Sum (psum) Accumulation

Input Image (Feature Map) Output Image

Element-wise Multiplication

H

a pixel

H

(7)

R H

Filter

R

High-Dimensional CNN Convolution

E

Sliding Window Processing Input Image (Feature Map)

a pixel

Output Image

H E

(8)

H

High-Dimensional CNN Convolution

R

R C

Input Image

Output Image Filter C

Many Input Channels (C)

E

H E

(9)

High-Dimensional CNN Convolution

E

Output Image Many

Filters (M)

Many

Output Channels (M)

M

R

R

1

R

R C

M

H

Input Image

C C

H E

(10)

High-Dimensional CNN Convolution

M

Many

Input Images (N) Many

Output Images (N)

R

R

R

R C C

Filters

E

E H

C

H

H C

1

E

1

N N

H E

(11)

Memory Access is the Bottleneck

filter weight ALU image pixel

partial sum updated partial sum

Memory Read MAC* Memory Write

* multiply-and-accumulate

(12)

Memory Access is the Bottleneck

ALU

Memory Read MAC* Memory Write

* multiply-and-accumulate

DRAM DRAM

• Example: AlexNet [NIPS 2012] has 724M MACs à 2896M DRAM accesses required

Worst Case: all memory R/W are DRAM accesses

(13)

Memory Access is the Bottleneck

ALU

Memory Read MAC* Memory Write

Extra levels of local memory hierarchy Mem

DRAM Mem DRAM

(14)

Memory Access is the Bottleneck

ALU

Memory Read Memory Write

Extra levels of local memory hierarchy

1

Opportunities: data reuse 1 local accumulation Mem

DRAM Mem DRAM

MAC*

(15)

Types of Data Reuse in CNN

Filter Reuse Convolutional Reuse Image Reuse

CONV layers only (sliding window)

CONV and FC layers CONV and FC layers (batch size > 1)

Filter Image Filters

2 1

Image

Filter

2 1

Images

Image pixels Filter weights

Reuse: Reuse: Image pixels Reuse: Filter weights

(16)

Memory Access is the Bottleneck

ALU

Memory Read Memory Write

Extra levels of local memory hierarchy

** AlexNet CONV layers

1) Can reduce DRAM reads of filter/image by up to 500×**

1

Opportunities: data reuse 1 local accumulation Mem

DRAM Mem DRAM

1

MAC*

(17)

Memory Access is the Bottleneck

1) Can reduce DRAM reads of filter/image by up to 500×

2) Partial sum accumulation does NOT have to access DRAM 1

2

ALU

Memory Read Memory Write

Extra levels of local memory hierarchy

2 1

Opportunities: data reuse 1 2 local accumulation Mem

DRAM Mem DRAM

MAC*

(18)

Memory Access is the Bottleneck

Opportunities: data reuse local accumulation

• Example: DRAM access in AlexNet can be reduced from 2896M to 61M (best case)

1) Can reduce DRAM reads of filter/image by up to 500×

2) Partial sum accumulation does NOT have to access DRAM

1 2

ALU

Memory Read Memory Write

Extra levels of local memory hierarchy

2 1

Mem

DRAM Mem DRAM

1 2

MAC*

(19)

Spatial Architecture for CNN

Processing Element (PE)

Global Buffer (100 – 500 kB)

ALU

ALU

ALU

ALU ALU

ALU

ALU

ALU ALU

ALU

ALU

ALU ALU

ALU

ALU

ALU

DRAM

Local Memory Hierarchy

• Global Buffer

• Direct inter-PE network

• PE-local memory (RF)

Control

Reg File 0.5 – 1.0 kB

(20)

Low-Cost Local Data Access

DRAM Global Buffer

PE

PE PE

ALU fetch data to run a MAC here

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×

PE ALU

(Reference)

DRAM ALU

0.5 – 1.0 kB

100 – 500 kB

NoC: 200 – 1000 PEs

(21)

Low-Cost Local Data Access

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

PE ALU

(Reference)

0.5 – 1.0 kB

100 – 500 kB

How to exploit data reuse and local accumulation with limited low-cost local storage?

1 2

NoC: 200 – 1000 PEs

specialized processing dataflow required!

(22)

Taxonomy of

Existing Dataflows

• Weight Stationary (WS)

• Output Stationary (OS)

• No Local Reuse (NLR)

(23)

Weight Stationary (WS)

Minimize weight read energy consumption

maximize convolutional and filter reuse of weights

Examples:

[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014]

[Park, ISSCC 2015] [Origami, GLSVLSI 2015]

Global Buffer

W0 W1 W2 W3 W4 W5 W6 W7

Psum Pixel

PE

Weight

(24)

Minimize partial sum R/W energy consumption

maximize local accumulation

Examples:

Output Stationary (OS)

[Gupta, ICML 2015] [ShiDianNao, ISCA 2015]

[Peemen, ICCD 2013]

Global Buffer

P0 P1 P2 P3 P4 P5 P6 P7

Pixel Weight

PE

Psum

(25)

• Use a large global buffer as shared storage

Reduce DRAM access energy consumption

Examples:

No Local Reuse (NLR)

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014]

[Zhang, FPGA 2015]

PE

Pixel

Psum

Global Buffer Weight

(26)

Energy Efficiency Comparison

0 0.5 1 1.5 2

WS OSA OSB OSC NLR RS

No rm . E n e rg y /O p

Dataflows

NLR

WS OSA OSB OSC Row

Stationary Normalized

Energy/MAC

CNN Dataflows Variants of OS

• Same total area • 256 PEs

• AlexNet Configuration* • Batch size = 16

* AlexNet CONV layers

(27)

Energy-Efficient Dataflow:

Row Stationary (RS)

• Maximize reuse and accumulation at RF

• Optimize for overall energy efficiency

instead for only a certain data type

(28)

1D Row Convolution in PE

* =

Filter Input Image Output Image

(29)

1D Row Convolution in PE

* =

Filter Partial Sums

a b c a b c

a b c d e

Reg File PE

b a c

d c

e b a

Input Image

(30)

1D Row Convolution in PE

* =

Filter

a b c a b c

a b c d e

e d

PE

b a c

Reg File

b a c

a

Partial Sums Input Image

(31)

1D Row Convolution in PE

* =

a b c

a b c d e Partial Sums Input Image

PE

b a c

Reg File

c b d

b e

a

Filter

a b c

(32)

1D Row Convolution in PE

* =

a b c

a b c d e Partial Sums Input Image

PE

b a c

Reg File

d c e

c

b a

Filter

a b c

(33)

1D Row Convolution in PE

PE

b a c

Reg File

d c e

c

b a

• Maximize row convolutional reuse in RF

Keep a filter row and image sliding window in RF

• Maximize row psum accumulation in RF

(34)

2D Convolution in PE Array

Row 1 Row 1

* =

*

PE 1

(35)

2D Convolution in PE Array

Row 1 Row 1

Row 2 Row 2

Row 3 Row 3 Row 1

* =

*

*

*

PE 1

PE 2

PE 3

(36)

2D Convolution in PE Array

Row 1 Row 1

Row 2 Row 2

Row 3 Row 3 Row 1

* =

Row 1 Row 2

Row 2 Row 3

Row 3 Row 4

* =

* *

* *

* *

Row 2

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

(37)

2D Convolution in PE Array

PE 1

Row 1 Row 1

PE 2

Row 2 Row 2

PE 3

Row 3 Row 3 Row 1

* =

PE 4

Row 1 Row 2

PE 5

Row 2 Row 3

PE 6

Row 3 Row 4 Row 2

* =

PE 7

Row 1 Row 3

PE 8

Row 2 Row 4

PE 9

Row 3 Row 5 Row 3

* =

* * *

* * *

* * *

(38)

Convolutional Reuse Maximized

Row 1

Row 2

Row 3 Row 1

Row 2

Row 3

Row 4 Row 2

Row 3

Row 4

Row 5 Row 3

* * *

* * *

* * *

Filter rows are reused across PEs horizontally

Row 1

Row 2

Row 3

Row 1

Row 2

Row 3

Row 1

Row 2

Row 3

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9

(39)

Convolutional Reuse Maximized

Row 1

Row 2

Row 3

Row 1

Row 1

Row 2

Row 3

Row 2

Row 1

Row 2

Row 3

Row 3

* * *

* * *

* * *

Image rows are reused across PEs diagonally

Row 1

Row 2

Row 3

Row 2

Row 3

Row 4

Row 3

Row 4

Row 5

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9

(40)

Maximize 2D Accumulation in PE Array

Row 1 Row 1

Row 2 Row 2

Row 3 Row 3

Row 1 Row 2

Row 2 Row 3

Row 3 Row 4

Row 1 Row 3

Row 2 Row 4

Row 3 Row 5

* * *

* * *

* * *

Partial sums accumulate across PEs vertically

Row 1 Row 2 Row 3

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9

(41)

Dimensions Beyond 2D Convolution

3 Multiple Channels

1 Multiple Images 2 Multiple Filters

(42)

Filter Reuse in PE

3 Multiple Channels

1 Multiple Images 2 Multiple Filters

R R C

H C

H

H C

H

Row 1 Row 1 Channel 1

Image 1

*

= Row 1

Psum 1 Filter 1

Channel 1 Row 1 Row 1 Image 2

*

= Row 1

Psum 2 Filter 1

Processing in PE: concatenate image rows

Channel 1 Row 1

*

Image 1 & 2

=

Psum 1 & 2 Filter 1

Row 1 Row 1 Row 1 Row 1 share the same filter row

(43)

Image Reuse in PE

3 Multiple Channels

1 Multiple Images 2 Multiple Filters

R R C

R R

C H

C

H

Row 1 Row 1 Channel 1

Image 1

*

= Row 1

Psum 1 Filter 1

Channel 1 Row 1 Row 1 Image 1

*

= Row 1

Psum 2 Filter 2

share the same image row

Processing in PE: interleave filter rows

*

Image 1

=

Psum 1 & 2 Filter 1 & 2

Row 1 Channel 1

(44)

Channel Accumulation in PE

3 Multiple Channels

1 Multiple Images 2 Multiple Filters

R R C

H C

H

Row 1 Row 1 Channel 1

Image 1

*

= Row 1

Psum 1 Filter 1

Channel 2 Row 1 Row 1 Image 1

*

= Row 1

Psum 1 Filter 1

accumulate psums Row 1

+

Row 1 = Row 1
(45)

Channel Accumulation in PE

3 Multiple Channels

1 Multiple Images 2 Multiple Filters

R R C

H C

H

Row 1 Row 1 Channel 1

Image 1

*

= Row 1

Psum 1 Filter 1

Channel 2 Row 1 Row 1 Image 1

*

= Row 1

Psum 1 Filter 1

Channel 1 & 2

Image 1

=

Psum Filter 1

*

Row 1

Processing in PE: interleave channels

accumulate psums

(46)

CNN Convolution – The Full Picture

Multiple images:

Multiple filters:

Multiple channels:

PE Row 1 Row 1

PE Row 2 Row 2

PE Row 3 Row 3

PE Row 1 Row 2

PE Row 2 Row 3

PE Row 3 Row 4

PE Row 1 Row 3

PE Row 2 Row 4

PE Row 3 Row 5

* * *

* * *

* * *

Image 1

=

Psum Filter 1

*

*

Image 1

=

Psum 1 & 2 Filter 1 & 2

*

Image 1 & 2

=

Psum 1 & 2 Filter 1

(47)

Simulation Results

• Same total hardware area

• 256 PEs

• AlexNet Configuration

• Batch size = 16

(48)

Dataflow Comparison: CONV Layers

RS uses 1.4× – 2.5× lower energy than other dataflows

Normalized Energy/MAC

ALU RF NoC buffer DRAM

0 0.5 1 1.5 2

WS OSA OSB OSC NLR RS

CNN Dataflows

(49)

Dataflow Comparison: CONV Layers

0 0.5 1 1.5 2

Normalized Energy/MAC

WS OSA OSB OSC NLR RS

psums weights pixels

CNN Dataflows

(50)

Dataflow Comparison: FC Layers

0 0.5 1 1.5 2

psums weights pixels Normalized

Energy/MAC

WS OSA OSB OSC NLR RS

CNN Dataflows

RS uses at least 1.3× lower energy than other dataflows

(51)

Row Stationary: Layer Breakdown

ALU RF NoC buffer DRAM

2.0e10

1.5e10

1.0e10

0.5e10

0 L1 L2 L3 L4 L5 L6 L7 L8

Normalized Energy

(1 MAC = 1)

CONV Layers FC Layers

Total Energy

80% 20%

(52)

Summary

We propose a Row Stationary (RS) dataflow to exploit the low-cost local memories in a spatial architecture.

RS optimizes for best overall energy efficiency while existing CNN dataflows only focus on certain data types.

RS has higher energy efficiency than existing dataflows

1.4× – 2.5× higher in CONV layers

at least 1.3× higher in FC layers. (batch size ≥ 16)

We have verified RS in a fabricated CNN processor chip, Eyeriss

4000 µm

4000 µm

(53)

Thank You

Learn more about Eyeriss at

http://eyeriss.mit.edu

Referensi

Dokumen terkait