Exploiting the inherent parallelisms of back-propagation neural networks to design a systolic array

(1)

EXPLOITING THE

INHERENT

PARALLELISMS OF BACK-PROPAGATION

NEURAL

NETWORKS TO DESIGN A SYSTOLIC ARRAY

Jai-Hoon Chung, Hyunsoo Yoon. and Seung R p d M a w Department of Computer Science

Korea Advanced Institute

of

Science and Technology (KAIsT) 373-1 Kusong-Dong, Ywung-Gv

Taejon 305- 701, Korea

ABSTRACT

I n this paper, a rwo-dimensional systolic array for back-propagation neural network is presented. The design is based on the classical systolic algorithm of matrix-by-wctor multiplication, and exploits the inherent parallelisms of back-propogcrtion neural networks. This design executes the forward and bockward passes in parallel, and a- ploits the pipelined parallelkm of multiple patterns in each pass. The estimated peerfornuurce ofthis des& shows

that the pipelining of multiple patterns is an important factor in V U 1 neural network implementations.

1. Intmductioo

As simulations of artificial neural networks (ANNs) a~ time consuming on a sequential computer, dedicated BT-

chitectures that exploit the padelisms inherent in the A " s are required to implement these netwds. A systolic array Wung821 is one of the best solutions to these problems. It can ovemme the communicatbn problems generated by the highly interconnected nemns. and can exploit the massive parallelism inherent in the problem.

M m v a since the computation of A N N s can be represented by a series of matrix-by-vector multiplications. the classical systolic algorithms cmbe used to implement them.

There have been several research efforts on systolic algorithms and systolic array architectures to implement ANNs. The approaches can be classified into three groups. One is mapping the systolic algorithms for ANNs onto parallel computers such as Warp, MasPar MF-1, and Transputer arrays. another is designing programmable systolic arrays for general ANN models, and the other is designing a VLSI systolic array dedicated to a specific model. All of these implementations exploit the spatial parallelism and ^trainingpattern paraZleZism inherent in ANNs. and suggest the systolic ring array or two-dimensional array structures.

The back-propagation neural network mume86l, in addition to the above two types of parallelism, has one more parallel aspect that the forward and backward passes can be executed in parallel with pipeling of multiple pat- terns. This paper proposes a systolic design dedicated to backpropagation neural network that can exploit this type of piuallelism.

In the following sections, the existing systolic implementations and inherent parallelisms that we can exploit are first reviewed. Then a systolic array that can exploit the forwatd/backwad pipelining of multiple pattems is presented. F i y the potential performance of this design is analyzed.

2. Background

The Back-Propagation Model

Let US consider an L -layer (layer 1 to layer L) network consisting of Nk neurons at the k-th layer. Layer 1 is the input layer, layer L is the output layer, and layers 2 to L-1 are the hidden layers. The operations of the back- propagation model are represented by following equations.

The Forward Pass

(2)

The weight Incranent update

N

P Pa=

E

=E

E, = (y#)

-

^dH)z^<^a

Inhuent Parpuelimrs

l e l h which a n bewploited by executing the ope” quired ineach l a y a m many p c e s s m . The spatial

’Ibueare thne types OfplarPUelipmsinhelWltinback-prapaeation neural netwalcn nle first is thespcrrialpcual-

S1) Wrtitioningthenavonsofeachlayerintogroups

S2) pamdoningtheope” ofcach llmron intosumof prodwxs operationsand the non-linearfunction S3) partitioniae thewun dproaucrP opaatiarp into several groups.

’Ihese pdtkmedopaetiOnS can be executedon diffaent proctssors in parallel. Most of neural network simula- tcrs implwneAted compltcrs exploit parallelism Sl) and S2). but pdlelism S3) is not exploited much d u e t o t h e a m m u “ ovemead

The stcond type of parallelism is the puttern parallelism which can be exploited by executing the different pat-

P1) partitioniae the training pattean set, independent execution on many processors, and epoch update

€2) Pipelining of multiple training paaeans P3) Pipelining of multiple pattan recalls

T h e p ” P2) can be exploited by allowing some processors that completed the parts of operations required to execute training for a paaem to srart mhing for the next training ~#utern while the other processors are pro- cessing for the pviuus aaining paems presented. The parallelism F9) can be exploited by allowing the pipelin- ing of multiple pattern recalls.

The third type of pnrallelism is the fomwd/&ckwurd pipfined parallelism which can be exploited by executing the forward and backward jmsses of different training paeans in parallel [Sin@]. This pppauelism implies the exploitation of the padewp2). The depth of pipelining can be maximized to two times thenumber of neurons in the network. and it is effective if the size dthe mining set is much larga than the number of neurons in the nehkrok The forward/backward pipelined parallelism can be classifred accodhg to the depth of pipelining as follows:

F1) Pass-by-pas pipelining F2) Layer*-lay= pipelining

maybe classified into thrae levekpccordiag tothedegme o f p ” cxploih”

terns on many p”X. The paaean parallelism can be classified as follows:

[pomc88,

Mill89. s i 1

n)

N-bY-”euro”PiPe?ining

M) C Q M U X h l ~ y ~ ~ t l ~ plpehhg

2205

(3)

Systolic Implementations of Artificial Neural Networks

There have been several systolic implementations of artificial neural networks. which can be classified as follows:

1) Mapping systolic algorithms for ne& networks onto paralael computers WandO, W89. Pome-88. Chin901 2) Design of a programmable systolic array for general n dnetworks [Kung88. Rama901

3) Design of a VLSI systolic array dedicated to a specific neural network model Play89. Blay90, KwmWI

‘Ihese also can be classified according to the dimension of their &tant systolic array. lhese implementations aresommanzed in Table 1. As shown in Table 1, the two-dimensional approaches can exploit more parallelisms.

Our approach is distinguished from the other approaches in that it exploits the. f o ” & pipelined @- lelism described above.

Table 1. Systolic Implementations of Artificial Neural Networks

__.-

Slap & L e h ” BlaY901 Kwsn & Tsang [ K w d l cldrm et al.

[ M d l

I

Kuo. et al. I Back I SPatialS1) I SandYB

Hopfdd Spatial SI), SZ), S3) Dedicated

Kohmen Patternp3) VLSI

Back Spatial SI). W S 3 ) Ropsgation PaaemP3)

Bedr S ~ t i a l $11 S2). S3) MasPar MP-1

.

I I I I

Updating the Weights

The issue of the weight updating time has been controversial with contention W e e n researchers who update network weights continuously after each training pattern is presented (continuow updating) and those who update weights only after some subset, or often after the entire set, of training patterns (batch updaring). To exploit the tmhg pattern pipelining. we use the batch updating method because the weights must not be changed for the for- ward pass and the backward pass of a pattean as shown in Equations (1)-(6), although the systolic army proposed in this paper can support both of the weight updating methods.

3. A Systolic Array for Back-Propagation Neural Network

The computation of &cial neural networks can be represented by a series of matrix-by-vector multiplications interleaved with the non-linear activation functions. The matrix represents the weights associated to the connec- tions between two adjacent layers, and the vector represents the input patterns or the outputs of a hidden layer in the forward pass. or the e m pmpagated in the backward pass. The resultant vector of a matrix-by-vector multi- plication is applied to the non-linear activation function and to the next layer.

To exploit the spatial parallelism, we do not consider the one-dimensional arrays. And to efficiently organize the systolic array even for a network in which the numbers of neurons of the adjacent layers are much different, we didn’t use the systolic algorithms poposed in [Kung88, KatoW].

(4)

Array OrgaIliZption

In F i i Ne). the systolic array on between twoad'acent layers is shown.

one

layercoasists o f 5 new Weightvahand tbeweightinaanent itwowed tothecannectron bemealne~iofalayerandnevronjof mu and the otha layer consists

%%%I!.

Each basic ceh wii is a computational element which contains the

Opaations as shown ^UIE ~ ~ ~ I o M (1.2) and (3.2) and

can be executed in parallel arc shown in Figun

lo).

Activ.tiOn W J

NW NE

Error

back-

. . ,,''

^WN ^N

propagated

a"

.

^ypJ ws ^S

e f? ^ES

-

^NW^{X W}⁺^WS

SW

-

^NW

SE

-

^{W N X W}⁺^NE

EN

-

^{W N}

(b)

Figure 1. Systolic array qanizdon between two adjacent layers of multi-layex neural networks: (a) combined design for forward and backward passes; (a) M cell opemiom.

Inputpattems ir ir

.

b . .

k . . .

lation

L

*

Desiicdoutputs

Rgure 2. An example of a systolic array organization for (5-3-2) 3-layer neural network.

(5)

In Figure 2. an example of a systolic array aganization for (5-3-2) 3-layer network is shown. The elements of two weight mauiX are set into the two dimensional arrays. the 3-by-5 weight matrix represents

de

weights associ- ated to the connections between the input layer and the hidden layer, and the 2-by-3 weight matrix represents the weights Bssociated ^tothe connections between the hidden layer and the output layer. The second matrix is tran- sposed in the figure. W i l e the weights are updated, the feeding of training pattems are stopped, and the pipeline

1s stalled.

Basic Cell Architecture

The intend path of the basic cell is shown in Figure 3. It consists of three multipliers. three adden, two re- a t e m for weight value and weight increment. The communication clock cycle can be two times fast than the computation clock cycle, and the g unit sends the value at idle cy&.

Figure 3. Internal data path of the basic cell 4. Performance Evaluation

Let us consider an L -layer network consisting of Nk neurons at the k-th layer. The required cycles. C , for forward and backward passes of this design are denoted by Equation (8):

C f . - = $ N i + L - 2 ,=

When the forward and backward passes are pipelined, the required cycles for a single training pattem are denoted by Equation (9):

C,i+ =

(cje,-,i

+ Ck-1- NL + 1 (9)

Equation (9) shows only the effects of the spatial parallelism and the forwardbackward pipelining. The effects of the forward/backward pipelining on performance is not great even though NL is much greater than 1. However it makes the pipelining of multiple training patterns possible. When multiple ^pattemsm pipelined. the required cycles for learning p training pattems are denoted by Equation (10):

c,

=C,k#l. +(p

-

1) (10)

The speedup. S,. by the pipelining of p training paaems is denoted by Equation (11). When p >> (&, 1.. the speedup can be approximated to C+l. ^thatis the depth of the pipelining and dependent only on the network size.

For u w g the weights, the cycles that feeding of training paatterns is stopped are denoted by Equation (12):

C- = 2

$

Ni - N I

-

N z + L

+

1 (12)

I =

(6)

When we _the_weight_t_times_duringpresentation o f p paaems. the tMal required cycles m deMted by

-0

).

c

The performancc of neurocomputers is

be the number of connections of the target neural network, in MCUPS then the (mUwns MCUPS is calculated of connection updates by Fipation (14): per second). Let N

In this &sign, the basic clock cycle time is tip- eoneaddition,

speed "Uunmts ofsevaal lngb

' as the maximum timeamong the time by one mul- time for 2tZd- activation function

(w

^-). ^an=time ^required

neural netwak imprUnentationS have bten performed for NET- wth 100 nsec of ,248 M& caa beobtluacd%iTtak fora single eaining pattan. only by the spatial parallelism and the forwad/baclrward pipelinin& and for (128-128-128) t h d a y a network [Sing901 with 64K lraining patterns pipehed. the pe&mawe that _can_beobtained is 324,000 MCUPS.

to feed thelnput pattems conanuously.

Assumin we implanent .this

5. Concluskm

In this paper. a systolic array for back-pmpagarion neural networL that executes the forward and backwad passes in parallel. and exploits the pipelining of multiple pattans in each

pass

has been presented. The estimated perfor- mance of this design shows that the pipelining of multiple paaans IS an important factor in VLSI implementations of artificial neural networks. To inplemMt a large network with this design. an efficient mapping method that still allows the exploitation of the pipelining is required as a huther study.

6. References

F. Blayo and P. Hurat, "A VLSI Systolic Array Dedicated to Hopfield Neural Network," VLSI for Artzjkial Intelligcncc. Kluwex Academic Press, 1989, pp.255-264.

F. Slap and C. Lehmann, "A Systolic Implementation of the Self Organization Algorithm," Proc.

INNC. Vo1.H. Paris, July, 1990.

Chinn, et al.. "Systolic Array Implementations of Neural Nets on the ^MasparMP-1 Massively Parallel

Processor."

Proc.

B. Fipre and*G.

Mazare.

qmplementation of Back-Fmpagarion on a VLSI Asynchronous Cellular ma"A Parallel NeuC0ca"puter l., Architecture Towards Billion Connection Updates Per Second," Proc. ofIJCNN, Vol.II. W e g t o n . D.C..January. 1sqO. pp.51-54.

H.K. Kwan and P.C. Tsang. "Systohc Implementatm of Mulh-Laya Feed-Forward Neural Net- work with Back-- Learning Scheme." Proc. of the IJCNN, VoI.II, Washington, D.C..

January. 1990. pp.84-87.

H.T. Kung, "Why Syst~lic Architectu~~?." IEEE Cou~uter. January 1982.

S.Y. Kung aRd JN. Hwang, "Padel Architectures for Artificial N d k e t s , " Proc. ofICNN, Volfl, San Diego, California. July 1988, pp.16S172.

R. Mann and S. Haykin. "A Parallel Implaneneetion of Kohmn Feature Maps on the Warp Systolic Com er," Proc. o IJCNN, VoLII, Washin

J.R b a n and P. Lfidl, "Learning by B&%opagation: A Systolic Algorithm and Its Transputer Implementation," Neural Networks. Vol.1. No.3. July 1989. pp.119-137.

D.+ pomalcsll, et. al., " N e u r a l " n k SimuttiOaat Warp SpeSa: How WeGot 17 M i C o n - ncchons P a S" Proc. NICNN, V d J I , San Diego. Califcnnia, July 1988. pp.143-150.

U. Ranwhex and J. Beichta. "Systolic Synthesis of Neural Networks." Proc. of INNC, Vol.11. Paris.

DE: Rum& G.E. Hinran. and RJ. Williams. "Learning In"I Repmentations by Error Fro- Pagation." Parallel Distributed Processin

.

Explorations in the Microstructures of Cognition, V01.1, Formdcuionr, MlTpnsS. 1986, pp.318-36%.

TJ. Sejnowski and C.R. Rosenberg, " W e 1 Networks that Leam to Pmnounce English Text,"

A. Smger, "Exploiting the Inherent Parallelism of Artificial Neural Networks to Achieve 1300 Mil- lion Interconnects pa Second," Proc. of 1°C. Vo1.n. Paris. July. 1990. pp.656-660.

IJCNN, VoLII. San Diego. California, June, 1990, pp.169-173.

h o c . ofINNC, VoLII. Ppris July. 1990. pp.631434.

. 3 7 4 .

n. D.C.. January, 1990. pp.84-87.

July 1990 572-576.

Comqlu S y ~ t ~ m r , VOLl, 1987, pp.145-168.