Learning Models and Rules (I)

(1)

Learning Models and Rules (I)

Jin Young Choi

Seoul National University

1

(2)

Outline

 Learning rules and models

 Statistical nature of learning

 Empirical risk minimization

 Structural risk minimization

(3)

Learning Models

 Probability Discriminant Functions

• A Posteriori Probability Function

 Linear Learning Models

 Universal Learning Models

• Basis networks

• Gaussian basis

• High order basis

• Sine, cosine, wavelet basis

• (Deep) Neural networks

3

( ) ( | )

i i

g x  p  x

1 1 0 0

( ) ... _{d d} ^t

g x  w x   w x w  w x w

(4)

4

Linear Learning Models

height

weight Linear Decision Surface

= Hyperplane H

( )

^t 0

0

g x  w x  w 

(5)

5

Universal Learning Models

height

weight Nonlinear Decision Surface

= Hypersurface H

( ) 0

g x 

(6)

Learning Strategy

 Nonlinear Learning Approach

• Learning of nonlinear kernel parameters

• Steepest descent method (error backpropagation Learning)

• Approximated newton method

• Genetic algorithm, simulated annealing

• Slow learning speed

• Local minima

 Linear learning approach

• Convex optimization (least squares, support vector machine)

• Fast learning speed, mathematically sound

• Kernel determination problems

• Generalization problems: high dimensional feature space

6

(7)

Statistical Nature of Learning

 The generalized linear discriminant function (General Learning Model)

 Observations (i.i.d. examples)

 For binary classification

 Multi-class classification

𝑑

_𝑖

= 1, ∀𝑥

_𝑖

∈ 𝜔

_𝑖

𝑑

_𝑗

= 0, ∀𝑥

_𝑗

∉ 𝜔

_𝑖

 Multi-label classification: 0 1 0 1 1 0 1 1 0 … …

 Ex) portrait with attributes (Bald, Mustache, Gold hair, Blue eye, ….etc.)

7

( , )

^t

( , ),

^t

[

^t ^t

]

g x   w φ x v   w v

{( , x d

_i _i

)}

_i^N_1

 

1 2

1, ,

i i

d x



  

   

(8)

Statistical Nature of Learning

 Loss Function

= −𝑑 log 𝑔(𝑥, 𝜃)

 Risk Functional (Expectation Value of L) (or Total Loss)

where is joint PDF for

x & d

, but unknown.

 Empirical Risk Functional

8

( , ( , )) ( ( , )) (or Entropy, KLD)2

L d g x   d  g x 

( ) ( , ( , )) ( , ) R  



L d g x  dp x d

( , ) p x d

1

( ) 1 ( , ( , ))

N

emp i i

i

R L d g x

 N 





(9)

Statistical Nature of Learning

 Definition(Convergence in Probability):

Sequence of random variables is said to converge in probability to ,i.e.,

if for any

as

 Definition(Strict Consistency):

Let , If

The empirical risk function is said to be strictly consistent.

9

1, 2, 3,..., _N,

   

  0,

(| _N * | ) 0 .

p











 N  

*,

P

N 

( ) { | ( )c  R  c}

  

( ) ( )

inf _emp( ) ^P inf ( ),

c R c R as N

   

    

*

Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer.

(10)

Empirical Risk Minimization

 Empirical Risk Minimization

 Original Risk Minimization

 Meaning

is probability of optimal decision error is probability of training error

 The difference between and depends #of samples and the complexity of a classifier.

10

arg min ( )

emp Remp

   

arg min ( )

o R

   

( ) R 

emp( ) R 

emp( ) R  ( )

R 

(11)

Polynomial Curve Fitting

(12)

Sum-of-Squares Error Function

(13)

0 th Order Polynomial

(14)

1 st Order Polynomial

(15)

3 rd Order Polynomial

(16)

9 th Order Polynomial

(17)

Over-fitting

Root-Mean-Square (RMS) Error:

(18)

Data Set Size:

9

^th

Order Polynomial

(19)

Structural Risk Minimization

 VC(Vapnik-Chervonenkis )-dimension:

Maximum number of points that can be labeled in all possible way

 VC dimension of linear classifiers in N dimensions is h=N+1 (= #of weights, 𝑛_𝑤), cf.) MLP: O(𝑛_𝑤² )

 Measure of Complexity of a classifier

 Minimizing VC dim. == Minimizing Complexity

19

(20)

Structural Risk Minimization

 Generalization Ability

 N: Training sample size

 h: VC-dimension

 e: Natural logarithm

 Generalization bound: 𝑝 𝑅 − 𝑅_𝑒𝑚𝑝 < 𝜀 > 1 − 𝛼

 Confidential interval

20

2 2

(| ( ) ( ) | ) exp( )

h

o emp emp

p R R eN N

     ^ h ^  

 

2 1

( , , ) h ln N 1 ln

N h N h N

   ^  ^ 

( )_o _emp( _emp) ( )_o , 1

R    R   R   with p  

Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer.

(21)

 For fixed training samples N

21

(22)

For fixed training samples N

22

How can we find the appropriate complexity ?

(23)

23

How can we find the appropriate complexity ? Structural Risk Minimization

,

( _emp, _emp) arg min _emp( , )

h

h R h

   

(24)

24

How can we find the appropriate complexity ? Structural Risk Minimization

SVM Pruning

Regularization Drop out

Use of pre- trained Net

,

( _emp, _emp) arg min _emp( , )

h

h R h

   

(25)

Interim Summary

 Learning rules and models : Parametric Optimization

 Statistical nature of learning

 Empirical risk minimization

 Structural risk minimization

( ) ( )

inf _emp( ) ^P inf ( ),

c R c R as N

   

    

2 1

( ) 1 ( ( , )) ( )

2

N

i i emp

i

E  d g x  R 





 

2 2

(| ( ) ( ) | ) exp( )

h

o emp emp

p R R eN N

      ^ h ^  

(26)

Learning Rules

 Gradient descent method: Error backpropagation method

 Gauss newton method

 Levenberg–Marquardt method

 Newton (-Raphson) method

 Support Vector Machine

 Parametric PDF estimation: MLE, MAP, Bayesian Learning

 Non-parametric estimation: EM, MCMC

 Boltzman Machine: MAP, Boltzman distribution, MLP, unsupervised L.

 Bayesian Learning: MCMC, Variational Inference

(27)

Learning Models and Rules (II)

Jin Young Choi

Seoul National University

1

(28)

Outline

 Gradient descent method

 Error backpropagation method

 Gauss newton method

 Levenberg–Marquardt method

 Newton (-Raphson) method

(29)

Empirical Risk Minimization

 Empirical Risk Functional

𝐸 𝜃 = 𝑝_𝑖 ln 𝑞_𝑖 𝑊

𝐸 𝜃 = σ_𝑖[𝑝_𝑖ln 𝑞_𝑖 𝑊 + (1 − 𝑝_𝑖) ln(1 − 𝑞_𝑖 𝑊 )]

 Learning Goal

 Convexity

 Generally, is not convex.

 For linear case, is convex.

3

2 1

( ) 1 ( ( , )) ( )

2

N

i i emp

i

E  d g x  R 





 

* arg min ( )

Find E



 



( ) E



( ) E



(30)

1 2

( ) ( ) ( )

( ) , ,...,

def t

n

E E E

E

   

  

   

      

Gradient Descent Methods

 Gradient Descent Update Rule (Steepest Descent for G=I)

( k 1) ( ) k ( ) k G E ( ( )), k G 0

        

4

(31)

Backpropagation Learning Rule

 Feedforward Neural Networks

(32)

 Empirical Risk Function:

 Gradient descent for output layer:

 Chain rule:

1 2

( ) ( )

d 2 k k

k outputs

E w t o









… …

j

1 i I

1 k K

…

… …

𝑤_𝑗𝑖 𝑤_𝑘𝑗

j ji i

i

net   w x

k kj j

j

net   w h

ℎ_𝑗 𝑜_𝑘

𝑥_𝑖

…

∆𝑤

_𝑘𝑗

= −𝜂 𝜕𝐸

_𝑑

𝜕𝑤

_𝑘𝑗

𝜕𝐸

_𝑑

𝜕𝑤

_𝑘𝑗

= 𝜕𝐸

_𝑑

𝜕𝑛𝑒𝑡

_𝑘

𝜕𝑛𝑒𝑡

_𝑘

𝜕𝑤

_𝑘𝑗

= 𝜕𝐸

_𝑑

𝜕𝑛𝑒𝑡

_𝑘

ℎ

_𝑗

𝐸(𝑤) = 1

2 |𝑡 − 𝑜 𝑥, 𝑤 | ²

(33)

• To learn the regression problem, the linear activation function was used and the sum-of-squares error function was used as the loss function. Let's define the loss function as 𝐸(𝑤) = ¹

2 |𝑜 𝑥, 𝑤 − 𝑡 | ². At this time, the 𝑘-th value of 𝑜 𝑥, 𝑤 is defined by 𝑜_𝑘 = 𝑛𝑒𝑡_𝑘 = 𝑤_𝑘^𝑇ℎ = σ_𝑗𝑤_𝑘𝑗ℎ_𝑗. Then find ^𝜕E

𝜕𝑛𝑒𝑡_𝑘.

7

Sol.)

Since 𝐸 𝑤 = Τ¹ ₂σ_𝑘(𝑛𝑒𝑡_𝑘 − 𝑡_𝑘)², ^𝜕E

𝜕𝑛𝑒𝑡_𝑘 = 𝑛𝑒𝑡_𝑘 − 𝑡_𝑘 = 𝑜_𝑘 − 𝑡_𝑘 = − 𝑡_𝑘 − 𝑜_𝑘 =

− 𝛿_𝑘. _… _…

j

1 i I

1 k K

…

… …

j ji i

i

net  w x

k kj j

j

net  w h ℎ_𝑗

𝑜_𝑘

𝑥_𝑖

…

∆𝑤_𝑘𝑗 = −𝜂 𝜕𝐸_𝑑

𝜕𝑤_𝑘𝑗 = 𝜂𝛿_𝑘ℎ_𝑗

𝜕𝐸_𝑑

𝜕𝑤_𝑘𝑗 = 𝜕𝐸_𝑑

𝜕𝑛𝑒𝑡_𝑘

𝜕𝑛𝑒𝑡_𝑘 ℎ_𝑗

(34)

Backpropagation Learning Rule

• For multi-label classification (ex, output: 0110100), sigmoid activation

function is used and the loss is defined by the cross entropy loss function:

𝐸 𝑤 = − σ_𝑘^𝐾[𝑡_𝑘log 𝑜_𝑘 𝑥, 𝑤 + 1 − 𝑡_𝑘 log 1 − 𝑜_𝑘 𝑥, 𝑤 ] , where 𝑜_𝑘 = 𝜎(𝑛𝑒𝑡_𝑘) = ¹

1+𝑒^{−𝑛𝑒𝑡𝑘} . Then find ^𝜕E

𝜕𝑛𝑒𝑡_𝑘 .

8

Sol.)

Since ^𝜕𝑜^𝑘

𝜕𝑛𝑒𝑡_𝑘 = 𝜎 𝑛𝑒𝑡_𝑘 1 − 𝜎 𝑛𝑒𝑡_𝑘 = 𝑜_𝑘(1 − 𝑜_𝑘) ,

𝜕E

𝜕𝑛𝑒𝑡_𝑘 = −𝑡_𝑘 ¹

𝑜_𝑘

𝜕𝑜_𝑘

𝜕𝑛𝑒𝑡_𝑘 − 1 − 𝑡_𝑘 ⁻¹

1−𝑜_𝑘

𝜕o_𝑘

= −𝑡_𝑘 ¹

𝑜_𝑘𝑜_𝑘 1 − 𝑜_𝑘 − 1 − 𝑡_𝑘 ⁻¹

1−𝑜_𝑘𝑜_𝑘 1 − 𝑜_𝑘

= −𝑡_𝑘 1 − 𝑜_𝑘 + 1 − 𝑡_𝑘 𝑜_𝑘 = 𝑜_𝑘 − 𝑡_𝑘 = − 𝑡_𝑘 − 𝑜_𝑘 = −𝛿_𝑘

𝜕𝐸_𝑑

… …

j

1 i I

1 k K

…

… 𝑤_𝑗𝑖 … 𝑤_𝑘𝑗

j ji i

i

net w x

k kj j

j

net w h

ℎ_𝑗 𝑜_𝑘

𝑥_𝑖

…

(35)

• For multi-class classification (ex, [0 0 0 1 0 0]), the softmax activation function is used and the loss is defined by the cross entropy loss

function: 𝐸 𝑤 = − σ_𝑖^𝐾𝑡_𝑖log(𝑜_𝑖 𝑥, 𝑤 ), where 𝑜_𝑘 𝑥, 𝑤 = ^𝑒^{𝑛𝑒𝑡𝑘}

σ_𝑗𝑒^{𝑛𝑒𝑡𝑗}. The target value 𝑡_𝑘 ∈ 0, 1 is labelled by 1 hot vector. Then find ^𝜕E

𝜕𝑛𝑒𝑡_𝑘.

9

Sol.)

𝜕𝐸_𝑛

𝜕𝑛𝑒𝑡_𝑘 = 𝜕

𝜕𝑛𝑒𝑡_𝑘 − ෍

𝑖 𝐾

𝑡_𝑖log( 𝑒^𝑛𝑒𝑡^𝑖

σ_𝑗 𝑒^𝑛𝑒𝑡^𝑗) = 𝜕

𝜕𝑛𝑒𝑡_𝑘 (− ෍

𝑖 𝐾

[𝑡_𝑖log(𝑒^𝑛𝑒𝑡^𝑖) − 𝑡_𝑖log(෍

𝑗

𝑒^𝑛𝑒𝑡^𝑗)])

= ^𝜕

𝜕𝑛𝑒𝑡_𝑘 (− σ_𝑖^𝐾[𝑡_𝑖𝑛𝑒𝑡_𝑖 − 𝑡_𝑖log(σ_𝑗𝑒^𝑛𝑒𝑡^𝑗)]) = −𝑡_𝑘 + σ_𝑖 𝑡_𝑖 ^𝑒^{𝑛𝑒𝑡𝑘}

σ_𝑗𝑒^{𝑛𝑒𝑡𝑗}

= −𝑡_𝑘 + ^𝑒^{𝑛𝑒𝑡𝑘}

σ_𝑗𝑒^{𝑛𝑒𝑡𝑗} σ_𝑖𝑡_𝑖 = 𝑜_𝑘 − 𝑡_𝑘 = − 𝑡_𝑘 − 𝑜_𝑘 = −𝛿_𝑘

𝜕𝐸_𝑑

(36)

Backpropagation Learning Rule

 Empirical Risk Function:

 Gradient descent for hidden layer:

 Chain rule:

1

2

( ) ( )

d

2

k k

k outputs

E w t o



  

… …

j

1 i I

1 k K

…

… …

j ji i

i

net   w x

k kj j

j

net   w h

ℎ_𝑗 𝑜_𝑘

𝑥_𝑖

…

Regression: 𝐿₂, linear

01001101: cross-entropy, sigmoid 00001000: cross-entropy, soft-max

∆𝑤

_𝑗𝑖

= −𝜂 𝜕𝐸

_𝑑

𝜕𝑤

_𝑗𝑖

𝜕𝐸

_𝑑

𝜕𝑤

_𝑗𝑖

= 𝜕𝐸

_𝑑

𝜕𝑛𝑒𝑡

_𝑗

𝜕𝑛𝑒𝑡

_𝑗

𝜕𝑤

_𝑗𝑖

= 𝜕𝐸

_𝑑

𝜕𝑛𝑒𝑡

_𝑗

𝑥

_𝑖

(37)

 Chain rule:

d d

,

i

ji j

E E

w net x

 

  

'

( )

d d k j

k outputs

j k j j

k j k

k outputs j j

j k kj

k outputs j

k kj j

k outputs

E E net h

net net h net

net h

h net w h

net w f net





   

    

 

 

 

  



 



^'

,

( )

ji j i

j j k kj

k outputs

w x

f net w



 



 

 

… …

j

1 i I

1 k K

…

… …

j ji i

i

net   w x

k kj j

j

net   w h

ℎ_𝑗 𝑜_𝑘

𝑥_𝑖

…

𝜕𝐸

_𝑑

𝜕𝑛𝑒𝑡

_𝑘

= −𝛿

_𝑘

ℎ_𝑗 = 𝑓(𝑛𝑒𝑡_𝑗)

= −𝛿_𝑗

(38)

… …

j

1 i I

1 k K

…

… …

𝑤_𝑗𝑖^𝑙 𝑤_𝑘𝑗^𝑙+1

1

l l l

j ji i

i

net 



w h ^

1 1

l l l

k kj j

j

net ^ 



w h^ ℎ_𝑗^𝑙

ℎ_𝑘^𝑙+1

ℎ_𝑖^𝑙−1

…

𝛿_𝑘^𝐿 = − 𝜕𝐸_𝑑

𝜕𝑛𝑒𝑡_𝑘 = 𝑡_𝑘 − 𝑜_𝑘 Regression: 𝐿₂, linear

01001101: cross-entropy, sigmoid 00001000: cross-entropy, soft-max

1

' 1 1 1

1

,

( )

l l l

ji j i

l l l l

j j k kj

k l layer

w h

f net w



 



  

 

 





Matrix Form (Backward. EBP)

∆𝑊

^𝑙

= 𝜂𝛿

^𝑙

ℎ

^𝑙−1𝑇

+ 𝜌∆𝑊

^{𝑙(𝑜𝑙𝑑)}

𝛿

^𝑙

= 𝐷𝑖𝑎𝑔[𝑓

^′

𝑛𝑒𝑡

^𝑙

]𝑊

^𝑙+1𝑇

𝛿

^𝑙+1

Matrix Form (Forward) ℎ

⁰

= 𝑥

ℎ

^𝑙+1

= 𝐷𝑖𝑎𝑔[𝑓]°𝑊

^𝑙+1

ℎ

^𝑙

(39)

Two Layer Convolutional Neural Network

Relu function 32x32x3

32x32x32

16x16x32

16x16x64

8x8x64

4096 10

(40)

Gradient Descent Methods

14

(41)

15

Gradient Descent Methods

 Optimal Learning Rate

• Necessary Condition

 Learning Rate Search Methods

• Initial Bracketing

• Line Searching

• Secant Method (Approximate Newton Method)

• Bisection Method

• Golden section search method

   

     

0,

0

T

next now

T

now now now

E E

E E E

 

   

  

    

15

(42)

16

Newton(-Raphson) Method

 Principle: The descent direction is determined by using the second derivatives of the objective function f () if available

 If the starting position _now is sufficient close to a local

minimum, the objective function f () can be approximated by a quadratic form:

2

( ) ( ) ( ) 1 ( ) ( )

2

( ), ( ).

T T

now now now now

now now

E E

where E E

       

 

      

    

H H

16

(43)

17

Newton(-Raphson) Method

17 2

( ) ( ) ( ) 1 ( ) ( )

2

( ), ( ).

T T

E E

where E E

       

 

      

    

H H

(44)

18

 Since the equation defines a quadratic function

• its minimum can be determined by differentiating & setting to 0.

18

Newton (-Raphson) Method

1

( ) ( ) ( ) 1 ( ) ( )

2

( ) 0

T T

next now

E



E

      

 

^

      

   

  

H H

H

)

( 1) ( ) ( ) ( ( )), 0

cf

k k k G E k G

        

(45)

19

Gauss Newton Method

 Key idea: Not to use Hassian matrix, we use linearized approximation of learning model.

 𝐸 𝜃 =

¹

2

𝑑 − 𝑔(𝑥, 𝜃)

²

𝑔 𝑥, 𝜃 ≈ 𝑔 𝑥, 𝜃

_𝑛𝑜𝑤

+ 𝐽

^𝑇

𝜃 − 𝜃

_𝑛𝑜𝑤

, where 𝐽 =

^{𝑑𝑔 𝑥,𝜃}

ൗ ห

𝑑𝜃 𝜃=𝜃_𝑛𝑜𝑤

 𝐸 𝜃 =

¹

2

𝑑 − 𝑔 𝑥, 𝜃

_𝑛𝑜𝑤

− 𝐽

^𝑇

𝜃 − 𝜃

_𝑛𝑜𝑤 ²

19

(46)

20

Gauss Newton Method

 Since the function is quadratic for  ,

 Its minimum can be determined by differentiating &

setting to 0.

 𝐸 𝜃 =

¹

2

𝑑 − 𝑔 𝑥, 𝜃

_𝑛𝑜𝑤

− 𝐽

^𝑇

𝜃 − 𝜃

_𝑛𝑜𝑤 ²

 𝛻𝐸 𝜃 = −𝐽 𝑑 − 𝑔 𝑥, 𝜃

_𝑛𝑜𝑤

− 𝐽

^𝑇

𝜃 − 𝜃

_𝑛𝑜𝑤

= 0

 Update Rule

 𝜃

_{𝑛𝑒𝑥𝑡}

= 𝜃

_𝑛𝑜𝑤

+ 𝐽𝐽

^{𝑇 −1}

𝐽(𝑑 − 𝑔 𝑥, 𝜃

_𝑛𝑜𝑤

)

 𝜃

_{𝑛𝑒𝑥𝑡}

= 𝜃

_𝑛𝑜𝑤

− 𝐽𝐽

^{𝑇 −1}

𝛻𝐸 𝜃

_𝑛𝑜𝑤

[∵ 𝛻𝐸 𝜃

_𝑛𝑜𝑤

= −𝐽 𝑑 − 𝑔 𝑥, 𝜃

_𝑛𝑜𝑤

]

20

)

( 1) ( ) ( ) ( ( )), 0

cf

k k k G E k G

        

(47)

RNN: Recurrent Neural Networks

Recurrent Neural Networks

 𝑠_𝑡: State of RNN after time 𝑡

 𝑥_𝑡: Input to RNN at time 𝑡

 𝑦_𝑡: Output of RNN at time 𝑡

 𝑊, 𝑈, 𝑉: parameter matrices, 𝜎(⋅) : non-linear activation function

 More expressive cells: GRU(Gated Recurrent Unit), LSTM(Long- Short-Term Memory), etc.

21

)

(48)

RNN: Recurrent Neural Networks

RNN for Sequence Generation

 Q: How to use RNN to generate sequences?

 A: Let 𝑥_𝑡+1 = 𝑦_𝑡

 Q: How to initialize 𝑠₀, 𝑥₁? When to stop generation?

 A: Use start/end of sequence token (SOS, EOS)

22

(49)

RNN: Recurrent Neural Networks

Training RNN

 We observe a sequence 𝑦^∗ of edges [1,0,…]

 Principle: Teacher Forcing -- Replace input and output by the real sequence

23

(50)

RNN: Recurrent Neural Networks

Training RNN

 Loss 𝐿: Binary cross entropy

 Minimize:

𝐿 = − ෍

𝑖

𝑦_𝑖^∗log 𝑦_𝑖 + (1 − 𝑦_𝑖^∗) log(1 − 𝑦_𝑖)

 If 𝑦_𝑖^∗ = 1, we minimize −log 𝑦_𝑖, making 𝑦_𝑖 higher to approach 1

 If 𝑦_𝑖^∗ = 0, we minimize −log(1 − 𝑦_𝑖), making 𝑦_𝑖 lower to approach 0

 𝑦_𝑖 is computed by RNN, this loss will adjust RNN parameters accordingly, using back propagation!

24

(51)

Interim Summary

25

Levenberg–Marquardt Levenberg

Gauss Newton

Newton (-Raphson)

1

1 1

( )

( ) ( )

( ( ) ) ( )

( )

next now k now

t

t t

G E

JJ E

Diag JJ JJ E E

   

    

   



  

   

  

I

H

Gradient Descent

(52)

Interim Summary

26

Error Backpropagation Rule (Gradient Descent)

1

' 1 1 1

1

,

( )

l l l

ji j i

l l l l

j j k kj

k l layer

w h

f net w



 



  

 

 

 

^… ^…

j

1 i I

1 k K

…

… …

𝑤_𝑗𝑖^𝑙 𝑤_𝑘𝑗^𝑙+1

1

' 1 1 1

1

,

( )

l l l

ji j i

l l l l

j j k kj

k l layer

w h

f net w



 



  

 

 

 

1

' 1 1 1

1

,

( )

l l l

ji j i

l l l l

j j k kj

k l layer

w h

f net w



 



  

 

 

 

ℎ_𝑗^𝑙 ℎ_𝑘^𝑙+1

ℎ_𝑖^𝑙−1

…

𝛿

_𝑘^𝐿

= − 𝜕𝐸

_𝑑

𝜕𝑛𝑒𝑡

_𝑘

= 𝑡

_𝑘

− 𝑜

_𝑘

Matrix Form (Backward. EBP)

∆𝑊

^𝑙

= 𝜂𝛿

^𝑙

ℎ

^𝑙−1𝑇

+ 𝜌∆𝑊

^{𝑙(𝑜𝑙𝑑)}

𝛿

^𝑙

= 𝐷𝑖𝑎𝑔[𝑓

^′

𝑎

^𝑙

]𝑊

^𝑙+1𝑇

𝛿

^𝑙+1

Matrix Form (Forward) ℎ

⁰

= 𝑥

ℎ

^𝑙+1

= 𝐷𝑖𝑎𝑔[𝑓]°𝑊

^𝑙+1

ℎ

^𝑙