• Tidak ada hasil yang ditemukan

Learning Models and Rules (I)

N/A
N/A
Protected

Academic year: 2024

Membagikan "Learning Models and Rules (I)"

Copied!
52
0
0

Teks penuh

(1)

Learning Models and Rules (I)

Jin Young Choi

Seoul National University

1

(2)

Outline

 Learning rules and models

 Statistical nature of learning

 Empirical risk minimization

 Structural risk minimization

(3)

Learning Models

Probability Discriminant Functions

A Posteriori Probability Function

Linear Learning Models

Universal Learning Models

Basis networks

Gaussian basis

High order basis

Sine, cosine, wavelet basis

(Deep) Neural networks

3

( ) ( | )

i i

g xpx

1 1 0 0

( ) ... d d t

g xw x   w xww xw

(4)

4

Linear Learning Models

height

weight Linear Decision Surface

= Hyperplane H

( )

t 0

0

g x  w x  w 

(5)

5

Universal Learning Models

height

weight Nonlinear Decision Surface

= Hypersurface H

( ) 0

g x 

(6)

Learning Strategy

 Nonlinear Learning Approach

• Learning of nonlinear kernel parameters

• Steepest descent method (error backpropagation Learning)

• Approximated newton method

• Genetic algorithm, simulated annealing

• Slow learning speed

• Local minima

 Linear learning approach

• Convex optimization (least squares, support vector machine)

• Fast learning speed, mathematically sound

• Kernel determination problems

• Generalization problems: high dimensional feature space

6

(7)

Statistical Nature of Learning

 The generalized linear discriminant function (General Learning Model)

 Observations (i.i.d. examples)

 For binary classification

 Multi-class classification

𝑑

𝑖

= 1, ∀𝑥

𝑖

∈ 𝜔

𝑖

𝑑

𝑗

= 0, ∀𝑥

𝑗

∉ 𝜔

𝑖

 Multi-label classification: 0 1 0 1 1 0 1 1 0 … …

Ex) portrait with attributes (Bald, Mustache, Gold hair, Blue eye, ….etc.)

7

( , )

t

( , ),

t

[

t t

]

g x   w φ x v   w v

{( , x d

i i

)}

iN1

 

1 2

1, ,

1, ,

i i

i i

d x

d x

  

   

(8)

Statistical Nature of Learning

 Loss Function

= −𝑑 log 𝑔(𝑥, 𝜃)

 Risk Functional (Expectation Value of L) (or Total Loss)

where is joint PDF for

x & d

, but unknown.

 Empirical Risk Functional

8

( , ( , )) ( ( , )) (or Entropy, KLD)2

L d g x   dg x

( ) ( , ( , )) ( , ) R  

L d g xdp x d

( , ) p x d

1

( ) 1 ( , ( , ))

N

emp i i

i

R L d g x

N

(9)

Statistical Nature of Learning

 Definition(Convergence in Probability):

Sequence of random variables is said to converge in probability to ,i.e.,

if for any

as

 Definition(Strict Consistency):

Let , If

The empirical risk function is said to be strictly consistent.

9

1, 2, 3,..., N,

  

0,

(| N * | ) 0 .

p

N  

*,

P

N 

( ) { | ( )cRc}

( ) ( )

inf emp( ) P inf ( ),

c R c R as N

    

*

Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer.

(10)

Empirical Risk Minimization

 Empirical Risk Minimization

 Original Risk Minimization

 Meaning

is probability of optimal decision error is probability of training error

 The difference between and depends #of samples and the complexity of a classifier.

10

arg min ( )

emp Remp

arg min ( )

o R

( ) R

emp( ) R

emp( ) R ( )

R

(11)

Polynomial Curve Fitting

(12)

Sum-of-Squares Error Function

(13)

0 th Order Polynomial

(14)

1 st Order Polynomial

(15)

3 rd Order Polynomial

(16)

9 th Order Polynomial

(17)

Over-fitting

Root-Mean-Square (RMS) Error:

(18)

Data Set Size:

9

th

Order Polynomial

(19)

Structural Risk Minimization

 VC(Vapnik-Chervonenkis )-dimension:

Maximum number of points that can be labeled in all possible way

 VC dimension of linear classifiers in N dimensions is h=N+1 (= #of weights, 𝑛𝑤), cf.) MLP: O(𝑛𝑤2 )

 Measure of Complexity of a classifier

Minimizing VC dim. == Minimizing Complexity

19

(20)

Structural Risk Minimization

 Generalization Ability

 N: Training sample size

 h: VC-dimension

 e: Natural logarithm

 Generalization bound: 𝑝 𝑅 − 𝑅𝑒𝑚𝑝 < 𝜀 > 1 − 𝛼

 Confidential interval

20

2 2

(| ( ) ( ) | ) exp( )

h

o emp emp

p R R eN N

     h   

 

2 1

( , , ) h ln N 1 ln

N h N h N

      

( )o emp( emp) ( )o , 1

R    R   R   with p  

Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer.

(21)

Structural Risk Minimization

 For fixed training samples N

21

(22)

Structural Risk Minimization

For fixed training samples N

22

How can we find the appropriate complexity ?

(23)

Structural Risk Minimization

 For fixed training samples N

23

How can we find the appropriate complexity ? Structural Risk Minimization

,

( emp, emp) arg min emp( , )

h

h R h

(24)

Structural Risk Minimization

 For fixed training samples N

24

How can we find the appropriate complexity ? Structural Risk Minimization

SVM Pruning

Regularization Drop out

Use of pre- trained Net

,

( emp, emp) arg min emp( , )

h

h R h

 

(25)

Interim Summary

 Learning rules and models : Parametric Optimization

 Statistical nature of learning

 Empirical risk minimization

 Structural risk minimization

( ) ( )

inf emp( ) P inf ( ),

c R c R as N

    

2 1

( ) 1 ( ( , )) ( )

2

N

i i emp

i

Ed g xR

 

2 2

(| ( ) ( ) | ) exp( )

h

o emp emp

p R R eN N

       h   

(26)

Learning Rules

 Gradient descent method: Error backpropagation method

 Gauss newton method

 Levenberg–Marquardt method

 Newton (-Raphson) method

 Support Vector Machine

 Parametric PDF estimation: MLE, MAP, Bayesian Learning

 Non-parametric estimation: EM, MCMC

 Boltzman Machine: MAP, Boltzman distribution, MLP, unsupervised L.

 Bayesian Learning: MCMC, Variational Inference

(27)

Learning Models and Rules (II)

Jin Young Choi

Seoul National University

1

(28)

Outline

 Gradient descent method

 Error backpropagation method

 Gauss newton method

 Levenberg–Marquardt method

 Newton (-Raphson) method

(29)

Empirical Risk Minimization

 Empirical Risk Functional

𝐸 𝜃 = 𝑝𝑖 ln 𝑞𝑖 𝑊

𝐸 𝜃 = σ𝑖[𝑝𝑖ln 𝑞𝑖 𝑊 + (1 − 𝑝𝑖) ln(1 − 𝑞𝑖 𝑊 )]

 Learning Goal

 Convexity

 Generally, is not convex.

 For linear case, is convex.

3

2 1

( ) 1 ( ( , )) ( )

2

N

i i emp

i

Ed g xR

* arg min ( )

Find E

( ) E

( ) E

(30)

1 2

( ) ( ) ( )

( ) , ,...,

def t

n

E E E

E

   

  

   

      

Gradient Descent Methods

 Gradient Descent Update Rule (Steepest Descent for G=I)

( k 1) ( ) k ( ) k G E ( ( )), k G 0

        

4

(31)

Backpropagation Learning Rule

 Feedforward Neural Networks

(32)

Backpropagation Learning Rule

 Empirical Risk Function:

 Gradient descent for output layer:

 Chain rule:

1 2

( ) ( )

d 2 k k

k outputs

E w t o

… …

j

1 i I

1 k K

… …

𝑤𝑗𝑖 𝑤𝑘𝑗

j ji i

i

net   w x

k kj j

j

net   w h

𝑗 𝑜𝑘

𝑥𝑖

∆𝑤

𝑘𝑗

= −𝜂 𝜕𝐸

𝑑

𝜕𝑤

𝑘𝑗

𝜕𝐸

𝑑

𝜕𝑤

𝑘𝑗

= 𝜕𝐸

𝑑

𝜕𝑛𝑒𝑡

𝑘

𝜕𝑛𝑒𝑡

𝑘

𝜕𝑤

𝑘𝑗

= 𝜕𝐸

𝑑

𝜕𝑛𝑒𝑡

𝑘

𝑗

𝐸(𝑤) = 1

2 |𝑡 − 𝑜 𝑥, 𝑤 | 2

(33)

Backpropagation Learning Rule

To learn the regression problem, the linear activation function was used and the sum-of-squares error function was used as the loss function. Let's define the loss function as 𝐸(𝑤) = 1

2 |𝑜 𝑥, 𝑤 − 𝑡 | 2. At this time, the 𝑘-th value of 𝑜 𝑥, 𝑤 is defined by 𝑜𝑘 = 𝑛𝑒𝑡𝑘 = 𝑤𝑘𝑇ℎ = σ𝑗𝑤𝑘𝑗𝑗. Then find 𝜕E

𝜕𝑛𝑒𝑡𝑘.

7

Sol.)

Since 𝐸 𝑤 = Τ1 2σ𝑘(𝑛𝑒𝑡𝑘 − 𝑡𝑘)2, 𝜕E

𝜕𝑛𝑒𝑡𝑘 = 𝑛𝑒𝑡𝑘 − 𝑡𝑘 = 𝑜𝑘 − 𝑡𝑘 = − 𝑡𝑘 − 𝑜𝑘 =

− 𝛿𝑘.

j

1 i I

1 k K

𝑤𝑗𝑖 𝑤𝑘𝑗

j ji i

i

net w x

k kj j

j

net w h 𝑗

𝑜𝑘

𝑥𝑖

∆𝑤𝑘𝑗 = −𝜂 𝜕𝐸𝑑

𝜕𝑤𝑘𝑗 = 𝜂𝛿𝑘𝑗

𝜕𝐸𝑑

𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑

𝜕𝑛𝑒𝑡𝑘

𝜕𝑛𝑒𝑡𝑘

𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑

𝜕𝑛𝑒𝑡𝑘𝑗

(34)

Backpropagation Learning Rule

• For multi-label classification (ex, output: 0110100), sigmoid activation

function is used and the loss is defined by the cross entropy loss function:

𝐸 𝑤 = − σ𝑘𝐾[𝑡𝑘log 𝑜𝑘 𝑥, 𝑤 + 1 − 𝑡𝑘 log 1 − 𝑜𝑘 𝑥, 𝑤 ] , where 𝑜𝑘 = 𝜎(𝑛𝑒𝑡𝑘) = 1

1+𝑒−𝑛𝑒𝑡𝑘 . Then find 𝜕E

𝜕𝑛𝑒𝑡𝑘 .

8

Sol.)

Since 𝜕𝑜𝑘

𝜕𝑛𝑒𝑡𝑘 = 𝜎 𝑛𝑒𝑡𝑘 1 − 𝜎 𝑛𝑒𝑡𝑘 = 𝑜𝑘(1 − 𝑜𝑘) ,

𝜕E

𝜕𝑛𝑒𝑡𝑘 = −𝑡𝑘 1

𝑜𝑘

𝜕𝑜𝑘

𝜕𝑛𝑒𝑡𝑘 − 1 − 𝑡𝑘 −1

1−𝑜𝑘

𝜕o𝑘

𝜕𝑛𝑒𝑡𝑘

= −𝑡𝑘 1

𝑜𝑘𝑜𝑘 1 − 𝑜𝑘 − 1 − 𝑡𝑘 −1

1−𝑜𝑘𝑜𝑘 1 − 𝑜𝑘

= −𝑡𝑘 1 − 𝑜𝑘 + 1 − 𝑡𝑘 𝑜𝑘 = 𝑜𝑘 − 𝑡𝑘 = − 𝑡𝑘 − 𝑜𝑘 = −𝛿𝑘

∆𝑤𝑘𝑗 = −𝜂 𝜕𝐸𝑑

𝜕𝑤𝑘𝑗 = 𝜂𝛿𝑘𝑗

𝜕𝐸𝑑

𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑

𝜕𝑛𝑒𝑡𝑘

𝜕𝑛𝑒𝑡𝑘

𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑

𝜕𝑛𝑒𝑡𝑘𝑗

j

1 i I

1 k K

𝑤𝑗𝑖 𝑤𝑘𝑗

j ji i

i

net w x

k kj j

j

net w h

𝑗 𝑜𝑘

𝑥𝑖

(35)

Backpropagation Learning Rule

• For multi-class classification (ex, [0 0 0 1 0 0]), the softmax activation function is used and the loss is defined by the cross entropy loss

function: 𝐸 𝑤 = − σ𝑖𝐾𝑡𝑖log(𝑜𝑖 𝑥, 𝑤 ), where 𝑜𝑘 𝑥, 𝑤 = 𝑒𝑛𝑒𝑡𝑘

σ𝑗𝑒𝑛𝑒𝑡𝑗. The target value 𝑡𝑘 ∈ 0, 1 is labelled by 1 hot vector. Then find 𝜕E

𝜕𝑛𝑒𝑡𝑘.

9

Sol.)

𝜕𝐸𝑛

𝜕𝑛𝑒𝑡𝑘 = 𝜕

𝜕𝑛𝑒𝑡𝑘 − ෍

𝑖 𝐾

𝑡𝑖log( 𝑒𝑛𝑒𝑡𝑖

σ𝑗 𝑒𝑛𝑒𝑡𝑗) = 𝜕

𝜕𝑛𝑒𝑡𝑘 (− ෍

𝑖 𝐾

[𝑡𝑖log(𝑒𝑛𝑒𝑡𝑖) − 𝑡𝑖log(

𝑗

𝑒𝑛𝑒𝑡𝑗)])

= 𝜕

𝜕𝑛𝑒𝑡𝑘 (− σ𝑖𝐾[𝑡𝑖𝑛𝑒𝑡𝑖 − 𝑡𝑖log(σ𝑗𝑒𝑛𝑒𝑡𝑗)]) = −𝑡𝑘 + σ𝑖 𝑡𝑖 𝑒𝑛𝑒𝑡𝑘

σ𝑗𝑒𝑛𝑒𝑡𝑗

= −𝑡𝑘 + 𝑒𝑛𝑒𝑡𝑘

σ𝑗𝑒𝑛𝑒𝑡𝑗 σ𝑖𝑡𝑖 = 𝑜𝑘 − 𝑡𝑘 = − 𝑡𝑘 − 𝑜𝑘 = −𝛿𝑘

∆𝑤𝑘𝑗 = −𝜂 𝜕𝐸𝑑

𝜕𝑤𝑘𝑗 = 𝜂𝛿𝑘𝑗

𝜕𝐸𝑑

𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑

𝜕𝑛𝑒𝑡𝑘

𝜕𝑛𝑒𝑡𝑘

𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑

𝜕𝑛𝑒𝑡𝑘𝑗

(36)

Backpropagation Learning Rule

 Empirical Risk Function:

 Gradient descent for hidden layer:

 Chain rule:

1

2

( ) ( )

d

2

k k

k outputs

E w t o

  

… …

j

1 i I

1 k K

… …

𝑤𝑗𝑖 𝑤𝑘𝑗

j ji i

i

net   w x

k kj j

j

net   w h

𝑗 𝑜𝑘

𝑥𝑖

Regression: 𝐿2, linear

01001101: cross-entropy, sigmoid 00001000: cross-entropy, soft-max

∆𝑤

𝑗𝑖

= −𝜂 𝜕𝐸

𝑑

𝜕𝑤

𝑗𝑖

𝜕𝐸

𝑑

𝜕𝑤

𝑗𝑖

= 𝜕𝐸

𝑑

𝜕𝑛𝑒𝑡

𝑗

𝜕𝑛𝑒𝑡

𝑗

𝜕𝑤

𝑗𝑖

= 𝜕𝐸

𝑑

𝜕𝑛𝑒𝑡

𝑗

𝑥

𝑖
(37)

Backpropagation Learning Rule

 Chain rule:

d d

,

i

ji j

E E

w net x

 

  

'

( )

d d k j

k outputs

j k j j

k j k

k outputs j j

j k kj

k outputs j

k kj j

k outputs

E E net h

net net h net

net h

h net w h

net w f net

   

    

 

 

 

  

 

'

,

( )

ji j i

j j k kj

k outputs

w x

f net w



 

 

 

… …

j

1 i I

1 k K

… …

𝑤𝑗𝑖 𝑤𝑘𝑗

j ji i

i

net   w x

k kj j

j

net   w h

𝑗 𝑜𝑘

𝑥𝑖

𝜕𝐸

𝑑

𝜕𝑛𝑒𝑡

𝑘

= −𝛿

𝑘

𝑗 = 𝑓(𝑛𝑒𝑡𝑗)

= −𝛿𝑗

(38)

Backpropagation Learning Rule

… …

j

1 i I

1 k K

… …

𝑤𝑗𝑖𝑙 𝑤𝑘𝑗𝑙+1

1

l l l

j ji i

i

net

w h

1 1

l l l

k kj j

j

net

w h 𝑗𝑙

𝑘𝑙+1

𝑖𝑙−1

𝛿𝑘𝐿 = − 𝜕𝐸𝑑

𝜕𝑛𝑒𝑡𝑘 = 𝑡𝑘 − 𝑜𝑘 Regression: 𝐿2, linear

01001101: cross-entropy, sigmoid 00001000: cross-entropy, soft-max

1

' 1 1 1

1

,

( )

l l l

ji j i

l l l l

j j k kj

k l layer

w h

f net w



 

 

 

Matrix Form (Backward. EBP)

∆𝑊

𝑙

= 𝜂𝛿

𝑙

𝑙−1𝑇

+ 𝜌∆𝑊

𝑙(𝑜𝑙𝑑)

𝛿

𝑙

= 𝐷𝑖𝑎𝑔[𝑓

𝑛𝑒𝑡

𝑙

]𝑊

𝑙+1𝑇

𝛿

𝑙+1

Matrix Form (Forward) ℎ

0

= 𝑥

𝑙+1

= 𝐷𝑖𝑎𝑔[𝑓]°𝑊

𝑙+1

𝑙
(39)

Two Layer Convolutional Neural Network

Relu function 32x32x3

32x32x32

16x16x32

16x16x64

8x8x64

4096 10

(40)

Gradient Descent Methods

14

(41)

15

Gradient Descent Methods

 Optimal Learning Rate

• Necessary Condition

 Learning Rate Search Methods

• Initial Bracketing

• Line Searching

• Secant Method (Approximate Newton Method)

• Bisection Method

• Golden section search method

   

     

0,

0

T

next now

T

now now now

E E

E E E

 

   

  

    

15

(42)

16

Newton(-Raphson) Method

 Principle: The descent direction is determined by using the second derivatives of the objective function f () if available

 If the starting position now is sufficient close to a local

minimum, the objective function f () can be approximated by a quadratic form:

2

( ) ( ) ( ) 1 ( ) ( )

2

( ), ( ).

T T

now now now now

now now

E E

where E E

       

 

      

    

H H

16

(43)

17

Newton(-Raphson) Method

17 2

( ) ( ) ( ) 1 ( ) ( )

2

( ), ( ).

T T

now now now now

E E

where E E

       

 

      

    

H H

(44)

18

 Since the equation defines a quadratic function

• its minimum can be determined by differentiating & setting to 0.

18

Newton (-Raphson) Method

1

( ) ( ) ( ) 1 ( ) ( )

2

( ) 0

T T

now now now now

next now

next now

E

E

      

 

 

      

   

  

H H

H

)

( 1) ( ) ( ) ( ( )), 0

cf

k k k G E k G

        

(45)

19

Gauss Newton Method

 Key idea: Not to use Hassian matrix, we use linearized approximation of learning model.

 𝐸 𝜃 =

1

2

𝑑 − 𝑔(𝑥, 𝜃)

2

𝑔 𝑥, 𝜃 ≈ 𝑔 𝑥, 𝜃

𝑛𝑜𝑤

+ 𝐽

𝑇

𝜃 − 𝜃

𝑛𝑜𝑤

, where 𝐽 =

𝑑𝑔 𝑥,𝜃

ൗ ห

𝑑𝜃 𝜃=𝜃𝑛𝑜𝑤

 𝐸 𝜃 =

1

2

𝑑 − 𝑔 𝑥, 𝜃

𝑛𝑜𝑤

− 𝐽

𝑇

𝜃 − 𝜃

𝑛𝑜𝑤 2

19

(46)

20

Gauss Newton Method

 Since the function is quadratic for  ,

 Its minimum can be determined by differentiating &

setting to 0.

 𝐸 𝜃 =

1

2

𝑑 − 𝑔 𝑥, 𝜃

𝑛𝑜𝑤

− 𝐽

𝑇

𝜃 − 𝜃

𝑛𝑜𝑤 2

 𝛻𝐸 𝜃 = −𝐽 𝑑 − 𝑔 𝑥, 𝜃

𝑛𝑜𝑤

− 𝐽

𝑇

𝜃 − 𝜃

𝑛𝑜𝑤

= 0

 Update Rule

 𝜃

𝑛𝑒𝑥𝑡

= 𝜃

𝑛𝑜𝑤

+ 𝐽𝐽

𝑇 −1

𝐽(𝑑 − 𝑔 𝑥, 𝜃

𝑛𝑜𝑤

)

 𝜃

𝑛𝑒𝑥𝑡

= 𝜃

𝑛𝑜𝑤

− 𝐽𝐽

𝑇 −1

𝛻𝐸 𝜃

𝑛𝑜𝑤

[∵ 𝛻𝐸 𝜃

𝑛𝑜𝑤

= −𝐽 𝑑 − 𝑔 𝑥, 𝜃

𝑛𝑜𝑤

]

20

)

( 1) ( ) ( ) ( ( )), 0

cf

k k k G E k G

        

(47)

RNN: Recurrent Neural Networks

Recurrent Neural Networks

𝑠𝑡: State of RNN after time 𝑡

𝑥𝑡: Input to RNN at time 𝑡

𝑦𝑡: Output of RNN at time 𝑡

𝑊, 𝑈, 𝑉: parameter matrices, 𝜎(⋅) : non-linear activation function

More expressive cells: GRU(Gated Recurrent Unit), LSTM(Long- Short-Term Memory), etc.

21

)

(48)

RNN: Recurrent Neural Networks

RNN for Sequence Generation

Q: How to use RNN to generate sequences?

A: Let 𝑥𝑡+1 = 𝑦𝑡

Q: How to initialize 𝑠0, 𝑥1? When to stop generation?

A: Use start/end of sequence token (SOS, EOS)

22

(49)

RNN: Recurrent Neural Networks

Training RNN

We observe a sequence 𝑦 of edges [1,0,…]

Principle: Teacher Forcing -- Replace input and output by the real sequence

23

(50)

RNN: Recurrent Neural Networks

Training RNN

Loss 𝐿: Binary cross entropy

Minimize:

𝐿 = − ෍

𝑖

𝑦𝑖log 𝑦𝑖 + (1 − 𝑦𝑖) log(1 − 𝑦𝑖)

If 𝑦𝑖 = 1, we minimize −log 𝑦𝑖, making 𝑦𝑖 higher to approach 1

If 𝑦𝑖 = 0, we minimize −log(1 − 𝑦𝑖), making 𝑦𝑖 lower to approach 0

𝑦𝑖 is computed by RNN, this loss will adjust RNN parameters accordingly, using back propagation!

24

(51)

Interim Summary

25

Levenberg–Marquardt Levenberg

Gauss Newton

Newton (-Raphson)

1

1

1 1

( )

( ) ( )

( ) ( )

( ( ) ) ( )

( )

next now k now

t

next now k now

t

next now k now

t t

next now k now

next now k now

G E

JJ E

JJ E

Diag JJ JJ E E

   

   

    

    

   

  

  

   

   

  

I

H

Gradient Descent

(52)

Interim Summary

26

Error Backpropagation Rule (Gradient Descent)

1

' 1 1 1

1

,

( )

l l l

ji j i

l l l l

j j k kj

k l layer

w h

f net w



 

 

 

 

j

1 i I

1 k K

𝑤𝑗𝑖𝑙 𝑤𝑘𝑗𝑙+1

1

' 1 1 1

1

,

( )

l l l

ji j i

l l l l

j j k kj

k l layer

w h

f net w



 

1

' 1 1 1

1

,

( )

l l l

ji j i

l l l l

j j k kj

k l layer

w h

f net w



 

𝑗𝑙 𝑘𝑙+1

𝑖𝑙−1

𝛿

𝑘𝐿

= − 𝜕𝐸

𝑑

𝜕𝑛𝑒𝑡

𝑘

= 𝑡

𝑘

− 𝑜

𝑘

Matrix Form (Backward. EBP)

∆𝑊

𝑙

= 𝜂𝛿

𝑙

𝑙−1𝑇

+ 𝜌∆𝑊

𝑙(𝑜𝑙𝑑)

𝛿

𝑙

= 𝐷𝑖𝑎𝑔[𝑓

𝑎

𝑙

]𝑊

𝑙+1𝑇

𝛿

𝑙+1

Matrix Form (Forward) ℎ

0

= 𝑥

𝑙+1

= 𝐷𝑖𝑎𝑔[𝑓]°𝑊

𝑙+1

𝑙

Referensi

Dokumen terkait