• Tidak ada hasil yang ditemukan

Neural Networks

N/A
N/A
Protected

Academic year: 2024

Membagikan "Neural Networks"

Copied!
25
0
0

Teks penuh

(1)

Artificial Intelligence Artificial Intelligence

Chapter 3 Chapter 3

Neural Networks Neural Networks

Biointelligence Lab

School of Computer Sci. & Eng.

Seoul National University

(2)

Outline Outline

3.1 Introduction

3.2 Training Single TLUs

¨ Gradient Descent

¨ Widrow-Hoff Rule

¨ Generalized Delta Procedure

3.3 Neural Networks

¨ The Backpropagation Method

¨ Derivation of the Backpropagation Learning Rule

3.4 Generalization, Accuracy, and Overfitting 3.5 Discussion

3.1 Introduction

3.2 Training Single TLUs

¨ Gradient Descent

¨ Widrow-Hoff Rule

¨ Generalized Delta Procedure

3.3 Neural Networks

¨ The Backpropagation Method

¨ Derivation of the Backpropagation Learning Rule

3.4 Generalization, Accuracy, and Overfitting

3.5 Discussion

(3)

3.1

3.1 Introduction Introduction

l TLU (threshold logic unit): Basic units for neural networks

¨ Based on some properties of biological neurons

l Training set

¨ Input: real value, boolean value, …

¨ Output:

< di: associated actions (Label, Class …) l Target of training

¨ Finding f(X) corresponds “acceptably” to the members of the training set.

¨ Supervised learning: Labels are given along with the input vectors.

) ,...,

( X

vector, dim

- n :

X = x

1

x

n

l TLU (threshold logic unit): Basic units for neural networks

¨ Based on some properties of biological neurons

l Training set

¨ Input: real value, boolean value, …

¨ Output:

< di: associated actions (Label, Class …) l Target of training

¨ Finding f(X) corresponds “acceptably” to the members of the training set.

¨ Supervised learning: Labels are given along with the input vectors.

) ,...,

( X

vector, dim

- n :

X = x

1

x

n
(4)

3.2

3.2 Training Single TLUs Training Single TLUs 3.2.1 TLU Geometry

3.2.1 TLU Geometry

l

Training TLU: Adjusting variable weights

l

A single TLU: Perceptron, Adaline (adaptive linear element) [Rosenblatt 1962, Widrow 1962]

l

Elements of TLU

¨ Weight: W =(w1, …, wn)

¨ Threshold: q

l

Output of TLU: Using weighted sum s = W×X

¨ 1 if s - q > 0

¨ 0 if s - q < 0 l

Hyperplane

¨ W×X - q = 0

l

Training TLU: Adjusting variable weights

l

A single TLU: Perceptron, Adaline (adaptive linear element) [Rosenblatt 1962, Widrow 1962]

l

Elements of TLU

¨ Weight: W =(w1, …, wn)

¨ Threshold: q

l

Output of TLU: Using weighted sum s = W×X

¨ 1 if s - q > 0

¨ 0 if s - q < 0 l

Hyperplane

¨ W×X - q = 0

(5)

Figure 3.1 TLU Geometry

(6)

3.2.2

3.2.2 Augmented Vectors Augmented Vectors

l

Adopting the convention that threshold is fixed to 0.

l

Arbitrary thresholds: (n + 1)-dimensional vector

l

W = (w

1

, …, w

n

, -q), X = (x

1

, … , x

n

, 1)

l

Output of TLU

¨ 1 if W×X ³ 0

¨ 0 if W×X < 0

l

Adopting the convention that threshold is fixed to 0.

l

Arbitrary thresholds: (n + 1)-dimensional vector

l

W = (w

1

, …, w

n

, -q), X = (x

1

, … , x

n

, 1)

l

Output of TLU

¨ 1 if W×X ³ 0

¨ 0 if W×X < 0

(7)

3.2.3

3.2.3 Gradient Decent Methods Gradient Decent Methods

l

Training TLU: minimizing the error function by adjusting weight values.

l

Batch learning v.s. incremental learning

l

Commonly used error function: squared error

¨ Gradient:

¨ Chain rule:

l Solution of nonlinearity of ¶f / ¶s :

¨ Ignoring threshold function: f = s

¨ Replacing threshold function with differentiable nonlinear function

)

2

( d - f e =

l

Training TLU: minimizing the error function by adjusting weight values.

l

Batch learning v.s. incremental learning

l

Commonly used error function: squared error

¨ Gradient:

¨ Chain rule:

l Solution of nonlinearity of ¶f / ¶s :

¨ Ignoring threshold function: f = s

¨ Replacing threshold function with differentiable nonlinear function

úû ù êë

é

=

+1 1

,..., ,...,

n i

def

w w

w

e e

e e

W

X W X

W s

f f s d

s

s ¶

- ¶ -

¶ =

= ¶

= ¶

¶e e e 2( )

(8)

3.2.4

3.2.4 The Widrow The Widrow--Hoff Procedure: Hoff Procedure:

First Solution First Solution

l

Weight update procedure:

¨ Using f = s = W×X

¨ Data labeled 1 ® 1, Data labeled 0 ® -1 l

Gradient:

l

New weight vector

l

Widrow-Hoff (delta) rule

¨ (d - f) > 0 ® increasing s ® decreasing (d - f)

¨ (d - f) < 0 ® decreasing s ® increasing (d - f) X

W 2( ) X 2(d f ) s

f f

d = - -

¶ - ¶

-

¶ =

¶e

l

Weight update procedure:

¨ Using f = s = W×X

¨ Data labeled 1 ® 1, Data labeled 0 ® -1 l

Gradient:

l

New weight vector

l

Widrow-Hoff (delta) rule

¨ (d - f) > 0 ® increasing s ® decreasing (d - f)

¨ (d - f) < 0 ® decreasing s ® increasing (d - f) X

W 2( ) X 2(d f ) s

f f

d = - -

¶ - ¶

-

¶ =

¶e

X W

W ¬ + c ( d - f )

(9)

3.2.5

3.2.5 The Generalized Delta Procedure: The Generalized Delta Procedure:

Second Solution Second Solution

l Sigmoid function (differentiable): [Rumelhart, et al. 1986]

l Gradient:

l Generalized delta procedure:

¨ Target output: 1, 0

¨ Output f = output of sigmoid function

¨ f(1– f) = 0, where f = 0 or 1

¨ Weight change can occur only within ‘fuzzy’ region surrounding the hyperplane (near the point f(s) = ½).

) 1

/(

1 )

( s e

s

f = +

-

X W 2( ) X 2(d f ) f (1 f )

s f f

d = - - -

¶ - ¶

-

¶ =

¶e

l Sigmoid function (differentiable): [Rumelhart, et al. 1986]

l Gradient:

l Generalized delta procedure:

¨ Target output: 1, 0

¨ Output f = output of sigmoid function

¨ f(1– f) = 0, where f = 0 or 1

¨ Weight change can occur only within ‘fuzzy’ region surrounding the hyperplane (near the point f(s) = ½).

X W

W ¬ + c ( d - f ) f ( 1 - f )

(10)

Figure 3.2 A Sigmoid Function

(11)

3.2.6

3.2.6 The Error The Error--Correction Procedure Correction Procedure

l

Using threshold unit: (d – f) can be either 1 or –1.

l

In the linearly separable case, after finite iteration, W will be converged to the solution.

l

In the nonlinearly separable case, W will never be converged.

l

The Widrow-Hoff and generalized delta procedures will find minimum squared error solutions even when the minimum error is not zero.

X W

W ¬ + c ( d - f )

l

Using threshold unit: (d – f) can be either 1 or –1.

l

In the linearly separable case, after finite iteration, W will be converged to the solution.

l

In the nonlinearly separable case, W will never be converged.

l

The Widrow-Hoff and generalized delta procedures will

find minimum squared error solutions even when the

minimum error is not zero.

(12)

3.3

3.3 Neural Networks Neural Networks 3.3.1 Motivation

3.3.1 Motivation

l

Need for use of multiple TLUs

¨ Feedforward network: no cycle

¨ Recurrent network: cycle (treated in a later chapter)

¨ Layered feedforward network

<jth layer can receive input only from j – 1th layer.

l

Example :

l

Need for use of multiple TLUs

¨ Feedforward network: no cycle

¨ Recurrent network: cycle (treated in a later chapter)

¨ Layered feedforward network

<jth layer can receive input only from j – 1th layer.

l

Example : f = x

1

x

2

+ x

1

x

2
(13)

3.3.2

3.3.2 Notation Notation

l

Hidden unit: neurons in all but the last layer

l

Output of j-th layer: X

(j)

® input of (j+1)-th layer

l

Input vector: X

(0)

l

Final output: f

l

The weight of i-th sigmoid unit in the j-th layer: W

i(j)

l

Weighted sum of i-th sigmoid unit in the j-th layer: s

i(j)

l

Number of sigmoid units in j-th layer: m

j

l

Hidden unit: neurons in all but the last layer

l

Output of j-th layer: X

(j)

® input of (j+1)-th layer

l

Input vector: X

(0)

l

Final output: f

l

The weight of i-th sigmoid unit in the j-th layer: W

i(j)

l

Weighted sum of i-th sigmoid unit in the j-th layer: s

i(j)

l

Number of sigmoid units in j-th layer: m

j

) ( )

1 ( )

( j

i j

j

s

i

= X

-

× W

) ,...,

,...,

(

1(, ) (, ) ( ) 1,

) (

) 1 (

j

i m

j i l j

i j

i

w w w

j +

=

-

W

(14)
(15)

3.3.3

3.3.3 The Backpropagation Method The Backpropagation Method

l

Gradient of W

i(j)

:

l

Weight update:

) ( )

1 ( )

( j

i j

j

s

i

= X

-

× W

) 1 ( ) ( )

1 ( ) (

) 1 ( ) ( )

( ) ( )

( )

(

2 )

(

2

- -

-

-

¶ = - ¶

-

=

= ¶

= ¶

j j

i j

j i

j j

i j

i j i j

i j

i

s f f

d

s s

s

X X

W X W

d e

e e

l

Gradient of W

i(j)

:

l

Weight update:

) ( )

( 2 )

( ( ) 2( )

j i j

i j

i s

f f s d

f d

s

- -

= -

=

e

) 1 ( ) ( ) ( )

( )

( -

+

¬

i j i j i j j

j

i

W c X

W d

) 1 ( ) ( )

1 ( ) (

) 1 ( ) ( )

( ) ( )

( )

(

2 )

(

2

- -

-

-

¶ = - ¶

-

=

= ¶

= ¶

j j

i j

j i

j j

i j

i j i j

i j

i

s f f

d

s s

s

X X

W X W

d e

e e

Local gradient

(16)

3.3.4

3.3.4 Computing Weight Changes in Computing Weight Changes in the Final Layer

the Final Layer

l

Local gradient:

l

Weight update:

) 1

( ) (

)

(

( )

) (

f f

f d

s f f

d

j

i k

- -

=

¶ - ¶

d =

l

Local gradient:

l

Weight update:

) 1

( ) (

)

(

( )

) (

f f

f d

s f f

d

j

i k

- -

=

¶ - ¶

d =

) 1 ( )

( )

( )

(k

¬ W

k

+ c

k

( d - f ) f ( 1 - f ) X

k-

W

(17)

3.3.5

3.3.5 Computing Changes to the Computing Changes to the Weights in Intermediate Layers Weights in Intermediate Layers

l Local gradient:

¨ The final ouput f, depends on si(j) through of the summed inputs to the sigmoids in the (j+1)-th layer.

l Need for computation of

) (

) 1 (

1

) 1 ( )

( ) 1 ( )

1 ( 1

) (

) 1 ( )

1 ( )

( ) 1 ( )

1 ( )

( ) 1 ( 1 )

1 ( 1 ) ( )

(

1 1

1

1

) (

...

...

) (

) (

j i

j l m

l

j j l

i j l j

l m

l

j i

j m j

m j

i j l j

l j

i j j

j i j

i

s s s

s s

f f d

s s s

f s

s s

f s

s s

f f d

s f f

d

j j

j

j

= ¶

¶ - ¶

=

ú ú û ù ê

ê ë é

¶ + ¶

¶ +

¶ + ¶

¶ - ¶

=

¶ - ¶

=

+

=

+ +

+

=

+ +

+ +

+ +

å å

+ +

+

+

d d

l Local gradient:

¨ The final ouput f, depends on si(j) through of the summed inputs to the sigmoids in the (j+1)-th layer.

l Need for computation of

) (

) 1 (

1

) 1 ( )

( ) 1 ( )

1 ( 1

) (

) 1 ( )

1 ( )

( ) 1 ( )

1 ( )

( ) 1 ( 1 )

1 ( 1 ) ( )

(

1 1

1

1

) (

...

...

) (

) (

j i

j l m

l

j j l

i j l j

l m

l

j i

j m j

m j

i j l j

l j

i j j

j i j

i

s s s

s s

f f d

s s s

f s

s s

f s

s s

f f d

s f f

d

j j

j

j

= ¶

¶ - ¶

=

ú ú û ù ê

ê ë é

¶ + ¶

¶ +

¶ + ¶

¶ - ¶

=

¶ - ¶

=

+

=

+ +

+

=

+ +

+ +

+ +

å å

+ +

+

+

d d

) (

) 1 (

j i

j l

s s

+

(18)

Weight Update in Hidden Layers Weight Update in Hidden Layers (cont.)

(cont.)

¨ v ¹ i : v = i :

¨ Conseqeuntly,

[ ]

) 1

(

) (

) ( )

( ) 1 (

1

1

) (

) ( )

1 ( )

( 1 1

) 1 ( ) ( )

( ) 1 (

1

1

) 1 ( ) ( )

1 ( )

( )

1 (

j i j

i j

il

m

v

j i

j j v

j vl i m

v

j vl j

v j

i j l

m

v

j vl j

v j

l j

j l

f f

w

v i

s I w f

s

w f

s s

w f

s

j j

j

-

=

¶ =

= ¶

= ¶

=

×

=

+

+

=

+ +

= + +

+

=

+ +

+

å å

å

W X

¨ v ¹ i : v = i :

¨ Conseqeuntly,

[ ]

) 1

(

) (

) ( )

( ) 1 (

1

1

) (

) ( )

1 ( )

( 1 1

) 1 ( ) ( )

( ) 1 (

1

1

) 1 ( ) ( )

1 ( )

( )

1 (

j i j

i j

il

m

v

j i

j j v

j vl i m

v

j vl j

v j

i j l

m

v

j vl j

v j

l j

j l

f f

w

v i

s I w f

s

w f

s s

w f

s

j j

j

-

=

¶ =

= ¶

= ¶

=

×

=

+

+

=

+ +

= + +

+

=

+ +

+

å å

å

W X

)

0

( ) (

¶ =

j i

j v

s

f

( ) ( )

( 1

( )

)

) (

j v j

j v i

j

v

f f

s

f = -

å

+

+

-

+

=

1

) 1 ( ) 1 ( )

( )

( )

(

( 1 )

mj

j il j

l j

i j

i j

i

f f d w

d

(19)

Weight Update in Hidden Layers Weight Update in Hidden Layers (cont.)

(cont.)

l

Attention to recursive equation of local gradient!

l

Backpropagation:

¨ Error is back-propagated from the output layer to the input layer

¨ Local gradient of the latter layer is used in calculating local gradient of the former layer.

) )(

1

)

(

(k

= f - f d - f

d

å

+

=

+

-

+

=

1

1

) 1 ( ) 1 ( )

( )

( )

(

( 1 )

mj

l

j il j

i j

i j

i j

i

f f d w

d

l

Attention to recursive equation of local gradient!

l

Backpropagation:

¨ Error is back-propagated from the output layer to the input layer

¨ Local gradient of the latter layer is used in calculating local gradient of the former layer.

å

+

=

+

-

+

=

1

1

) 1 ( ) 1 ( )

( )

( )

(

( 1 )

mj

l

j il j

i j

i j

i j

i

f f d w

d

(20)

3.3.5 (

3.3.5 (con’t) con’t)

l

Example (even parity function)

¨ Learning rate: 1.0 input target

1 0 1 0

0 0 1 1

0 1 1 0

l

Example (even parity function)

¨ Learning rate: 1.0

0 1 1 0

1 1 1 1

Figure 3.6 A Network to Be Trained by Backprop

(21)

3.4

3.4 Generalization, Accuracy, and Generalization, Accuracy, and Overfitting

Overfitting

l

Generalization ability:

¨ NN appropriately classifies vectors not in the training set.

¨ Measurement = accuracy l

Curve fitting

¨ Number of training input vectors ³ number of degrees of freedom of the network.

¨ In the case of m data points, is (m-1)-degree polynomial best model? No, it can not capture any special information.

l

Overfitting

¨ Extra degrees of freedom are essentially just fitting the noise.

¨ Given sufficient data, the Occam’s Razor principle dictates to choose the lowest-degree polynomial that adequately fits the data.

l

Generalization ability:

¨ NN appropriately classifies vectors not in the training set.

¨ Measurement = accuracy l

Curve fitting

¨ Number of training input vectors ³ number of degrees of freedom of the network.

¨ In the case of m data points, is (m-1)-degree polynomial best model? No, it can not capture any special information.

l

Overfitting

¨ Extra degrees of freedom are essentially just fitting the noise.

¨ Given sufficient data, the Occam’s Razor principle dictates to choose the lowest-degree polynomial that adequately fits the data.

(22)
(23)

3.4 (

3.4 (cont’d) cont’d)

l

Out-of-sample-set error rate

¨ Error rate on data drawn from the same underlying distribution of training set.

l

Dividing available data into a training set and a validation set

¨ Usually use 2/3 for training and 1/3 for validation l

k-fold cross validation

¨ k disjoint subsets (called folds).

¨ Repeat training k times with the configuration: one validation set, k-1 (combined) training sets.

¨ Take average of the error rate of each validation as the out-of- sample error.

¨ Empirically 10-fold is preferred.

l

Out-of-sample-set error rate

¨ Error rate on data drawn from the same underlying distribution of training set.

l

Dividing available data into a training set and a validation set

¨ Usually use 2/3 for training and 1/3 for validation l

k-fold cross validation

¨ k disjoint subsets (called folds).

¨ Repeat training k times with the configuration: one validation set, k-1 (combined) training sets.

¨ Take average of the error rate of each validation as the out-of- sample error.

¨ Empirically 10-fold is preferred.

(24)

Figure 3.8 Error Versus Number of Hidden Units

Fig 3.9 Estimate of Generalization Error Versus Number of Hidden Units

(25)

3.5

3.5 Additional Readings and Additional Readings and Discussion

Discussion

l

Applications

¨ Pattern recognition, automatic control, brain-function modeling

¨ Designing and training neural networks still need experience and experiments.

l

Major annual conferences

¨ Neural Information Processing Systems (NIPS)

¨ International Conference on Machine Learning (ICML)

¨ Computational Learning Theory (COLT) l

Major journals

¨ Neural Computation

¨ IEEE Transactions on Neural Networks

¨ Machine Learning l

Applications

¨ Pattern recognition, automatic control, brain-function modeling

¨ Designing and training neural networks still need experience and experiments.

l

Major annual conferences

¨ Neural Information Processing Systems (NIPS)

¨ International Conference on Machine Learning (ICML)

¨ Computational Learning Theory (COLT) l

Major journals

¨ Neural Computation

¨ IEEE Transactions on Neural Networks

¨ Machine Learning

Gambar

Figure 3.1 TLU Geometry
Figure 3.2 A Sigmoid Function
Figure 3.6 A Network to Be Trained by Backprop
Figure 3.8 Error Versus Number of  Hidden Units

Referensi

Dokumen terkait

Parameter yang digunakan dalam pengukuran evaluasi hasil prediksi dari algoritma Conjugate Gradient metode Artificial Neural Network Backpropagation adalah Meant Square Error

Parameter yang digunakan dalam pengukuran evaluasi hasil prediksi dari algoritma Conjugate Gradient metode Artificial Neural Network Backpropagation adalah Meant Square Error

To solve these problems, here, Song and Mason equation, support vector machine (SVM), and artificial neural networks (ANNs) were used to develop theoretical and machine learning

propagation of errors&#34;, is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent. The

Jawaban ini untuk neural networks umpan maju generic Convolutional Neural Network (CNN) terdiri dari satu atau lebih lapisan konvolusional (seringkali dengan

An Efficient Hardware Implementation of Activation Functions Using Stochastic Computing for Deep Neural Networks Van-Tinh Nguyen, Tieu-Khanh Luong, Han Le Duc, and Van-Phuc Hoang Le

In this research forecast the demand for household electricity consumption in 2016 based on validation data from 2011-2015 using Application of Neural Networks Backpropagation

Our contributions can be summarized as follows: • We propose delay differential neural networks DDNN, a continuous depth deep learning model based on delay differ- ential equations