Learning Models and Rules (I)
Jin Young Choi
Seoul National University
1
Outline
Learning rules and models
Statistical nature of learning
Empirical risk minimization
Structural risk minimization
Learning Models
Probability Discriminant Functions
• A Posteriori Probability Function
Linear Learning Models
Universal Learning Models
• Basis networks
• Gaussian basis
• High order basis
• Sine, cosine, wavelet basis
• (Deep) Neural networks
3
( ) ( | )
i i
g x p x
1 1 0 0
( ) ... d d t
g x w x w x w w x w
4
Linear Learning Models
height
weight Linear Decision Surface
= Hyperplane H
( )
t 00
g x w x w
5
Universal Learning Models
height
weight Nonlinear Decision Surface
= Hypersurface H
( ) 0
g x
Learning Strategy
Nonlinear Learning Approach
• Learning of nonlinear kernel parameters
• Steepest descent method (error backpropagation Learning)
• Approximated newton method
• Genetic algorithm, simulated annealing
• Slow learning speed
• Local minima
Linear learning approach
• Convex optimization (least squares, support vector machine)
• Fast learning speed, mathematically sound
• Kernel determination problems
• Generalization problems: high dimensional feature space
6
Statistical Nature of Learning
The generalized linear discriminant function (General Learning Model)
Observations (i.i.d. examples)
For binary classification
Multi-class classification
𝑑
𝑖= 1, ∀𝑥
𝑖∈ 𝜔
𝑖𝑑
𝑗= 0, ∀𝑥
𝑗∉ 𝜔
𝑖 Multi-label classification: 0 1 0 1 1 0 1 1 0 … …
Ex) portrait with attributes (Bald, Mustache, Gold hair, Blue eye, ….etc.)
7
( , )
t( , ),
t[
t t]
g x w φ x v w v
{( , x d
i i)}
iN1
1 2
1, ,
1, ,
i i
i i
d x
d x
Statistical Nature of Learning
Loss Function
= −𝑑 log 𝑔(𝑥, 𝜃)
Risk Functional (Expectation Value of L) (or Total Loss)
where is joint PDF for
x & d
, but unknown. Empirical Risk Functional
8
( , ( , )) ( ( , )) (or Entropy, KLD)2
L d g x d g x
( ) ( , ( , )) ( , ) R
L d g x dp x d( , ) p x d
1
( ) 1 ( , ( , ))
N
emp i i
i
R L d g x
N
Statistical Nature of Learning
Definition(Convergence in Probability):
Sequence of random variables is said to converge in probability to ,i.e.,
if for any
as
Definition(Strict Consistency):
Let , If
The empirical risk function is said to be strictly consistent.
9
1, 2, 3,..., N,
0,
(| N * | ) 0 .
p
N *,
P
N
( ) { | ( )c R c}
( ) ( )
inf emp( ) P inf ( ),
c R c R as N
*
Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer.
Empirical Risk Minimization
Empirical Risk Minimization
Original Risk Minimization
Meaning
is probability of optimal decision error is probability of training error
The difference between and depends #of samples and the complexity of a classifier.
10
arg min ( )
emp Remp
arg min ( )
o R
( ) R
emp( ) R
emp( ) R ( )
R
Polynomial Curve Fitting
Sum-of-Squares Error Function
0 th Order Polynomial
1 st Order Polynomial
3 rd Order Polynomial
9 th Order Polynomial
Over-fitting
Root-Mean-Square (RMS) Error:
Data Set Size:
9
thOrder Polynomial
Structural Risk Minimization
VC(Vapnik-Chervonenkis )-dimension:
Maximum number of points that can be labeled in all possible way
VC dimension of linear classifiers in N dimensions is h=N+1 (= #of weights, 𝑛𝑤), cf.) MLP: O(𝑛𝑤2 )
Measure of Complexity of a classifier
Minimizing VC dim. == Minimizing Complexity
19
Structural Risk Minimization
Generalization Ability
N: Training sample size
h: VC-dimension
e: Natural logarithm
Generalization bound: 𝑝 𝑅 − 𝑅𝑒𝑚𝑝 < 𝜀 > 1 − 𝛼
Confidential interval
20
2 2
(| ( ) ( ) | ) exp( )
h
o emp emp
p R R eN N
h
2 1
( , , ) h ln N 1 ln
N h N h N
( )o emp( emp) ( )o , 1
R R R with p
Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer.
Structural Risk Minimization
For fixed training samples N
21
Structural Risk Minimization
For fixed training samples N
22
How can we find the appropriate complexity ?
Structural Risk Minimization
For fixed training samples N
23
How can we find the appropriate complexity ? Structural Risk Minimization
,
( emp, emp) arg min emp( , )
h
h R h
Structural Risk Minimization
For fixed training samples N
24
How can we find the appropriate complexity ? Structural Risk Minimization
SVM Pruning
Regularization Drop out
Use of pre- trained Net
,
( emp, emp) arg min emp( , )
h
h R h
Interim Summary
Learning rules and models : Parametric Optimization
Statistical nature of learning
Empirical risk minimization
Structural risk minimization
( ) ( )
inf emp( ) P inf ( ),
c R c R as N
2 1
( ) 1 ( ( , )) ( )
2
N
i i emp
i
E d g x R
2 2
(| ( ) ( ) | ) exp( )
h
o emp emp
p R R eN N
h
Learning Rules
Gradient descent method: Error backpropagation method
Gauss newton method
Levenberg–Marquardt method
Newton (-Raphson) method
Support Vector Machine
Parametric PDF estimation: MLE, MAP, Bayesian Learning
Non-parametric estimation: EM, MCMC
Boltzman Machine: MAP, Boltzman distribution, MLP, unsupervised L.
Bayesian Learning: MCMC, Variational Inference
Learning Models and Rules (II)
Jin Young Choi
Seoul National University
1
Outline
Gradient descent method
Error backpropagation method
Gauss newton method
Levenberg–Marquardt method
Newton (-Raphson) method
Empirical Risk Minimization
Empirical Risk Functional
𝐸 𝜃 = 𝑝𝑖 ln 𝑞𝑖 𝑊
𝐸 𝜃 = σ𝑖[𝑝𝑖ln 𝑞𝑖 𝑊 + (1 − 𝑝𝑖) ln(1 − 𝑞𝑖 𝑊 )]
Learning Goal
Convexity
Generally, is not convex.
For linear case, is convex.
3
2 1
( ) 1 ( ( , )) ( )
2
N
i i emp
i
E d g x R
* arg min ( )
Find E
( ) E
( ) E
1 2
( ) ( ) ( )
( ) , ,...,
def t
n
E E E
E
Gradient Descent Methods
Gradient Descent Update Rule (Steepest Descent for G=I)
( k 1) ( ) k ( ) k G E ( ( )), k G 0
4
Backpropagation Learning Rule
Feedforward Neural Networks
Backpropagation Learning Rule
Empirical Risk Function:
Gradient descent for output layer:
Chain rule:
1 2
( ) ( )
d 2 k k
k outputs
E w t o
… …
j
1 i I
1 k K
…
… …
𝑤𝑗𝑖 𝑤𝑘𝑗
j ji i
i
net w x
k kj j
j
net w h
ℎ𝑗 𝑜𝑘
𝑥𝑖
…
∆𝑤
𝑘𝑗= −𝜂 𝜕𝐸
𝑑𝜕𝑤
𝑘𝑗𝜕𝐸
𝑑𝜕𝑤
𝑘𝑗= 𝜕𝐸
𝑑𝜕𝑛𝑒𝑡
𝑘𝜕𝑛𝑒𝑡
𝑘𝜕𝑤
𝑘𝑗= 𝜕𝐸
𝑑𝜕𝑛𝑒𝑡
𝑘ℎ
𝑗𝐸(𝑤) = 1
2 |𝑡 − 𝑜 𝑥, 𝑤 | 2
Backpropagation Learning Rule
• To learn the regression problem, the linear activation function was used and the sum-of-squares error function was used as the loss function. Let's define the loss function as 𝐸(𝑤) = 1
2 |𝑜 𝑥, 𝑤 − 𝑡 | 2. At this time, the 𝑘-th value of 𝑜 𝑥, 𝑤 is defined by 𝑜𝑘 = 𝑛𝑒𝑡𝑘 = 𝑤𝑘𝑇ℎ = σ𝑗𝑤𝑘𝑗ℎ𝑗. Then find 𝜕E
𝜕𝑛𝑒𝑡𝑘.
7
Sol.)
Since 𝐸 𝑤 = Τ1 2σ𝑘(𝑛𝑒𝑡𝑘 − 𝑡𝑘)2, 𝜕E
𝜕𝑛𝑒𝑡𝑘 = 𝑛𝑒𝑡𝑘 − 𝑡𝑘 = 𝑜𝑘 − 𝑡𝑘 = − 𝑡𝑘 − 𝑜𝑘 =
− 𝛿𝑘. … …
j
1 i I
1 k K
…
… …
𝑤𝑗𝑖 𝑤𝑘𝑗
j ji i
i
net w x
k kj j
j
net w h ℎ𝑗
𝑜𝑘
𝑥𝑖
…
∆𝑤𝑘𝑗 = −𝜂 𝜕𝐸𝑑
𝜕𝑤𝑘𝑗 = 𝜂𝛿𝑘ℎ𝑗
𝜕𝐸𝑑
𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑
𝜕𝑛𝑒𝑡𝑘
𝜕𝑛𝑒𝑡𝑘
𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑
𝜕𝑛𝑒𝑡𝑘 ℎ𝑗
Backpropagation Learning Rule
• For multi-label classification (ex, output: 0110100), sigmoid activation
function is used and the loss is defined by the cross entropy loss function:
𝐸 𝑤 = − σ𝑘𝐾[𝑡𝑘log 𝑜𝑘 𝑥, 𝑤 + 1 − 𝑡𝑘 log 1 − 𝑜𝑘 𝑥, 𝑤 ] , where 𝑜𝑘 = 𝜎(𝑛𝑒𝑡𝑘) = 1
1+𝑒−𝑛𝑒𝑡𝑘 . Then find 𝜕E
𝜕𝑛𝑒𝑡𝑘 .
8
Sol.)
Since 𝜕𝑜𝑘
𝜕𝑛𝑒𝑡𝑘 = 𝜎 𝑛𝑒𝑡𝑘 1 − 𝜎 𝑛𝑒𝑡𝑘 = 𝑜𝑘(1 − 𝑜𝑘) ,
𝜕E
𝜕𝑛𝑒𝑡𝑘 = −𝑡𝑘 1
𝑜𝑘
𝜕𝑜𝑘
𝜕𝑛𝑒𝑡𝑘 − 1 − 𝑡𝑘 −1
1−𝑜𝑘
𝜕o𝑘
𝜕𝑛𝑒𝑡𝑘
= −𝑡𝑘 1
𝑜𝑘𝑜𝑘 1 − 𝑜𝑘 − 1 − 𝑡𝑘 −1
1−𝑜𝑘𝑜𝑘 1 − 𝑜𝑘
= −𝑡𝑘 1 − 𝑜𝑘 + 1 − 𝑡𝑘 𝑜𝑘 = 𝑜𝑘 − 𝑡𝑘 = − 𝑡𝑘 − 𝑜𝑘 = −𝛿𝑘
∆𝑤𝑘𝑗 = −𝜂 𝜕𝐸𝑑
𝜕𝑤𝑘𝑗 = 𝜂𝛿𝑘ℎ𝑗
𝜕𝐸𝑑
𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑
𝜕𝑛𝑒𝑡𝑘
𝜕𝑛𝑒𝑡𝑘
𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑
𝜕𝑛𝑒𝑡𝑘 ℎ𝑗
… …
j
1 i I
1 k K
…
… 𝑤𝑗𝑖 … 𝑤𝑘𝑗
j ji i
i
net w x
k kj j
j
net w h
ℎ𝑗 𝑜𝑘
𝑥𝑖
…
Backpropagation Learning Rule
• For multi-class classification (ex, [0 0 0 1 0 0]), the softmax activation function is used and the loss is defined by the cross entropy loss
function: 𝐸 𝑤 = − σ𝑖𝐾𝑡𝑖log(𝑜𝑖 𝑥, 𝑤 ), where 𝑜𝑘 𝑥, 𝑤 = 𝑒𝑛𝑒𝑡𝑘
σ𝑗𝑒𝑛𝑒𝑡𝑗. The target value 𝑡𝑘 ∈ 0, 1 is labelled by 1 hot vector. Then find 𝜕E
𝜕𝑛𝑒𝑡𝑘.
9
Sol.)
𝜕𝐸𝑛
𝜕𝑛𝑒𝑡𝑘 = 𝜕
𝜕𝑛𝑒𝑡𝑘 −
𝑖 𝐾
𝑡𝑖log( 𝑒𝑛𝑒𝑡𝑖
σ𝑗 𝑒𝑛𝑒𝑡𝑗) = 𝜕
𝜕𝑛𝑒𝑡𝑘 (−
𝑖 𝐾
[𝑡𝑖log(𝑒𝑛𝑒𝑡𝑖) − 𝑡𝑖log(
𝑗
𝑒𝑛𝑒𝑡𝑗)])
= 𝜕
𝜕𝑛𝑒𝑡𝑘 (− σ𝑖𝐾[𝑡𝑖𝑛𝑒𝑡𝑖 − 𝑡𝑖log(σ𝑗𝑒𝑛𝑒𝑡𝑗)]) = −𝑡𝑘 + σ𝑖 𝑡𝑖 𝑒𝑛𝑒𝑡𝑘
σ𝑗𝑒𝑛𝑒𝑡𝑗
= −𝑡𝑘 + 𝑒𝑛𝑒𝑡𝑘
σ𝑗𝑒𝑛𝑒𝑡𝑗 σ𝑖𝑡𝑖 = 𝑜𝑘 − 𝑡𝑘 = − 𝑡𝑘 − 𝑜𝑘 = −𝛿𝑘
∆𝑤𝑘𝑗 = −𝜂 𝜕𝐸𝑑
𝜕𝑤𝑘𝑗 = 𝜂𝛿𝑘ℎ𝑗
𝜕𝐸𝑑
𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑
𝜕𝑛𝑒𝑡𝑘
𝜕𝑛𝑒𝑡𝑘
𝜕𝑤𝑘𝑗 = 𝜕𝐸𝑑
𝜕𝑛𝑒𝑡𝑘 ℎ𝑗
Backpropagation Learning Rule
Empirical Risk Function:
Gradient descent for hidden layer:
Chain rule:
1
2( ) ( )
d
2
k kk outputs
E w t o
… …
j
1 i I
1 k K
…
… …
𝑤𝑗𝑖 𝑤𝑘𝑗
j ji i
i
net w x
k kj j
j
net w h
ℎ𝑗 𝑜𝑘
𝑥𝑖
…
Regression: 𝐿2, linear
01001101: cross-entropy, sigmoid 00001000: cross-entropy, soft-max
∆𝑤
𝑗𝑖= −𝜂 𝜕𝐸
𝑑𝜕𝑤
𝑗𝑖𝜕𝐸
𝑑𝜕𝑤
𝑗𝑖= 𝜕𝐸
𝑑𝜕𝑛𝑒𝑡
𝑗𝜕𝑛𝑒𝑡
𝑗𝜕𝑤
𝑗𝑖= 𝜕𝐸
𝑑𝜕𝑛𝑒𝑡
𝑗𝑥
𝑖Backpropagation Learning Rule
Chain rule:
d d
,
i
ji j
E E
w net x
'
( )
d d k j
k outputs
j k j j
k j k
k outputs j j
j k kj
k outputs j
k kj j
k outputs
E E net h
net net h net
net h
h net w h
net w f net
',
( )
ji j i
j j k kj
k outputs
w x
f net w
… …
j
1 i I
1 k K
…
… …
𝑤𝑗𝑖 𝑤𝑘𝑗
j ji i
i
net w x
k kj j
j
net w h
ℎ𝑗 𝑜𝑘
𝑥𝑖
…
𝜕𝐸
𝑑𝜕𝑛𝑒𝑡
𝑘= −𝛿
𝑘ℎ𝑗 = 𝑓(𝑛𝑒𝑡𝑗)
= −𝛿𝑗
Backpropagation Learning Rule
… …
j
1 i I
1 k K
…
… …
𝑤𝑗𝑖𝑙 𝑤𝑘𝑗𝑙+1
1
l l l
j ji i
i
net
w h 1 1
l l l
k kj j
j
net
w h ℎ𝑗𝑙ℎ𝑘𝑙+1
ℎ𝑖𝑙−1
…
𝛿𝑘𝐿 = − 𝜕𝐸𝑑
𝜕𝑛𝑒𝑡𝑘 = 𝑡𝑘 − 𝑜𝑘 Regression: 𝐿2, linear
01001101: cross-entropy, sigmoid 00001000: cross-entropy, soft-max
1
' 1 1 1
1
,
( )
l l l
ji j i
l l l l
j j k kj
k l layer
w h
f net w
Matrix Form (Backward. EBP)
∆𝑊
𝑙= 𝜂𝛿
𝑙ℎ
𝑙−1𝑇+ 𝜌∆𝑊
𝑙(𝑜𝑙𝑑)𝛿
𝑙= 𝐷𝑖𝑎𝑔[𝑓
′𝑛𝑒𝑡
𝑙]𝑊
𝑙+1𝑇𝛿
𝑙+1Matrix Form (Forward) ℎ
0= 𝑥
ℎ
𝑙+1= 𝐷𝑖𝑎𝑔[𝑓]°𝑊
𝑙+1ℎ
𝑙Two Layer Convolutional Neural Network
Relu function 32x32x3
32x32x32
16x16x32
16x16x64
8x8x64
4096 10
Gradient Descent Methods
14
15
Gradient Descent Methods
Optimal Learning Rate
• Necessary Condition
Learning Rate Search Methods
• Initial Bracketing
• Line Searching
• Secant Method (Approximate Newton Method)
• Bisection Method
• Golden section search method
0,
0
T
next now
T
now now now
E E
E E E
15
16
Newton(-Raphson) Method
Principle: The descent direction is determined by using the second derivatives of the objective function f () if available
If the starting position now is sufficient close to a local
minimum, the objective function f () can be approximated by a quadratic form:
2
( ) ( ) ( ) 1 ( ) ( )
2
( ), ( ).
T T
now now now now
now now
E E
where E E
H H
16
17
Newton(-Raphson) Method
17 2
( ) ( ) ( ) 1 ( ) ( )
2
( ), ( ).
T T
now now now now
E E
where E E
H H
18
Since the equation defines a quadratic function
• its minimum can be determined by differentiating & setting to 0.
18
Newton (-Raphson) Method
1
( ) ( ) ( ) 1 ( ) ( )
2
( ) 0
T T
now now now now
next now
next now
E
E
H H
H
)
( 1) ( ) ( ) ( ( )), 0
cf
k k k G E k G
19
Gauss Newton Method
Key idea: Not to use Hassian matrix, we use linearized approximation of learning model.
𝐸 𝜃 =
12
𝑑 − 𝑔(𝑥, 𝜃)
2𝑔 𝑥, 𝜃 ≈ 𝑔 𝑥, 𝜃
𝑛𝑜𝑤+ 𝐽
𝑇𝜃 − 𝜃
𝑛𝑜𝑤, where 𝐽 =
𝑑𝑔 𝑥,𝜃ൗ ห
𝑑𝜃 𝜃=𝜃𝑛𝑜𝑤
𝐸 𝜃 =
12
𝑑 − 𝑔 𝑥, 𝜃
𝑛𝑜𝑤− 𝐽
𝑇𝜃 − 𝜃
𝑛𝑜𝑤 219
20
Gauss Newton Method
Since the function is quadratic for ,
Its minimum can be determined by differentiating &
setting to 0.
𝐸 𝜃 =
12
𝑑 − 𝑔 𝑥, 𝜃
𝑛𝑜𝑤− 𝐽
𝑇𝜃 − 𝜃
𝑛𝑜𝑤 2 𝛻𝐸 𝜃 = −𝐽 𝑑 − 𝑔 𝑥, 𝜃
𝑛𝑜𝑤− 𝐽
𝑇𝜃 − 𝜃
𝑛𝑜𝑤= 0
Update Rule
𝜃
𝑛𝑒𝑥𝑡= 𝜃
𝑛𝑜𝑤+ 𝐽𝐽
𝑇 −1𝐽(𝑑 − 𝑔 𝑥, 𝜃
𝑛𝑜𝑤)
𝜃
𝑛𝑒𝑥𝑡= 𝜃
𝑛𝑜𝑤− 𝐽𝐽
𝑇 −1𝛻𝐸 𝜃
𝑛𝑜𝑤[∵ 𝛻𝐸 𝜃
𝑛𝑜𝑤= −𝐽 𝑑 − 𝑔 𝑥, 𝜃
𝑛𝑜𝑤]
20
)
( 1) ( ) ( ) ( ( )), 0
cf
k k k G E k G
RNN: Recurrent Neural Networks
Recurrent Neural Networks
𝑠𝑡: State of RNN after time 𝑡
𝑥𝑡: Input to RNN at time 𝑡
𝑦𝑡: Output of RNN at time 𝑡
𝑊, 𝑈, 𝑉: parameter matrices, 𝜎(⋅) : non-linear activation function
More expressive cells: GRU(Gated Recurrent Unit), LSTM(Long- Short-Term Memory), etc.
21
)
RNN: Recurrent Neural Networks
RNN for Sequence Generation
Q: How to use RNN to generate sequences?
A: Let 𝑥𝑡+1 = 𝑦𝑡
Q: How to initialize 𝑠0, 𝑥1? When to stop generation?
A: Use start/end of sequence token (SOS, EOS)
22
RNN: Recurrent Neural Networks
Training RNN
We observe a sequence 𝑦∗ of edges [1,0,…]
Principle: Teacher Forcing -- Replace input and output by the real sequence
23
RNN: Recurrent Neural Networks
Training RNN
Loss 𝐿: Binary cross entropy
Minimize:
𝐿 = −
𝑖
𝑦𝑖∗log 𝑦𝑖 + (1 − 𝑦𝑖∗) log(1 − 𝑦𝑖)
If 𝑦𝑖∗ = 1, we minimize −log 𝑦𝑖, making 𝑦𝑖 higher to approach 1
If 𝑦𝑖∗ = 0, we minimize −log(1 − 𝑦𝑖), making 𝑦𝑖 lower to approach 0
𝑦𝑖 is computed by RNN, this loss will adjust RNN parameters accordingly, using back propagation!
24
Interim Summary
25
Levenberg–Marquardt Levenberg
Gauss Newton
Newton (-Raphson)
1
1
1 1
( )
( ) ( )
( ) ( )
( ( ) ) ( )
( )
next now k now
t
next now k now
t
next now k now
t t
next now k now
next now k now
G E
JJ E
JJ E
Diag JJ JJ E E
I
H
Gradient Descent
Interim Summary
26
Error Backpropagation Rule (Gradient Descent)
1
' 1 1 1
1
,
( )
l l l
ji j i
l l l l
j j k kj
k l layer
w h
f net w
… …j
1 i I
1 k K
…
… …
𝑤𝑗𝑖𝑙 𝑤𝑘𝑗𝑙+1
1
' 1 1 1
1
,
( )
l l l
ji j i
l l l l
j j k kj
k l layer
w h
f net w
1
' 1 1 1
1
,
( )
l l l
ji j i
l l l l
j j k kj
k l layer
w h
f net w
ℎ𝑗𝑙 ℎ𝑘𝑙+1
ℎ𝑖𝑙−1
…
𝛿
𝑘𝐿= − 𝜕𝐸
𝑑𝜕𝑛𝑒𝑡
𝑘= 𝑡
𝑘− 𝑜
𝑘Matrix Form (Backward. EBP)
∆𝑊
𝑙= 𝜂𝛿
𝑙ℎ
𝑙−1𝑇+ 𝜌∆𝑊
𝑙(𝑜𝑙𝑑)𝛿
𝑙= 𝐷𝑖𝑎𝑔[𝑓
′𝑎
𝑙]𝑊
𝑙+1𝑇𝛿
𝑙+1Matrix Form (Forward) ℎ
0= 𝑥
ℎ
𝑙+1= 𝐷𝑖𝑎𝑔[𝑓]°𝑊
𝑙+1ℎ
𝑙