• Tidak ada hasil yang ditemukan

Deep Learning (Multi-Layer Perceptron)

N/A
N/A
Protected

Academic year: 2025

Membagikan "Deep Learning (Multi-Layer Perceptron)"

Copied!
47
0
0

Teks penuh

(1)

Deep Learning

( Multi-Layer Perceptron )

Sadegh Eskandari

Department of Computer Science, University of Guilan

[email protected]

(2)

Today …

o Biological Neurons o Artificial Neurons

o Multi-Layer Perceptron o Training a Single Neuron

o Training an MLP (Backpropagation) o Activation Functions

o Regularization

o Weight Initialization

(3)

Biological Neurons

o Dendrites receive impulses (synapses) from other cells.

Dendrites

Axon

Axon Terminals

Cell Body (Soma)

o The cell body fires (sends signal) if the received impulses are more than a threshold.

o Signal travels through the axon

and is delivered to other cells via

axon terminals

(4)

Biological Neurons

Multiple layers in a biological neural network of human cortex

Source: Montesinos López, O.A., Montesinos López, A. and Crossa, J., 2022. Fundamentals of Artificial Neural Networks and Deep Learning. In Multivariate Statistical Machine Learning Methods for Genomic Prediction (pp. 379-425). Springer, Cham.

(5)

Biological Neurons

The postnatal development of the human cerebral cortex

Source: https://developingchild.harvard.edu/

(6)

Artificial Neurons

Synapses Dendrites Soma

𝑜 = 𝜑 ෍

𝑗=1 𝑛

𝑤

𝑗

𝑥

𝑗

= 𝜑 𝒘

𝑇

𝐱

Where 𝒘 = 𝑤

1

, 𝑤

2

, … , 𝑤

𝑛 𝑇

and 𝐱 = 𝑥

1

, 𝑥

2

, … , 𝑥

𝑛 𝑇

𝑥

𝑖1

𝑥

𝑖2

𝑥

𝑖3

𝑥

𝑖𝑛

∑ 𝑤

1

𝑤

2

𝑤

3

𝑤

𝑛

𝜑 𝑜

Inputs Weights

Transfer Function

Activation

Function

(7)

Artificial Neurons

Regular Activation Functions (𝜑)

(8)

Multi-Layer Perceptron (MLP): Notations

𝑥

𝑖1

𝑥

𝑖2

𝑥

𝑖3

𝑥

𝑖4

𝑓

11

𝑓

12

𝑓

12

𝑓

21

𝑓

22

𝑓

31

𝑦 ෝ 𝑖

Layer 1 Layer 2 Layer 3

3-Layer Fully-Connected Artificial Neural Network

(9)

Multi-Layer Perceptron (MLP): Notations

𝑥

𝑖1

𝑥

𝑖2

𝑥

𝑖3

𝑥

𝑖4

𝑓

11

𝑓

12

𝑓

13

𝑓

21

𝑓

22

𝑓

31

𝑦 ෝ 𝑖

3-Layer Fully-Connected Artificial Neural Network

𝑤

431

𝑤

112

𝑤

213

𝑤 𝑖𝑗 𝑙 The weight of edge from neuron 𝑖 of layer 𝑙 − 1 to neuron 𝑗 of layer 𝑙

(10)

Multi-Layer Perceptron (MLP): Notations

𝑥

𝑖1

𝑥

𝑖2

𝑥

𝑖3

𝑥

𝑖4

𝑓

11

𝑓

12

𝑓

13

𝑓

21

𝑓

22

𝑓

31

𝑦 ෝ 𝑖

3-Layer Fully-Connected Artificial Neural Network

𝑤 2

= 𝑤11

2 𝑤122 𝑤212 𝑤222 𝑤312 𝑤322

𝑤 1

=

𝑤111 𝑤121 𝑤131 𝑤211 𝑤221 𝑤231 𝑤311 𝑤321 𝑤331 𝑤411 𝑤421 𝑤431

𝑤 3

= 𝑤113

𝑤213

(11)

Training a Single Neuron

Step 1: Loss function

𝐗 = 𝒙

𝟏

, 𝒙

𝟐

, … , 𝒙

𝑵 𝑻

𝐲 = 𝑦

1

, 𝑦

2

, … , 𝑦

𝑁 𝑻

𝐱

𝐢

= 𝑥

𝑖1

, 𝑥

𝑖2

, … , 𝑥

𝑖𝑛 𝑻

Training data 𝑁: num of data

𝑛: dimension of each data point

𝑙

𝑖

= 𝑦

𝑖

− 𝑜

𝑖 2

= 𝑦

𝑖

− 𝜑 𝒘

𝑇

𝒙

𝑖 2

𝐿 = ෍

𝑖=1 𝑁

𝑙

𝑖

Step 2: The objective function

𝑤

= arg min

𝒘

𝐿

Step 3: Optimization

Initialize 𝒘

for 𝑖𝑡𝑒𝑟 = 1 to 𝐾:

𝒘

𝐿 =

𝜕𝐿

𝜕𝑤1

,

𝜕𝐿

𝜕𝑤2

, … ,

𝜕𝐿

𝜕𝑤𝑛

𝒘

𝑛𝑒𝑤

= 𝒘

𝑜𝑙𝑑

− 𝛾∇

𝒘

𝐿

(12)

Training an MLP (Backpropagation)

o As we know, 𝛻 𝑊 𝐿 plays a central role in finding the optimal parameters of model o But calculating the gradients in multi-layer neural networks is not a trivial task!!

o Solution: Backpropagation (Rumelhart et al., 1986a)

(13)

Computational Graphs: A simple example

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

Training an MLP (Backpropagation)

(14)

Computational Graphs: A simple example

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

Training an MLP (Backpropagation)

(15)

Computational Graphs: A simple example

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

Training an MLP (Backpropagation)

𝜕𝑓

𝜕𝑓

1

(16)

Computational Graphs: A simple example

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

Training an MLP (Backpropagation)

1

𝜕𝑓

𝜕𝑧

3

(17)

Computational Graphs: A simple example

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

Training an MLP (Backpropagation)

1 3

𝜕𝑓

𝜕𝑧

-4

(18)

Computational Graphs: A simple example

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

Training an MLP (Backpropagation)

1 3

-4

𝜕𝑓

𝜕𝑦

?

𝜕𝑓

𝜕𝑦 = 𝜕𝑓

𝜕𝑞

𝜕𝑞

𝜕𝑦

Chain Rule:

Upstream

gradient Local gradient

(19)

Computational Graphs: A simple example

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

Training an MLP (Backpropagation)

1 3

-4

𝜕𝑓

𝜕𝑦 -4

𝜕𝑓

𝜕𝑦 = 𝜕𝑓

𝜕𝑞

𝜕𝑞

𝜕𝑦

Chain Rule:

Upstream

gradient Local gradient

(20)

Computational Graphs: A simple example

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

Training an MLP (Backpropagation)

1 3

-4

𝜕𝑓

𝜕𝑥 -4

𝜕𝑓

𝜕𝑥 = 𝜕𝑓

𝜕𝑞

𝜕𝑞

𝜕𝑥

Chain Rule:

Upstream

gradient Local gradient

-4

(21)

Training an MLP (Backpropagation)

(22)

Training an MLP (Backpropagation)

(23)

Training an MLP (Backpropagation)

(24)

Training an MLP (Backpropagation)

(25)

Training an MLP (Backpropagation)

(26)

Training an MLP (Backpropagation)

Step 1: Loss function

𝐗 = 𝒙

𝟏

, 𝒙

𝟐

, … , 𝒙

𝑵 𝑻

𝐲 = 𝑦

1

, 𝑦

2

, … , 𝑦

𝑁 𝑻

𝐱

𝐢

= 𝑥

𝑖1

, 𝑥

𝑖2

, … , 𝑥

𝑖𝑛 𝑻

Training data 𝑁: num of data

𝑛: dimension of each data point

𝑙

𝑖

= 𝑦

𝑖

− 𝑜

𝑖 2

= 𝑦

𝑖

− 𝜑 𝒘

𝑇

𝒙

𝑖 2

𝐿 = ෍

𝑖=1 𝑁

𝑙

𝑖

Step 2: The objective function

arg min

𝑤

1

,𝑤

2

,𝑤

3

𝐿

Step 3: Optimization

Initialize 𝑤

𝑖𝑗𝑘

for 𝑖𝑡𝑒𝑟 = 1 to 𝐾:

for each i, 𝑗, 𝑘:

𝑤

𝑖𝑗𝑘

𝑛𝑒𝑤

= 𝑤

𝑖𝑗𝑘

𝑜𝑙𝑑

− 𝛾 𝜕𝐿

𝜕𝑤

𝑖𝑗𝑘

𝑜

𝑖

𝑤

1

𝑤

2

𝑤

3
(27)

Training an MLP (Backpropagation)

𝑜

31

𝑤

113

𝑤

123

𝐿 = ෍

𝑖=1 𝑁

𝑙

𝑖

𝜕𝐿

𝜕𝑤

113

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

113

𝜕𝐿

𝜕𝑤

123

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

123
(28)

Training an MLP (Backpropagation)

𝑜

31

𝑜

21

𝑜

22

𝐿 = ෍

𝑖=1 𝑁

𝑙

𝑖

𝜕𝐿

𝜕𝑤

113

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

113

𝜕𝐿

𝜕𝑤

123

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

123

𝑤

112

𝑤

122

𝑤

322

𝜕𝐿

𝜕𝑤

112

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑤

112

𝜕𝐿

𝜕𝑤

122

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑤

122

𝜕𝐿

𝜕𝑤

322

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

22

× 𝜕𝑜

22

𝜕𝑤

322

(29)

Training an MLP (Backpropagation)

𝑜

31

𝑜

21

𝑜

22

𝐿 = ෍

𝑖=1 𝑁

𝑙

𝑖

𝜕𝐿

𝜕𝑤

113

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

113

𝜕𝐿

𝜕𝑤

123

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

123

𝑤

111

𝜕𝐿

𝜕𝑤

112

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑤

112

𝜕𝐿

𝜕𝑤

122

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑤

122

𝜕𝐿

𝜕𝑤

322

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

22

× 𝜕𝑜

22

𝜕𝑤

322

𝜕𝐿

𝜕𝑤

111

=?

𝑜

11

𝑜

12

𝑜

13
(30)

Training an MLP (Backpropagation)

𝑓

𝑔

𝑥 ℎ ℎ 𝑓 𝑥 , 𝑔 𝑥

𝜕ℎ

𝜕𝑥 = 𝜕ℎ

𝜕𝑓 × 𝜕𝑓

𝜕𝑥 + 𝜕ℎ

𝜕𝑔 × 𝜕𝑔

𝜕𝑥

(31)

Training an MLP (Backpropagation)

𝑜

31

𝑜

21

𝑜

22

𝐿 = ෍

𝑖=1 𝑁

𝑙

𝑖

𝜕𝐿

𝜕𝑤

113

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

113

𝜕𝐿

𝜕𝑤

123

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

123

𝑤

111

𝜕𝐿

𝜕𝑤

112

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑤

112

𝜕𝐿

𝜕𝑤

122

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑤

122

𝜕𝐿

𝜕𝑤

322

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

22

× 𝜕𝑜

22

𝜕𝑤

322

𝜕𝐿

𝜕𝑤

111

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑜

11

× 𝜕𝑜

11

𝜕𝑤

111

+ 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

22

× 𝜕𝑜

22

𝜕𝑜

11

× 𝜕𝑜

11

𝜕𝑤

111

𝑜

11

𝑜

12

𝑜

13

(32)

Training an MLP (Backpropagation)

𝑜

31

𝑜

21

𝑜

22

𝐿 = ෍

𝑖=1 𝑁

𝑙

𝑖

𝜕𝐿

𝜕𝑤

113

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

113

𝜕𝐿

𝜕𝑤

123

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑤

123

𝑤

111

𝜕𝐿

𝜕𝑤

112

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑤

112

𝜕𝐿

𝜕𝑤

122

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑤

122

𝜕𝐿

𝜕𝑤

322

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

22

× 𝜕𝑜

22

𝜕𝑤

322

𝜕𝐿

𝜕𝑤

111

= 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

21

× 𝜕𝑜

21

𝜕𝑜

11

× 𝜕𝑜

11

𝜕𝑤

111

+ 𝜕𝐿

𝜕𝑜

31

× 𝜕𝑜

31

𝜕𝑜

22

× 𝜕𝑜

22

𝜕𝑜

11

× 𝜕𝑜

11

𝜕𝑤

111

𝑜

11

𝑜

12

𝑜

13

Memoization: Avoid redundant computations

𝜕𝐿

𝜕𝑜31

𝜕𝐿

𝜕𝑜21

𝜕𝐿

𝜕𝑜22

(33)

Activation Functions Revisited

o Sigmoid and Tanh are the two most used traditional activation functions in neural networks

o 𝑑𝑓

𝑑𝑥 ∈ 0,1 for both functions 

o Consider computing

𝜕𝐿

𝜕𝑤111

and

𝜕𝐿

𝜕𝑤113

:

𝜕𝐿

𝜕𝑤111 = 𝜕𝐿

𝜕𝑜31 × 𝜕𝑜31

𝜕𝑜21 × 𝜕𝑜21

𝜕𝑜11 × 𝜕𝑜11

𝜕𝑤111 + 𝜕𝐿

𝜕𝑜31 × 𝜕𝑜31

𝜕𝑜22 ×𝜕𝑜22

𝜕𝑜11 × 𝜕𝑜11

𝜕𝑤111

0.4 0.2 0.08 0.9 0.4 0.6 0.3 0.9

𝜕𝐿

𝜕𝑤113 = 𝜕𝐿

𝜕𝑜31 × 𝜕𝑜31

𝜕𝑤113

0.07

0.4 0.5 0.2

𝜕𝐿

𝜕𝑤113 ≫ 𝜕𝐿

𝜕𝑤111

o As we move backward the input, the values of the partial derivatives become smaller.

o This is called Vanishing Gradients

(34)

Activation Functions Revisited

o What’s the problem with the vanishing gradients?

𝑤

𝑖𝑗𝑘

𝑛𝑒𝑤

= 𝑤

𝑖𝑗𝑘

𝑜𝑙𝑑

− 𝛾 𝜕𝐿

𝜕𝑤

𝑖𝑗𝑘

o Consider the optimization phase in GD method:

o If the gradient is vanished ( 𝜕𝐿

𝜕𝑤

𝑖𝑗𝑘

≈ 0), 𝑤 𝑖𝑗 𝑘

𝑛𝑒𝑤 ≈ 𝑤 𝑖𝑗 𝑘

𝑜𝑙𝑑

o This happens in early layers of a large neural

network

(35)

Activation Functions Revisited

o ReLU: Rectified Linear Units

𝑓 𝑧 = max 0, 𝑧

(36)

Activation Functions Revisited

o ReLU: Rectified Linear Units

𝑓 𝑥 = max 0, 𝑥

o 𝑑𝑓

𝑑𝑥 ∈ 0,1 for both functions  o Consider computing 𝜕𝐿

𝜕𝑤

111

:

𝜕𝐿

𝜕𝑤111 = 𝜕𝐿

𝜕𝑜31 ×𝜕𝑜31

𝜕𝑜21 × 𝜕𝑜21

𝜕𝑜11 × 𝜕𝑜11

𝜕𝑤111 + 𝜕𝐿

𝜕𝑜31 × 𝜕𝑜31

𝜕𝑜22 × 𝜕𝑜22

𝜕𝑜11 × 𝜕𝑜11

𝜕𝑤111

1 1 0 1 0 1 1 1 0

o ReLU neurons cannot learn on examples for which

their activation is zero (Dead Activations).

(37)

Activation Functions Revisited

ReLU vs Tanh

Source: Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2017. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), pp.84-90.

(38)

Activation Functions Revisited

Paper: Maas, A.L., Hannun, A.Y. and Ng, A.Y., 2013, June. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml (Vol.

30, No. 1, p. 3).

Source: https://paperswithcode.com/method/leaky-relu

(39)

Activation Functions Revisited

Paper: He, K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).

Source:https://paperswithcode.com/method/prelu#:~:text=A%20Parametric%20Rectified%20Linear%20Unit,if%20y%20i%20%E2%89%A4%2 00

(40)

Regularization

o The aim of regularization is to avoid overfitting o L1-Regularization

𝐿 = ෍

𝑖=1 𝑁

𝑙 𝑖 + 𝜆 ෍

𝑖,𝑗,𝑘

𝑤 𝑖𝑗 𝑘

o L2-Regularization

𝐿 = ෍

𝑖=1 𝑁

𝑙 𝑖 + 𝜆 ෍

𝑖,𝑗,𝑘

𝑤 𝑖𝑗 𝑘 2

(41)

Regularization

o Dropout: a very simple and elegant approach to apply regularization to neural networks

Source: Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.

and Salakhutdinov, R., 2014.

Dropout: a simple way to prevent neural networks from overfitting. The journal of

machine learning

research, 15(1), pp.1929-1958.

(42)

Regularization

o Dropout: a very simple and elegant approach to apply regularization to neural networks

Source: Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.

and Salakhutdinov, R., 2014.

Dropout: a simple way to prevent neural networks from overfitting. The journal of

machine learning

research, 15(1), pp.1929-1958.

(43)

Weight Initialization

o Our understanding of how the initial point affects generalization is especially primitive, offering little to no guidance for how to select the initial point.

o Designing initialization strategies is a difficult task because neural network optimization is not yet well understood.

o Perhaps the only property known with complete certainty is that the initial parameters need to “break symmetry” between different units.

• If two hidden units with the same activation function are connected to the same inputs, then these units must have different initial parameters.

o General Rules:

• Larger initial weights will yield a stronger symmetry-breaking effect. Therefore, the initial weights should be small (not very small)

• The initial weights should not be zero to avoid dead activation phenomenon

• Initial weights should have good variance to cover different viewpoints

(44)

Weight Initialization

o Uniform Initialization (traditionally for Sigmoid and Tanh units)

𝑤

𝑖𝑗𝑘

~𝑈 −1

𝑓𝑎𝑛

𝑖𝑛

, 1 𝑓𝑎𝑛

𝑖𝑛

Fan-in 𝑓 Fan-out

o Xavier / Glorot Initialization [1] (for Sigmoid)

𝑤

𝑖𝑗𝑘

~𝑈 − 6

𝑓𝑎𝑛

𝑖𝑛

+ 𝑓𝑎𝑛

𝑜𝑢𝑡

, 6

𝑓𝑎𝑛

𝑖𝑛

+ 𝑓𝑎𝑛

𝑜𝑢𝑡

𝑤

𝑖𝑗𝑘

~𝑁 0, 2

𝑓𝑎𝑛

𝑖𝑛

+ 𝑓𝑎𝑛

𝑜𝑢𝑡

[1]: Glorot, X. and Bengio, Y., 2010, March. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256). JMLR Workshop and Conference Proceedings.

(45)

Weight Initialization

Fan-in 𝑓 Fan-out

o He Initialization [1] (for ReLU)

𝑤

𝑖𝑗𝑘

~𝑈 − 6

𝑓𝑎𝑛

𝑖𝑛

, 6

𝑓𝑎𝑛

𝑖𝑛

𝑤

𝑖𝑗𝑘

~𝑁 0, 2

𝑓𝑎𝑛

𝑖𝑛

[1]:, K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034)..

(46)

Weight Initialization

Source: K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034)..

(47)

Weight Initialization

Source: K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034)..

Referensi

Dokumen terkait

In a different paper [69], the authors developed a stacking classifier in which they combined six existing classifiers—the Support Vector Machine, Artificial Neural Net- work, Logistic