Deep Learning (Multi-Layer Perceptron)

(1)

Deep Learning

( Multi-Layer Perceptron )

Sadegh Eskandari

Department of Computer Science, University of Guilan

[email protected]

(2)

Today …

o Biological Neurons o Artificial Neurons

o Multi-Layer Perceptron o Training a Single Neuron

o Training an MLP (Backpropagation) o Activation Functions

o Regularization

o Weight Initialization

(3)

Biological Neurons

o Dendrites receive impulses (synapses) from other cells.

Dendrites

Axon

Axon Terminals

Cell Body (Soma)

o The cell body fires (sends signal) if the received impulses are more than a threshold.

o Signal travels through the axon

and is delivered to other cells via

axon terminals

(4)

Biological Neurons

Multiple layers in a biological neural network of human cortex

Source: Montesinos López, O.A., Montesinos López, A. and Crossa, J., 2022. Fundamentals of Artificial Neural Networks and Deep Learning. In Multivariate Statistical Machine Learning Methods for Genomic Prediction (pp. 379-425). Springer, Cham.

(5)

Biological Neurons

The postnatal development of the human cerebral cortex

Source: https://developingchild.harvard.edu/

(6)

Artificial Neurons

Synapses Dendrites Soma

𝑜 = 𝜑 ෍

𝑗=1 𝑛

𝑤

_𝑗

𝑥

_𝑗

= 𝜑 𝒘

^𝑇

𝐱

Where 𝒘 = 𝑤

₁

, 𝑤

₂

, … , 𝑤

_𝑛 ^𝑇

and 𝐱 = 𝑥

₁

, 𝑥

₂

, … , 𝑥

_𝑛 ^𝑇

𝑥

_𝑖1

𝑥

_𝑖2

𝑥

_𝑖3

𝑥

_𝑖𝑛

∑ 𝑤

₁

𝑤

₂

𝑤

₃

𝑤

_𝑛

𝜑 𝑜

Inputs Weights

Transfer Function

Activation

Function

(7)

Artificial Neurons

Regular Activation Functions (𝜑)

(8)

Multi-Layer Perceptron (MLP): Notations

𝑥

_𝑖1

𝑥

_𝑖2

𝑥

_𝑖3

𝑥

_𝑖4

𝑓

₁₁

𝑓

₁₂

𝑓

₁₂

𝑓

₂₁

𝑓

₂₂

𝑓

₃₁

𝑦 ෝ 𝑖

Layer 1 Layer 2 Layer 3

3-Layer Fully-Connected Artificial Neural Network

(9)

𝑥

_𝑖1

𝑥

_𝑖2

𝑥

_𝑖3

𝑥

_𝑖4

𝑓

₁₁

𝑓

₁₂

𝑓

₁₃

𝑓

₂₁

𝑓

₂₂

𝑓

₃₁

𝑦 ෝ 𝑖

𝑤

₄₃¹

𝑤

₁₁²

𝑤

₂₁³

𝑤 𝑖𝑗 𝑙 The weight of edge from neuron 𝑖 of layer 𝑙 − 1 to neuron 𝑗 of layer 𝑙

(10)

𝑥

_𝑖1

𝑥

_𝑖2

𝑥

_𝑖3

𝑥

_𝑖4

𝑓

₁₁

𝑓

₁₂

𝑓

₁₃

𝑓

₂₁

𝑓

₂₂

𝑓

₃₁

𝑦 ෝ 𝑖

𝑤 2

⁼ ^𝑤¹¹

2 𝑤₁₂² 𝑤₂₁² 𝑤₂₂² 𝑤₃₁² 𝑤₃₂²

𝑤 1

⁼

𝑤₁₁¹ 𝑤₁₂¹ 𝑤₁₃¹ 𝑤₂₁¹ 𝑤₂₂¹ 𝑤₂₃¹ 𝑤₃₁¹ 𝑤₃₂¹ 𝑤₃₃¹ 𝑤₄₁¹ 𝑤₄₂¹ 𝑤₄₃¹

𝑤 3

⁼ ^𝑤¹¹³

𝑤₂₁³

(11)

Training a Single Neuron

Step 1: Loss function

𝐗 = 𝒙

_𝟏

, 𝒙

_𝟐

, … , 𝒙

_𝑵 ^𝑻

𝐲 = 𝑦

₁

, 𝑦

₂

, … , 𝑦

_𝑁 ^𝑻

𝐱

_𝐢

= 𝑥

_𝑖1

, 𝑥

_𝑖2

, … , 𝑥

_𝑖𝑛 ^𝑻

Training data 𝑁: num of data

𝑛: dimension of each data point

𝑙

_𝑖

= 𝑦

_𝑖

− 𝑜

_𝑖 ²

= 𝑦

_𝑖

− 𝜑 𝒘

^𝑇

𝒙

_𝑖 ²

𝐿 = ෍

𝑖=1 𝑁

𝑙

_𝑖

Step 2: The objective function

𝑤

^∗

= arg min

𝒘

𝐿

Step 3: Optimization

Initialize 𝒘

for 𝑖𝑡𝑒𝑟 = 1 to 𝐾:

∇

_𝒘

𝐿 =

^𝜕𝐿

𝜕𝑤₁

,

^𝜕𝐿

𝜕𝑤₂

, … ,

^𝜕𝐿

𝜕𝑤_𝑛

𝒘

_𝑛𝑒𝑤

= 𝒘

_𝑜𝑙𝑑

− 𝛾∇

_𝒘

𝐿

(12)

Training an MLP (Backpropagation)

o As we know, 𝛻 𝑊 𝐿 plays a central role in finding the optimal parameters of model o But calculating the gradients in multi-layer neural networks is not a trivial task!!

o Solution: Backpropagation (Rumelhart et al., 1986a)

(13)

Computational Graphs: A simple example

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

(14)

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

(15)

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

𝜕𝑓

1

(16)

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

1

𝜕𝑓

𝜕𝑧

3

(17)

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

1 3

𝜕𝑓

𝜕𝑧

-4

(18)

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

1 3

-4

𝜕𝑓

𝜕𝑦

?

𝜕𝑓

𝜕𝑦 = 𝜕𝑓

𝜕𝑞

𝜕𝑦

Chain Rule:

Upstream

gradient Local gradient

(19)

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

1 3

-4

𝜕𝑓

𝜕𝑦 -4

𝜕𝑓

𝜕𝑦 = 𝜕𝑓

𝜕𝑞

𝜕𝑦

Chain Rule:

Upstream

(20)

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z = -4

+

* 𝒙

𝒚 𝒛

𝑓 𝑞

-2

5

-4

3

-12

𝑞 = 𝑥 + 𝑦 𝜕𝑞

𝜕𝑥 = 1, 𝜕𝑞

𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓

𝜕𝑞 = 𝑧, 𝜕𝑓

𝜕𝑧 = 𝑞

Want: 𝜕𝑓

𝜕𝑥 , 𝜕𝑓

𝜕𝑦 , 𝜕𝑓

𝜕𝑧

Training an MLP (Backpropagation)

1 3

-4

𝜕𝑓

𝜕𝑥 -4

𝜕𝑓

𝜕𝑥 = 𝜕𝑓

𝜕𝑞

𝜕𝑥

Chain Rule:

Upstream

-4

(21)

(22)

(23)

(24)

(25)

(26)

Step 1: Loss function

𝐗 = 𝒙

_𝟏

, 𝒙

_𝟐

, … , 𝒙

_𝑵 ^𝑻

𝐲 = 𝑦

₁

, 𝑦

₂

, … , 𝑦

_𝑁 ^𝑻

𝐱

_𝐢

= 𝑥

_𝑖1

, 𝑥

_𝑖2

, … , 𝑥

_𝑖𝑛 ^𝑻

Training data 𝑁: num of data

𝑛: dimension of each data point

𝑙

_𝑖

= 𝑦

_𝑖

− 𝑜

_𝑖 ²

= 𝑦

_𝑖

− 𝜑 𝒘

^𝑇

𝒙

_𝑖 ²

𝐿 = ෍

𝑖=1 𝑁

𝑙

_𝑖

Step 2: The objective function

arg min

𝑤

¹

,𝑤

²

,𝑤

³

𝐿

Step 3: Optimization

Initialize 𝑤

_𝑖𝑗^𝑘

for 𝑖𝑡𝑒𝑟 = 1 to 𝐾:

for each i, 𝑗, 𝑘:

𝑤

_𝑖𝑗^𝑘

𝑛𝑒𝑤

= 𝑤

_𝑖𝑗^𝑘

𝑜𝑙𝑑

− 𝛾 𝜕𝐿

𝜕𝑤

_𝑖𝑗^𝑘

𝑜

_𝑖

𝑤

¹

𝑤

²

𝑤

³

(27)

𝑜

₃₁

𝑤

₁₁³

𝑤

₁₂³

𝐿 = ෍

𝑖=1 𝑁

𝑙

_𝑖

𝜕𝐿

𝜕𝑤

₁₁³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₁³

𝜕𝐿

𝜕𝑤

₁₂³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₂³

(28)

𝑜

₃₁

𝑜

₂₁

𝑜

₂₂

𝐿 = ෍

𝑖=1 𝑁

𝑙

_𝑖

𝜕𝐿

𝜕𝑤

₁₁³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₁³

𝜕𝐿

𝜕𝑤

₁₂³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₂³

𝑤

₁₁²

𝑤

₁₂²

𝑤

₃₂²

𝜕𝐿

𝜕𝑤

₁₁²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑤

₁₁²

𝜕𝐿

𝜕𝑤

₁₂²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑤

₁₂²

𝜕𝐿

𝜕𝑤

₃₂²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₂

× 𝜕𝑜

₂₂

𝜕𝑤

₃₂²

…

(29)

𝑜

₃₁

𝑜

₂₁

𝑜

₂₂

𝐿 = ෍

𝑖=1 𝑁

𝑙

_𝑖

𝜕𝐿

𝜕𝑤

₁₁³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₁³

𝜕𝐿

𝜕𝑤

₁₂³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₂³

𝑤

₁₁¹

𝜕𝐿

𝜕𝑤

₁₁²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑤

₁₁²

𝜕𝐿

𝜕𝑤

₁₂²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑤

₁₂²

𝜕𝐿

𝜕𝑤

₃₂²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₂

× 𝜕𝑜

₂₂

𝜕𝑤

₃₂²

…

𝜕𝐿

𝜕𝑤

₁₁¹

=?

𝑜

₁₁

𝑜

₁₂

𝑜

₁₃

(30)

𝑓

𝑔

𝑥 ℎ ℎ 𝑓 𝑥 , 𝑔 𝑥

𝜕ℎ

𝜕𝑥 = 𝜕ℎ

𝜕𝑓 × 𝜕𝑓

𝜕𝑥 + 𝜕ℎ

𝜕𝑔 × 𝜕𝑔

𝜕𝑥

(31)

𝑜

₃₁

𝑜

₂₁

𝑜

₂₂

𝐿 = ෍

𝑖=1 𝑁

𝑙

_𝑖

𝜕𝐿

𝜕𝑤

₁₁³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₁³

𝜕𝐿

𝜕𝑤

₁₂³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₂³

𝑤

₁₁¹

𝜕𝐿

𝜕𝑤

₁₁²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑤

₁₁²

𝜕𝐿

𝜕𝑤

₁₂²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑤

₁₂²

𝜕𝐿

𝜕𝑤

₃₂²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₂

× 𝜕𝑜

₂₂

𝜕𝑤

₃₂²

…

𝜕𝐿

𝜕𝑤

₁₁¹

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑜

₁₁

× 𝜕𝑜

₁₁

𝜕𝑤

₁₁¹

+ 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₂

× 𝜕𝑜

₂₂

𝜕𝑜

₁₁

× 𝜕𝑜

₁₁

𝜕𝑤

₁₁¹

𝑜

₁₁

𝑜

₁₂

𝑜

₁₃

…

(32)

𝑜

₃₁

𝑜

₂₁

𝑜

₂₂

𝐿 = ෍

𝑖=1 𝑁

𝑙

_𝑖

𝜕𝐿

𝜕𝑤

₁₁³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₁³

𝜕𝐿

𝜕𝑤

₁₂³

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑤

₁₂³

𝑤

₁₁¹

𝜕𝐿

𝜕𝑤

₁₁²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑤

₁₁²

𝜕𝐿

𝜕𝑤

₁₂²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑤

₁₂²

𝜕𝐿

𝜕𝑤

₃₂²

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₂

× 𝜕𝑜

₂₂

𝜕𝑤

₃₂²

…

𝜕𝐿

𝜕𝑤

₁₁¹

= 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₁

× 𝜕𝑜

₂₁

𝜕𝑜

₁₁

× 𝜕𝑜

₁₁

𝜕𝑤

₁₁¹

+ 𝜕𝐿

𝜕𝑜

₃₁

× 𝜕𝑜

₃₁

𝜕𝑜

₂₂

× 𝜕𝑜

₂₂

𝜕𝑜

₁₁

× 𝜕𝑜

₁₁

𝜕𝑤

₁₁¹

𝑜

₁₁

𝑜

₁₂

𝑜

₁₃

…

Memoization: Avoid redundant computations

𝜕𝐿

𝜕𝑜₃₁

𝜕𝐿

𝜕𝑜₂₁

𝜕𝐿

𝜕𝑜₂₂

…

(33)

Activation Functions Revisited

o Sigmoid and Tanh are the two most used traditional activation functions in neural networks

o 𝑑𝑓

𝑑𝑥 ∈ 0,1 for both functions 

o Consider computing

^𝜕𝐿

𝜕𝑤₁₁¹

and

^𝜕𝐿

𝜕𝑤₁₁³

:

𝜕𝐿

𝜕𝑤₁₁¹ = 𝜕𝐿

𝜕𝑜₃₁ × 𝜕𝑜₃₁

𝜕𝑜₂₁ × 𝜕𝑜₂₁

𝜕𝑜₁₁ × 𝜕𝑜₁₁

𝜕𝑤₁₁¹ + 𝜕𝐿

𝜕𝑜₃₁ × 𝜕𝑜₃₁

𝜕𝑜₂₂ ×𝜕𝑜₂₂

𝜕𝑜₁₁ × 𝜕𝑜₁₁

𝜕𝑤₁₁¹

0.4 0.2 0.08 0.9 0.4 0.6 0.3 0.9

𝜕𝐿

𝜕𝑤₁₁³ = 𝜕𝐿

𝜕𝑜₃₁ × 𝜕𝑜₃₁

𝜕𝑤₁₁³

0.07

0.4 0.5 0.2

𝜕𝐿

𝜕𝑤₁₁³ ≫ 𝜕𝐿

𝜕𝑤₁₁¹

o As we move backward the input, the values of the partial derivatives become smaller.

o This is called Vanishing Gradients

(34)

o What’s the problem with the vanishing gradients?

𝑤

_𝑖𝑗^𝑘

𝑛𝑒𝑤

= 𝑤

_𝑖𝑗^𝑘

𝑜𝑙𝑑

− 𝛾 𝜕𝐿

𝜕𝑤

_𝑖𝑗^𝑘

o Consider the optimization phase in GD method:

o If the gradient is vanished ( 𝜕𝐿

𝜕𝑤

_𝑖𝑗^𝑘

≈ 0), 𝑤 𝑖𝑗 𝑘

𝑛𝑒𝑤 ≈ 𝑤 𝑖𝑗 𝑘

𝑜𝑙𝑑

o This happens in early layers of a large neural

network

(35)

o ReLU: Rectified Linear Units

𝑓 𝑧 = max 0, 𝑧

(36)

o ReLU: Rectified Linear Units

𝑓 𝑥 = max 0, 𝑥

o 𝑑𝑓

𝑑𝑥 ∈ 0,1 for both functions  o Consider computing 𝜕𝐿

𝜕𝑤

₁₁¹

:

𝜕𝐿

𝜕𝑤₁₁¹ = 𝜕𝐿

𝜕𝑜₃₁ ×𝜕𝑜₃₁

𝜕𝑜₂₁ × 𝜕𝑜₂₁

𝜕𝑜₁₁ × 𝜕𝑜₁₁

𝜕𝑤₁₁¹ + 𝜕𝐿

𝜕𝑜₃₁ × 𝜕𝑜₃₁

𝜕𝑜₂₂ × 𝜕𝑜₂₂

𝜕𝑜₁₁ × 𝜕𝑜₁₁

𝜕𝑤₁₁¹

1 1 0 1 0 1 1 1 0

o ReLU neurons cannot learn on examples for which

their activation is zero (Dead Activations).

(37)

Activation Functions Revisited

ReLU vs Tanh

Source: Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2017. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), pp.84-90.

(38)

Paper: Maas, A.L., Hannun, A.Y. and Ng, A.Y., 2013, June. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml (Vol.

30, No. 1, p. 3).

Source: https://paperswithcode.com/method/leaky-relu

(39)

Activation Functions Revisited

Paper: He, K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).

Source:https://paperswithcode.com/method/prelu#:~:text=A%20Parametric%20Rectified%20Linear%20Unit,if%20y%20i%20%E2%89%A4%2 00

(40)

Regularization

o The aim of regularization is to avoid overfitting o L1-Regularization

𝐿 = ෍

𝑖=1 𝑁

𝑙 𝑖 + 𝜆 ෍

𝑖,𝑗,𝑘

𝑤 𝑖𝑗 𝑘

o L2-Regularization

𝐿 = ෍

𝑖=1 𝑁

𝑙 𝑖 + 𝜆 ෍

𝑖,𝑗,𝑘

𝑤 𝑖𝑗 𝑘 2

(41)

Regularization

o Dropout: a very simple and elegant approach to apply regularization to neural networks

Source: Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.

and Salakhutdinov, R., 2014.

Dropout: a simple way to prevent neural networks from overfitting. The journal of

machine learning

research, 15(1), pp.1929-1958.

(42)

Regularization

o Dropout: a very simple and elegant approach to apply regularization to neural networks

Source: Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.

and Salakhutdinov, R., 2014.

Dropout: a simple way to prevent neural networks from overfitting. The journal of

machine learning

research, 15(1), pp.1929-1958.

(43)

Weight Initialization

o Our understanding of how the initial point affects generalization is especially primitive, offering little to no guidance for how to select the initial point.

o Designing initialization strategies is a difficult task because neural network optimization is not yet well understood.

o Perhaps the only property known with complete certainty is that the initial parameters need to “break symmetry” between different units.

• If two hidden units with the same activation function are connected to the same inputs, then these units must have different initial parameters.

o General Rules:

• Larger initial weights will yield a stronger symmetry-breaking effect. Therefore, the initial weights should be small (not very small)

• The initial weights should not be zero to avoid dead activation phenomenon

• Initial weights should have good variance to cover different viewpoints

(44)

o Uniform Initialization (traditionally for Sigmoid and Tanh units)

𝑤

_𝑖𝑗^𝑘

~𝑈 −1

𝑓𝑎𝑛

_𝑖𝑛

, 1 𝑓𝑎𝑛

_𝑖𝑛

Fan-in 𝑓 Fan-out

o Xavier / Glorot Initialization [1] (for Sigmoid)

𝑤

_𝑖𝑗^𝑘

~𝑈 − 6

𝑓𝑎𝑛

_𝑖𝑛

+ 𝑓𝑎𝑛

_𝑜𝑢𝑡

, 6

𝑓𝑎𝑛

_𝑖𝑛

+ 𝑓𝑎𝑛

_𝑜𝑢𝑡

𝑤

_𝑖𝑗^𝑘

~𝑁 0, 2

𝑓𝑎𝑛

_𝑖𝑛

+ 𝑓𝑎𝑛

_𝑜𝑢𝑡

[1]: Glorot, X. and Bengio, Y., 2010, March. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256). JMLR Workshop and Conference Proceedings.

(45)

Weight Initialization

Fan-in 𝑓 Fan-out

o He Initialization [1] (for ReLU)

𝑤

_𝑖𝑗^𝑘

~𝑈 − 6

𝑓𝑎𝑛

_𝑖𝑛

, 6

𝑓𝑎𝑛

_𝑖𝑛

𝑤

_𝑖𝑗^𝑘

~𝑁 0, 2

𝑓𝑎𝑛

_𝑖𝑛

[1]:, K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034)..

(46)

Source: K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034)..

(47)

Weight Initialization

Source: K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034)..