• Tidak ada hasil yang ditemukan

Advanced Deep Learning

N/A
N/A
Protected

Academic year: 2024

Membagikan "Advanced Deep Learning"

Copied!
47
0
0

Teks penuh

(1)

U Kang 1

Advanced Deep Learning

Deep Generative Models U Kang

Seoul National University

(2)

In This Lecture

Boltzmann Machines

Restricted Boltzmann Machines

Deep Belief Networks

Deep Boltzmann Machines

Back-Propagation through Random Operations

Directed Generative Nets

(3)

U Kang 3

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

Deep Boltzmann Machines

Back-Propagation through Random Operations Directed Generative Nets

(4)

Boltzmann Machines

Model a distribution over binary vectors

Undirected probabilistic graphical models

May be stacked to form deeper models

Energy-based model (recall the function 𝐸 𝑥 )

(5)

U Kang 5

Joint Probability Distribution

We consider a binary random vector 𝑥

Then, the joint probability distribution is

𝑝 𝑥 = exp −𝐸 𝑥 𝑍

where 𝐸 𝑥 is the energy function, and 𝑍 is the partition function that ensures σ𝑥 𝑝 𝑥 = 1

(6)

The Energy Function

The energy function of the BM is given by

𝐸 𝑥 = −𝑥𝑇𝑈𝑥 − 𝑏𝑇𝑥

where 𝑈 is the weight matrix and 𝑏 is the bias

Represented as a fully-connected network

(7)

U Kang 7

Latent Variables (1)

What if not all the variables are observed?

Then, some of the variables are latent

The latent ones act similarly to hidden units

(8)

Latent Variables (2)

We decompose the units 𝑥 into two subsets:

the visible units 𝑣

the latent (or hidden) units ℎ

The energy function 𝐸 𝑣, ℎ becomes

−𝑣𝑇𝑅𝑣 − 𝑣𝑇𝑊ℎ − ℎ𝑇𝑆ℎ − 𝑏𝑇𝑣 − 𝑐𝑇

which is similar to the structure of 𝐸 𝑥

(9)

U Kang 9

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

Deep Boltzmann Machines

Back-Propagation through Random Operations Directed Generative Nets

(10)

Restricted BMs

Common building blocks of deep prob. models

Contain the following layers:

a layer of observable variables 𝑣

a single layer of latent variables

No connections are permitted in a same layer!

(11)

U Kang 11

Examples

(a) The structure of a restricted BM (RBM)

(c) The structure of a deep BM

(12)

Joint Probability Distribution

The joint probability distribution is

𝑝 𝑣, ℎ = exp −𝐸 𝑣, ℎ 𝑍

which is the same with the one for a BM

(13)

U Kang 13

The Energy Function

The energy function for an RBM is

𝐸 𝑣, ℎ = −𝑏𝑇𝑣 − 𝑐𝑇ℎ − 𝑣𝑇𝑊ℎ

Compare it to the non-restricted one:

−𝑣𝑇𝑅𝑣 − 𝑣𝑇𝑊ℎ − ℎ𝑇𝑆ℎ − 𝑏𝑇𝑣 − 𝑐𝑇

(14)

The Partition Function

The partition function is defined as

𝑍 = ෍

𝑣

exp −𝐸 𝑣, ℎ

which is intractable

This also means 𝑝 𝑣 = σ 𝑝(ℎ, 𝑣) = σ exp −𝐸 𝑣,ℎ is also intractable

(15)

U Kang 15

Conditional Distributions (1)

Though 𝑝(𝑣) is intractable, the bipartite structure of the RBM allows 𝑃 ℎ 𝑣 and 𝑃 𝑣 ℎ to be easily

computed and sampled from

We start by computing 𝑝 ℎ𝑗 = 1 𝑣 :

𝑝 ℎ 𝑣 = 𝑝(ℎ,𝑣)

𝑝(𝑣) = 1

𝑝(𝑣) 1

𝑍 exp 𝑏𝑇𝑣 + 𝑐𝑇ℎ + 𝑣𝑇𝑊ℎ =

1

𝑍′ exp{𝑐𝑇ℎ + 𝑣𝑇𝑊ℎ} = 1

𝑍′ ς𝑗=1 exp 𝑐𝑗𝑗 + 𝑣𝑇𝑊:,𝑗𝑗

𝑝 ℎ𝑗 = 1 𝑣 = 𝑝(ℎ 𝑗=1|𝑣)

𝑝(ℎ𝑗=0|𝑣)+ ෤𝑝(ℎ𝑗=1|𝑣) = exp{𝑐𝑗+𝑣𝑇𝑊:,𝑗}

exp 0 +exp{𝑐𝑗+𝑣𝑇𝑊:,𝑗} = 𝜎 𝑐𝑗 + 𝑣𝑇𝑊:,𝑗

(16)

Conditional Distributions (2)

Then, we express the full conditionals as the factorial distribution

𝑃 ℎ 𝑣 = ς𝑗=1 𝜎 2ℎ − 1 ⨀ 𝑐 + 𝑊𝑇𝑣

𝑗

𝑃 𝑣 ℎ = ς𝑖=1𝑣 𝜎 2𝑣 − 1 ⨀ 𝑏 + 𝑊ℎ

𝑖

(17)

U Kang 17

Training RBM

RBM admits

Efficient evaluation and differentiation of 𝑝(𝑣)

Efficient MCMC sampling in the form of block Gibbs sampling: it is easy to sample each 𝑗 using

𝑃 ℎ𝑗 = 1 𝑣 = 𝜎 𝑐𝑗 + 𝑣𝑇𝑊:,𝑗 , and each ℎ𝑗 can be sampled independently

RBM can readily be trained with any of the

techniques (e.g., CD, SML) for training models with intractable partition functions

(18)

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

Deep Boltzmann Machines

Back-Propagation through Random Operations Directed Generative Nets

(19)

U Kang 19

Deep Belief Networks

The introduction of deep belief networks (DBNs) began the current deep learning renaissance

Contain several layers of latent variables

Contain no intra-layer connections

Undirected connections between the top 2 layers

Directed connections between all other layers

(20)

Graph Structure

A DBN with one visible and two hidden layers

It can be represented as a graphical model with inter-layer connections

(21)

U Kang 21

Parameters

A DBN with 𝑙 hidden layers contains

𝑙 weight matrices 𝑊 1 , … , 𝑊 𝑙

𝑙 + 1 bias vectors 𝑏 0 , … , 𝑏 𝑙

𝑏 0 provides the biases for the visible layer

(22)

The Probability Distribution (1)

The probability dist. consists of three parts

The first distribution is given by

𝑃 ℎ 𝑙 , ℎ 𝑙−1

exp 𝑏 𝑙 𝑇𝑙 + 𝑏 𝑙−1 𝑇𝑙−1 + ℎ 𝑙−1 𝑇𝑊 𝑙𝑙

It provides the joint distribution between the top two hidden layers ℎ 𝑙 and ℎ 𝑙−1

(23)

U Kang 23

The Probability Distribution (2)

The second distribution is given by

𝑃 ℎ𝑖𝑘 = 1 ℎ 𝑘+1

= 𝜎 𝑏𝑖𝑘 + 𝑊:,𝑖𝑘+1 𝑇𝑘+1

It provides the conditional distribution for the

activation of a hidden layer ℎ 𝑘 given ℎ 𝑘+1 , for 𝑘 = 𝑙 − 2 to 1

(24)

The Probability Distribution (3)

The third distribution is given by

𝑃 𝑣𝑖 = 1 ℎ 1 = 𝜎 𝑏𝑖0 + 𝑊:,𝑖1 𝑇1

It provides the conditional distribution for the activation of the visible layer 𝑣 given ℎ 1

In the case of real-valued visible units, 𝑣~𝑁(𝑣; 𝑏 0 + 𝑊 1 𝑇1 , 𝛽−1)

(25)

U Kang 25

Deep Belief Network

Generating a sample from DBN

First, run several steps of Gibbs sampling on the top two hidden layers

Then, use a single pass of ancestral sampling through the rest of the model to draw a sample from the

visible units

(26)

Training DBN

Intractability

Inference in DBN is intractable due to the “explaining away” effect and the top 2 hidden layers

(27)

U Kang 27

Training DBN

Training DBN: greedy layer-wise training

First, train an RBM to maximize 𝐸𝒗~𝑝𝑑𝑎𝑡𝑎 log 𝑝(𝒗) using CD or SML; the learned parameters define the parameters of the first layer of the DBN

Next, a second RBM is trained to approximately maximize 𝐸𝒗~𝑝𝑑𝑎𝑡𝑎𝐸𝒉(1)~𝑝 1 (𝒉 1 |𝑣) log 𝑝 2 (𝒉(1))

This procedures are repeated indefinitely to add as many layers to the DBN

The learned weights are used for the parameters of DBN

(28)

Using the Trained DBN

DBN may be used for

Generating samples from the model

Improve classification by taking the weights from the DBN and use them to define an MLP; after this initialization, the MLP is fine-tuned with a classification task

1 = 𝜎 𝑏 1 + 𝒗(𝑇)𝑾(1)

𝑙 = 𝜎 𝑏𝑖𝑙 + 𝒉 𝑙−1 𝑇𝑾(𝑙) for 𝑙 = 2 … 𝑚

(29)

U Kang 29

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

Deep Boltzmann Machines

Back-Propagation through Random Operations Directed Generative Nets

(30)

Deep Boltzmann Machines

Another kind of deep generative models

Entirely undirected models unlike DBNs

Have several layers of latent variables

(31)

U Kang 31

Graph Structure

A DBM with one visible and two hidden layers

There are no intra-layer connections

(32)

Joint Probability Distribution

A DBM is also an energy-based model

The joint probability distribution is given by 𝑃 𝑣, ℎ 1 , ℎ 2 = 1

𝑍 𝜃 exp −𝐸 𝑣, ℎ 1 , ℎ 2 ; 𝜃 in the case with one visible layer 𝑣 and two

hidden layers ℎ 1 , and ℎ 2

𝐸 𝑣, ℎ 1 , ℎ 2 ; 𝜃 = −𝑣𝑇𝑊 11 − ℎ 1 𝑇𝑊 22

Bias terms are omitted for simplicity

(33)

U Kang 33

Bipartite Graph (1)

The DBM layers can be organized as bipartite:

(34)

Bipartite Graph (2)

The variables in the odd (even) layers become conditionally independent when we condition on the variables in the even (odd) layers:

𝑃 𝑣, ℎ 21 = 𝑃 𝑣 ℎ 1 𝑃 ℎ 21

= ෑ

𝑖

𝑃 𝑣𝑖1

𝑗

𝑃 ℎ𝑗(2)1

(35)

U Kang 35

Conditional Distributions

The activation probabilities are given by

𝑃 𝑣𝑖 = 1 ℎ 1 = 𝜎 𝑊𝑖,:11

𝑃 ℎ𝑖1 = 1 𝑣, ℎ 2 = 𝜎 𝑣𝑇𝑊:,𝑖1 + 𝑊𝑖,:22 𝑃 ℎ𝑘2 = 1 ℎ 1 = 𝜎 ℎ 1 𝑇𝑊:,𝑘2

(36)

Gibbs Sampling

The bipartite architecture in DBM makes Gibbs sampling efficient

In DBM, Gibbs sampling updates all odd (or even) layers at once: iterate the followings

Sample from layers including all even layers (and the visible layer)

Sample from layers including all odd layers

(37)

U Kang 37

Inference in DBM

The distribution over all hidden layers does not factorize because of interactions between layers

In the example with two hidden layers,

𝑃 ℎ 1 , ℎ 2 𝑣 does not factorize because of the interaction weights 𝑊 2 between ℎ 1 and ℎ 2

(38)

Mean Field Inference

A simple form of variational inference

We try to approximate 𝑃 ℎ 1 , ℎ 2 𝑣

But, we restrict the approximating distribution to fully factorial distributions

We attempt to find 𝑄 that best fits 𝑃

(39)

U Kang 39

Mean Field Assumption

Let 𝑄 ℎ 1 , ℎ 2 𝑣 be the approximation

The mean field assumption implies that

𝑄 ℎ 1 , ℎ 2 𝑣 = ෑ

𝑗

𝑄 ℎ𝑗1 𝑣 ෑ

𝑘

𝑄 ℎ𝑘2 𝑣

It restricts the search space (for efficiency)

(40)

Mean Field Approximation

The mean field approach is to minimize

𝐾𝐿 𝑄 ∥ 𝑃 = ෍

𝑄 log 𝑄 𝑃

which measures the difference between 𝑄 and 𝑃

(41)

U Kang 41

Parameterization

We associate the probability with a parameter

We parameterize ℎ 1 and ℎ 2 as

ℎ෠𝑗1 = 𝑄 ℎ𝑗1 = 1 𝑣 ℎ෠𝑘2 = 𝑄 ℎ𝑘2 = 1 𝑣

where ℎ෠𝑗1 ∈ 0,1 and ℎ෠𝑘2 ∈ 0,1

(42)

Update Rules for Inference in DBM

We saw that the optimal 𝑞 ℎ𝑗 𝑣 can be obtained by normalizing the unnormalized distribution

𝑞 ℎ 𝑗 𝑣 = exp(𝐸

−𝑗~𝑞 −𝑗 𝑣 log 𝑝 ℎ, 𝑣 )

Then, we obtain the update rules

𝑗1 = 𝜎 σ𝑖 𝑣𝑖𝑊𝑖,𝑗1 + σ𝑘 𝑊𝑗,𝑘2𝑘2 , ∀𝑗

𝑘2 = 𝜎 σ𝑗 𝑊𝑗2,𝑘𝑗1 , ∀𝑘

These equations define an iterative algorithm where we

1 2

(43)

U Kang 43

Update Rules for Inference in DBM

Proof of the update rule 𝑘2 = 𝜎 σ𝑗𝑊𝑗,𝑘 2 𝑗

1 , ∀𝑘

First, compute 𝑞 ℎ 𝑗 𝑣 = exp(𝐸

−𝑗~𝑞 −𝑗 𝑣 log 𝑝 ℎ, 𝑣 )where ℎ𝑗 = ℎ𝑘2

𝐸

−𝑗~𝑞 −𝑗 𝑣 log 𝑝 ℎ, 𝑣 ∝ 𝐸−𝑗~𝑞 −𝑗 𝑣 [log ෤𝑝(ℎ, 𝑣)] = 𝐸

−𝑗~𝑞 −𝑗 𝑣 𝑣𝑇𝑊 1 1 + ℎ 1 𝑇𝑊 2 2 = 𝐸

−𝑗~𝑞 −𝑗 𝑣 σ𝑖σ𝑗𝑣𝑖𝑊𝑖,𝑗1

𝑗1

+ σ𝑗σ𝑘𝑗1

𝑊𝑗,𝑘 2 𝑘2

= σ

−𝑗 σ𝑖σ𝑗𝑣𝑖𝑊𝑖,𝑗1

𝑗1

+ σ𝑗𝑗1

𝑊𝑗,𝑘

2 𝑗 + σ𝑗σ𝑘\j𝑗1

𝑊𝑗,𝑘 2 𝑘2

𝑞 ℎ−𝑗 𝑣

Thus, 𝑞 ℎ 𝑗 = 1 𝑣 = ෤𝑞 ℎ𝑘2 = 1 𝑣 ∝ exp σ

−𝑗𝑗 𝑗1

𝑊𝑗,𝑘

2 )𝑞 ℎ−𝑗 𝑣 = exp[σ𝑗𝑗1

𝑊𝑗,𝑘

2 ]. Similarly, 𝑞 ℎ 𝑗 = 0 𝑣 = ෤𝑞 ℎ𝑘2 = 0 𝑣 ∝ exp[0]

Thus, 𝑞 ℎ𝑘2 = 1 𝑣 = 𝑞(ℎ 𝑘

2 =1|𝑣)

𝑞(ℎ𝑘2 =0|𝑣)+ ෤𝑞(ℎ𝑘2 =1|𝑣) = exp{σ𝑗′

𝑗′

1 𝑊

𝑗′,𝑘 2 } exp 0 +exp{σ

𝑗′

𝑗′

1 𝑊

𝑗′,𝑘 2 } = 𝜎 σ𝑗𝑊𝑗,𝑘

2 𝑗1

(44)

DBM Parameter Learning

Learning in DBM must confront both the challenge of an intractable partition function, and the

challenge of an intractable posterior distribution

Variational inference allows the construction of a distribution Q(h|v) that approximates the

intractable P(h|v). Learning then proceeds by

maximining 𝐿 𝑣, 𝑄, 𝜃 , the variational lower bound on the intractable log-likelihood, log 𝑃(𝑣; 𝜃)

(45)

U Kang 45

DBM Parameter Learning

For DBM with 2 hidden layers, the ELBO L is given by

(Proof) 𝐿 𝑣, 𝜃, 𝑄 = 𝐸ℎ~𝑞[log 𝑝(ℎ, 𝑣)] + 𝐻 𝑄 = 𝐸ℎ~𝑞[log ෤𝑝(ℎ, 𝑣)] − log 𝑍(𝜃) + 𝐻 𝑄 .

Note that 𝐸ℎ~𝑞[log ෤𝑝(ℎ, 𝑣)] = 𝐸ℎ~𝑞 𝑣𝑇𝑊 1 1 + ℎ 1 𝑇𝑊 2 2 = 𝐸ℎ~𝑞 σ𝑖σ𝑗 𝑣𝑖𝑊𝑖,𝑗1 𝑗1 + σ𝑗σ𝑘 𝑗1 𝑊𝑗2,𝑘𝑘2 =

σ σ𝑖 σ𝑗 𝑣𝑖𝑊𝑖,𝑗1

𝑗1

+ σ𝑗 σ𝑘 𝑗1

𝑊𝑗,𝑘 2 𝑘2

𝑞 ℎ 𝑣

= σ𝑖σ𝑗𝑣𝑖𝑊𝑖,𝑗1 𝑗1

+ σ𝑗σ𝑘 𝑗1

𝑊𝑗,𝑘 2 𝑘2

(46)

DBM Parameter Learning

The equation still contains the log partition function log 𝑍(𝜃); evaluating the partition function requires an approximate technique like AIS to estimate the partition function

Also, training the model requires the gradient of the log partition function; typically SML is used

Another popular method for DBM parameter learning is greedy layer-wise pretraining

(47)

U Kang 47

Questions?

Gambar

Graph Structure
Graph Structure

Referensi

Dokumen terkait

• Hybrid Deep Networks (Deep Learning gabungan): Pendekatan tipe ini bertujuan agar dapat dicapai hasil yang baik dengan menggunakan pembelajaran yang diawasi untuk

Illustrate Recent Advances in Scalable Network Generation UNIT - I Introduction and Application of Large-scale Graph: Characteristics, Complex Data Sources - Social Networks,

Deep Introduction TensorFlow & Keras Learning Framework Konsep Backpropagation pada dasarnya untuk melakukan training pada neural network, kita melakukan proses pada diagram

To alleviate these problems, a deep belief network DBN uses a deep architecture that is capable of learning feature representations from both the labeled and unlabeled data presented to

165 Matching Deep Networks to the Right Problem 165 Columnar Data and Multilayer Perceptrons 166 Images and Convolutional Neural Networks 166 Time-series Sequences and Recurrent

This model is called the deep earprint learning DEL, this model using Adam optimization to determine the best parameters of convolution and pooling layers to obtain the best error if

This is an illustrative example with only two input units, two hidden units and one output unit, but the networks used for object recognition or natural language processing contain tens