Advanced Deep Learning

(1)

U Kang 1

Advanced Deep Learning

Deep Generative Models U Kang

Seoul National University

(2)

In This Lecture

 Boltzmann Machines

 Restricted Boltzmann Machines

 Deep Belief Networks

 Deep Boltzmann Machines

 Back-Propagation through Random Operations

 Directed Generative Nets

(3)

U Kang 3

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

Deep Boltzmann Machines

Back-Propagation through Random Operations Directed Generative Nets

(4)

Boltzmann Machines

 Model a distribution over binary vectors

 Undirected probabilistic graphical models

 May be stacked to form deeper models

 Energy-based model (recall the function 𝐸 𝑥 )

(5)

U Kang 5

Joint Probability Distribution

 We consider a binary random vector 𝑥

 Then, the joint probability distribution is

𝑝 𝑥 = exp −𝐸 𝑥 𝑍

where 𝐸 𝑥 is the energy function, and 𝑍 is the partition function that ensures σ_𝑥 𝑝 𝑥 = 1

(6)

The Energy Function

 The energy function of the BM is given by

𝐸 𝑥 = −𝑥^𝑇𝑈𝑥 − 𝑏^𝑇𝑥

where 𝑈 is the weight matrix and 𝑏 is the bias

 Represented as a fully-connected network

(7)

U Kang 7

Latent Variables (1)

 What if not all the variables are observed?

 Then, some of the variables are latent

 The latent ones act similarly to hidden units

(8)

Latent Variables (2)

 We decompose the units 𝑥 into two subsets:

 the visible units 𝑣

 the latent (or hidden) units ℎ

 The energy function 𝐸 𝑣, ℎ becomes

−𝑣^𝑇𝑅𝑣 − 𝑣^𝑇𝑊ℎ − ℎ^𝑇𝑆ℎ − 𝑏^𝑇𝑣 − 𝑐^𝑇ℎ

which is similar to the structure of 𝐸 𝑥

(9)

U Kang 9

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

(10)

Restricted BMs

 Common building blocks of deep prob. models

 Contain the following layers:

 a layer of observable variables 𝑣

 a single layer of latent variables ℎ

 No connections are permitted in a same layer!

(11)

U Kang 11

Examples

 (a) The structure of a restricted BM (RBM)

 (c) The structure of a deep BM

(12)

Joint Probability Distribution

 The joint probability distribution is

𝑝 𝑣, ℎ = exp −𝐸 𝑣, ℎ 𝑍

which is the same with the one for a BM

(13)

U Kang 13

The Energy Function

 The energy function for an RBM is

𝐸 𝑣, ℎ = −𝑏^𝑇𝑣 − 𝑐^𝑇ℎ − 𝑣^𝑇𝑊ℎ

 Compare it to the non-restricted one:

−𝑣^𝑇𝑅𝑣 − 𝑣^𝑇𝑊ℎ − ℎ^𝑇𝑆ℎ − 𝑏^𝑇𝑣 − 𝑐^𝑇ℎ

(14)

The Partition Function

 The partition function is defined as

𝑍 = ෍

𝑣

෍

ℎ

exp −𝐸 𝑣, ℎ

which is intractable

 This also means 𝑝 𝑣 = σ_ℎ 𝑝(ℎ, 𝑣) = σ_ℎ ^{exp −𝐸 𝑣,ℎ} is also intractable

(15)

U Kang 15

Conditional Distributions (1)

 Though 𝑝(𝑣) is intractable, the bipartite structure of the RBM allows 𝑃 ℎ 𝑣 and 𝑃 𝑣 ℎ to be easily

computed and sampled from

 We start by computing 𝑝 ℎ_𝑗 = 1 𝑣 :

 𝑝 ℎ 𝑣 = ^{𝑝(ℎ,𝑣)}

𝑝(𝑣) = ¹

𝑝(𝑣) 1

𝑍 exp 𝑏^𝑇𝑣 + 𝑐^𝑇ℎ + 𝑣^𝑇𝑊ℎ =

1

𝑍′ exp{𝑐^𝑇ℎ + 𝑣^𝑇𝑊ℎ} = ¹

𝑍′ ς_𝑗=1^ℎ exp 𝑐_𝑗ℎ_𝑗 + 𝑣^𝑇𝑊_:,𝑗ℎ_𝑗

 𝑝 ℎ_𝑗 = 1 𝑣 = ^𝑝(ℎ^෤ ^𝑗^=1|𝑣)

෤

𝑝(ℎ_𝑗=0|𝑣)+ ෤𝑝(ℎ_𝑗=1|𝑣) = ^exp{𝑐^𝑗^+𝑣^𝑇^𝑊^:,𝑗^}

exp 0 +exp{𝑐_𝑗+𝑣^𝑇𝑊_:,𝑗} = 𝜎 𝑐_𝑗 + 𝑣^𝑇𝑊_:,𝑗

(16)

Conditional Distributions (2)

 Then, we express the full conditionals as the factorial distribution

𝑃 ℎ 𝑣 = ς_𝑗=1^ℎ 𝜎 2ℎ − 1 ⨀ 𝑐 + 𝑊^𝑇𝑣

𝑗

𝑃 𝑣 ℎ = ς_𝑖=1^𝑣 𝜎 2𝑣 − 1 ⨀ 𝑏 + 𝑊ℎ

𝑖

(17)

U Kang 17

Training RBM

 RBM admits

 Efficient evaluation and differentiation of 𝑝(𝑣)෤

 Efficient MCMC sampling in the form of block Gibbs sampling: it is easy to sample each ℎ_𝑗 using

𝑃 ℎ_𝑗 = 1 𝑣 = 𝜎 𝑐_𝑗 + 𝑣^𝑇𝑊_:,𝑗 , and each ℎ_𝑗 can be sampled independently

 RBM can readily be trained with any of the

techniques (e.g., CD, SML) for training models with intractable partition functions

(18)

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

(19)

U Kang 19

Deep Belief Networks

 The introduction of deep belief networks (DBNs) began the current deep learning renaissance

 Contain several layers of latent variables

 Contain no intra-layer connections

 Undirected connections between the top 2 layers

 Directed connections between all other layers

(20)

Graph Structure

 A DBN with one visible and two hidden layers

 It can be represented as a graphical model with inter-layer connections

(21)

U Kang 21

Parameters

 A DBN with 𝑙 hidden layers contains

 𝑙 weight matrices 𝑊 ¹ , … , 𝑊 ^𝑙

 𝑙 + 1 bias vectors 𝑏 ⁰ , … , 𝑏 ^𝑙

 𝑏 ⁰ provides the biases for the visible layer

(22)

The Probability Distribution (1)

 The probability dist. consists of three parts

 The first distribution is given by

𝑃 ℎ ^𝑙 , ℎ ^𝑙−1 ∝

exp 𝑏 ^{𝑙 𝑇}ℎ ^𝑙 + 𝑏 ^{𝑙−1 𝑇}ℎ ^𝑙−1 + ℎ ^{𝑙−1 𝑇}𝑊 ^𝑙 ℎ ^𝑙

 It provides the joint distribution between the top two hidden layers ℎ ^𝑙 and ℎ ^𝑙−1

(23)

U Kang 23

The Probability Distribution (2)

 The second distribution is given by

𝑃 ℎ_𝑖^𝑘 = 1 ℎ ^𝑘+1

= 𝜎 𝑏_𝑖^𝑘 + 𝑊_:,𝑖^{𝑘+1 𝑇}ℎ ^𝑘+1

 It provides the conditional distribution for the

activation of a hidden layer ℎ ^𝑘 given ℎ ^𝑘+1 , for 𝑘 = 𝑙 − 2 to 1

(24)

The Probability Distribution (3)

 The third distribution is given by

𝑃 𝑣_𝑖 = 1 ℎ ¹ = 𝜎 𝑏_𝑖⁰ + 𝑊_:,𝑖^{1 𝑇}ℎ ¹

 It provides the conditional distribution for the activation of the visible layer 𝑣 given ℎ ¹

 In the case of real-valued visible units, 𝑣~𝑁(𝑣; 𝑏 ⁰ + 𝑊 ^{1 𝑇}ℎ ¹ , 𝛽⁻¹)

(25)

U Kang 25

Deep Belief Network

 Generating a sample from DBN

 First, run several steps of Gibbs sampling on the top two hidden layers

 Then, use a single pass of ancestral sampling through the rest of the model to draw a sample from the

visible units

(26)

Training DBN

 Intractability

 Inference in DBN is intractable due to the “explaining away” effect and the top 2 hidden layers

(27)

U Kang 27

Training DBN

 Training DBN: greedy layer-wise training

 First, train an RBM to maximize 𝐸_𝒗~𝑝_{𝑑𝑎𝑡𝑎} log 𝑝(𝒗) using CD or SML; the learned parameters define the parameters of the first layer of the DBN

 Next, a second RBM is trained to approximately maximize 𝐸_𝒗~𝑝_{𝑑𝑎𝑡𝑎}𝐸_𝒉(1)~𝑝 ¹ (𝒉 ¹ |𝑣) log 𝑝 ² (𝒉⁽¹⁾)

 This procedures are repeated indefinitely to add as many layers to the DBN

 The learned weights are used for the parameters of DBN

(28)

Using the Trained DBN

 DBN may be used for

 Generating samples from the model

 Improve classification by taking the weights from the DBN and use them to define an MLP; after this initialization, the MLP is fine-tuned with a classification task

 ℎ ¹ = 𝜎 𝑏 ¹ + 𝒗^(𝑇)𝑾⁽¹⁾

 ℎ ^𝑙 = 𝜎 𝑏_𝑖^𝑙 + 𝒉 ^{𝑙−1 𝑇}𝑾^(𝑙) for 𝑙 = 2 … 𝑚

(29)

U Kang 29

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

Deep Boltzmann Machines

(30)

Deep Boltzmann Machines

 Another kind of deep generative models

 Entirely undirected models unlike DBNs

 Have several layers of latent variables

(31)

U Kang 31

Graph Structure

 A DBM with one visible and two hidden layers

 There are no intra-layer connections

(32)

Joint Probability Distribution

 A DBM is also an energy-based model

 The joint probability distribution is given by 𝑃 𝑣, ℎ ¹ , ℎ ² = 1

𝑍 𝜃 exp −𝐸 𝑣, ℎ ¹ , ℎ ² ; 𝜃 in the case with one visible layer 𝑣 and two

hidden layers ℎ ¹ , and ℎ ²

 𝐸 𝑣, ℎ ¹ , ℎ ² ; 𝜃 = −𝑣^𝑇𝑊 ¹ ℎ ¹ − ℎ ^{1 𝑇}𝑊 ² ℎ ²

 Bias terms are omitted for simplicity

(33)

U Kang 33

Bipartite Graph (1)

 The DBM layers can be organized as bipartite:

(34)

Bipartite Graph (2)

 The variables in the odd (even) layers become conditionally independent when we condition on the variables in the even (odd) layers:

𝑃 𝑣, ℎ ² ℎ ¹ = 𝑃 𝑣 ℎ ¹ 𝑃 ℎ ² ℎ ¹

= ෑ

𝑖

𝑃 𝑣_𝑖 ℎ ¹ ෑ

𝑗

𝑃 ℎ_𝑗⁽²⁾ ℎ ¹

(35)

U Kang 35

Conditional Distributions

 The activation probabilities are given by

𝑃 𝑣_𝑖 = 1 ℎ ¹ = 𝜎 𝑊_𝑖,:¹ ℎ ¹

𝑃 ℎ_𝑖¹ = 1 𝑣, ℎ ² = 𝜎 𝑣^𝑇𝑊_:,𝑖¹ + 𝑊_𝑖,:² ℎ ² 𝑃 ℎ_𝑘² = 1 ℎ ¹ = 𝜎 ℎ ^{1 𝑇}𝑊_:,𝑘²

(36)

Gibbs Sampling

 The bipartite architecture in DBM makes Gibbs sampling efficient

 In DBM, Gibbs sampling updates all odd (or even) layers at once: iterate the followings

 Sample from layers including all even layers (and the visible layer)

 Sample from layers including all odd layers

(37)

U Kang 37

Inference in DBM

 The distribution over all hidden layers does not factorize because of interactions between layers

 In the example with two hidden layers,

 𝑃 ℎ ¹ , ℎ ² 𝑣 does not factorize because of the interaction weights 𝑊 ² between ℎ ¹ and ℎ ²

(38)

Mean Field Inference

 A simple form of variational inference

 We try to approximate 𝑃 ℎ ¹ , ℎ ² 𝑣

 But, we restrict the approximating distribution to fully factorial distributions

 We attempt to find 𝑄 that best fits 𝑃

(39)

U Kang 39

Mean Field Assumption

 Let 𝑄 ℎ ¹ , ℎ ² 𝑣 be the approximation

 The mean field assumption implies that

𝑄 ℎ ¹ , ℎ ² 𝑣 = ෑ

𝑗

𝑄 ℎ_𝑗¹ 𝑣 ෑ

𝑘

𝑄 ℎ_𝑘² 𝑣

 It restricts the search space (for efficiency)

(40)

Mean Field Approximation

 The mean field approach is to minimize

𝐾𝐿 𝑄 ∥ 𝑃 = ෍

ℎ

𝑄 log 𝑄 𝑃

which measures the difference between 𝑄 and 𝑃

(41)

U Kang 41

Parameterization

 We associate the probability with a parameter

 We parameterize ℎ ¹ and ℎ ² as

ℎ෠_𝑗¹ = 𝑄 ℎ_𝑗¹ = 1 𝑣 ℎ෠_𝑘² = 𝑄 ℎ_𝑘² = 1 𝑣

where ℎ෠_𝑗¹ ∈ 0,1 and ℎ෠_𝑘² ∈ 0,1

(42)

Update Rules for Inference in DBM

 We saw that the optimal 𝑞 ℎ_𝑗 𝑣 can be obtained by normalizing the unnormalized distribution

 𝑞 ℎ෤ _𝑗 𝑣 = exp(𝐸_ℎ

−𝑗~𝑞 ℎ_−𝑗 𝑣 log 𝑝 ℎ, 𝑣 )

 Then, we obtain the update rules

 ℎ෠_𝑗¹ = 𝜎 σ_𝑖 𝑣_𝑖𝑊_𝑖,𝑗¹ + σ_𝑘′ 𝑊_𝑗,𝑘²_′ℎ෠_𝑘²_′ , ∀𝑗

 ℎ෠_𝑘² = 𝜎 σ_𝑗′ 𝑊_𝑗_′²_,𝑘ℎ෠_𝑗_′¹ , ∀𝑘

 These equations define an iterative algorithm where we

෠ ¹ ෠ ²

(43)

U Kang 43

Update Rules for Inference in DBM

 Proof of the update rule ℎ^෠_𝑘² = 𝜎 σ_𝑗′𝑊_𝑗′,𝑘 2 ℎ෠_𝑗′

1 , ∀𝑘

 First, compute 𝑞 ℎ෤ _𝑗 𝑣 = exp(𝐸

ℎ_−𝑗~𝑞 ℎ_−𝑗 𝑣 log 𝑝 ℎ, 𝑣 )where ℎ_𝑗 = ℎ_𝑘²

 𝐸

ℎ_−𝑗~𝑞 ℎ_−𝑗 𝑣 log 𝑝 ℎ, 𝑣 ∝ 𝐸ℎ_−𝑗~𝑞 ℎ_−𝑗 𝑣 [log ෤𝑝(ℎ, 𝑣)] = 𝐸_ℎ

−𝑗~𝑞 ℎ_−𝑗 𝑣 𝑣^𝑇𝑊 ¹ ℎ ¹ + ℎ ^{1 𝑇}𝑊 ² ℎ ² = 𝐸_ℎ

−𝑗~𝑞 ℎ_−𝑗 𝑣 σ^𝑖σ_𝑗′𝑣_𝑖𝑊_𝑖,𝑗1′

ℎ_𝑗′1

+ σ_𝑗′σ_𝑘′ℎ_𝑗′1

𝑊_𝑗′,𝑘^′ 2 ℎ_𝑘2′

= σ_ℎ

−𝑗 σ_𝑖σ_𝑗′𝑣_𝑖𝑊_𝑖,𝑗1′

ℎ_𝑗′1

+ σ_𝑗′ℎ_𝑗′1

𝑊_𝑗′,𝑘

2 ℎ_𝑗 + σ_𝑗′σ_𝑘′\jℎ_𝑗′1

𝑊_𝑗′,𝑘^′ 2 ℎ_𝑘2′

𝑞 ℎ_−𝑗 𝑣

 Thus, 𝑞 ℎ෤ _𝑗 = 1 𝑣 = ෤𝑞 ℎ_𝑘² = 1 𝑣 ∝ exp σ_ℎ

−𝑗(σ_𝑗′ ℎ_𝑗′1

𝑊_𝑗′,𝑘

2 )𝑞 ℎ_−𝑗 𝑣 = exp[σ_𝑗′ℎ෠_𝑗′1

𝑊_𝑗′,𝑘

2 ]. Similarly, 𝑞 ℎ෤ _𝑗 = 0 𝑣 = ෤𝑞 ℎ_𝑘² = 0 𝑣 ∝ exp[0]

 Thus, 𝑞 ℎ_𝑘² = 1 𝑣 = ^𝑞(ℎ^෤ ^𝑘

2 =1|𝑣)

෤

𝑞(ℎ_𝑘² =0|𝑣)+ ෤𝑞(ℎ_𝑘² =1|𝑣) = ^exp{σ^𝑗′

ℎ෠

𝑗′

1 𝑊

𝑗′,𝑘 2 } exp 0 +exp{σ

𝑗′ℎ෠

𝑗′

1 𝑊

𝑗′,𝑘 2 } = 𝜎 σ_𝑗′𝑊_𝑗′,𝑘

2 ℎ෠_𝑗′1

(44)

DBM Parameter Learning

 Learning in DBM must confront both the challenge of an intractable partition function, and the

challenge of an intractable posterior distribution

 Variational inference allows the construction of a distribution Q(h|v) that approximates the

intractable P(h|v). Learning then proceeds by

maximining 𝐿 𝑣, 𝑄, 𝜃 , the variational lower bound on the intractable log-likelihood, log 𝑃(𝑣; 𝜃)

(45)

U Kang 45

DBM Parameter Learning

 For DBM with 2 hidden layers, the ELBO L is given by

 (Proof) 𝐿 𝑣, 𝜃, 𝑄 = 𝐸_ℎ~𝑞[log 𝑝(ℎ, 𝑣)] + 𝐻 𝑄 = 𝐸_ℎ~𝑞[log ෤𝑝(ℎ, 𝑣)] − log 𝑍(𝜃) + 𝐻 𝑄 .

 Note that 𝐸_ℎ~𝑞[log ෤𝑝(ℎ, 𝑣)] = 𝐸_ℎ~𝑞 𝑣^𝑇𝑊 ¹ ℎ ¹ + ℎ ^{1 𝑇}𝑊 ² ℎ ² = 𝐸_ℎ~𝑞 σ_𝑖σ_𝑗′ 𝑣_𝑖𝑊_𝑖,𝑗¹_′ ℎ_𝑗_′¹ + σ_𝑗′σ_𝑘′ ℎ_𝑗_′¹ 𝑊_𝑗_′²_,𝑘_′ℎ_𝑘²_′ =

σ_ℎ σ_𝑖 σ_𝑗′ 𝑣_𝑖𝑊_𝑖,𝑗1′

ℎ_𝑗′1

+ σ_𝑗′ σ_𝑘′ ℎ_𝑗′1

𝑊_𝑗′,𝑘^′ 2 ℎ_𝑘2′

𝑞 ℎ 𝑣

= σ_𝑖σ_𝑗′𝑣_𝑖𝑊_𝑖,𝑗1′ ℎ෠_𝑗′1

+ σ_𝑗′σ_𝑘′ ℎ෠_𝑗′1

𝑊_𝑗′,𝑘^′ 2 ℎ෠_𝑘2′

(46)

DBM Parameter Learning

 The equation still contains the log partition function log 𝑍(𝜃); evaluating the partition function requires an approximate technique like AIS to estimate the partition function

 Also, training the model requires the gradient of the log partition function; typically SML is used

 Another popular method for DBM parameter learning is greedy layer-wise pretraining

(47)

U Kang 47

Advanced Deep Learning