U Kang 1
Advanced Deep Learning
Deep Generative Models U Kang
Seoul National University
In This Lecture
Boltzmann Machines
Restricted Boltzmann Machines
Deep Belief Networks
Deep Boltzmann Machines
Back-Propagation through Random Operations
Directed Generative Nets
U Kang 3
Outline
Boltzmann Machines
Restricted Boltzmann Machines Deep Belief Networks
Deep Boltzmann Machines
Back-Propagation through Random Operations Directed Generative Nets
Boltzmann Machines
Model a distribution over binary vectors
Undirected probabilistic graphical models
May be stacked to form deeper models
Energy-based model (recall the function 𝐸 𝑥 )
U Kang 5
Joint Probability Distribution
We consider a binary random vector 𝑥
Then, the joint probability distribution is
𝑝 𝑥 = exp −𝐸 𝑥 𝑍
where 𝐸 𝑥 is the energy function, and 𝑍 is the partition function that ensures σ𝑥 𝑝 𝑥 = 1
The Energy Function
The energy function of the BM is given by
𝐸 𝑥 = −𝑥𝑇𝑈𝑥 − 𝑏𝑇𝑥
where 𝑈 is the weight matrix and 𝑏 is the bias
Represented as a fully-connected network
U Kang 7
Latent Variables (1)
What if not all the variables are observed?
Then, some of the variables are latent
The latent ones act similarly to hidden units
Latent Variables (2)
We decompose the units 𝑥 into two subsets:
the visible units 𝑣
the latent (or hidden) units ℎ
The energy function 𝐸 𝑣, ℎ becomes
−𝑣𝑇𝑅𝑣 − 𝑣𝑇𝑊ℎ − ℎ𝑇𝑆ℎ − 𝑏𝑇𝑣 − 𝑐𝑇ℎ
which is similar to the structure of 𝐸 𝑥
U Kang 9
Outline
Boltzmann Machines
Restricted Boltzmann Machines Deep Belief Networks
Deep Boltzmann Machines
Back-Propagation through Random Operations Directed Generative Nets
Restricted BMs
Common building blocks of deep prob. models
Contain the following layers:
a layer of observable variables 𝑣
a single layer of latent variables ℎ
No connections are permitted in a same layer!
U Kang 11
Examples
(a) The structure of a restricted BM (RBM)
(c) The structure of a deep BM
Joint Probability Distribution
The joint probability distribution is
𝑝 𝑣, ℎ = exp −𝐸 𝑣, ℎ 𝑍
which is the same with the one for a BM
U Kang 13
The Energy Function
The energy function for an RBM is
𝐸 𝑣, ℎ = −𝑏𝑇𝑣 − 𝑐𝑇ℎ − 𝑣𝑇𝑊ℎ
Compare it to the non-restricted one:
−𝑣𝑇𝑅𝑣 − 𝑣𝑇𝑊ℎ − ℎ𝑇𝑆ℎ − 𝑏𝑇𝑣 − 𝑐𝑇ℎ
The Partition Function
The partition function is defined as
𝑍 =
𝑣
ℎ
exp −𝐸 𝑣, ℎ
which is intractable
This also means 𝑝 𝑣 = σℎ 𝑝(ℎ, 𝑣) = σℎ exp −𝐸 𝑣,ℎ is also intractable
U Kang 15
Conditional Distributions (1)
Though 𝑝(𝑣) is intractable, the bipartite structure of the RBM allows 𝑃 ℎ 𝑣 and 𝑃 𝑣 ℎ to be easily
computed and sampled from
We start by computing 𝑝 ℎ𝑗 = 1 𝑣 :
𝑝 ℎ 𝑣 = 𝑝(ℎ,𝑣)
𝑝(𝑣) = 1
𝑝(𝑣) 1
𝑍 exp 𝑏𝑇𝑣 + 𝑐𝑇ℎ + 𝑣𝑇𝑊ℎ =
1
𝑍′ exp{𝑐𝑇ℎ + 𝑣𝑇𝑊ℎ} = 1
𝑍′ ς𝑗=1ℎ exp 𝑐𝑗ℎ𝑗 + 𝑣𝑇𝑊:,𝑗ℎ𝑗
𝑝 ℎ𝑗 = 1 𝑣 = 𝑝(ℎ 𝑗=1|𝑣)
𝑝(ℎ𝑗=0|𝑣)+ 𝑝(ℎ𝑗=1|𝑣) = exp{𝑐𝑗+𝑣𝑇𝑊:,𝑗}
exp 0 +exp{𝑐𝑗+𝑣𝑇𝑊:,𝑗} = 𝜎 𝑐𝑗 + 𝑣𝑇𝑊:,𝑗
Conditional Distributions (2)
Then, we express the full conditionals as the factorial distribution
𝑃 ℎ 𝑣 = ς𝑗=1ℎ 𝜎 2ℎ − 1 ⨀ 𝑐 + 𝑊𝑇𝑣
𝑗
𝑃 𝑣 ℎ = ς𝑖=1𝑣 𝜎 2𝑣 − 1 ⨀ 𝑏 + 𝑊ℎ
𝑖
U Kang 17
Training RBM
RBM admits
Efficient evaluation and differentiation of 𝑝(𝑣)
Efficient MCMC sampling in the form of block Gibbs sampling: it is easy to sample each ℎ𝑗 using
𝑃 ℎ𝑗 = 1 𝑣 = 𝜎 𝑐𝑗 + 𝑣𝑇𝑊:,𝑗 , and each ℎ𝑗 can be sampled independently
RBM can readily be trained with any of the
techniques (e.g., CD, SML) for training models with intractable partition functions
Outline
Boltzmann Machines
Restricted Boltzmann Machines Deep Belief Networks
Deep Boltzmann Machines
Back-Propagation through Random Operations Directed Generative Nets
U Kang 19
Deep Belief Networks
The introduction of deep belief networks (DBNs) began the current deep learning renaissance
Contain several layers of latent variables
Contain no intra-layer connections
Undirected connections between the top 2 layers
Directed connections between all other layers
Graph Structure
A DBN with one visible and two hidden layers
It can be represented as a graphical model with inter-layer connections
U Kang 21
Parameters
A DBN with 𝑙 hidden layers contains
𝑙 weight matrices 𝑊 1 , … , 𝑊 𝑙
𝑙 + 1 bias vectors 𝑏 0 , … , 𝑏 𝑙
𝑏 0 provides the biases for the visible layer
The Probability Distribution (1)
The probability dist. consists of three parts
The first distribution is given by
𝑃 ℎ 𝑙 , ℎ 𝑙−1 ∝
exp 𝑏 𝑙 𝑇ℎ 𝑙 + 𝑏 𝑙−1 𝑇ℎ 𝑙−1 + ℎ 𝑙−1 𝑇𝑊 𝑙 ℎ 𝑙
It provides the joint distribution between the top two hidden layers ℎ 𝑙 and ℎ 𝑙−1
U Kang 23
The Probability Distribution (2)
The second distribution is given by
𝑃 ℎ𝑖𝑘 = 1 ℎ 𝑘+1
= 𝜎 𝑏𝑖𝑘 + 𝑊:,𝑖𝑘+1 𝑇ℎ 𝑘+1
It provides the conditional distribution for the
activation of a hidden layer ℎ 𝑘 given ℎ 𝑘+1 , for 𝑘 = 𝑙 − 2 to 1
The Probability Distribution (3)
The third distribution is given by
𝑃 𝑣𝑖 = 1 ℎ 1 = 𝜎 𝑏𝑖0 + 𝑊:,𝑖1 𝑇ℎ 1
It provides the conditional distribution for the activation of the visible layer 𝑣 given ℎ 1
In the case of real-valued visible units, 𝑣~𝑁(𝑣; 𝑏 0 + 𝑊 1 𝑇ℎ 1 , 𝛽−1)
U Kang 25
Deep Belief Network
Generating a sample from DBN
First, run several steps of Gibbs sampling on the top two hidden layers
Then, use a single pass of ancestral sampling through the rest of the model to draw a sample from the
visible units
Training DBN
Intractability
Inference in DBN is intractable due to the “explaining away” effect and the top 2 hidden layers
U Kang 27
Training DBN
Training DBN: greedy layer-wise training
First, train an RBM to maximize 𝐸𝒗~𝑝𝑑𝑎𝑡𝑎 log 𝑝(𝒗) using CD or SML; the learned parameters define the parameters of the first layer of the DBN
Next, a second RBM is trained to approximately maximize 𝐸𝒗~𝑝𝑑𝑎𝑡𝑎𝐸𝒉(1)~𝑝 1 (𝒉 1 |𝑣) log 𝑝 2 (𝒉(1))
This procedures are repeated indefinitely to add as many layers to the DBN
The learned weights are used for the parameters of DBN
Using the Trained DBN
DBN may be used for
Generating samples from the model
Improve classification by taking the weights from the DBN and use them to define an MLP; after this initialization, the MLP is fine-tuned with a classification task
ℎ 1 = 𝜎 𝑏 1 + 𝒗(𝑇)𝑾(1)
ℎ 𝑙 = 𝜎 𝑏𝑖𝑙 + 𝒉 𝑙−1 𝑇𝑾(𝑙) for 𝑙 = 2 … 𝑚
U Kang 29
Outline
Boltzmann Machines
Restricted Boltzmann Machines Deep Belief Networks
Deep Boltzmann Machines
Back-Propagation through Random Operations Directed Generative Nets
Deep Boltzmann Machines
Another kind of deep generative models
Entirely undirected models unlike DBNs
Have several layers of latent variables
U Kang 31
Graph Structure
A DBM with one visible and two hidden layers
There are no intra-layer connections
Joint Probability Distribution
A DBM is also an energy-based model
The joint probability distribution is given by 𝑃 𝑣, ℎ 1 , ℎ 2 = 1
𝑍 𝜃 exp −𝐸 𝑣, ℎ 1 , ℎ 2 ; 𝜃 in the case with one visible layer 𝑣 and two
hidden layers ℎ 1 , and ℎ 2
𝐸 𝑣, ℎ 1 , ℎ 2 ; 𝜃 = −𝑣𝑇𝑊 1 ℎ 1 − ℎ 1 𝑇𝑊 2 ℎ 2
Bias terms are omitted for simplicity
U Kang 33
Bipartite Graph (1)
The DBM layers can be organized as bipartite:
Bipartite Graph (2)
The variables in the odd (even) layers become conditionally independent when we condition on the variables in the even (odd) layers:
𝑃 𝑣, ℎ 2 ℎ 1 = 𝑃 𝑣 ℎ 1 𝑃 ℎ 2 ℎ 1
= ෑ
𝑖
𝑃 𝑣𝑖 ℎ 1 ෑ
𝑗
𝑃 ℎ𝑗(2) ℎ 1
U Kang 35
Conditional Distributions
The activation probabilities are given by
𝑃 𝑣𝑖 = 1 ℎ 1 = 𝜎 𝑊𝑖,:1 ℎ 1
𝑃 ℎ𝑖1 = 1 𝑣, ℎ 2 = 𝜎 𝑣𝑇𝑊:,𝑖1 + 𝑊𝑖,:2 ℎ 2 𝑃 ℎ𝑘2 = 1 ℎ 1 = 𝜎 ℎ 1 𝑇𝑊:,𝑘2
Gibbs Sampling
The bipartite architecture in DBM makes Gibbs sampling efficient
In DBM, Gibbs sampling updates all odd (or even) layers at once: iterate the followings
Sample from layers including all even layers (and the visible layer)
Sample from layers including all odd layers
U Kang 37
Inference in DBM
The distribution over all hidden layers does not factorize because of interactions between layers
In the example with two hidden layers,
𝑃 ℎ 1 , ℎ 2 𝑣 does not factorize because of the interaction weights 𝑊 2 between ℎ 1 and ℎ 2
Mean Field Inference
A simple form of variational inference
We try to approximate 𝑃 ℎ 1 , ℎ 2 𝑣
But, we restrict the approximating distribution to fully factorial distributions
We attempt to find 𝑄 that best fits 𝑃
U Kang 39
Mean Field Assumption
Let 𝑄 ℎ 1 , ℎ 2 𝑣 be the approximation
The mean field assumption implies that
𝑄 ℎ 1 , ℎ 2 𝑣 = ෑ
𝑗
𝑄 ℎ𝑗1 𝑣 ෑ
𝑘
𝑄 ℎ𝑘2 𝑣
It restricts the search space (for efficiency)
Mean Field Approximation
The mean field approach is to minimize
𝐾𝐿 𝑄 ∥ 𝑃 =
ℎ
𝑄 log 𝑄 𝑃
which measures the difference between 𝑄 and 𝑃
U Kang 41
Parameterization
We associate the probability with a parameter
We parameterize ℎ 1 and ℎ 2 as
ℎ𝑗1 = 𝑄 ℎ𝑗1 = 1 𝑣 ℎ𝑘2 = 𝑄 ℎ𝑘2 = 1 𝑣
where ℎ𝑗1 ∈ 0,1 and ℎ𝑘2 ∈ 0,1
Update Rules for Inference in DBM
We saw that the optimal 𝑞 ℎ𝑗 𝑣 can be obtained by normalizing the unnormalized distribution
𝑞 ℎ 𝑗 𝑣 = exp(𝐸ℎ
−𝑗~𝑞 ℎ−𝑗 𝑣 log 𝑝 ℎ, 𝑣 )
Then, we obtain the update rules
ℎ𝑗1 = 𝜎 σ𝑖 𝑣𝑖𝑊𝑖,𝑗1 + σ𝑘′ 𝑊𝑗,𝑘2′ℎ𝑘2′ , ∀𝑗
ℎ𝑘2 = 𝜎 σ𝑗′ 𝑊𝑗′2,𝑘ℎ𝑗′1 , ∀𝑘
These equations define an iterative algorithm where we
1 2
U Kang 43
Update Rules for Inference in DBM
Proof of the update rule ℎ𝑘2 = 𝜎 σ𝑗′𝑊𝑗′,𝑘 2 ℎ𝑗′
1 , ∀𝑘
First, compute 𝑞 ℎ 𝑗 𝑣 = exp(𝐸
ℎ−𝑗~𝑞 ℎ−𝑗 𝑣 log 𝑝 ℎ, 𝑣 )where ℎ𝑗 = ℎ𝑘2
𝐸
ℎ−𝑗~𝑞 ℎ−𝑗 𝑣 log 𝑝 ℎ, 𝑣 ∝ 𝐸ℎ−𝑗~𝑞 ℎ−𝑗 𝑣 [log 𝑝(ℎ, 𝑣)] = 𝐸ℎ
−𝑗~𝑞 ℎ−𝑗 𝑣 𝑣𝑇𝑊 1 ℎ 1 + ℎ 1 𝑇𝑊 2 ℎ 2 = 𝐸ℎ
−𝑗~𝑞 ℎ−𝑗 𝑣 σ𝑖σ𝑗′𝑣𝑖𝑊𝑖,𝑗1′
ℎ𝑗′1
+ σ𝑗′σ𝑘′ℎ𝑗′1
𝑊𝑗′,𝑘′ 2 ℎ𝑘2′
= σℎ
−𝑗 σ𝑖σ𝑗′𝑣𝑖𝑊𝑖,𝑗1′
ℎ𝑗′1
+ σ𝑗′ℎ𝑗′1
𝑊𝑗′,𝑘
2 ℎ𝑗 + σ𝑗′σ𝑘′\jℎ𝑗′1
𝑊𝑗′,𝑘′ 2 ℎ𝑘2′
𝑞 ℎ−𝑗 𝑣
Thus, 𝑞 ℎ 𝑗 = 1 𝑣 = 𝑞 ℎ𝑘2 = 1 𝑣 ∝ exp σℎ
−𝑗(σ𝑗′ ℎ𝑗′1
𝑊𝑗′,𝑘
2 )𝑞 ℎ−𝑗 𝑣 = exp[σ𝑗′ℎ𝑗′1
𝑊𝑗′,𝑘
2 ]. Similarly, 𝑞 ℎ 𝑗 = 0 𝑣 = 𝑞 ℎ𝑘2 = 0 𝑣 ∝ exp[0]
Thus, 𝑞 ℎ𝑘2 = 1 𝑣 = 𝑞(ℎ 𝑘
2 =1|𝑣)
𝑞(ℎ𝑘2 =0|𝑣)+ 𝑞(ℎ𝑘2 =1|𝑣) = exp{σ𝑗′
ℎ
𝑗′
1 𝑊
𝑗′,𝑘 2 } exp 0 +exp{σ
𝑗′ℎ
𝑗′
1 𝑊
𝑗′,𝑘 2 } = 𝜎 σ𝑗′𝑊𝑗′,𝑘
2 ℎ𝑗′1
DBM Parameter Learning
Learning in DBM must confront both the challenge of an intractable partition function, and the
challenge of an intractable posterior distribution
Variational inference allows the construction of a distribution Q(h|v) that approximates the
intractable P(h|v). Learning then proceeds by
maximining 𝐿 𝑣, 𝑄, 𝜃 , the variational lower bound on the intractable log-likelihood, log 𝑃(𝑣; 𝜃)
U Kang 45
DBM Parameter Learning
For DBM with 2 hidden layers, the ELBO L is given by
(Proof) 𝐿 𝑣, 𝜃, 𝑄 = 𝐸ℎ~𝑞[log 𝑝(ℎ, 𝑣)] + 𝐻 𝑄 = 𝐸ℎ~𝑞[log 𝑝(ℎ, 𝑣)] − log 𝑍(𝜃) + 𝐻 𝑄 .
Note that 𝐸ℎ~𝑞[log 𝑝(ℎ, 𝑣)] = 𝐸ℎ~𝑞 𝑣𝑇𝑊 1 ℎ 1 + ℎ 1 𝑇𝑊 2 ℎ 2 = 𝐸ℎ~𝑞 σ𝑖σ𝑗′ 𝑣𝑖𝑊𝑖,𝑗1′ ℎ𝑗′1 + σ𝑗′σ𝑘′ ℎ𝑗′1 𝑊𝑗′2,𝑘′ℎ𝑘2′ =
σℎ σ𝑖 σ𝑗′ 𝑣𝑖𝑊𝑖,𝑗1′
ℎ𝑗′1
+ σ𝑗′ σ𝑘′ ℎ𝑗′1
𝑊𝑗′,𝑘′ 2 ℎ𝑘2′
𝑞 ℎ 𝑣
= σ𝑖σ𝑗′𝑣𝑖𝑊𝑖,𝑗1′ ℎ𝑗′1
+ σ𝑗′σ𝑘′ ℎ𝑗′1
𝑊𝑗′,𝑘′ 2 ℎ𝑘2′
DBM Parameter Learning
The equation still contains the log partition function log 𝑍(𝜃); evaluating the partition function requires an approximate technique like AIS to estimate the partition function
Also, training the model requires the gradient of the log partition function; typically SML is used
Another popular method for DBM parameter learning is greedy layer-wise pretraining
U Kang 47