Advanced Deep Learning
Deep Generative Models-2 U Kang
Seoul National University
Outline
Boltzmann Machines
Restricted Boltzmann Machines Deep Belief Networks
Deep Boltzmann Machines
Back-Propagation through Random Operations
Directed Generative Nets
Motivation
๏ฎ Typical neural networks implement a deterministic transformation of some input variables x
๏ฎ In developing generative models, we often wish to extend neural networks to implement stochastic transformation of x
๏ฎ One simple way to do this is to augment the neural network with extra inputs z that are sampled from simple prob. distribution such as uniform or Gaussian
๏ฎ The function f(x,z) will appear stochastic to an observer who does not have access to z
๏ฎ Provided that f is continuous and differentiable, we can compute
Back-Prop Through Sampling
๏ฎ Consider the operation consisting of drawing samples ๐ฆ~๐(๐, ๐2)
๏ฎ It may seem counterintuitive to differentiate y wrt the parameter of its distribution (๐ and ๐2)
๏ฎ However, we can use a reparameterization trick, which is to rewrite the sampling process as transforming an underlying random value ๐ง~๐(0,1) to obtain a sample from the desired distribution: ๐ฆ = ๐ + ๐๐ง
๏ฎ Now we can back-propagate through the sampling operation, by regarding it as a deterministic operation with an extra input z
๏ฎ The extra input should be a random variable which is not a function of any of the parameter we differentiate against y
Back-Prop Through Sampling
๏ฎ The result (e.g., ๐๐ฆ
๐๐ or ๐๐ฆ
๐๐) tells us how an infinitesimal change in ๐ or ๐ would change the output if we could repeat the sampling operation again with the same value of z
๏ฎ Being able to back-propagating through sampling operation allows us to incorporate it into a larger graph
๏ฎ We can build elements of the graph on top of the output of the sampling distribution; we also can build elements of the graph whose outputs are the inputs or the parameters of the sampling operation
๏ฎ E.g., we could build a large graph with ๐ = ๐(๐ฅ; ๐) and ๐ =
Reparametrization
๏ฎ Reparametrization = stochastic back-propagation = perturbation analysis
๏ฑ Consider a probability distribution ๐ ๐ฆ ๐, ๐ฅ = ๐(๐ฆ|๐ค) where ๐ค is a variable containing both parameters ๐ and the input ๐ฅ
๏ฑ We rewrite y~๐(๐ฆ|๐ค) as ๐ฆ = ๐(๐ง; ๐ค) where ๐ง is a source of randomness
๏ฑ We may the differentiate ๐ฆ with respect to ๐ค using back-propagation applied to ๐, so long as ๐ is continuous and differentiable
๏ฑ Crucially, ๐ค must not be a function of ๐ง, and ๐ง must not be a function of ๐ค
Example: VAE
Outline
Boltzmann Machines
Restricted Boltzmann Machines Deep Belief Networks
Deep Boltzmann Machines Other Boltzmann Machines
Back-Propagation through Random Operations Directed Generative Nets
VAE
GAN
Variational Autoencoder (VAE)
๏ฎ A generative model that uses learned approximate inference and can be trained purely with gradient-based method
๏ฎ To generate a sample from the model, VAE first draws a sample z from the code distribution ๐๐๐๐๐๐(๐ง); the sample then is run
through a differentiable generator network g(z)
๏ฎ Finally, x is sampled from a distribution ๐๐๐๐๐๐ ๐ฅ; ๐ ๐ง = ๐๐๐๐๐๐(๐ฅ|๐ง)
๏ฎ However, during training, the approximate inference network (or encoder) ๐(๐ง|๐ฅ) is used to obtain z and ๐๐๐๐๐๐(๐ฅ|๐ง) is viewed as a decoder network
Variational Autoencoder (VAE)
Encoder ๐๐(๐ง|๐ฅ) Decoder ๐๐(๐ฅ|๐ง)
Data: x z
z
Reconstruction: ๐ฅเทค
Variational Autoencoder (VAE)
๐
๐ฅ|๐งฮฃ
๐ฅ|๐ง๐ง
๐
๐ง|๐ฅฮฃ
๐ง|๐ฅเทค ๐ฅ
Encoder network ๐ (๐ง|๐ฅ)
Sample z from ๐ง|๐ฅ~๐(๐๐ง|๐ฅ, ฮฃ๐ง|๐ฅ) Decoder network
๐๐(๐ฅ|๐ง)
Sample x|z from ๐ฅ|๐ง~๐(๐๐ฅ|๐ง, ฮฃ๐ฅ|๐ง)
Objective Function of VAE
๏ฎ
VAE is trained by maximizing ELBO ๐ฟ(๐) associated with any data point ๐ฅ
๏ฑ In eq. (1), the first term is the joint log-likelihood of the visible and the hidden variables. The second term is the entropy of the approximate posterior. When q is chosen to be a Gaussian distribution, maximizing L encourages to place high probability mass on many z values that could have generated x, rather
than collapsing to a single point estimate of the most likely value
(1) (2)
Objective Function of VAE
๏ฎ
VAE is trained by maximizing ELBO ๐ฟ(๐) associated with any data point ๐ฅ
๏ฑ In eq. (2), the first term can be viewed as the reconstruction log-likelihood found in other autoencoders. The second term tries to make the approximate posterior distribution q(z|x) and the model prior ๐๐๐๐๐๐(๐ง) to be similar
(2) (1)
Main Idea of VAE
๏ฎ
Train a parametric encoder that produces the parameters of ๐
๏ฎ
So long as ๐ง is a continuous variable, we then back- propagate through samples of ๐ง drawn from from ๐ ๐ง ๐ฅ = ๐(๐ง; ๐ ๐ฅ; ๐ ) in order to update ๐
๏ฑ Back-prop through random operation is used in the middle
๏ฎ
Learning then consists sole of maximizing ๐ฟ wrt the parameters of the encoder and decoder. All of the
expectations in ๐ฟ can be approximated by Monte Carlo
sampling
Main Idea of VAE
๏ฎ The encoder and decoder are feedforward neural networks, where the outputs are parameters of Gaussian
๏ฑ I.e., P(๐ง|๐ฅ)~๐(๐๐ง|๐ฅ, ฮฃ๐ง|๐ฅ) where ๐๐ง|๐ฅ and ฮฃ๐ง|๐ฅ are outputs of the encoder network; P(๐ฅ|๐ง)~๐(๐๐ฅ|๐ง, ฮฃ๐ฅ|๐ง) where ๐๐ฅ|๐ง and ฮฃ๐ฅ|๐ง are outputs of the decoder network
๏ฎ Also, ๐ง is modeled as a simple Gaussian ๐(0, ๐ผ). This makes generating samples easily
๏ฑ ๐(๐ง|๐ฅ) is modeled as similar as possible to ๐(0, ๐ผ)
Generating Samples from VAE
๐
๐ฅ|๐งฮฃ
๐ฅ|๐ง๐ง
เทค ๐ฅ
Sample z from ๐ง~๐(0, ๐ผ) Decoder network
๐๐(๐ฅ|๐ง)
Sample x|z from ๐ฅ|๐ง~๐(๐๐ฅ|๐ง, ฮฃ๐ฅ|๐ง)
VAE Example
VAE: Discussion
๏ฎ Strength of VAE
๏ฑ Principled approach to generative modeling
๏ฑ Allows inference of ๐(๐ง|๐ฅ) which extracts hidden features
๏ฑ Simple to implement
๏ฎ Weakness of VAE
๏ฑ Approximate inference: i.e., optimizes the lower bound ELBO
๏ฑ Samples from VAE trained on images tend to be blurry
Outline
Boltzmann Machines
Restricted Boltzmann Machines Deep Belief Networks
Deep Boltzmann Machines Other Boltzmann Machines
Back-Propagation through Random Operations
Directed Generative Nets
Generative Adversarial Network
๏ฎ Generative Adversarial Network (GAN)
๏ฑ Another model based on differentiable generator network
๏ฑ Based on a game theoretic scenario where the generator network computes against an adversary
๏ฑ The generator network produces samples ๐ฅ = ๐(๐ง; ๐ ๐ ). The adversary discriminator network attempts to distinguish samples drawn from the training data and samples from the generator
๏ฑ The discriminator emits a probability value given by ๐(๐ฅ; ๐ ๐ ), the probability that ๐ฅ is a real training example rather than a fake sample
Generative Adversarial Network
๏ฎ Generator: try to fool the discriminator
๏ฎ Discriminator: try to distinguish real or fake images
Discriminator Network Real or Fake
Fake images
Real images
Learning in GAN
๏ฎ Learning is formulated as a zero-sum game, where a function ๐ฃ(๐ ๐ , ๐ ๐ ) determines the payoff of the discriminator. The generator receives
โ ๐ฃ(๐ ๐ , ๐ ๐ ) as its own payoff.
๏ฎ During learning, each player attempts to maximize its payoff, so that at convergence
๏ฑ ๐โ = arg ๐๐๐๐๐๐๐ฅ๐ ๐ฃ ๐ ๐ , ๐ ๐ = arg ๐๐๐๐๐๐๐ฅ๐ [๐ธ๐ฅ~๐๐๐๐ก๐ log ๐ ๐ฅ + ๐ธ๐ฅ~๐๐๐๐๐๐๐๐ก๐๐ log(1 โ ๐ ๐ฅ )]
๏ฑ Note that the discriminator is a function of ๐ ๐ and the generator is a function of ๐ ๐
๏ฎ This formulation enforces the discriminator to learn to correctly classify
samples as real or fake. Also, the generator is trained to fool the discriminator into believing its samples are real
๏ฎ At convergence, the generatorโs samples are indistinguishable from real data, and the discriminator outputs ยฝ everywhere. Then the discriminator may be discarded
Learning in GAN
๏ฎ Objective function
๏ฑ ๐โ = arg ๐๐๐๐๐๐๐ฅ๐ ๐ฃ ๐ ๐ , ๐ ๐ = arg ๐๐๐๐๐๐๐ฅ๐ [๐ธ๐ฅ~๐๐๐๐ก๐ log ๐ ๐ฅ + ๐ธ๐ฅ~๐๐๐๐๐๐๐๐ก๐๐ log(1 โ ๐ ๐ฅ )]
๏ฑ Note that the discriminator is a function of ๐ ๐ and the generator is a function of ๐ ๐
๏ฎ Algorithm: alternate the following two steps
๏ฑ Step 1: update ๐ ๐ to maximize the objective ๐ธ๐ฅ~๐๐๐๐ก๐ log ๐ ๐ฅ + ๐ธ๐ฅ~๐๐๐๐๐๐๐๐ก๐๐ log(1 โ ๐ ๐ฅ ))
๏ฑ Step 2: update ๐ ๐ to minimize the objective ๐ธ๐ฅ~๐๐๐๐๐๐๐๐ก๐๐log(1 โ ๐ ๐ฅ )
Learning in GAN
๏ฎ Algorithm: alternate the following two steps
๏ฑ Step 1: update ๐ ๐ to maximize the objective ๐ธ๐ฅ~๐๐๐๐ก๐ log ๐ ๐ฅ + ๐ธ๐ฅ~๐๐๐๐๐๐๐๐ก๐๐ log(1 โ ๐ ๐ฅ ))
๏ฑ Step 2: update ๐ ๐ to minimize the objective ๐ธ๐ฅ~๐๐๐๐๐๐๐๐ก๐๐ log(1 โ ๐ ๐ฅ )
๏ฎ In reality: the Step 2 is reformulated as follows
๏ฑ Step 2: update ๐ ๐ to maximize the objective ๐ธ๐ฅ~๐๐๐๐๐๐๐๐ก๐๐ log ๐ ๐ฅ
๏ฑ The reason is that gradient of log(1 โ ๐ ๐ฅ ) is small when ๐(๐ฅ) is small, while that of log ๐ ๐ฅ is large, so the parameters are updated quickly
Samples from GAN
Case Study: CycleGAN
๏ฎ Given any two unorderedimage collections X and Y, CycleGAN learns to automatically translate an image from one into the other and vice versa
[Zhu and Park et al., Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2017]
Case Study: CycleGAN
๏ฎ Formulation
๏ฑ The model includes two mappings G: X -> Y, and F: Y->X
๏ฑ Two adversarial discriminators: ๐ท๐ (to distinguish between images {x} and the translated images {F(y)} ), and ๐ท๐ (to distinguish between images {y}
and the translated images {G(x)} )
๏ฑ The objective function contains two types of loss: adversarial loss for matching the distribution of generated images to the data distribution in the target domain, and cycle consistency loss to prevent the learned
mappings G and F from contradicting each other
Case Study: CycleGAN
๏ฎ Adversarial loss
๏ฑ For mapping function G: X -> Y and its discriminator ๐ท๐, the objective is ๐ฟ๐บ๐ด๐ ๐บ, ๐ท๐, ๐, ๐ = ๐ธ๐ฆ~๐๐๐๐ก๐(๐ฆ) log ๐ท๐(๐ฆ) + ๐ธ๐ฅ~๐๐๐๐ก๐(๐ฅ) log(1 โ ๐ท๐ ๐บ ๐ฅ )
๏ฑ where G tries to generate images G(x) that look similar to images from domain Y, while ๐ท๐ aims to distinguish between translated samples G(x) and real samples y
๏ฑ We want to find the best G and ๐ท๐ by optimizing the function ๐๐๐๐บ๐๐๐ฅ๐ท๐๐ฟ๐บ๐ด๐ ๐บ, ๐ท๐, ๐, ๐
๏ฑ A similar adversarial loss for the mapping function F: Y -> X and its discriminator ๐ท๐ is defined as well with the objective function ๐๐๐๐น๐๐๐ฅ๐ท๐๐ฟ๐บ๐ด๐ ๐น, ๐ท๐, ๐, ๐
Case Study: CycleGAN
๏ฎ Cycle consistency loss
๏ฑ The adversarial loss alone cannot guarantee that the learned function can map an
individual input ๐ฅ๐ to a desired output ๐ฆ๐. E.g., multiple images in X may be mapped to the same image in Y
๏ฑ CycleGAN introduces cycle-consistency loss
๏ฎ Forwary cycle consistency: for each image x from domain X, the image translation cycle should bring x back to the original: i.e., ๐ฅ โ ๐บ(๐ฅ) โ ๐น(๐บ(๐ฅ)) โ ๐ฅ
๏ฎ Backward cycle consistency: for each image y from domain Y, the image translation cycle should bring y back to the original: i.e., ๐ฆ โ ๐น(๐ฆ) โ ๐บ(๐น(๐ฆ)) โ ๐ฆ
Case Study: CycleGAN
๏ฎ
Full objective function
๏ฑ Objective: ๐ฟ ๐บ, ๐น, ๐ท๐, ๐ท๐ = ๐ฟ๐บ๐ด๐ ๐บ, ๐ท๐, ๐, ๐ + ๐ฟ๐บ๐ด๐ ๐น, ๐ท๐, ๐, ๐ + ๐๐ฟ๐๐ฆ๐ ๐บ, ๐น
๏ฑ Goal: solve ๐บโ, ๐นโ = arg ๐๐๐๐บ,๐น๐๐๐ฅ๐ท๐,๐ท๐๐ฟ ๐บ, ๐น, ๐ท๐, ๐ท๐
Case Study: CycleGAN
Case Study: CycleGAN
Case Study: CycleGAN
GAN: Discussion
๏ฎ Strength of GAN
๏ฑ State-of-the-art image samples
๏ฎ Weakness of GAN
๏ฑ Unstable to train
๏ฑ Cannot perform inference queries ๐(๐ง|๐ฅ)
What you need to know
๏ฎ
Inference as Optimization
๏ฎ
Expectation Maximization
๏ฎ
MAP Inference and Sparse Coding
๏ฎ
Variational Inference and Learning
Questions?