• Tidak ada hasil yang ditemukan

PDF Advanced Deep Learning - Seoul National University

N/A
N/A
Protected

Academic year: 2024

Membagikan "PDF Advanced Deep Learning - Seoul National University"

Copied!
36
0
0

Teks penuh

(1)

Advanced Deep Learning

Deep Generative Models-2 U Kang

Seoul National University

(2)

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

Deep Boltzmann Machines

Back-Propagation through Random Operations

Directed Generative Nets

(3)

Motivation

๏ฎ Typical neural networks implement a deterministic transformation of some input variables x

๏ฎ In developing generative models, we often wish to extend neural networks to implement stochastic transformation of x

๏ฎ One simple way to do this is to augment the neural network with extra inputs z that are sampled from simple prob. distribution such as uniform or Gaussian

๏ฎ The function f(x,z) will appear stochastic to an observer who does not have access to z

๏ฎ Provided that f is continuous and differentiable, we can compute

(4)

Back-Prop Through Sampling

๏ฎ Consider the operation consisting of drawing samples ๐‘ฆ~๐‘(๐œ‡, ๐œŽ2)

๏ฎ It may seem counterintuitive to differentiate y wrt the parameter of its distribution (๐œ‡ and ๐œŽ2)

๏ฎ However, we can use a reparameterization trick, which is to rewrite the sampling process as transforming an underlying random value ๐‘ง~๐‘(0,1) to obtain a sample from the desired distribution: ๐‘ฆ = ๐œ‡ + ๐œŽ๐‘ง

๏ฎ Now we can back-propagate through the sampling operation, by regarding it as a deterministic operation with an extra input z

๏ฎ The extra input should be a random variable which is not a function of any of the parameter we differentiate against y

(5)

Back-Prop Through Sampling

๏ฎ The result (e.g., ๐œ•๐‘ฆ

๐œ•๐œ‡ or ๐œ•๐‘ฆ

๐œ•๐œŽ) tells us how an infinitesimal change in ๐œ‡ or ๐œŽ would change the output if we could repeat the sampling operation again with the same value of z

๏ฎ Being able to back-propagating through sampling operation allows us to incorporate it into a larger graph

๏ฎ We can build elements of the graph on top of the output of the sampling distribution; we also can build elements of the graph whose outputs are the inputs or the parameters of the sampling operation

๏ฎ E.g., we could build a large graph with ๐œ‡ = ๐‘“(๐‘ฅ; ๐œƒ) and ๐œŽ =

(6)

Reparametrization

๏ฎ Reparametrization = stochastic back-propagation = perturbation analysis

๏ฑ Consider a probability distribution ๐‘ ๐‘ฆ ๐œƒ, ๐‘ฅ = ๐‘(๐‘ฆ|๐‘ค) where ๐‘ค is a variable containing both parameters ๐œƒ and the input ๐‘ฅ

๏ฑ We rewrite y~๐‘(๐‘ฆ|๐‘ค) as ๐‘ฆ = ๐‘“(๐‘ง; ๐‘ค) where ๐‘ง is a source of randomness

๏ฑ We may the differentiate ๐‘ฆ with respect to ๐‘ค using back-propagation applied to ๐‘“, so long as ๐‘“ is continuous and differentiable

๏ฑ Crucially, ๐‘ค must not be a function of ๐‘ง, and ๐‘ง must not be a function of ๐‘ค

(7)

Example: VAE

(8)

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

Deep Boltzmann Machines Other Boltzmann Machines

Back-Propagation through Random Operations Directed Generative Nets

VAE

GAN

(9)

Variational Autoencoder (VAE)

๏ฎ A generative model that uses learned approximate inference and can be trained purely with gradient-based method

๏ฎ To generate a sample from the model, VAE first draws a sample z from the code distribution ๐‘๐‘š๐‘œ๐‘‘๐‘’๐‘™(๐‘ง); the sample then is run

through a differentiable generator network g(z)

๏ฎ Finally, x is sampled from a distribution ๐‘๐‘š๐‘œ๐‘‘๐‘’๐‘™ ๐‘ฅ; ๐‘” ๐‘ง = ๐‘๐‘š๐‘œ๐‘‘๐‘’๐‘™(๐‘ฅ|๐‘ง)

๏ฎ However, during training, the approximate inference network (or encoder) ๐‘ž(๐‘ง|๐‘ฅ) is used to obtain z and ๐‘๐‘š๐‘œ๐‘‘๐‘’๐‘™(๐‘ฅ|๐‘ง) is viewed as a decoder network

(10)

Variational Autoencoder (VAE)

Encoder ๐‘ž๐œƒ(๐‘ง|๐‘ฅ) Decoder ๐‘๐œ™(๐‘ฅ|๐‘ง)

Data: x z

z

Reconstruction: ๐‘ฅเทค

(11)

Variational Autoencoder (VAE)

๐œ‡

๐‘ฅ|๐‘ง

ฮฃ

๐‘ฅ|๐‘ง

๐‘ง

๐œ‡

๐‘ง|๐‘ฅ

ฮฃ

๐‘ง|๐‘ฅ

เทค ๐‘ฅ

Encoder network ๐‘ž (๐‘ง|๐‘ฅ)

Sample z from ๐‘ง|๐‘ฅ~๐‘(๐œ‡๐‘ง|๐‘ฅ, ฮฃ๐‘ง|๐‘ฅ) Decoder network

๐‘๐œƒ(๐‘ฅ|๐‘ง)

Sample x|z from ๐‘ฅ|๐‘ง~๐‘(๐œ‡๐‘ฅ|๐‘ง, ฮฃ๐‘ฅ|๐‘ง)

(12)

Objective Function of VAE

๏ฎ

VAE is trained by maximizing ELBO ๐ฟ(๐‘ž) associated with any data point ๐‘ฅ

๏ฑ In eq. (1), the first term is the joint log-likelihood of the visible and the hidden variables. The second term is the entropy of the approximate posterior. When q is chosen to be a Gaussian distribution, maximizing L encourages to place high probability mass on many z values that could have generated x, rather

than collapsing to a single point estimate of the most likely value

(1) (2)

(13)

Objective Function of VAE

๏ฎ

VAE is trained by maximizing ELBO ๐ฟ(๐‘ž) associated with any data point ๐‘ฅ

๏ฑ In eq. (2), the first term can be viewed as the reconstruction log-likelihood found in other autoencoders. The second term tries to make the approximate posterior distribution q(z|x) and the model prior ๐‘๐‘š๐‘œ๐‘‘๐‘’๐‘™(๐‘ง) to be similar

(2) (1)

(14)

Main Idea of VAE

๏ฎ

Train a parametric encoder that produces the parameters of ๐‘ž

๏ฎ

So long as ๐‘ง is a continuous variable, we then back- propagate through samples of ๐‘ง drawn from from ๐‘ž ๐‘ง ๐‘ฅ = ๐‘ž(๐‘ง; ๐‘“ ๐‘ฅ; ๐œƒ ) in order to update ๐œƒ

๏ฑ Back-prop through random operation is used in the middle

๏ฎ

Learning then consists sole of maximizing ๐ฟ wrt the parameters of the encoder and decoder. All of the

expectations in ๐ฟ can be approximated by Monte Carlo

sampling

(15)

Main Idea of VAE

๏ฎ The encoder and decoder are feedforward neural networks, where the outputs are parameters of Gaussian

๏ฑ I.e., P(๐‘ง|๐‘ฅ)~๐‘(๐œ‡๐‘ง|๐‘ฅ, ฮฃ๐‘ง|๐‘ฅ) where ๐œ‡๐‘ง|๐‘ฅ and ฮฃ๐‘ง|๐‘ฅ are outputs of the encoder network; P(๐‘ฅ|๐‘ง)~๐‘(๐œ‡๐‘ฅ|๐‘ง, ฮฃ๐‘ฅ|๐‘ง) where ๐œ‡๐‘ฅ|๐‘ง and ฮฃ๐‘ฅ|๐‘ง are outputs of the decoder network

๏ฎ Also, ๐‘ง is modeled as a simple Gaussian ๐‘(0, ๐ผ). This makes generating samples easily

๏ฑ ๐‘ž(๐‘ง|๐‘ฅ) is modeled as similar as possible to ๐‘(0, ๐ผ)

(16)

Generating Samples from VAE

๐œ‡

๐‘ฅ|๐‘ง

ฮฃ

๐‘ฅ|๐‘ง

๐‘ง

เทค ๐‘ฅ

Sample z from ๐‘ง~๐‘(0, ๐ผ) Decoder network

๐‘๐œƒ(๐‘ฅ|๐‘ง)

Sample x|z from ๐‘ฅ|๐‘ง~๐‘(๐œ‡๐‘ฅ|๐‘ง, ฮฃ๐‘ฅ|๐‘ง)

(17)

VAE Example

(18)

VAE: Discussion

๏ฎ Strength of VAE

๏ฑ Principled approach to generative modeling

๏ฑ Allows inference of ๐‘ž(๐‘ง|๐‘ฅ) which extracts hidden features

๏ฑ Simple to implement

๏ฎ Weakness of VAE

๏ฑ Approximate inference: i.e., optimizes the lower bound ELBO

๏ฑ Samples from VAE trained on images tend to be blurry

(19)

Outline

Boltzmann Machines

Restricted Boltzmann Machines Deep Belief Networks

Deep Boltzmann Machines Other Boltzmann Machines

Back-Propagation through Random Operations

Directed Generative Nets

(20)

Generative Adversarial Network

๏ฎ Generative Adversarial Network (GAN)

๏ฑ Another model based on differentiable generator network

๏ฑ Based on a game theoretic scenario where the generator network computes against an adversary

๏ฑ The generator network produces samples ๐‘ฅ = ๐‘”(๐‘ง; ๐œƒ ๐‘” ). The adversary discriminator network attempts to distinguish samples drawn from the training data and samples from the generator

๏ฑ The discriminator emits a probability value given by ๐‘‘(๐‘ฅ; ๐œƒ ๐‘‘ ), the probability that ๐‘ฅ is a real training example rather than a fake sample

(21)

Generative Adversarial Network

๏ฎ Generator: try to fool the discriminator

๏ฎ Discriminator: try to distinguish real or fake images

Discriminator Network Real or Fake

Fake images

Real images

(22)

Learning in GAN

๏ฎ Learning is formulated as a zero-sum game, where a function ๐‘ฃ(๐œƒ ๐‘” , ๐œƒ ๐‘‘ ) determines the payoff of the discriminator. The generator receives

โˆ’ ๐‘ฃ(๐œƒ ๐‘” , ๐œƒ ๐‘‘ ) as its own payoff.

๏ฎ During learning, each player attempts to maximize its payoff, so that at convergence

๏ฑ ๐‘”โˆ— = arg ๐‘š๐‘–๐‘›๐‘”๐‘š๐‘Ž๐‘ฅ๐‘‘ ๐‘ฃ ๐œƒ ๐‘” , ๐œƒ ๐‘‘ = arg ๐‘š๐‘–๐‘›๐‘”๐‘š๐‘Ž๐‘ฅ๐‘‘ [๐ธ๐‘ฅ~๐‘๐‘‘๐‘Ž๐‘ก๐‘Ž log ๐‘‘ ๐‘ฅ + ๐ธ๐‘ฅ~๐‘๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘œ๐‘Ÿ log(1 โˆ’ ๐‘‘ ๐‘ฅ )]

๏ฑ Note that the discriminator is a function of ๐œƒ ๐‘‘ and the generator is a function of ๐œƒ ๐‘”

๏ฎ This formulation enforces the discriminator to learn to correctly classify

samples as real or fake. Also, the generator is trained to fool the discriminator into believing its samples are real

๏ฎ At convergence, the generatorโ€™s samples are indistinguishable from real data, and the discriminator outputs ยฝ everywhere. Then the discriminator may be discarded

(23)

Learning in GAN

๏ฎ Objective function

๏ฑ ๐‘”โˆ— = arg ๐‘š๐‘–๐‘›๐‘”๐‘š๐‘Ž๐‘ฅ๐‘‘ ๐‘ฃ ๐œƒ ๐‘” , ๐œƒ ๐‘‘ = arg ๐‘š๐‘–๐‘›๐‘”๐‘š๐‘Ž๐‘ฅ๐‘‘ [๐ธ๐‘ฅ~๐‘๐‘‘๐‘Ž๐‘ก๐‘Ž log ๐‘‘ ๐‘ฅ + ๐ธ๐‘ฅ~๐‘๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘œ๐‘Ÿ log(1 โˆ’ ๐‘‘ ๐‘ฅ )]

๏ฑ Note that the discriminator is a function of ๐œƒ ๐‘‘ and the generator is a function of ๐œƒ ๐‘”

๏ฎ Algorithm: alternate the following two steps

๏ฑ Step 1: update ๐œƒ ๐‘‘ to maximize the objective ๐ธ๐‘ฅ~๐‘๐‘‘๐‘Ž๐‘ก๐‘Ž log ๐‘‘ ๐‘ฅ + ๐ธ๐‘ฅ~๐‘๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘œ๐‘Ÿ log(1 โˆ’ ๐‘‘ ๐‘ฅ ))

๏ฑ Step 2: update ๐œƒ ๐‘” to minimize the objective ๐ธ๐‘ฅ~๐‘๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘œ๐‘Ÿlog(1 โˆ’ ๐‘‘ ๐‘ฅ )

(24)

Learning in GAN

๏ฎ Algorithm: alternate the following two steps

๏ฑ Step 1: update ๐œƒ ๐‘‘ to maximize the objective ๐ธ๐‘ฅ~๐‘๐‘‘๐‘Ž๐‘ก๐‘Ž log ๐‘‘ ๐‘ฅ + ๐ธ๐‘ฅ~๐‘๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘œ๐‘Ÿ log(1 โˆ’ ๐‘‘ ๐‘ฅ ))

๏ฑ Step 2: update ๐œƒ ๐‘” to minimize the objective ๐ธ๐‘ฅ~๐‘๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘œ๐‘Ÿ log(1 โˆ’ ๐‘‘ ๐‘ฅ )

๏ฎ In reality: the Step 2 is reformulated as follows

๏ฑ Step 2: update ๐œƒ ๐‘” to maximize the objective ๐ธ๐‘ฅ~๐‘๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘œ๐‘Ÿ log ๐‘‘ ๐‘ฅ

๏ฑ The reason is that gradient of log(1 โˆ’ ๐‘‘ ๐‘ฅ ) is small when ๐‘‘(๐‘ฅ) is small, while that of log ๐‘‘ ๐‘ฅ is large, so the parameters are updated quickly

(25)

Samples from GAN

(26)

Case Study: CycleGAN

๏ฎ Given any two unorderedimage collections X and Y, CycleGAN learns to automatically translate an image from one into the other and vice versa

[Zhu and Park et al., Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2017]

(27)

Case Study: CycleGAN

๏ฎ Formulation

๏ฑ The model includes two mappings G: X -> Y, and F: Y->X

๏ฑ Two adversarial discriminators: ๐ท๐‘‹ (to distinguish between images {x} and the translated images {F(y)} ), and ๐ท๐‘Œ (to distinguish between images {y}

and the translated images {G(x)} )

๏ฑ The objective function contains two types of loss: adversarial loss for matching the distribution of generated images to the data distribution in the target domain, and cycle consistency loss to prevent the learned

mappings G and F from contradicting each other

(28)

Case Study: CycleGAN

๏ฎ Adversarial loss

๏ฑ For mapping function G: X -> Y and its discriminator ๐ท๐‘Œ, the objective is ๐ฟ๐บ๐ด๐‘ ๐บ, ๐ท๐‘Œ, ๐‘‹, ๐‘Œ = ๐ธ๐‘ฆ~๐‘๐‘‘๐‘Ž๐‘ก๐‘Ž(๐‘ฆ) log ๐ท๐‘Œ(๐‘ฆ) + ๐ธ๐‘ฅ~๐‘๐‘‘๐‘Ž๐‘ก๐‘Ž(๐‘ฅ) log(1 โˆ’ ๐ท๐‘Œ ๐บ ๐‘ฅ )

๏ฑ where G tries to generate images G(x) that look similar to images from domain Y, while ๐ท๐‘Œ aims to distinguish between translated samples G(x) and real samples y

๏ฑ We want to find the best G and ๐ท๐‘Œ by optimizing the function ๐‘š๐‘–๐‘›๐บ๐‘š๐‘Ž๐‘ฅ๐ท๐‘Œ๐ฟ๐บ๐ด๐‘ ๐บ, ๐ท๐‘Œ, ๐‘‹, ๐‘Œ

๏ฑ A similar adversarial loss for the mapping function F: Y -> X and its discriminator ๐ท๐‘‹ is defined as well with the objective function ๐‘š๐‘–๐‘›๐น๐‘š๐‘Ž๐‘ฅ๐ท๐‘‹๐ฟ๐บ๐ด๐‘ ๐น, ๐ท๐‘‹, ๐‘Œ, ๐‘‹

(29)

Case Study: CycleGAN

๏ฎ Cycle consistency loss

๏ฑ The adversarial loss alone cannot guarantee that the learned function can map an

individual input ๐‘ฅ๐‘– to a desired output ๐‘ฆ๐‘–. E.g., multiple images in X may be mapped to the same image in Y

๏ฑ CycleGAN introduces cycle-consistency loss

๏ฎ Forwary cycle consistency: for each image x from domain X, the image translation cycle should bring x back to the original: i.e., ๐‘ฅ โ†’ ๐บ(๐‘ฅ) โ†’ ๐น(๐บ(๐‘ฅ)) โ‰ˆ ๐‘ฅ

๏ฎ Backward cycle consistency: for each image y from domain Y, the image translation cycle should bring y back to the original: i.e., ๐‘ฆ โ†’ ๐น(๐‘ฆ) โ†’ ๐บ(๐น(๐‘ฆ)) โ‰ˆ ๐‘ฆ

(30)

Case Study: CycleGAN

๏ฎ

Full objective function

๏ฑ Objective: ๐ฟ ๐บ, ๐น, ๐ท๐‘‹, ๐ท๐‘Œ = ๐ฟ๐บ๐ด๐‘ ๐บ, ๐ท๐‘Œ, ๐‘‹, ๐‘Œ + ๐ฟ๐บ๐ด๐‘ ๐น, ๐ท๐‘‹, ๐‘Œ, ๐‘‹ + ๐œ†๐ฟ๐‘๐‘ฆ๐‘ ๐บ, ๐น

๏ฑ Goal: solve ๐บโˆ—, ๐นโˆ— = arg ๐‘š๐‘–๐‘›๐บ,๐น๐‘š๐‘Ž๐‘ฅ๐ท๐‘‹,๐ท๐‘Œ๐ฟ ๐บ, ๐น, ๐ท๐‘‹, ๐ท๐‘Œ

(31)

Case Study: CycleGAN

(32)

Case Study: CycleGAN

(33)

Case Study: CycleGAN

(34)

GAN: Discussion

๏ฎ Strength of GAN

๏ฑ State-of-the-art image samples

๏ฎ Weakness of GAN

๏ฑ Unstable to train

๏ฑ Cannot perform inference queries ๐‘ž(๐‘ง|๐‘ฅ)

(35)

What you need to know

๏ฎ

Inference as Optimization

๏ฎ

Expectation Maximization

๏ฎ

MAP Inference and Sparse Coding

๏ฎ

Variational Inference and Learning

(36)

Questions?

Referensi

Dokumen terkait