Inference of Bayesian Network: MCMC

(1)

Inference of Bayesian Network:

MCMC

Jin Young Choi

1

(2)

Outline

▪ Latent Dirichlet Allocation (LDA) model(Topic Modelling)

▪ Inference of LDA Model

▪ Markov Chain Monte Carlo (MCMC)

▪ Gibbs Sampling

▪ Collapsed Gibbs Sampling for LDA

▪ Estimation of Multinomial Parameters via Dirichlet Prior

(3)

LDA Model (Topic Modelling)

(4)

LDA Model

𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝑝(𝑤|𝑧, 𝜃, ∅, 𝛼, 𝛽) 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖: 𝑝(𝑧, 𝜃, ∅|𝑤, 𝛼, 𝛽)

▪Dir(K, α): 𝑝(𝜇₁, … , 𝜇_𝐾|𝛼₁, … , 𝛼_𝐾)=^Γ(σ^𝑖=1^𝐾 ^(𝛼^𝑖⁾⁾

ς_𝑖=1^𝐾 Γ(𝛼_𝑖) 𝜇₁^𝛼¹⁻¹ … 𝜇_𝐾^𝛼^𝐾⁻¹

▪Mul(K,𝜇): 𝑝(𝑥₁, … , 𝑥_𝐾|𝜇₁, … , 𝜇_𝐾)= ^𝑛!

𝑥₁!…𝑥_𝐾!𝜇₁^𝑥¹ … 𝜇_𝐾^𝑥^𝐾

(5)

LDA Model

▪Dir(K, α): 𝑝(𝜇₁, … , 𝜇_𝐾|𝛼₁, … , 𝛼_𝐾)=^Γ(σ^𝑖=1

𝐾 (𝛼_𝑖))

ς_𝑖=1^𝐾 Γ(𝛼_𝑖) 𝜇₁^𝛼¹⁻¹ … 𝜇_𝐾^𝛼^𝐾⁻¹

▪Mul(K,𝜇): 𝑝(𝑥₁, … , 𝑥_𝐾|𝜇₁, … , 𝜇_𝐾)= ^𝑛!

𝑥₁!…𝑥_𝐾!𝜇₁^𝑥¹ … 𝜇_𝐾^𝑥^𝐾

(6)

Inference of LDA Model

▪ Maximum A posteriori Probability (MAP) given observation 𝑤, 𝛼, 𝛽

Not Convex

Closed-form solution is not available 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝑝(𝑤|𝑧, 𝜃, ∅, 𝛼, 𝛽) 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖: 𝑝(𝑧, 𝜃, ∅|𝑤, 𝛼, 𝛽)

Likelihoods

(7)

Discrete Variables, Dirichlet

▪ Parameters: 𝛼

₁

, … , 𝛼

_𝐾

> 0 (concentration hyper-parameter)

▪ Support: 𝜇

₁

, … , 𝜇

_𝐾

∈ (0,1) where σ

_𝑖=1^𝐾

𝜇

_𝑖

= 1

▪ Dir(K, α): 𝑝(𝜇

₁

, … , 𝜇

_𝐾

|𝛼

₁

, … , 𝛼

_𝐾

)=

^Γ(σ^𝑖=1^𝐾 ^(𝛼^𝑖⁾⁾

ς_𝑖=1^𝐾 Γ(𝛼_𝑖)

𝜇

₁^𝛼¹⁻¹

… 𝜇

_𝐾^𝛼^𝐾⁻¹

▪ Mul(K, 𝜇): 𝑝(𝑥

₁

, … , 𝑥

_𝐾

|𝜇

₁

, … , 𝜇

_𝐾

)=

^𝑛!

𝑥₁!…𝑥_𝐾!

𝜇

₁^𝑥¹

… 𝜇

_𝐾^𝑥^𝐾

▪ Dir(K, c + α): 𝑝(𝜇|𝑥, 𝛼) ∝ 𝑝 𝑥 𝜇 𝑝(𝜇|𝛼)

where 𝑐 = (𝑐

₁

, … , 𝑐

_𝐾

) is number of occurrences

▪ 𝐸[𝜇

_𝑘

] =

^𝑐^𝑘^+𝛼^𝑘

σ_𝑖=1^𝐾 (𝑐_𝑖+𝛼_𝑖)

1 3

5

0 0.02 0.04 0.06 0.08 0.1

1 2 3 4 5

(8)

Markov Chain Monte Carlo (MCMC)

▪ Markov Chain Monte Carlo (MCMC) framework

Posteriors

Markov Blanket

i: 1, 2, 3, 4, 5, 6, 7, 8, 9 w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜙(1,3): 1

ℎ_𝜙(2,3): 2

i: 1, 2, 3, 4, 5, 6, 7, 8, 9 w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜃(d,2): 5

(9)

Markov Chain Monte Carlo (MCMC)

Posteriors

Markov Blanket

w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜙(1,3): 1

ℎ_𝜙(2,3): 2

w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜃(d,2): 2

𝑝(𝑧|𝑤, 𝛼, 𝛽) Gibbs sampling

𝑝(𝑧|𝑤, 𝜃_𝑑, 𝜙_𝑘)

(10)

Gibbs Sampling (Review)

1

2

(11)

Collapsed Gibbs Sampling for LDA

▪ Latent Variables

• 𝜃 ∶ topic distribution in a document

• ∅ : word distribution in a topic

• 𝑧 : topic assignment to a word 𝑤

• 𝑝(𝜃, ∅, 𝑧|𝑤, 𝛼, 𝛽)

▪ Collapsed Gibbs Sampling

• 𝜃 and ∅ are induced by the association between 𝑧 and 𝑤

• 𝑧 is sufficient statistic to estimate 𝜃 and ∅

• Simpler algorithm can be used by sampling only 𝑧 after marginalizing 𝜃 and ∅.

• 𝑝 𝑧 𝑤, 𝛼, 𝛽 = ׭ 𝑝(𝜃, ∅, 𝑧|𝑤, 𝛼, 𝛽) 𝑑_𝜃𝑑_𝜙

𝑝(𝑧|𝑤, 𝛼, 𝛽)

Collapsed Gibbs Sampling

𝑝(𝑧|𝑤, 𝜃_𝑑, 𝜙_𝑘)

(12)

Collapsed Gibbs Sampling for LDA

▪ The Gibbs sampling equation for LDA (Topic Labelling, Unsupervised Learning)

Not related with 𝑧_−𝑑𝑖

(13)

▪ Dirichlet and multinomial probability into 𝑝 𝜙 𝛽 and 𝑝(𝑤|𝑧, 𝜙)

where

▪Dir(K, α): 𝑝(𝜇₁, … , 𝜇_𝐾|𝛼₁, … , 𝛼_𝐾)=^Γ(σ^𝑖=1

𝐾 (𝛼_𝑖))

ς_𝑖=1^𝐾 Γ(𝛼_𝑖) 𝜇₁^𝛼¹⁻¹ … 𝜇_𝐾^𝛼^𝐾⁻¹

▪Mul(K,𝜇): 𝑝(𝑥₁, … , 𝑥_𝐾|𝜇₁, … , 𝜇_𝐾)= ^𝑛!

𝑥₁!…𝑥_𝐾!𝜇₁^𝑥¹ … 𝜇_𝐾^𝑥^𝐾

𝛽 𝑣 − 1

(14)

▪ The number of times that the word 𝑤_𝑑𝑖 = 𝑣 is assigned to the topic 𝑧_𝑑𝑖 = 𝑘

where ℎ_𝜙(𝑘, 𝑣) ∈ 𝑁^𝐾×𝑉 denotes the histogram matrix which counts the number of times given by

w: 1, 2, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 1, 1, 1, 2, 1, 2 h(2,3): 0

h(1,3): 2

(15)

Collapsed Gibbs Sampling for LDA

▪ In consequence

𝛽 𝑣 − 1

𝛽 𝑣 − 1 𝛽 𝑣 − 1

▪Dir(K, α): 𝑝(𝜇₁, … , 𝜇_𝐾|𝛼₁, … , 𝛼_𝐾)=^Γ(σ^𝑖=1^𝐾 ^(𝛼^𝑖⁾⁾

ς_𝑖=1^𝐾 Γ(𝛼_𝑖) 𝜇₁^𝛼¹⁻¹ … 𝜇_𝐾^𝛼^𝐾⁻¹

▪Mul(K,𝜇): 𝑝(𝑥₁, … , 𝑥_𝐾|𝜇₁, … , 𝜇_𝐾)= ^𝑛!

𝑥₁!…𝑥_𝐾!𝜇₁^𝑥¹ … 𝜇_𝐾^𝑥^𝐾

(16)

∝

▪ Integrating PDF

𝛽 𝑣 − 1

(17)

▪ In a similar manner Topic portion corresponding to a word

w: 1, 2, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 1, 1, 1, 2, 1, 2 h(d,2): 4

𝛼 𝑘 − 1

(18)

▪ In a similar manner

𝛼 𝑘 − 1

𝛼 𝑘 − 1 𝛼 𝑘 − 1

(19)

▪ The joint distribution of words 𝑤 and topic assignments 𝑧 becomes

w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜙(1,3): 1

ℎ_𝜙(2,3): 2

w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜃(d,2): 5

(20)

Collapsed Gibbs Sampling for LDA

▪ The Gibbs sampling equation for LDA

Not related with 𝑧_−𝑑𝑖

(21)

(22)

∝ ෑ

𝑘=1 𝐾

ෑ

𝑣=1 𝑉

(23)

Collapsed Gibbs Sampling for LDA

𝑘 depends on the sampled 𝑧_𝑑𝑖

∝ ෑ

𝑑=1 𝐷

ෑ

𝑘=1 𝐾

= ෑ

𝑑=1 𝐷

ෑ

𝑘=1 𝐾

constant

(24)

▪ Sampling

𝑝(𝑧_𝑑𝑖)

1 2 3 4 5

෍ 𝑝(𝑧_𝑑𝑖)

1 2 3 4 5

ෑ

𝑣=1 𝑉

𝑝(𝑧_𝑑𝑖|𝑧_−𝑑𝑖, 𝑤, 𝛼, 𝛽) ∝

Tractable

∝ ෑ

𝑘=1 𝐾

ෑ

𝑣=1 𝑉

=

ෑ

𝑑=1 𝐷

ෑ

𝑘=1 𝐾

What is the inferred label of 𝑤_𝑑𝑖 ?

(25)

▪ Estimation of the multinomial parameters 𝜃 and 𝜙

w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜙(1,3): 1

ℎ_𝜙(2,3): 2

w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜃(d,2): 5

w: 1, 2, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 1, 1, 1, 2, 1, 2 ℎ_𝜃(d,2): 4

← 2

← 1

(26)

Interim Summary

▪ LDA Model (Topic Modelling)

▪ Inference of LDA Model

▪ Markov Chain Monte Carlo (MCMC)

▪ Gibbs Sampling

▪ Collapsed Gibbs Sampling for LDA

▪ Estimation of Multinomial Parameters via Dirichlet Prior

(27)

Inference of Bayesian Network:

Variational Inference

Jin Young Choi

1

(28)

Outline

▪ What is variational inference ?

▪ Kullback–Leibler divergence (KL-divergence) formulation

▪ Dual of KL-divergence

▪ Variational Inference for LDA

▪ Estimating variational parameters

▪ Estimating LDA parameters

▪ Application of VI to Generative Image Modeling (VAE)

▪ Application of LDA toTraffic Pattern Analysis

(29)

LDA Model (Topic Modelling)

𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝑝(𝑤|𝑧, 𝜃, ∅, 𝛼, 𝛽) 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖: 𝑝(𝑧, 𝜃, ∅|𝑤, 𝛼, 𝛽)

(30)

Markov Chain Monte Carlo (MCMC)

Posteriors

Markov Blanket

w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜙(1,3): 1

ℎ_𝜙(2,3): 2

w: 1, 3, 2, 3, 3, 5, 4, 1, 6 z : 1, 2, 2, 2, 1, 1, 2, 1, 2 ℎ_𝜃(d,2): 5

𝑝(𝑧|𝑤, 𝜙_𝑘, 𝜃_𝑘)

(31)

Collapsed Gibbs Sampling for LDA

▪ Sampling

𝑝(𝑧_𝑑𝑖)

1 2 3 4 5

෍ 𝑝(𝑧_𝑑𝑖)

1 2 3 4 5

ෑ

𝑣=1 𝑉

𝑝(𝑧_𝑑𝑖|𝑧_−𝑑𝑖, 𝑤, 𝛼, 𝛽) ∝

Tractable

∝ ෑ

𝑘=1 𝐾

ෑ

𝑣=1 𝑉

=

ෑ

𝑑=1 𝐷

ෑ

𝑘=1 𝐾

What is the inferred label of 𝑤_𝑑𝑖 ?

(32)

Outline

▪ Application of VI to Generative Image Modeling

▪ Application of LDA to Traffic Pattern Analysis

(33)

Variational Inference (VI)

▪ Approximating posteriors in an tractable probabilistic model

▪ MCMC: stochastic inference

▪ VI: deterministic inference

▪ The posterior distribution 𝑝 𝑧 𝑥 can be approximated by a variational distribution 𝑞(𝑧)

where 𝑞(𝑧) should belong to a family of simpler form than 𝑝 𝑧 𝑥

▪ Minimizing Kullback–Leibler divergence (KL-divergence)

𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝑝(𝑤|𝑧, 𝜃, ∅, 𝛼, 𝛽) 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖: 𝑝(𝑧, 𝜃, ∅|𝑤, 𝛼, 𝛽) 𝑝 𝜙 𝑤, 𝑧 ∝ 𝑝 𝑤 𝜙, 𝑧 𝑝(𝜙|𝛽) ≈ 𝑞 𝜙|𝛾

𝐶𝜙₁^{ℎ 1 +𝛽}¹𝜙₂^ℎ(2)+𝛽² …𝜙_𝑣^{ℎ(𝑣)+𝛽}^𝑣 ≈ 𝐶𝜙₁^𝛾¹𝜙₂^𝛾² …𝜙_𝑣^𝛾^𝑣

(34)

Variational Inference (VI)

▪ Kullback–Leibler divergence (KL-divergence)

constant Minimize −

(35)

entropy of 𝑞 𝑧

(36)

▪ Choose a variational distribution 𝑞 𝑧

▪ An approximation (Parisi, 1988) was proposed

where 𝜆_𝑖 is a variational parameter for each hidden variable 𝑧_𝑖

▪ Estimation of 𝜆 መ𝜆 = max

𝜆 ℒ 𝜆ҧ

𝛻_𝜆ℒ 𝜆 = 0ҧ

where 𝑞 . would be designed forℒ 𝜆ҧ to be convex.

𝑝 𝜙 𝑤, 𝑧 ∝ 𝑝 𝑤 𝜙, 𝑧 𝑝(𝜙|𝛽) ≈ 𝑞 𝜙|𝛾

𝐶𝜙₁^{ℎ 1 +𝛽}¹𝜙₂^ℎ(2)+𝛽² …𝜙_𝑉^{ℎ(𝑉)+𝛽}^𝑣 ≈ 𝐶𝜙₁^𝛾¹𝜙₂^𝛾² …𝜙_𝑉^𝛾^𝑉 𝑝 𝜃 𝑧 ∝ 𝑝 𝑧 𝜃 𝑝(𝜃|𝛼) ≈ 𝑞 𝜃|𝜇

𝐶𝜃₁^{ℎ 1 +𝛼}¹𝜃₂^ℎ(2)+𝛼² …𝜃_𝐾^{ℎ(𝐾)+𝛼}^𝑣 ≈ 𝐶𝜃₁^𝜇¹𝜃₂^𝜇² …𝜃_𝐾^𝜇^𝐾

ҧℒ 𝛾 ҧℒ 𝜇 𝑧 → 𝜙

𝑥 → 𝑤 𝑧 → θ 𝑥 → 𝑧

(37)

Variational Inference for LDA

▪ Objective function and joint probability of LDA

Maximization is intractable

→ Variational Inference

≈

(38)

Variational Inference for LDA

▪ A simpler variational distribution

where 𝜆, 𝛾, 𝜑 are the variational parameters used for approximate inference of 𝜙, 𝜃, 𝑧 , respectively.

𝑝 𝜙 𝑤, 𝑧 ∝ 𝑝 𝑤 𝜙, 𝑧 𝑝(𝜙|𝛽) ≈ 𝑞 𝜙|𝜆

𝐶𝜙₁^{ℎ 1 +𝛽}¹𝜙₂^ℎ(2)+𝛽² …𝜙_𝑉^{ℎ(𝑉)+𝛽}^𝑉 ≈ 𝐶𝜙₁^𝜆¹𝜙₂^𝜆² …𝜙_𝑉^𝜆^𝑉 𝑝 𝜃 𝑧 ∝ 𝑝 𝑧 𝜃 𝑝(𝜃|𝛼) ≈ 𝑞 𝜃|𝛾

𝐶𝜃₁^{ℎ 1 +𝛼}¹𝜃₂^ℎ(2)+𝛼² …𝜃_𝐾^{ℎ(𝐾)+𝛼}^𝐾 ≈ 𝐶𝜃₁^𝛾¹𝜃₂^𝛾² …𝜃_𝐾^𝛾^𝐾

, 𝑤_𝑑𝑖)

(39)

▪ Optimal values of the variational parameters

𝑝 𝜙 𝑤, 𝑧 ∝ 𝑝 𝑤 𝜙, 𝑧 𝑝(𝜙|𝛽) ≈ 𝑞 𝜙|𝜆

𝐶𝜙₁^{ℎ 1 +𝛽}¹𝜙₂^ℎ(2)+𝛽² …𝜙_𝑉^{ℎ(𝑉)+𝛽}^𝑉 ≈ 𝐶𝜙₁^𝜆¹𝜙₂^𝜆² …𝜙_𝑉^𝜆^𝑉 𝑝 𝜃 𝑧 ∝ 𝑝 𝑧 𝜃 𝑝(𝜃|𝛼) ≈ 𝑞 𝜃|𝛾

𝐶𝜃₁^{ℎ 1 +𝛼}¹𝜃₂^ℎ(2)+𝛼² …𝜃_𝐾^{ℎ(𝐾)+𝛼}^𝐾 ≈ 𝐶𝜃₁^𝛾¹𝜃₂^𝛾² …𝜃_𝐾^𝛾^𝐾

(40)

Variational Inference for LDA

▪ Optimal values of the variational parameters

𝑝 𝜙, 𝜃, 𝑧 𝑤, 𝛼, 𝛽 = 𝑝 𝜙, 𝜃, 𝑧, 𝑤 𝛼, 𝛽 /𝑝(𝑤)

(41)

▪ Dual of KL divergence

ℒ 𝑞 = log 𝑝 𝑤 𝛼, 𝛽 − 𝐷[𝑞(𝜙, 𝜃, 𝑧|𝜆, 𝛾, 𝜑)||𝑝( 𝜙, 𝜃, 𝑧|𝑤, 𝛼, 𝛽)]

▪ Yielding the optimal value of each variational parameter

𝑓(𝛾_𝑑) 𝑓(𝜆_𝑘)

, 𝑤_𝑑𝑖 )

(42)

▪ Expectation

where Ψ(. ) is the digamma function (Blei et al. 2003) given by

▪ The iterative updates of the variational parameters are guaranteed to converge into a stationary point

▪ For the iteration, 𝜑 are updated with λ and 𝛾 fixed, and λ and 𝛾 are updated given the fixed 𝜑

(43)

Variational Inference for LDA

▪ The final distribution results 𝜙 𝑎𝑛𝑑 𝜃

𝑝(𝑧_𝑑𝑖|𝜑_𝑑𝑖)

1 2 3 4 5

, 𝑤_𝑑𝑖 )

(44)

References for LDA and VI

▪ Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation.

JML. Res., 3, 993–1022.

▪ Blei, D. M. (2012). Probabilistic topic models. Commun. ACM, 55(4), 77–84.

▪ Blei, D. M. (2014). Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application, 1(1), 203–232.

▪ Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An

introduction to variational methods for graphical models. Mach. Learn., 37(2), 183–233.

URL http://dx.doi.org/10.1023/A:1007665907178

(45)

Perception and Intelligence Lab.

Ex) Image Generation

19

Generative Image Modeling

(46)

Goal: Model the distribution p(x)

▪ cf) Discriminative approach

20

Generative Image Modeling

Issue: High Dimensional low resolution (64x64x3 = 12,288)

(47)

Goal: Model the distribution p(x)

▪ Using Deep Neural Networks (CNNs)

21

Generative Image Modeling

(48)

Two prominent approaches

▪ Variational Auto-encoder (VAE)

▪ Generative Adversarial Networks (GAN)

22

Deep Generative Models

(49)

23

Variational Auto-encoder (VAE)

Encoder Decoder

x

z

𝑳𝒐𝒔𝒔 = −𝒍𝒐𝒈𝑷_𝜽 𝒙 𝒛 + 𝑫_𝑲𝑳(𝒒_𝝋 𝒛 𝒙 ||𝑷_𝜽(𝒛))

Reconstruction Loss

𝒑_𝜽 𝒙 𝒛 : a multivariate Gaussian (real-valued data)

𝒑_𝜽 𝒙 𝒛 : a Bernoulli (binary-valued data)

Variational Inference

(50)

24

Variational Auto-encoder (VAE)

Encoder Decoder

x

z

𝑳𝒐𝒔𝒔 = −𝒍𝒐𝒈𝑷_𝜽 𝒙 𝒛 + 𝑫_𝑲𝑳(𝒒_𝝋 𝒛 𝒙 ||𝑷_𝜽(𝒛))

Variational Inference

𝒑_𝜽 𝒛 ~𝑵(𝟎, 𝑰)

Decoder x

z

𝑺𝒂𝒎𝒑𝒍𝒊𝒏𝒈 𝒛 𝒇𝒓𝒐𝒎 𝑵(𝟎, 𝑰)

(51)

Interim Summary

▪ Application of VI to Generative Image Modeling