Inference in Multiparameter Models, Conditional Posterior, Local Conjugacy

(1)

Inference in Multiparameter Models, Conditional Posterior, Local Conjugacy

Piyush Rai

Topics in Probabilistic Modeling and Inference (CS698X)

Feb 4, 2019

(2)

Today’s Plan

Foray into models with several parameters

Goal will be to infer theposterior over all of them(not posterior for some, MLE-II for others) Idea ofconditional/local posteriorsin such problems

Local conjugacy(which helps in computing conditional posteriors)

Gibbs sampling(an algorithm that infer the joint posterior via conditional posteriors) An example: Bayesian matrix factorization (model with many parameters)

Note: Conditional/local posterior, local conjugacy, etc are important ideas (will appear in many inference algorithms that we will see later)

(3)

Moving Beyond Simple Models..

So far we’ve usually seen models with one “main” parameter and maybe a few hyperparams, e.g., Given data assumed to be from a Gaussian, infer the mean assuming variance known (or vice-versa) Bayesian linear regression with weight vectorw and noise/prior precision hyperparamsβ, λ

GP regression with one function to be learned

Easy posterior inference if the likelihood and prior are conjugate to each other

Otherwise have to approx. the posterior (e.g., Laplace approx. - recall Bayesian logistic regression) Hyperparams, if desired, can be also estimated via MLE-II

Note however that MLE-II would only give a point estimate of hyperparams

What if we have a model with lots of parames/hyperparams and want posteriors over all of those?

Intractable in general but today we will look at a way of doing approx. inference in such models

(4)

Multiparameter Models

Multiparameter models consist of two or more unknowns, sayθ1andθ2

Given datay, some examples for the simple two parameter case

Assume the likelihood model to be of the formp(y|θ1, θ2) (e.g., case 1 and 3 above) Assume ajoint priordistributionp(θ1, θ2)

Thejoint posterior p(θ1, θ2|y)∝p(y|θ1, θ2)p(θ1, θ2)

Easy the joint prior is conjugate to the likelihood (e.g., NIW prior for Gaussian likelihood)

Otherwise needs more work, e.g., MLE-II, MCMC, VB, etc. (already saw MLE-II, will see more later)

(5)

Multiparameter Models: Some Examples

Multiparameter models arise in many situations, e.g.,

Probabilistic models with unknown hyperparameters (e.g., Bayesian linear regression we just saw) Joint analysisof data from multiple (and possibly related) groups: Hierarchical models

.. and in fact, pretty much in any non-toy example of probablistic model :)

(6)

Another Example Problem: Matrix Factorization/Completion

Given: DataD={rij} of “interactions” (e.g., ratings) of usersi = 1, . . . ,N onj = 1, . . . ,M items Note: “users” and “items” could mean other things too (depends on the data)

(i,j)∈Ω denotes an observed user-item pair. Ω is the set of all such pairs Only a small number of user-item ratings observed, i.e.,|Ω| NM We would like to predict the unobserved valuesrij ∈ D/

(7)

Matrix Completion via Matrix Factorization

Let’s call the full matrixRand assume it to be anapproximately low-rankmatrix R=UV^>+E

U= [u1. . .uN]^> isN×K and consists of the latent factors of theN users ui ∈R^K denotes the latent factors (or learned features) of useri

V= [v1. . .vM]^> isM×K and consists of the latent factors of theM items vj∈R^K denotes the latent factors (or learned features) of itemj

E={_ij} consists of the “noise” inR(not captured by the low-rank assumption) We can write each element of matrixRas

rij =u^>_i vj+ij (i= 1, . . . ,N, j = 1, . . . ,M) Givenui andvj, any unobserved element inRcan be predicted using the above

(8)

A Bayesian Model for Matrix Factorization

The low-rank matrix factorization model assumes r_ij=u^>_i vj+_ij Let’s assume the noise to be Gaussianij∼ N(ij|0, β⁻¹)

This results in the followingGaussian likelihoodfor each observation p(rij|ui,vj) =N(rij|u^>_i vj, β⁻¹) AssumeGaussian priorson the user and item latent factors

p(ui) =N(ui|0, λ⁻¹_u IK) and p(vj) =N(vj|0, λ⁻¹_v IK)

The goal is to infer latent factorsU={ui}^N_i=1 andV={vj}^M_j=1, given observed ratings from R For simplicity, we will assume the hyperparamsβ, λ_u, λ_v to be fixed and not to be learned

(9)

The Posterior

Our target posterior distribution for this model will be

p(U,V|R) = p(R|U,V)p(U,V)

R Rp(R|U,V)p(U,V)dUdV=

Q

(i,j)∈Ωp(r_ij|u_i,vj)Q

ip(ui)Q

jp(vj) R. . .RQ

(i,j)∈Ωp(rij|ui,vj)Q

ip(ui)Q

jp(vj)du1. . .duNdv1. . .dvM

The denominator (and hence the posterior) is intractable!

Therefore, the posterior must be approximated somehow

One way to approximate is to computeConditional Posterior(CP) over individual variables, e.g., p(ui|R,V,U_−i) and p(vj|R,U,V_−j)

U_−i denotes all ofUexceptui. Note: V,U_−i is the set of all unknowns except ui

V_−j denotes all ofV exceptvj. Note: U,V_−j is the set of all unknowns exceptvj

Caveat: Each CP should be “computable” (but this is possible for models with“local conjugacy”) Since CP of each var. depends on all other vars, inference algos based on computing CPs usually work inalternating fashion, until each CP converges (e.g.,Gibbs samplingwhich we’ll look at later)

(10)

Conditional Posterior and Local Conjugacy

(11)

Conditional Posterior and Local Conjugacy

Conditional Posteriors are easy to compute if the model admitslocal conjugacy Note: Some researchers also call CP asComplete ConditionalorLocal Posterior

Consider a general model with dataXandK unknown params/hyperparams Θ = (θ1, θ2, . . . , θK) Suppose posteriorp(Θ|X) =^p(X|Θ)p(Θ)_p(X) is intractable (will be the case ifp(Θ) isn’t conjugate) However suppose we can compute the following conditional posterior tractably

p(θk|X,Θ−k) = p(X|θk,Θ−k)p(θk) R p(X|θk,Θ_−k)p(θk)dθk

∝p(X|θk,Θ−k)p(θk) .. which would be possible ifp(X|θk,Θ_−k) and p(θ_k) are conjugate to each other Such models are called“locally conjugate” models

Important: In the above context, when considering the likelihoodp(X|θk,Θ_−k) Xactually refers to only that part of dataXthat depends onθk

Θ−k refers to only those unknowns that “interact” withθkin generating that part of data

(12)

Representation of Posterior

With the conditional posterior based approximation, the target posterior p(Θ|X) =p(X|Θ)p(Θ)

p(X)

.. is represented by several conditional posteriorsp(θk|X,Θ_−k),k = 1, . . . ,K

Each of the conditional posterior is a distributionover one unknown θk, given all other unknowns Need a way to “combine” these CPs to get the overall posterior

One way to get the overall representation of the posterior can be can be using sampling based inference algorithms like Gibbs sampling or MCMC (more on this later)

(13)

Detour: Gibbs Sampling (Geman and Geman, 1982)

A generalsampling algorithmto simulate samples from multivariate distributions

Samples one component at a time from its conditional, conditioned on all other components Assumes that the conditional distributions are available in a closed form

The generated samples give asample-based approximationof the multivariate distribution

(14)

Detour: Gibbs Sampling (Geman and Geman, 1982)

Can be used to get asampling-based approximationof a multiparameter posterior distribution Gibbs sampler iteratively draws random samples from conditional posteriors

When run long enough, the sampler produces samples from the joint posterior For the simple two-parameter caseθ= (θ₁, θ₂), the Gibb sampler looks like this

Initializeθ⁽⁰⁾₂ Fors= 1, . . . ,S

Draw a random sample forθ1asθ^(s)₁ ∼p(θ1|θ^(s−1)₂ ,y) Draw a random sample forθ2asθ^(s)₂ ∼p(θ2|θ^(s)₁ ,y)

The set ofS random samples{θ^(s₁⁾, θ₂^(s)}^S_s=1 represent the joint posterior distributionp(θ1, θ2|y) More on Gibbs sampling when we discuss MCMC sampling algorithms (above is the high-level idea)

(15)

Back to Bayesian Matrix Factorization..

(16)

Bayesian Matrix Factorization: Conditional Posteriors

The BMF model with Gaussian likelihood and Gaussian prior has local conjugacy To see this, note that the conditional posterior for user latent factorui is

p(ui|R,V,U−i)∝ Y

j:(i,j)∈Ω

p(rij|ui,vj)p(ui)

Note: the posterior ofui doesn’t actually depend on U_−i and rows ofRexcept rowi After substituting the likelihood and prior (both Gaussians), the conditional posterior ofui is

p(ui|R,V)∝ Y

j:(i,j)∈Ω

N(r_ij|u^>_i vj, β⁻¹)N(ui|0, λ⁻¹_u I_K)

SinceVfixed (remember we are computing conditional posteriors alternating fashion), the likelihoodandprior are conjugate. This is just likeBayesian linear regression

Linear regression analogy: {vj}j:(i,j)∈Ω: inputs,{rij}j:(i,j)∈Ω: responses,ui: unknown weight vector Likewise, the conditional posterior ofvj will be

p(vj|R,U)∝ Y

i:(i,j)∈Ω

N(r_ij|u^>_i vj, β⁻¹)N(v_j|0, λ⁻¹_v I_K)

.. like Bayesian lin. reg. with{ui}_{i:(i,j)∈Ω}: inputs,{rij}_{i:(i,j)∈Ω}: responses,vj: unknown weight vec

(17)

Bayesian Matrix Factorization: Conditional Posteriors

The conditional posteriors will have forms similar to solution of Bayesian linear regression For eachui, its conditional posterior, given Vand ratings

p(ui|R,V) =N(ui|µ_ui,Σ_ui)

whereΣ_ui = (λ_uI+βP

j:(i,j)∈Ωvjv^>_j )⁻¹andµ_ui =Σ_ui(βP

j:(i,j)∈Ωr_ijvj)

For eachvj, its conditional posterior, given Uand ratings

p(vj|R,U) =N(vj|µ_vj,Σ_vj)

whereΣ_vj = (λ_vI+βP

i:(i,j)∈Ωuiu^>_i )⁻¹andµ_vj=Σ_vj(βP

i:(i,j)∈Ωr_ijui)

These conditional posteriors can be updated in an alternating fashion until convergence This can be be implemented using aGibbs sampler

Note: Hyperparameters can also be inferred by computing their conditional posteriors(also see “Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo” by Salakhutdinov and Mnih (2008))

Can extend Gaussian BMF easily toother exp. family distr. while maintaining local conjugacy

(18)

Bayesian Matrix Factorization

The posterior predictive distribution for BMF (assuming other hyperparams known) p(rij|R) =

Z Z

p(rij|ui,vj)p(ui,vj|R)duidvj

In general, this is hard and needs approximation

If we are using Gibbs sampling, we can use theSsamples{u^(s)_i ,v^(s)_j }^S_s=1to compute the mean For the Gaussian likelihood case, the mean can be computed as

E[rij]≈ 1 S

S

X

s=1

u^(s)_i ^>v^(s)_j (Monte-Carlo averaging) Can also compute the variance ofrij (think how)

A comparison of Bayesian MF with other methods (from Salakhutdinov and Mnih (2008))

(19)

Summary and Some Comments

Bayesian inference in even very complex probabilistic models can often be performed rather easily if the models have the local conjugacy property

It therefore helps to choose the likelihood model and priors on each param as exp. family distr.

Even if we can’t get a globally conjugacy model, we can still get a model with local conjugacy Local conjugacy allows computing conditional posteriors that are needed in inference algos like Gibbs sampling, MCMC, EM, variational inference, etc.