Advanced Deep Learning
Confronting the Partition Function-2 U Kang
Seoul National University
Outline
The Log-Likelihood Gradient
Stochastic Maximum Likelihood and Contrastive Divergence
Pseudolikelihood
Noise-Contrastive Estimation
Estimating the Partition Function
Pseudolikelihood
๏ฎ
Another approach to solve intractability of partition function is to train the model
without computing the partition function
๏ฎ
Observation: conditional distribution can be
computed without partition function
Pseudolikelihood
๏ฎ Conditional probabilities
๏ฑ
If a and c are large, it may require large
number of variables to be marginalized
Main Idea of Pseudolikelihood
๏ฎ
If we have n variables, with the chain of probabilities:
๏ฑ
We can make a maximally small, but in the worst case, we must marginalize a set of variables of size n-1
๏ฎ
What if we move c into b to ignore
computational cost? => pseudolikelihood
Pseudolikelihood
๏ฎ Generalized Pseudolikelihood Estimator
๏ฑ
We trade computational complexity for
deviation from maximum likelihood behavior
๏ฑ
Uses m sets of indices of variable that appear together on the left side of the conditioning bar
๏ฑ
Objective function:
Pseudolikelihood
๏ฎ Performance of pseudolikelihood-based approaches
๏ฑ
Depends largely on how the model is used
๏ฑ
Performs poorly when a good model of the full joint p(x) is needed (e.g. sampling)
๏ฑ
Performs well when only the conditional
distributions used during training are needed (e.g. filling in small amounts of missing
values)
Pseudolikelihood
๏ฎ Performance of pseudolikelihood-based approaches
๏ฑ
Generalized pseudolikelihood:
Powerful if the data has regular structure that allows the index sets to be designed to capture the most important correlations while leaving out negligible correlations (e.g. pixels in natural images: each GPLโs set is a small, spatially
localized window)
Pseudolikelihood
๏ฎ Pseudolikelihood estimatorโs weaknesses
๏ฑ
Cannot be used with other approximations that only provide a lower bound on ๐(x) , such as เทค
variational inference
โ Hard to apply on deep models (e.g. Deep Boltzmann machine) because variational methods are used to
approximately marginalizing out the many layers of hidden variables that interact with each other
๏ฑ
But, useful to train single-layer models or deep models with methods not based on lower
bounds
Outline
The Log-Likelihood Gradient
Stochastic Maximum Likelihood and Contrastive Divergence
Pseudolikelihood
Noise-Contrastive Estimation
Estimating the Partition Function
Noise-Contrastive Estimation (NCE)
๏ฎ
Consider a language model which predicts a word w in a vocabulary V based on a context c
๏ฑ ๐๐ ๐ค|๐ = ๐ข๐ ๐ค,๐
ฯ๐คโฒโ๐๐ข๐ ๐คโฒ,๐ = ๐ข๐ ๐ค,๐
๐๐(๐)
๏ฎ
The goal is to learn parameter ๐ using maximum likelihood.
๏ฑ I.e., find parameters of the model ๐๐ ๐ค|๐ that best approximates the empirical distribution ๐(๐ค|๐)เทค
๏ฎ
However, evaluating the partition function ๐
๐(๐) is expensive
๏ฎ
How can we learn ๐ without evaluating the partition
function?
Noise-Contrastive Estimation
๏ฎ NCE reduces the language model estimation problem to the problem of estimating the parameters of a probabilistic classifier that uses the same parameters to distinguish samples from the empirical distribution from samples generated by the noise distribution
๏ฑ Intuition: a good theta can perform well for classifying whether a word is from the noise distribution or empirical distribution: p(D=0 or 1 | c,w)
๏ฎ The two-class training data is generated as follows: sample a c from ๐(๐), เทค then sample one โtrueโ sample from ๐(๐ค|๐), with auxiliary label D=1 เทค
indicating the datapoint is drawn from the true distribution, and k โnoiseโ
samples from a noise distribution q(w) with auxiliary label D=0 indicating these data points are noise
๏ฎ Thus, the joint probability of d, w given c is given by
Noise-Contrastive Estimation
๏ฎ Then, a conditional probability of d having observed w and c is given by
๏ฑ ๐ ๐ท = 0 ๐, ๐ค) =
๐ 1+๐๐ ๐ค
1
1+๐๐(๐ค|๐) +เทค ๐
1+๐๐ ๐ค = ๐ ร๐(๐ค)
๐(๐ค|๐)+๐ ร๐(๐ค)เทค
๏ฑ ๐ ๐ท = 1 ๐, ๐ค) = ๐(๐ค|๐)เทค
เทค
๐(๐ค|๐)+๐ ร๐(๐ค)
๏ฎ NCE replaces the empirical distribution ๐(๐ค|๐)เทค with the model distribution ๐๐ ๐ค|๐
๏ฎ NCE makes two further assumptions
๏ฑ It proposes partition function value Z(c) be estimated as parameter ๐ง๐
๏ฑ Fixing ๐ง๐ = 1for all c is effective
๏ฎ With the two assumptions, then the above probabilities are
๏ฑ ๐ ๐ท = 0 ๐, ๐ค) = ๐ ร๐(๐ค)
๐ข๐ ๐ค,๐ +๐ ร๐(๐ค)
๏ฑ ๐ ๐ท = 1 ๐, ๐ค) = ๐ข๐ ๐ค,๐
๐ข๐ ๐ค,๐ +๐ ร๐(๐ค)
Noise-Contrastive Estimation
๏ฎ
We now have a binary classification problem with parameters ๐ that can be trained to maximize conditional log-likelihood of D, with k negative samples chosen:
๏ฑ ๐ฟ๐๐ถ๐ธ๐ = ฯ(๐ค,๐)โ๐ท(log ๐ ๐ท = 1 ๐, ๐ค + ๐๐ธ๐ค~๐เดฅ log ๐(๐ท = 0|๐, เดฅ๐ค))
๏ฎ
The above ๐ฟ
๐๐ถ๐ธ๐can be approximated with its Monte Carlo approximation
๏ฑ ๐ฟ๐๐ถ๐๐ถ๐ธ๐ = ฯ(๐ค,๐)โ๐ท(log ๐ ๐ท = 1 ๐, ๐ค + ๐ ฯ๐=1, เดฅ๐ ๐ค~๐ 1
๐ log ๐(๐ท = 0|๐, เดฅ๐ค)) = ฯ(๐ค,๐)โ๐ท(log ๐ ๐ท = 1 ๐, ๐ค + ฯ๐=1, เดฅ๐ ๐ค~๐ log ๐(๐ท = 0|๐, เดฅ๐ค))
Negative Sampling
๏ฎ
Negative sampling is a special case of Noise-Contrastive Estimation (NCE) where k=|V| and q is uniform. Thus,
๏ฑ ๐ ๐ท = 0 ๐, ๐ค) = ๐ ร๐(๐ค)
๐ข๐ ๐ค,๐ +๐ ร๐(๐ค) = 1
๐ข๐ ๐ค,๐ +1
๏ฑ ๐ ๐ท = 1 ๐, ๐ค) = ๐ข๐ ๐ค,๐
๐ข๐ ๐ค,๐ +๐ ร๐(๐ค) = ๐ข๐ ๐ค,๐
๐ข๐ ๐ค,๐ +1
Outline
The Log-Likelihood Gradient
Stochastic Maximum Likelihood and Contrastive Divergence
Pseudolikelihood
Noise-Contrastive Estimation
Estimating the Partition Function
Estimating Partition Function
๏ฎ
Reminder of the comparing two models
๏ฑ
Assume we have two models:
๏ฎ ๐๐ด with a probability distribution ๐๐ด ๐; ๐ฝ๐ด = 1
๐๐ด ๐เทค๐ด(๐; ๐ฝ๐ด)
๏ฎ ๐๐ต with a probability distribution ๐๐ต ๐; ๐ฝ๐ต = 1
๐๐ต ๐เทค๐ต(๐; ๐ฝ๐ต)
๏ฑ
If ฯ
๐log ๐
๐ด๐ฅ
๐; ๐ฝ
๐ดโ ฯ
๐log ๐
๐ต๐ฅ
๐; ๐ฝ
๐ต> 0, we say that ๐
๐ดis a better model than ๐
๐ต.
๏ฑ
It seems that we need to know partition functions to evaluate the above equation
๏ฑ
Since ฯ
๐log ๐
๐ด๐ฅ
๐; ๐ฝ
๐ดโ ฯ
๐log ๐
๐ต๐ฅ
๐; ๐ฝ
๐ต= ฯ
๐log
๐เทค๐ด ๐ฅ ๐ ,๐ฝ๐ดเทค
๐๐ต ๐ฅ ๐ ,๐ฝ๐ต
โ ๐ log
๐ ๐ฝ๐ด๐ ๐ฝ๐ต
, we donโt need to know
the partition functions but only their ratio.
Estimating Partition Function
๏ฎ
If we know the ratio of two partition functions ๐ =
๐ ๐ฝ๐ต๐ ๐ฝ๐ด
and the actual value of one of the two functions ๐(๐ฝ
๐ด), we could compute the value of the other:
๏ฑ
๐ ๐ฝ
๐ต=
๐ ๐ฝ๐ต๐ ๐ฝ๐ด
โ ๐ ๐ฝ
๐ด= ๐๐(๐ฝ
๐ด)
Monte Carlo Method
๏ฎ
MC provides an approximation of the partition function.
๏ฎ
Example
๏ฑ Consider a desired partition function ๐1 = ืฌ เทค๐1 ๐ ๐๐ with a proposal distribution ๐0 ๐ = 1
๐0 ๐เทค0(๐) which supports tractable sampling and tractable evaluation of both ๐0 and ๐เทค0(๐).
๏ฑ Then, ๐1 = ืฌ เทค๐1 ๐ ๐๐ = ืฌ๐0 ๐
๐0 ๐ ๐เทค1 ๐ ๐๐
= ๐0 เถฑ ๐0 ๐ ๐เทค1 ๐
เทค
๐0 ๐ ๐๐
๏ฑ ๐แ1 = ๐0
๐พ ฯ๐=1๐พ ๐เทค1 ๐ ๐
๐เทค0 ๐ ๐ such that ๐ ๐ ~๐0
This also allows to estimate the ratio ๐เท 1 = 1 ฯ๐พ ๐เทค1 ๐ ๐
Monte Carlo Method
๏ฎ
Now, we have an approximation ๐ แ
1of the desired partition function.
๏ฎ
However, this method is not practical due to difficulty of finding a tractable ๐
0s.t.
๏ฑ
It is simple enough to evaluate
๏ฑ
Close to ๐
1to result in a high quality approximation
๏ฎ If it is not, then most samples from ๐0 will have low probability under ๐1 and therefore make negligible contribution to the sum
Monte Carlo Method
๏ฎ
Problem: the proposal ๐
0is too far from ๐
1๏ฎ
Two strategies to solve the problem
๏ฑ
Annealed importance sampling, and bridge sampling
๏ฑ
Idea: introduce intermediate distributions that attempt to
bridge the gap between ๐
0and ๐
1Annealed Importance Sampling
๏ฎ
Annealed Importance Sampling (AIS) overcomes this problem by using intermediate distributions.
๏ฎ
We have a sequence of distributions ๐
0, ๐
1, โฆ ๐
๐with a sequence of partition functions
๐
0, ๐
1, โฆ , ๐
๐.
๏ฎ
Then,
๐
๐๐
0= ๐
๐๐
0โ ๐
1๐
1โฏ ๐
๐โ1๐
๐โ1๐
1๐
2๐
๐ ๐โ1๐
๐+1Annealed Importance Sampling
๏ฎ
How can we obtain the intermediate distributions?
๏ฑ
One popular choice is to use the weighted geometric average of target distribution ๐
1and proposal
distribution ๐
0๏ฎ
๐
๐๐โ ๐
1๐๐๐
01โ๐๐Bridge Sampling
๏ฎ
Bridge sampling is another method similar to AIS.
๏ฎ
The only difference between the two methods is
that bridge sampling uses only one distribution
๐
โ, known as the bridge, while AIS uses a series of
distributions.
What you need to know
๏ฎ
The Log-Likelihood Gradient
๏ฎ
Stochastic Maximum Likelihood and Contrastive Divergence
๏ฎ
Pseudolikelihood
๏ฎ
Noise-Contrastive Estimation
๏ฎ
Estimating the Partition Function
Questions?