• Tidak ada hasil yang ditemukan

Advanced Deep Learning

N/A
N/A
Protected

Academic year: 2024

Membagikan "Advanced Deep Learning"

Copied!
26
0
0

Teks penuh

(1)

Advanced Deep Learning

Confronting the Partition Function-2 U Kang

Seoul National University

(2)

Outline

The Log-Likelihood Gradient

Stochastic Maximum Likelihood and Contrastive Divergence

Pseudolikelihood

Noise-Contrastive Estimation

Estimating the Partition Function

(3)

Pseudolikelihood

๏ฎ

Another approach to solve intractability of partition function is to train the model

without computing the partition function

๏ฎ

Observation: conditional distribution can be

computed without partition function

(4)

Pseudolikelihood

๏ฎ Conditional probabilities

๏ฑ

If a and c are large, it may require large

number of variables to be marginalized

(5)

Main Idea of Pseudolikelihood

๏ฎ

If we have n variables, with the chain of probabilities:

๏ฑ

We can make a maximally small, but in the worst case, we must marginalize a set of variables of size n-1

๏ฎ

What if we move c into b to ignore

computational cost? => pseudolikelihood

(6)

Pseudolikelihood

๏ฎ Generalized Pseudolikelihood Estimator

๏ฑ

We trade computational complexity for

deviation from maximum likelihood behavior

๏ฑ

Uses m sets of indices of variable that appear together on the left side of the conditioning bar

๏ฑ

Objective function:

(7)

Pseudolikelihood

๏ฎ Performance of pseudolikelihood-based approaches

๏ฑ

Depends largely on how the model is used

๏ฑ

Performs poorly when a good model of the full joint p(x) is needed (e.g. sampling)

๏ฑ

Performs well when only the conditional

distributions used during training are needed (e.g. filling in small amounts of missing

values)

(8)

Pseudolikelihood

๏ฎ Performance of pseudolikelihood-based approaches

๏ฑ

Generalized pseudolikelihood:

Powerful if the data has regular structure that allows the index sets to be designed to capture the most important correlations while leaving out negligible correlations (e.g. pixels in natural images: each GPLโ€™s set is a small, spatially

localized window)

(9)

Pseudolikelihood

๏ฎ Pseudolikelihood estimatorโ€™s weaknesses

๏ฑ

Cannot be used with other approximations that only provide a lower bound on ๐‘(x) , such as เทค

variational inference

โ‡’ Hard to apply on deep models (e.g. Deep Boltzmann machine) because variational methods are used to

approximately marginalizing out the many layers of hidden variables that interact with each other

๏ฑ

But, useful to train single-layer models or deep models with methods not based on lower

bounds

(10)

Outline

The Log-Likelihood Gradient

Stochastic Maximum Likelihood and Contrastive Divergence

Pseudolikelihood

Noise-Contrastive Estimation

Estimating the Partition Function

(11)

Noise-Contrastive Estimation (NCE)

๏ฎ

Consider a language model which predicts a word w in a vocabulary V based on a context c

๏ฑ ๐‘๐œƒ ๐‘ค|๐‘ = ๐‘ข๐œƒ ๐‘ค,๐‘

ฯƒ๐‘คโ€ฒโˆˆ๐‘‰๐‘ข๐œƒ ๐‘คโ€ฒ,๐‘ = ๐‘ข๐œƒ ๐‘ค,๐‘

๐‘๐œƒ(๐‘)

๏ฎ

The goal is to learn parameter ๐œƒ using maximum likelihood.

๏ฑ I.e., find parameters of the model ๐‘๐œƒ ๐‘ค|๐‘ that best approximates the empirical distribution ๐‘(๐‘ค|๐‘)เทค

๏ฎ

However, evaluating the partition function ๐‘

๐œƒ

(๐‘) is expensive

๏ฎ

How can we learn ๐œƒ without evaluating the partition

function?

(12)

Noise-Contrastive Estimation

๏ฎ NCE reduces the language model estimation problem to the problem of estimating the parameters of a probabilistic classifier that uses the same parameters to distinguish samples from the empirical distribution from samples generated by the noise distribution

๏ฑ Intuition: a good theta can perform well for classifying whether a word is from the noise distribution or empirical distribution: p(D=0 or 1 | c,w)

๏ฎ The two-class training data is generated as follows: sample a c from ๐‘(๐‘), เทค then sample one โ€œtrueโ€ sample from ๐‘(๐‘ค|๐‘), with auxiliary label D=1 เทค

indicating the datapoint is drawn from the true distribution, and k โ€œnoiseโ€

samples from a noise distribution q(w) with auxiliary label D=0 indicating these data points are noise

๏ฎ Thus, the joint probability of d, w given c is given by

(13)

Noise-Contrastive Estimation

๏ฎ Then, a conditional probability of d having observed w and c is given by

๏ฑ ๐‘ ๐ท = 0 ๐‘, ๐‘ค) =

๐‘˜ 1+๐‘˜๐‘ž ๐‘ค

1

1+๐‘˜๐‘(๐‘ค|๐‘) +เทค ๐‘˜

1+๐‘˜๐‘ž ๐‘ค = ๐‘˜ ร—๐‘ž(๐‘ค)

๐‘(๐‘ค|๐‘)+๐‘˜ ร—๐‘ž(๐‘ค)เทค

๏ฑ ๐‘ ๐ท = 1 ๐‘, ๐‘ค) = ๐‘(๐‘ค|๐‘)เทค

เทค

๐‘(๐‘ค|๐‘)+๐‘˜ ร—๐‘ž(๐‘ค)

๏ฎ NCE replaces the empirical distribution ๐‘(๐‘ค|๐‘)เทค with the model distribution ๐‘๐œƒ ๐‘ค|๐‘

๏ฎ NCE makes two further assumptions

๏ฑ It proposes partition function value Z(c) be estimated as parameter ๐‘ง๐‘

๏ฑ Fixing ๐‘ง๐‘ = 1for all c is effective

๏ฎ With the two assumptions, then the above probabilities are

๏ฑ ๐‘ ๐ท = 0 ๐‘, ๐‘ค) = ๐‘˜ ร—๐‘ž(๐‘ค)

๐‘ข๐œƒ ๐‘ค,๐‘ +๐‘˜ ร—๐‘ž(๐‘ค)

๏ฑ ๐‘ ๐ท = 1 ๐‘, ๐‘ค) = ๐‘ข๐œƒ ๐‘ค,๐‘

๐‘ข๐œƒ ๐‘ค,๐‘ +๐‘˜ ร—๐‘ž(๐‘ค)

(14)

Noise-Contrastive Estimation

๏ฎ

We now have a binary classification problem with parameters ๐œƒ that can be trained to maximize conditional log-likelihood of D, with k negative samples chosen:

๏ฑ ๐ฟ๐‘๐ถ๐ธ๐‘˜ = ฯƒ(๐‘ค,๐‘)โˆˆ๐ท(log ๐‘ ๐ท = 1 ๐‘, ๐‘ค + ๐‘˜๐ธ๐‘ค~๐‘žเดฅ log ๐‘(๐ท = 0|๐‘, เดฅ๐‘ค))

๏ฎ

The above ๐ฟ

๐‘๐ถ๐ธ๐‘˜

can be approximated with its Monte Carlo approximation

๏ฑ ๐ฟ๐‘€๐ถ๐‘๐ถ๐ธ๐‘˜ = ฯƒ(๐‘ค,๐‘)โˆˆ๐ท(log ๐‘ ๐ท = 1 ๐‘, ๐‘ค + ๐‘˜ ฯƒ๐‘–=1, เดฅ๐‘˜ ๐‘ค~๐‘ž 1

๐‘˜ log ๐‘(๐ท = 0|๐‘, เดฅ๐‘ค)) = ฯƒ(๐‘ค,๐‘)โˆˆ๐ท(log ๐‘ ๐ท = 1 ๐‘, ๐‘ค + ฯƒ๐‘–=1, เดฅ๐‘˜ ๐‘ค~๐‘ž log ๐‘(๐ท = 0|๐‘, เดฅ๐‘ค))

(15)

Negative Sampling

๏ฎ

Negative sampling is a special case of Noise-Contrastive Estimation (NCE) where k=|V| and q is uniform. Thus,

๏ฑ ๐‘ ๐ท = 0 ๐‘, ๐‘ค) = ๐‘˜ ร—๐‘ž(๐‘ค)

๐‘ข๐œƒ ๐‘ค,๐‘ +๐‘˜ ร—๐‘ž(๐‘ค) = 1

๐‘ข๐œƒ ๐‘ค,๐‘ +1

๏ฑ ๐‘ ๐ท = 1 ๐‘, ๐‘ค) = ๐‘ข๐œƒ ๐‘ค,๐‘

๐‘ข๐œƒ ๐‘ค,๐‘ +๐‘˜ ร—๐‘ž(๐‘ค) = ๐‘ข๐œƒ ๐‘ค,๐‘

๐‘ข๐œƒ ๐‘ค,๐‘ +1

(16)

Outline

The Log-Likelihood Gradient

Stochastic Maximum Likelihood and Contrastive Divergence

Pseudolikelihood

Noise-Contrastive Estimation

Estimating the Partition Function

(17)

Estimating Partition Function

๏ฎ

Reminder of the comparing two models

๏ฑ

Assume we have two models:

๏ฎ ๐‘€๐ด with a probability distribution ๐‘๐ด ๐’™; ๐œฝ๐ด = 1

๐‘๐ด ๐‘เทค๐ด(๐’™; ๐œฝ๐ด)

๏ฎ ๐‘€๐ต with a probability distribution ๐‘๐ต ๐’™; ๐œฝ๐ต = 1

๐‘๐ต ๐‘เทค๐ต(๐’™; ๐œฝ๐ต)

๏ฑ

If ฯƒ

๐‘–

log ๐‘

๐ด

๐‘ฅ

๐‘–

; ๐œฝ

๐ด

โˆ’ ฯƒ

๐‘–

log ๐‘

๐ต

๐‘ฅ

๐‘–

; ๐œฝ

๐ต

> 0, we say that ๐‘€

๐ด

is a better model than ๐‘€

๐ต

.

๏ฑ

It seems that we need to know partition functions to evaluate the above equation

๏ฑ

Since ฯƒ

๐‘–

log ๐‘

๐ด

๐‘ฅ

๐‘–

; ๐œฝ

๐ด

โˆ’ ฯƒ

๐‘–

log ๐‘

๐ต

๐‘ฅ

๐‘–

; ๐œฝ

๐ต

= ฯƒ

๐‘–

log

๐‘เทค๐ด ๐‘ฅ ๐‘– ,๐œฝ๐ด

เทค

๐‘๐ต ๐‘ฅ ๐‘– ,๐œฝ๐ต

โˆ’ ๐‘š log

๐‘ ๐œฝ๐ด

๐‘ ๐œฝ๐ต

, we donโ€™t need to know

the partition functions but only their ratio.

(18)

Estimating Partition Function

๏ฎ

If we know the ratio of two partition functions ๐‘Ÿ =

๐‘ ๐œฝ๐ต

๐‘ ๐œฝ๐ด

and the actual value of one of the two functions ๐‘(๐œฝ

๐ด

), we could compute the value of the other:

๏ฑ

๐‘ ๐œฝ

๐ต

=

๐‘ ๐œฝ๐ต

๐‘ ๐œฝ๐ด

โ‹… ๐‘ ๐œฝ

๐ด

= ๐‘Ÿ๐‘(๐œฝ

๐ด

)

(19)

Monte Carlo Method

๏ฎ

MC provides an approximation of the partition function.

๏ฎ

Example

๏ฑ Consider a desired partition function ๐‘1 = ืฌ เทค๐‘1 ๐’™ ๐‘‘๐’™ with a proposal distribution ๐‘0 ๐’™ = 1

๐‘0 ๐‘เทค0(๐’™) which supports tractable sampling and tractable evaluation of both ๐‘0 and ๐‘เทค0(๐’™).

๏ฑ Then, ๐‘1 = ืฌ เทค๐‘1 ๐’™ ๐‘‘๐’™ = ืฌ๐‘0 ๐’™

๐‘0 ๐’™ ๐‘เทค1 ๐’™ ๐‘‘๐’™

= ๐‘0 เถฑ ๐‘0 ๐’™ ๐‘เทค1 ๐’™

เทค

๐‘0 ๐’™ ๐‘‘๐’™

๏ฑ ๐‘แˆ˜1 = ๐‘0

๐พ ฯƒ๐‘˜=1๐พ ๐‘เทค1 ๐’™ ๐‘˜

๐‘เทค0 ๐’™ ๐‘˜ such that ๐’™ ๐‘˜ ~๐‘0

This also allows to estimate the ratio ๐‘เท 1 = 1 ฯƒ๐พ ๐‘เทค1 ๐’™ ๐‘˜

(20)

Monte Carlo Method

๏ฎ

Now, we have an approximation ๐‘ แˆ˜

1

of the desired partition function.

๏ฎ

However, this method is not practical due to difficulty of finding a tractable ๐‘

0

s.t.

๏ฑ

It is simple enough to evaluate

๏ฑ

Close to ๐‘

1

to result in a high quality approximation

๏ฎ If it is not, then most samples from ๐‘0 will have low probability under ๐‘1 and therefore make negligible contribution to the sum

(21)

Monte Carlo Method

๏ฎ

Problem: the proposal ๐‘

0

is too far from ๐‘

1

๏ฎ

Two strategies to solve the problem

๏ฑ

Annealed importance sampling, and bridge sampling

๏ฑ

Idea: introduce intermediate distributions that attempt to

bridge the gap between ๐‘

0

and ๐‘

1
(22)

Annealed Importance Sampling

๏ฎ

Annealed Importance Sampling (AIS) overcomes this problem by using intermediate distributions.

๏ฎ

We have a sequence of distributions ๐‘

0

, ๐‘

1

, โ€ฆ ๐‘

๐‘›

with a sequence of partition functions

๐‘

0

, ๐‘

1

, โ€ฆ , ๐‘

๐‘›

.

๏ฎ

Then,

๐‘

๐‘›

๐‘

0

= ๐‘

๐‘›

๐‘

0

โ‹… ๐‘

1

๐‘

1

โ‹ฏ ๐‘

๐‘›โˆ’1

๐‘

๐‘›โˆ’1

๐‘

1

๐‘

2

๐‘

๐‘› ๐‘›โˆ’1

๐‘

๐‘—+1
(23)

Annealed Importance Sampling

๏ฎ

How can we obtain the intermediate distributions?

๏ฑ

One popular choice is to use the weighted geometric average of target distribution ๐‘

1

and proposal

distribution ๐‘

0

๏ฎ

๐‘

๐œ‚๐‘—

โˆ ๐‘

1๐œ‚๐‘—

๐‘

01โˆ’๐œ‚๐‘—
(24)

Bridge Sampling

๏ฎ

Bridge sampling is another method similar to AIS.

๏ฎ

The only difference between the two methods is

that bridge sampling uses only one distribution

๐‘

โˆ—

, known as the bridge, while AIS uses a series of

distributions.

(25)

What you need to know

๏ฎ

The Log-Likelihood Gradient

๏ฎ

Stochastic Maximum Likelihood and Contrastive Divergence

๏ฎ

Pseudolikelihood

๏ฎ

Noise-Contrastive Estimation

๏ฎ

Estimating the Partition Function

(26)

Questions?

Referensi

Dokumen terkait