• Tidak ada hasil yang ditemukan

PDF When Monte Carlo and Optimization met in a Markovian dance

N/A
N/A
Protected

Academic year: 2024

Membagikan "PDF When Monte Carlo and Optimization met in a Markovian dance"

Copied!
23
0
0

Teks penuh

(1)

When Monte Carlo and Optimization met in a Markovian dance

Gersende Fort CNRS

Institut de Mathématiques de Toulouse, France

ICTS "Advances in Applied Probability", Bengaluru, August 2019.

(2)

.

Intertwined, why ?

.

(3)

To improve Monte Carlo methods targetting: = π

•The "naive" MC sampler depends on design parameters θ, in Rp or in innite dimension

•Theoretical studies caracterize an optimal choice of theses parameters θ? by θ? ∈ Θ s.t. Z H(θ, x) dπ(x) = 0

or

θ? ∈ argminθ∈Θ

Z

C(θ, x) dπ(x) = 0.

• Strategies:

- Strategy 1: a preliminary "machinery" for the approximation of θ?; then run the MC sampler with θ ← θ?

- Strategy 2: learn θ and sample concomitantly

(4)

To make optimization methods tractable

• Intractable objective function

θ s.t. h(θ) = 0 when h is not explicit h(θ) =

Z

X H(θ, x) dπθ(x) or

argminθ∈Θf(θ) f(θ) :=

Z

X C(θ, x) dπθ(x)

• Intractable auxiliary quantities Ex-1 Gradient-based methods

∇f(θ) =

Z

X H(θ, x) dπθ(x)

Ex-2 Majorize-Minimization methods at iteration t, f(θ) ≤ Ft(θ) =

Z

X Ht(θ, x) dπt,θ(x)

• Strategies: Use Monte Carlo to approximate the unknown quantities.

(5)

In this talk, Markov !

• from the Monte Carlo point of view:

which conditions on the updating scheme for convergence of the sampler ? Case: Markov chain Monte Carlo sampler

• from the optimization point of view:

which conditions on the Monte Carlo approximation for convergence of the stochastic optimization ?

Case: Stochastic Approximation methods with Markovian inputs

• (Talk) Application to a Computational Machine Learning pbm: penalized Maximum Likelihood through Stochastic Proximal-Gradient based methods

(6)

.

Part I: Motivating examples

.

(7)

1st Ex. Adaptive Importance sampling by Wang-Landau approaches (1/6) The problem

• A highly multimodal target density dπ on X ⊆ Rd.

−2

−1 0 1 2 3

−2 −2.5

−1 −1.5 0 −0.5 1 0.5 2 1.5 2.5 0 1 2 3 4 5 6 7 8

−1.5 −2

−0.5 −1 0.5 0 1.5 1 2.5 2

−4

−2

0

2

4 0 1 2 3 4 5 6 7 8

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2 2.5

• Two samplers with dierent behaviors (plot: the x-path of a chain in R2)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 106

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2

beta=4

0 2 4 6 8 10 12

x 104

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2

beta=4

(8)

1st Ex. (2/6)

The strategy for choosing the proposal mecanism

• A family of proposal mecanisms obtained by biasing locally the target:

- given a partition X1,· · · ,XI of X,

- for any weight vector θ = (θ(1), · · · , θ(I)) dπθ(x) = 1

PI i=1

θ?(i) θ(i)

I X i=1

1X

i(x) dπ(x)

θ(i) , with θ?(i) :=

Z

Xi dπ(u).

• Optimal proposal: dπθ? <proof>

• Unfortunately, θ? unavailable.

(9)

1st Ex. (3/6)

If πθ? were available

• The algorithm would be:

- Sample X1,· · · , Xt,· · · i.i.d. with distribution dπθ? (or a MCMC with target dπθ?) - Compute the importance ratio

θ?(Xt) = I

I X i=1

1X

i(Xt) θ?(i)

• When approximating an expectation, set

Z

φdπ ≈ I T

T X t=1

I X i=1

1X

i(Xt) θ?(i)

φ(Xt).

(10)

1st Ex. (4/6)

θ? and therefore dπθ? are unknown, so ?

• θ? ∈ RI collects RXi dπ for all i ∈ {1,· · · , I},

• θ? the unique root of θ 7→ RX H(θ, x) dπθ(x) ∈ RI where for all i ∈ {1,· · · , I} Hi(θ, x) := θ(i)1X(i)(x) − θ(i)

I X j=1

1X

j(x)θ(j).

thus suggesting the use of a Stochastic Approximation procedure: θ? ≈ limt θt θt+1 = θt + γt+1H(θt, Xt+1) Xt+1 ∼ dπθ

t or Xt+1 ∼ one-step MCMC

• This update scheme is a normalized counter of the number of visits to Xi

<proof>

(11)

1st Ex. (5/6)

The algorithm: Wang-Landau based procedures

• Initialisation: a weight vector θ0 Repeat for t = 1,· · · , T

- sample a point Xt+1 ∼ Pθt(Xt,·) and Pθ inv. wrt dπθ. - update the estimate of θ?

θt+1 = θt + γt+1 H(θt, Xt+1).

• Expected:

- the convergence of θt to θ?: SA scheme, fed with adaptive (controlled) MCMC sampler,

- the convergence of the distribution of Xt to dπθ?

thus allowing

Z

φdπ ≈ I T

T X t=1

I X i=1

1X

i(Xt) θt(i)

φ(Xt).

(12)

1st Ex. (6/6)

Does it work ? Plot: convergence of θt and rst exit times from one mode

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2 2.5

I see F.-Kuhn-Jourdain-Lelièvre- Stoltz (2014); F.-Jourdain- Lelièvre-Stoltz (2015, 2017, 2018) for studies of these Wang-Landau bases algorithms; including self-tuned SA update rules (γt is random).

0 0.5 e6 1 e6 1.5 e6 2 e6 2.5 e6 3 e6

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

0 0.02 0.04 0.06 0.08 0.1 0.12

-1 -0.5 0 0.5 1

0 2e+08 4e+08 6e+08 8e+08 1e+09 1.2e+09 1.4e+09

Xn,1

Iterations

100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09

2 4 6 8 10 12 14

Exit time

β

d = 96 d = 48 d = 24 d = 12 d = 6 d = 3 slope 1.25

(13)

Conclusion of the 1st example

• Iterative sampler

•Each iteration combines : (i) a sampling step Xt+1 ∼ Pθt(Xt, ·); and (ii) an optimization step to update the knowledge of some optimal parameter.

• The points {X1,· · · , Xt,· · · } can be seen as the output of a controlled Markov chain

E

hf(Xt+1)|Fti = Pθ

t(Xt,·) Ft := σ(X0:t, θ0) where Pθ has dπθ as its unique invariant distribution.

• The convergence of the parameter θt is the convergence of a SA scheme with

"controlled Markovian" dynamics θt+1 = θt + γt+1 H(θt, Xt+1)

(14)

2nd Example: penalized ML in latent variable models (1/6)

• An example from Pharmacokinetic:

- N patients.

- At time 0: dose D of a drug.

- For patient #i, observations Yi1,· · · , YiJ

i giving the evolution of the concentra- tion at times ti1,· · · , tiJ

i.

• The model:

Yij = F tij, Xi + ij ij i.i.d.∼ N(0, σ2) where Xi ∈ RL is modeled as

Xi = Ziβ + di ∈ RL di i.i.d.∼ NL(0,Ω) and independent of

and Zi known matrix s.t. each row of Xi has in intercept (xed eect) and covariates.

• Statistical analysis: (i) estimation of θ = (β, σ2,Ω), under sparsity constraints on β; (ii) selection of the covariates based on βˆ.

(15)

2nd Ex. (2/6)

Penalized Maximum Likelihood

• The likelihood of Y := {Yij,1 ≤ i ≤ N,1 ≤ j ≤ Ji} is not explicit:

- The distribution of Yi,j given Xi is simple; the distribution of Xi is simple.

- The joint distribution has an explicit expression - It is an example of latent variable model:

log L(Y ; θ) = log

Z

p(Y, x1:N;θ) dν(x1:N)

• Sparsity constraints on the parameter θ: through a penalty term g(θ)

• The penalized ML is of the form argminΘ (−logL(Y ;θ) + g(θ))

with an intractable objective function.

(16)

2nd Ex. (3/6)

What about rst-order methods for solving the optimization ?

• On the likelihood term:

- Usually regular enough so that the Gradient exists and <proof>

θ log L(Y ; θ) =

Zθ p(Y, x;θ) p(Y, x; θ)

p(Y, x; θ) dµ(x)

R p(Y, z; θ) dµ(z)

=

Z

θ (log p(Y, x;θ)) dπθ(x)

| {z }

the a posteriori distribution of x given Y the dep upon Y is omitted

- the a posteriori distribution is known up to a normalizing constant.

• On the penalty term

- May be non smooth, but: convex and lower semi-continuous

- Hence a Proximal operator (implicit gradient) is associated - <See the talk>.

(17)

2nd Ex. (4/6)

What about EM-like methods for solving the optimization ?

• Expectation-Maximization Dempster-Laird-Rubin, 1977 introduced to solve argminθ∈Θ

−log

Z

X p(x; θ)dµ(x) + g(θ)

where the rst part is intractable; by iterating two steps - Expectation step

Q(θ, θt) :=

Z

log p(x; θ) p(x;θt) dµ(x)

R p(z; θt) dµ(z) =

Z

log p(x; θ) dπθt(x) - Minimization step

θt+1 := argminθ (−Q(θ, θt) + g(θ)) .

• θ 7→ Q(θ, θt) is an integral which is intractable; dπθ is known up to a normalizing constant.

see F,Moulines (2003); F,Ollier,Samson (2018)

(18)

2nd Ex. (5/6)

•Both in EM-like approaches and in gradient-based approaches, - faced with intractable auxiliary quantities of the form

Z

X H(θ, x) dπθ

t(x) (1)

at itreration t of the optimization algorithm.

- intractable integral; dπθ is often known up to a normalizing constant.

•What kind of approximation of the integral (1) at iteration t ? - Quadrature techniques: poor behavior w.r.t. the dimension of X

- I.i.d. samples from πθt to dene a Monte Carlo approximation: not possible, in general.

- use m samples from a MCMC sampler {Xj,t+1, j ≤ m} with unique inv. dist.

θ

t.

(19)

2nd Ex. (6/6) Does it work ?

see F-Moulines (2003) for EM-like approaches;

see Atchadé-F.-Moulines (2017) and

F.-Ollier-Samson (2018) for gradient-based

approaches;

see F.-Ollier-Samson (2018) for the parallel between EM-like

and Gradient-based techniques

0.0 0.1 0.2 0.3 0.4 0.5

−3

−2

−1 0 1

0.50 0.75 1.00 1.25 1.50

0.00 0.25 0.50 0.75 1.00

0 100 200 300 400 500 0 100 200 300 400 500

0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Parameter value

Proximal MCEM Decreasing Step Size

0.0 0.1 0.2 0.3 0.4 0.5

−3

−2

−1 0 1

0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.25 0.50 0.75 1.00

0 100 200 300 400 500 0 100 200 300 400 500

0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Parameter value

Proximal MCEM Adaptive Step Size

0.0 0.1 0.2 0.3 0.4 0.5

−3

−2

−1 0 1

0.50 0.75 1.00 1.25

0.00 0.25 0.50 0.75 1.00

0 100 200 300 400 500 0 100 200 300 400 500

0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Parameter value

Proximal SAEM Decreasing Step Size

−0.1 0.0 0.1 0.2 0.3 0.4 0.5

−3

−2

−1 0 1

0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.25 0.50 0.75 1.00

0 100 200 300 400 500 0 100 200 300 400 500

0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Parameter value

Proximal SAEM Adaptive Step Size

(20)

Conclusion of the 2nd example

• Iterative optimization technique

•Each iteration combines : (i) an update of the parameter; (ii) a sampling step Xj+1,t+1 ∼ Pθt(Xj,t+1,·) to approximate auxiliary quantities.

• The convergence of {θt}t is the convergence of a stochastically perturbed itera- tive optimization algorithm. At each iteration: an exact quantity R H(θ, x) dπθ

t(x) is approximated by a Monte Carlo sum

Z

H(θ, x) dπθ

t(x) ≈ 1

mt+1

mt+1 X j=1

H(θ, Xj,t+1)

• The points {Xj,t+1}j satisfy E

hf(Xj,t+1)|Fti = Pθj

t(X0,t+1,·) Ft := σ(X:,0:t, θ0), X0,t+1 = Xmt,t where Pθ has dπθ as its unique invariant distribution.

(21)

Conclusion of this rst part (1/3): is a theory required ?

(22)

Conclusion of this rst part (2/3): is a theory required when sampling ? YES ! convergence can be lost by the adaption mecanism

Even in a simple case when

∀θ ∈ Θ, Pθ invariant wrt dπ,

one can dene a simple adaption mecanism Xt+1|past1:t ∼ Pθt(Xt, ·) θt ∈ σ(X1:t) such that

limt E[f(Xt)] 6=

Z

f dπ.

<proof> A {0,1}-valued chain {Xt}t dened by Xt+1 PXt(Xt,·) where the transition matrices are

P0 =

h t0 (1t0) (1t0) t0

i

P1 =

h t1 (1t1) (1t1) t1

i

Then P0 and P1 are invariant w.r.t [1/2,1/2] but {Xt} is a Markov chain invariant w.r.t. [t1, t0]

(23)

Conclusion of this rst part (3/3): is a theory required when optimizing ? YES ! Unfortunately ,

• a biased approximation <proof>

E

1 mt+1

mt+1 X j=1

H(θ, Xj,t+1)Ft

= ? 6=

Z

X H(θ, x) dπθt(x)

• For a reduced computational cost: a bias which we would like NOT vanishing i.e. mt = m(= 1).

Ex. Stochastic Approximation with controlled Markovian dynamics θt+1 = θt + γt+1 H(θt, Xt+1) Xt+1 ∼ Pθ

t(Xt,·)

= θt + γt+1

Z

H(θt, x)dπθt(x)

| {z }

h(θt)

t+1 H(θt, Xt+1) − h(θt)

| {z }

non centered

Referensi

Dokumen terkait