PDF When Monte Carlo and Optimization met in a Markovian dance

(1)

When Monte Carlo and Optimization met in a Markovian dance

Gersende Fort CNRS

Institut de Mathématiques de Toulouse, France

ICTS "Advances in Applied Probability", Bengaluru, August 2019.

(2)

.

Intertwined, why ?

.

(3)

To improve Monte Carlo methods targetting: dπ = πdµ

•The "naive" MC sampler depends on design parameters θ, ⁱⁿ ^R^p or in innite dimension

•Theoretical studies caracterize an optimal choice of theses parameters θ_? by θ_? ∈ Θ s.t. ^Z H(θ, x) dπ(x) = 0

or

θ_? ∈ argmin_θ∈Θ

Z

C(θ, x) dπ(x) = 0.

• Strategies:

- Strategy 1: a preliminary "machinery" for the approximation of θ_?; then run the MC sampler with θ ← θ_?

- Strategy 2: learn θ and sample concomitantly

(4)

To make optimization methods tractable

• Intractable objective function

θ s.t. h(θ) = 0 when h is not explicit h(θ) =

Z

X H(θ, x) dπ_θ(x) or

argmin_θ∈Θf(θ) f(θ) :=

Z

X C(θ, x) dπ_θ(x)

• Intractable auxiliary quantities Ex-1 Gradient-based methods

∇f(θ) =

Z

X H(θ, x) dπ_θ(x)

Ex-2 Majorize-Minimization methods at iteration t, f(θ) ≤ F_t(θ) =

Z

X H_t(θ, x) dπ_t,θ(x)

• Strategies: Use Monte Carlo to approximate the unknown quantities.

(5)

In this talk, Markov !

• from the Monte Carlo point of view:

which conditions on the updating scheme for convergence of the sampler ? Case: Markov chain Monte Carlo sampler

• from the optimization point of view:

which conditions on the Monte Carlo approximation for convergence of the stochastic optimization ?

Case: Stochastic Approximation methods with Markovian inputs

• (Talk) Application to a Computational Machine Learning pbm: penalized Maximum Likelihood through Stochastic Proximal-Gradient based methods

(6)

.

Part I: Motivating examples

.

(7)

1st Ex. Adaptive Importance sampling by Wang-Landau approaches (1/6) The problem

• A highly multimodal target density dπ on X ⊆ R^d.

−2

−1 0 1 2 3

−2 −2.5

−1 −1.5 0 −0.5 1 0.5 2 1.5 2.5 0 1 2 3 4 5 6 7 8

−1.5 −2

−0.5 −1 0.5 0 1.5 1 2.5 2

−4

−2

0

2

4 0 1 2 3 4 5 6 7 8

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2 2.5

• Two samplers with dierent behaviors (plot: the x-path of a chain in R²)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 10⁶

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2

beta=4

0 2 4 6 8 10 12

x 10⁴

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2

beta=4

(8)

1st Ex. (2/6)

The strategy for choosing the proposal mecanism

• A family of proposal mecanisms obtained by biasing locally the target:

- given a partition X₁,· · · ,X_I _of X_,

- for any weight vector θ = (θ(1), · · · , θ(I)) dπ_θ(x) = 1

P_I i=1

θ_?(i) θ(i)

I X i=1

1_X

i(x) dπ(x)

θ(i) , with θ_?(i) :=

Z

X_i dπ(u).

• Optimal proposal: dπ_θ_? <proof>

• Unfortunately, θ_? unavailable.

(9)

1st Ex. (3/6)

If π_θ_? were available

• The algorithm would be:

- Sample X₁,· · · , X_t,· · · i.i.d. with distribution dπ_θ_? (or a MCMC with target dπ_θ_?) - Compute the importance ratio

dπ

dπ_θ_?(X_t) = I

I X i=1

1_X

i(X_t) θ_?(i)

• When approximating an expectation, set

Z

φdπ ≈ I T

T X t=1



 I X i=1

1_X

i(X_t) θ_?(i)



 φ(X_t).

(10)

1st Ex. (4/6)

θ_? and therefore dπ_θ_? are unknown, so ?

• θ_? ∈ R^I collects ^R_X_i dπ for all i ∈ {1,· · · , I},

• θ_? the unique root of θ 7→ ^R_X H(θ, x) dπ_θ(x) ∈ R^I where for all i ∈ {1,· · · , I} H_i(θ, x) := θ(i)1_X_(i)(x) − θ(i)

I X j=1

1_X

j(x)θ(j).

thus suggesting the use of a Stochastic Approximation procedure: θ_? ≈ lim_t θ_t θ_t+1 = θ_t + γ_t+1H(θ_t, X_t+1) X_t+1 ∼ dπ_θ

t or X_t+1 ∼ one-step MCMC

• This update scheme is a normalized counter of the number of visits to X_i

<proof>

(11)

1st Ex. (5/6)

The algorithm: Wang-Landau based procedures

• Initialisation: a weight vector θ₀ Repeat for t = 1,· · · , T

- sample a point X_t+1 ∼ P_θ_t(X_t,·) and P_θ inv. wrt dπ_θ. - update the estimate of θ_?

θ_t+1 = θ_t + γ_t+1 H(θ_t, X_t+1).

• Expected:

- the convergence of θ_t to θ_?: SA scheme, fed with adaptive (controlled) MCMC sampler,

- the convergence of the distribution of X_t to dπ_θ_?

thus allowing

Z

φdπ ≈ I T

T X t=1



 I X i=1

1_X

i(X_t) θ_t(i)



 φ(X_t).

(12)

1st Ex. (6/6)

Does it work ? Plot: convergence of θ_t and rst exit times from one mode

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2 2.5

I see F.-Kuhn-Jourdain-Lelièvre- Stoltz (2014); F.-Jourdain- Lelièvre-Stoltz (2015, 2017, 2018) for studies of these Wang-Landau bases algorithms; including self-tuned SA update rules (γt is random).

0 0.5 e6 1 e6 1.5 e6 2 e6 2.5 e6 3 e6

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

0 0.02 0.04 0.06 0.08 0.1 0.12

-1 -0.5 0 0.5 1

0 2e+08 4e+08 6e+08 8e+08 1e+09 1.2e+09 1.4e+09

Xn,1

Iterations

100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09

2 4 6 8 10 12 14

Exit time

β

d = 96 d = 48 d = 24 d = 12 d = 6 d = 3 slope 1.25

(13)

Conclusion of the 1st example

• Iterative sampler

•Each iteration combines : (i) a sampling step X_t+1 ∼ P_θ_t(X_t, ·); and (ii) an optimization step to update the knowledge of some optimal parameter.

• The points {X₁,· · · , X_t,· · · } can be seen as the output of a controlled Markov chain

E

hf(X_t+1)|F_tⁱ = P_θ

t(X_t,·) F_t := σ(X_0:t, θ₀) where P_θ has dπ_θ as its unique invariant distribution.

• The convergence of the parameter θ_t is the convergence of a SA scheme with

"controlled Markovian" dynamics θ_t+1 = θ_t + γ_t+1 H(θ_t, X_t+1)

(14)

2nd Example: penalized ML in latent variable models (1/6)

• An example from Pharmacokinetic:

- N patients.

- At time 0: dose D of a drug.

- For patient #i, observations Y_i1,· · · , Y_iJ

i giving the evolution of the concentra- tion at times t_i1,· · · , t_iJ

i.

• The model:

Y_ij = F t_ij, X_i + _ij _ij ^i.i.d.∼ N(0, σ²) where X_i ∈ R^L is modeled as

X_i = Z_iβ + d_i ∈ R^L d_i ^i.i.d.∼ N_L(0,Ω) and independent of _•

and Z_i known matrix s.t. each row of X_i has in intercept (xed eect) and covariates.

• Statistical analysis: (i) estimation of θ = (β, σ²,Ω), under sparsity constraints on β; (ii) selection of the covariates based on βˆ.

(15)

2nd Ex. (2/6)

Penalized Maximum Likelihood

• The likelihood of Y := {Y_ij,1 ≤ i ≤ N,1 ≤ j ≤ J_i} is not explicit:

- The distribution of Y_i,j given X_i is simple; the distribution of X_i is simple.

- The joint distribution has an explicit expression - It is an example of latent variable model:

log L(Y ; θ) = log

Z

p(Y, x_1:N;θ) dν(x_1:N)

• Sparsity constraints on the parameter θ: through a penalty term g(θ)

• The penalized ML is of the form argmin_Θ (−logL(Y ;θ) + g(θ))

with an intractable objective function.

(16)

2nd Ex. (3/6)

What about rst-order methods for solving the optimization ?

• On the likelihood term:

- Usually regular enough so that the Gradient exists and <proof>

∇_θ log L(Y ; θ) =

Z ∂_θ p(Y, x;θ) p(Y, x; θ)

p(Y, x; θ) dµ(x)

R p(Y, z; θ) dµ(z)

=

Z

∂_θ (log p(Y, x;θ)) dπ_θ(x)

| {z }

the a posteriori distribution of x given Y the dep upon Y is omitted

- the a posteriori distribution is known up to a normalizing constant.

• On the penalty term

- May be non smooth, but: convex and lower semi-continuous

- Hence a Proximal operator (implicit gradient) is associated - <See the talk>.

(17)

2nd Ex. (4/6)

What about EM-like methods for solving the optimization ?

• Expectation-Maximization Dempster-Laird-Rubin, 1977 introduced to solve argmin_θ∈Θ

−log

Z

X p(x; θ)dµ(x) + g(θ)

where the rst part is intractable; by iterating two steps - Expectation step

Q(θ, θ_t) :=

Z

log p(x; θ) p(x;θ_t) dµ(x)

R p(z; θ_t) dµ(z) =

Z

log p(x; θ) dπ_θ_t(x) - Minimization step

θ_t+1 := argmin_θ (−Q(θ, θ_t) + g(θ)) .

• θ 7→ Q(θ, θ_t) is an integral which is intractable; dπ_θ is known up to a normalizing constant.

see F,Moulines (2003); F,Ollier,Samson (2018)

(18)

2nd Ex. (5/6)

•Both in EM-like approaches and in gradient-based approaches, - faced with intractable auxiliary quantities of the form

Z

X H(θ, x) dπ_θ

t(x) (1)

at itreration t of the optimization algorithm.

- intractable integral; dπ_θ is often known up to a normalizing constant.

•What kind of approximation of the integral (1) at iteration t ? - Quadrature techniques: poor behavior w.r.t. the dimension of X

- I.i.d. samples from π_θ_t to dene a Monte Carlo approximation: not possible, in general.

- use m samples from a MCMC sampler {X_j,t+1, j ≤ m} with unique inv. dist.

dπ_θ

t.

(19)

2nd Ex. (6/6) Does it work ?

see F-Moulines (2003) for EM-like approaches;

see Atchadé-F.-Moulines (2017) and

F.-Ollier-Samson (2018) for gradient-based

approaches;

see F.-Ollier-Samson (2018) for the parallel between EM-like

and Gradient-based techniques

0.0 0.1 0.2 0.3 0.4 0.5

−3

−2

−1 0 1

0.50 0.75 1.00 1.25 1.50

0.00 0.25 0.50 0.75 1.00

0 100 200 300 400 500 0 100 200 300 400 500

0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Parameter value

Proximal MCEM Decreasing Step Size

0.0 0.1 0.2 0.3 0.4 0.5

−3

−2

−1 0 1

0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.25 0.50 0.75 1.00

0 100 200 300 400 500 0 100 200 300 400 500

0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Parameter value

Proximal MCEM Adaptive Step Size

0.0 0.1 0.2 0.3 0.4 0.5

−3

−2

−1 0 1

0.50 0.75 1.00 1.25

0.00 0.25 0.50 0.75 1.00

0 100 200 300 400 500 0 100 200 300 400 500

0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Parameter value

Proximal SAEM Decreasing Step Size

−0.1 0.0 0.1 0.2 0.3 0.4 0.5

−3

−2

−1 0 1

0.5 0.6 0.7 0.8 0.9 1.0

0.00 0.25 0.50 0.75 1.00

0 100 200 300 400 500 0 100 200 300 400 500

0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Parameter value

Proximal SAEM Adaptive Step Size

(20)

Conclusion of the 2nd example

• Iterative optimization technique

•Each iteration combines : (i) an update of the parameter; (ii) a sampling step X_j+1,t+1 ∼ P_θ_t(X_j,t+1,·) to approximate auxiliary quantities.

• The convergence of {θ_t}_t is the convergence of a stochastically perturbed iterative optimization algorithm. At each iteration: an exact quantity ^R H(θ, x) dπ_θ

t(x) is approximated by a Monte Carlo sum

Z

H(θ, x) dπ_θ

t(x) ≈ 1

m_t+1

m_t+1 X j=1

H(θ, X_j,t+1)

• The points {X_j,t+1}_j satisfy E

hf(X_j,t+1)|F_tⁱ = P_θ^j

t(X_0,t+1,·) F_t := σ(X_:,0:t, θ₀), X_0,t+1 = X_m_t_,t where P_θ has dπ_θ as its unique invariant distribution.

(21)

Conclusion of this rst part (1/3): is a theory required ?

(22)

Conclusion of this rst part (2/3): is a theory required when sampling ? YES ! convergence can be lost by the adaption mecanism

Even in a simple case when

∀θ ∈ Θ, P_θ invariant wrt dπ,

one can dene a simple adaption mecanism X_t+1|past1:t ∼ P_θ_t(X_t, ·) θ_t ∈ σ(X_1:t) such that

limt E[f(X_t)] 6=

Z

f dπ.

<proof> A {0,1}-valued chain {X_t}_t dened by X_t+1 ∼ P_X_t(X_t,·) where the transition matrices are

P0 =

h t0 (1−t0) (1−t0) t0

i

P1 =

h t1 (1−t1) (1−t1) t1

i

Then P0 and P1 are invariant w.r.t [1/2,1/2] but {Xt} is a Markov chain invariant w.r.t. [t1, t0]

(23)

Conclusion of this rst part (3/3): is a theory required when optimizing ? YES ! Unfortunately ,

• a biased approximation <proof>

E





1 m_t+1

m_t+1 X j=1

H(θ, X_j,t+1)F_t



 = ? 6=

Z

X H(θ, x) dπ_θ_t(x)

• For a reduced computational cost: a bias which we would like NOT vanishing i.e. m_t = m(= 1).

Ex. Stochastic Approximation with controlled Markovian dynamics θ_t+1 = θ_t + γ_t+1 H(θ_t, X_t+1) X_t+1 ∼ P_θ

t(X_t,·)

= θ_t + γ_t+1

Z

H(θ_t, x)dπ_θ_t(x)

| {z }

h(θ_t)

+γ_t+1 H(θ_t, X_t+1) − h(θ_t)

| {z }

non centered