Stochastic Dynamic Programming 1 Introduction

(1)

Stochastic Dynamic Programming

1 Introduction

We revisit stochastic dynamic programming, now for infinite state spaces, based on Stachurski Chapter 10. There’s a state spaceS ∈ B(<ⁿ) (i.e. S is a Borel set in<ⁿ), anaction spaceA ∈ B(<^m), and afeasibility correspondence Γ : S →→A specifying the set Γ(x) of feasible actions for any state x. The graph of Γ is defined as the set

grΓ ={(x, u)|x∈S &u∈Γ(x)}

The state evolves as follows. There’s a shock space Z ∈ B(<^k), and a sequence (Wt)^∞₁ of iid shocks distributed according to the probability measure φ ∈ P(Z); (P(Z) is the set of probability measures on (Z,B(Z))). Correlated shocks can be handled by enlarging the definition of the state space, as shown in an example below. The transition function F :grΓ×Z → S is the map (x, u, z)7→F(x, u, z)∈S.

The agent (or depending on the context, the social planner) at time t= 0,1,2, ... observes Xt = x, takes action ut = u ∈ Γ(x), and gets current rewardr(x, u) according to the functionr. Then a shockW_t+1 =z is realized and the state changes to Xt+1 =F(x, u, z). The process then repeats.

The agent does not care only about the current rewardr(x, u), else would have to simply choose argmax_u∈Γ(x)r(x, u). Rather, the agent wishes to maximize

1

(2)

E

"_∞ X

t=0

ρ^tr(x_t, u_t)

#

through a choice of theut’s conditional on the history thus far. We restrict the agent to Markov policies: Choose a measurable function σ:S →Afrom the set

Σ ={σ :S →A|σ(x)∈Γ(x)∀x∈S, σ measurable}

Example 1 Accumulation problem:

Outputf is a function of capitalkand a real-valued shockW; depreciation rate δ= 1; state is outputyand agent chooses action kor investment/capital stock in the production functionf. With iid shocks (W_t)^∞_t=1, the state evolves according to

y_t+1 =f(k_t, W_t+1) Here, the transition function is

F(y, k, z) = f(k, z)

It is independent of the state. The tradeoff in this model is between con- suming more versus saving now, embodied in the choice of k_t, in order to influence the state y_t+1 tomorrow.

Example 2 Correlated Shocks:

Now suppose y_t+1 = f(k_t, η_t+1), where η_t+1 = g(η_t, W_t+1), say with g :

<₊×<₊ → <₊, and where(W_t)^∞₁ are iid shocks. So output is hit by correlated shocks.

Expand the state space to contain all output-correlated shock pairs: (y, η)∈ S ≡ <₊× <₊. The transition function is the map

(3)

F : (y, η, k, z)7→(f(k, g(η, z)), g(η, z))

The feasible correspondence maps from states to feasible choices of capital stock, i.e. Γ : (y, η)7→[0, y].

For each σ∈Σ, we get a SRS

X_t+1 =F(X_t, σ(X_t), W_t+1),(W_t)^∞₁ iid ∼φ

With this we get the associated stochastic kernel P_σ(x, dy) on S. (A stochastic kernel on S is a family of probability measures on (S,B(S)), one for each state x), giving transition probabilities fromx. Here, for eachx∈S and Borel set B ∈ B(S),

Pσ(x, B) =

Z

1BF(x, σ(x), z)φ(dz)

i.e. the probability of the state being in B at t+ 1 if it’s x at t and the agent follows the policy σ.

M_σ : P(S) → P(S) is the corresponding Markov operator. For every measure ψ (marginal distribution at time t),ψMσ gives the marginal distribution at time t+ 1: for every Borel set B ∈ B(S),

ψM_σ(B) =

Z

P_σ(x, B)ψ(dx)

is the unconditional probability of being inBtomorrow, having integrated out today’s state x. Thus M_σ supports the recursionψ_t+1 =ψM_σ under the policy σ.

Let r_σ : S → < be defined by r_σ(x) = r(x, σ(x)). This is the current reward in state x if I use policy σ. Then the expected reward tomorrow if the state today is x and I follow σ is

(4)

M_σr_σ(x) =

Z

r_σ(y)P_σ(x, dy) =

Z

r_σ(F(x, σ(x), z))φ(dz)

It’s useful to view (Wt)^∞_t=1 as a stochastic process (family of random vari- ables; i.e. measurable functions defined on Ω) on a common probability space (Ω,F, ν). Each ω ∈ Ω leads to a realization (Wt(ω)^∞_t=1 of the Wt’s. This, along with X₀ = x and a policy σ ∈ Σ and the SRS, recursively defines a realization (X_t(ω))^∞_t=1 of a time path for the state.

Let the r.v. Y_σ : Ω→ < be defined by Y_σ(ω) =

∞

X

t=0

ρ^tr_σ(X_t(ω)), ∀ω ∈Ω .

Yσ is well-defined as 0 < ρ < 1 and r() is bounded; it captures the discounted present payoff from following policy σ on each path (X_t(ω)^∞_t=0. The expected discounted payoff of the agent following policy σ is given by, for all X₀ =x∈S:

v_σ(x)≡EY_σ =

Z "_∞ X

t=0

ρ^tr_σ(X_t(ω))

#

ν(dω) (1)

The objective is to chooseσ ∈Σ to maximizev_σ(x), for all x∈S.

v^∗ :S → <defined by

v^∗(x) = sup

σ∈Σ

v_σ(x)∀x∈S (2)

is called thevalue functionof the problem. A policyσ^∗ is defined to be optimal if for every x∈S, v_σ^∗(x) =v^∗(x).

Note in Equation (1) that if the context of the probability measure being over sample paths is understood, we may remove the integral sign and useE,

(5)

the expectations operator. Can we move the expectations operator inside the summation sign? Yes, using DCT (the Dominated Convergence Theorem).

Letf_n =^Pⁿ_t=0ρ^tr_σ(X_t(ω)),n = 1,2, ...be the partial sums whose limit is the function f = ^P^∞_t=0ρ^trσ(Xt(ω)) being integrated in Equation (1). Since 0< ρ < 1 and r() is bounded, I can findM > 0 s.t. |r_σ(X_t(ω)| ≤ M for all time paths of the state and use the bound|f_n| ≤ _(1−ρ)^M as a constant function g that bounds the partial sum functions. Then I can appeal to DCT: note that the RHS of Equation (1) is just ^R limf_n. This then equals

lim

Z

fn = lim

Z ⁿ X

t=0

( ) = lim

n

X

t=0

Z

( ) =

∞

X

t=0

Z

( )

which is the infinite sum of period-by-period expectations. The second equality/interchange follows from the finite additivity of the integral. Note that this, ^P^∞_t=0ρ^tEr_σ(X_t) equals^P^∞_t=0ρ^tM_σ^tr_σ(x). So we have for everyX₀ = x∈S,

vσ(x) =

∞

X

t=0

ρ^tM_σ^trσ(x) (3) where in period t, (givenX₀ = x), the expectation is taken with respect to P_σ^t(x, dy).

2 Main Results

Throughout this section, the assumptions below are maintained.

Assumptions.

A1. r:grΓ→ < is bounded and continuous.

A2. Γ :S → B(A) is continuous and compact-valued.

A3. (x, u)7→F(x, u, z) is continuous ∀z ∈Z.

(6)

Definition 1 Given w ∈ bB(S) [i.e. w : S → < is bounded and (B(S),B)- measurable], σ ∈Σ is w-greedy if for all x∈S,

σ(x)∈argmaxu∈Γ(x)

r(x, u) +ρ

Z

w(F(x, u, z))φ(dz)

(4) Lemma 1 Suppose Assumptions A1, A2, A3 hold. If w ∈ bcS, then Σ contains at least one w-greedy policy.

Since w is continuous, the RHS of Equation (4) is continuous w.r.t. u.

And Γ(x) is compact. So a max exists by Weierstrass’ Theorem; and hence we can get a bounded continuous mapping σ to exist. Its measurability can be established using ‘measurable selection’ results (e.g. Aliprantis and Border 1999, Section 17.3).

We now state the main theorem, which says that the value function defined above in Equation (2) satisfies Bellman’s equation (and indeed is the unique solution to Bellman’s equation in the theorem).

Theorem 1 Suppose Assumptions A1, A2, A3 hold. Thenv^∗ defined above is the unique function in bB(S) that satisfies

v^∗(x) = sup

u∈Γ(x)

r(x, u) +ρ

Z

v^∗(F(x, u, z))φ(dz)

(5) Morevoer, v^∗ ∈ bcS; and a feasible policy is optimal if and only if it is v^∗-greedy. At least one such policy exists.

We will use two lemmas below for proving this theorem. The second of the lemmas has to do with the Bellman operator in Equation (5) above being a uniformly strict contraction. The first lemma has to do with the operator T_σ defined below, and showing that it’s a contraction with unique fixed point being the value function for a policy σ, namelyv_σ defined above.

(7)

Lemma 2 Define T_σ on bB(S) as follows: ∀σ∈Σ and x∈S,

T_σw(x) = r(x, σ(x)) +ρ

Z

w(F(x, σ(x), z))φ(dz) (6) Then, (i) T_σ :bB(S)→bB(S)

(ii) T_σ is monotone.

(iii) ∀w, w⁰ ∈ B(S),

||T_σw−T_σw⁰||∞ ≤ ρ||w−w⁰||∞, so T_σ is uniformly contracting: and its unique fixed point is v_σ.

Proof. (i) To be filled in.

(ii). w⁰ ≥w implies^R w⁰ ≥^R w, so T_σw⁰ ≥T_σw.

(iii). Take any w, w⁰ ∈ bB(S), and any x ∈ S. (Interpret this x in the expressions below as X₀ =x, the initial or current state). Then

≤ρM_σ|w(x)−w⁰(x)| ≤ρM_σ||w−w⁰||∞=ρ||w−w⁰||∞

Since this holds for every x ∈ S, it holds for the supremum over them,

||T_σw−T_σw⁰||∞. The first inequality we did in our treatment of integration:

it’s a sort of triangle inequality for integrals. The second inequality holds because |w(x)− w⁰(x)| ≤ ||w −w⁰||∞, the supremum. The final equality follows because the constant ||w−w⁰||∞ is factored out of the integration, which then integrates the probability measure P_σ(x, dy) to 1.

SoT_σ is uniformly contracting. To seeT_σv_σ =v_σ, note that for anyx∈S, v_σ(x) =

∞

X

t=0

ρ^tM_σ^tr_σ(x) =r_σ(x) +

∞

X

t=1

ρ^tM_σ^t(x)

=r_σ(x) +ρM_σ

∞

X

t=0

ρ^tM_σ^tr_σ(x) = r_σ(x) +ρM_σv_σ(x) =T_σv_σ(x)

(8)

The next lemma is about the Bellman operator.

Lemma 3 The Bellman operator T defined by

T w(x) = sup

u∈Γ(x)

r(x, u) +ρ

Z

w(F(x, u, z))φ(dz)

(7) is monotone, and uniformly contracting on (bB(S), d∞).

Proof. w⁰ ≥w implies ^R w⁰ ≥ ^R w implies T w⁰ ≥ T w establishes monotone T.

To establish the second claim, note that |supw−supw⁰| ≤sup|w−w⁰|, where the sup is taken over all x∈S. This is because

supw= sup(w−w⁰+w⁰)≤sup(w−w⁰) + supw⁰ ≤sup|w−w⁰|+ supw⁰ So supw−supw⁰ ≤ sup|w−w⁰|. Interchanging the roles of w and w⁰ clinches the argument.

Here, we have|T w(x)−T w⁰(x)| equal to

sup

u

r(x, u) +ρ

Z

w(F(x, u, z))φ(dz)

−sup

u

r(x, u) +ρ

Z

w⁰(F(x, u, z))φ(dz)

≤ρsup

u

Z

(w(F(x, u, z))−w⁰(F(x, u, z)))φ(dz)

≤ρsup

u

Z

|w(F(x, u, z))−w⁰(F(x, u, z))|φ(dz)

≤ρsup

u

Z

||w−w⁰||_∞φ(dz) =ρ||w−w⁰||_∞

This is true for all x∈S, and so for the supremum on the LHS.

(9)

The penultimate inequality is again the sort of triangle inequality for the integral done in the integration lecture.

Proof of Theorem 1.

By Lemma 3, there is a uniquew^∗ s.t. T w^∗ =w^∗. We need to show that (i) w^∗ ∈ bcS and (ii) w^∗ =v^∗.

(i) First note that T maps from bcS into itself. Indeed, if w ∈ bcS, then it follows from Berge’s Theorem of the Maximum that T w is continuous;

boundedness is routine as in other proofs above. Note that the Theorem of the Maximum has to do with the continuity of the maximum value function of a maximization problem, and the upper-hemicontinuity of the argmax correspondence.

Second, note thatbcS is a closed subset ofbB(S). This closedness of bcS was established earlier in the course. So, if we start with any w ∈ bcS, the limit of the sequence (Tⁿw)^∞_n=0 is in bcS; by Banach’s theorem, this limit is w^∗.

(ii) Note that since w^∗ ∈ bcS, by Lemma 1 there is at least one w^∗-greedy policy, say ¯σ; so, T w^∗ =T_σ_¯w^∗. So w^∗ =T w^∗ =T_¯_σw^∗. This implies w^∗ =v_¯_σ, the unique fixed point of T_σ_¯. However, by definition of the value function v^∗, v_σ ≤v^∗ for any policy σ; and sow^∗ =v_σ_¯ ≤v^∗.

Now for the converse, i.e. w^∗ ≥ v^∗. Take any σ ∈ Σ, so we have w^∗ = T w^∗ ≥T_σw^∗; (sinceσmay not maximize the RHS of the Bellman expression).

Now, T_σ is increasing, so w^∗ ≥T_σw^∗ implies T_σw^∗ ≥T_σ²w^∗ by operating on both sides withT_σ; and this therefore is less than or equal to w^∗. Iteratingk times, w^∗ ≥T_σ^kw^∗. Since T_σ^kw^∗ converges to v_σ uniformly (by Lemma 2 and Banach’s theorem), taking limits we get w^∗ ≥v_σ. Since this holds for all σ, it holds for the supremum, and hence w^∗ ≥v^∗.