Stochastic Dynamic Programming
1 Introduction
We revisit stochastic dynamic programming, now for infinite state spaces, based on Stachurski Chapter 10. There’s a state spaceS ∈ B(<n) (i.e. S is a Borel set in<n), anaction spaceA ∈ B(<m), and afeasibility correspondence Γ : S →→A specifying the set Γ(x) of feasible actions for any state x. The graph of Γ is defined as the set
grΓ ={(x, u)|x∈S &u∈Γ(x)}
The state evolves as follows. There’s a shock space Z ∈ B(<k), and a sequence (Wt)∞1 of iid shocks distributed according to the probability measure φ ∈ P(Z); (P(Z) is the set of probability measures on (Z,B(Z))). Correlated shocks can be handled by enlarging the definition of the state space, as shown in an example below. The transition function F :grΓ×Z → S is the map (x, u, z)7→F(x, u, z)∈S.
The agent (or depending on the context, the social planner) at time t= 0,1,2, ... observes Xt = x, takes action ut = u ∈ Γ(x), and gets current rewardr(x, u) according to the functionr. Then a shockWt+1 =z is realized and the state changes to Xt+1 =F(x, u, z). The process then repeats.
The agent does not care only about the current rewardr(x, u), else would have to simply choose argmaxu∈Γ(x)r(x, u). Rather, the agent wishes to max- imize
1
E
"∞ X
t=0
ρtr(xt, ut)
#
through a choice of theut’s conditional on the history thus far. We restrict the agent to Markov policies: Choose a measurable function σ:S →Afrom the set
Σ ={σ :S →A|σ(x)∈Γ(x)∀x∈S, σ measurable}
Example 1 Accumulation problem:
Outputf is a function of capitalkand a real-valued shockW; depreciation rate δ= 1; state is outputyand agent chooses action kor investment/capital stock in the production functionf. With iid shocks (Wt)∞t=1, the state evolves according to
yt+1 =f(kt, Wt+1) Here, the transition function is
F(y, k, z) = f(k, z)
It is independent of the state. The tradeoff in this model is between con- suming more versus saving now, embodied in the choice of kt, in order to influence the state yt+1 tomorrow.
Example 2 Correlated Shocks:
Now suppose yt+1 = f(kt, ηt+1), where ηt+1 = g(ηt, Wt+1), say with g :
<+×<+ → <+, and where(Wt)∞1 are iid shocks. So output is hit by correlated shocks.
Expand the state space to contain all output-correlated shock pairs: (y, η)∈ S ≡ <+× <+. The transition function is the map
F : (y, η, k, z)7→(f(k, g(η, z)), g(η, z))
The feasible correspondence maps from states to feasible choices of capital stock, i.e. Γ : (y, η)7→[0, y].
For each σ∈Σ, we get a SRS
Xt+1 =F(Xt, σ(Xt), Wt+1),(Wt)∞1 iid ∼φ
With this we get the associated stochastic kernel Pσ(x, dy) on S. (A stochastic kernel on S is a family of probability measures on (S,B(S)), one for each state x), giving transition probabilities fromx. Here, for eachx∈S and Borel set B ∈ B(S),
Pσ(x, B) =
Z
1BF(x, σ(x), z)φ(dz)
i.e. the probability of the state being in B at t+ 1 if it’s x at t and the agent follows the policy σ.
Mσ : P(S) → P(S) is the corresponding Markov operator. For every measure ψ (marginal distribution at time t),ψMσ gives the marginal distri- bution at time t+ 1: for every Borel set B ∈ B(S),
ψMσ(B) =
Z
Pσ(x, B)ψ(dx)
is the unconditional probability of being inBtomorrow, having integrated out today’s state x. Thus Mσ supports the recursionψt+1 =ψMσ under the policy σ.
Let rσ : S → < be defined by rσ(x) = r(x, σ(x)). This is the current reward in state x if I use policy σ. Then the expected reward tomorrow if the state today is x and I follow σ is
Mσrσ(x) =
Z
rσ(y)Pσ(x, dy) =
Z
rσ(F(x, σ(x), z))φ(dz)
It’s useful to view (Wt)∞t=1 as a stochastic process (family of random vari- ables; i.e. measurable functions defined on Ω) on a common probability space (Ω,F, ν). Each ω ∈ Ω leads to a realization (Wt(ω)∞t=1 of the Wt’s. This, along with X0 = x and a policy σ ∈ Σ and the SRS, recursively defines a realization (Xt(ω))∞t=1 of a time path for the state.
Let the r.v. Yσ : Ω→ < be defined by Yσ(ω) =
∞
X
t=0
ρtrσ(Xt(ω)), ∀ω ∈Ω .
Yσ is well-defined as 0 < ρ < 1 and r() is bounded; it captures the discounted present payoff from following policy σ on each path (Xt(ω)∞t=0. The expected discounted payoff of the agent following policy σ is given by, for all X0 =x∈S:
vσ(x)≡EYσ =
Z "∞ X
t=0
ρtrσ(Xt(ω))
#
ν(dω) (1)
The objective is to chooseσ ∈Σ to maximizevσ(x), for all x∈S.
v∗ :S → <defined by
v∗(x) = sup
σ∈Σ
vσ(x)∀x∈S (2)
is called thevalue functionof the problem. A policyσ∗ is defined to be optimal if for every x∈S, vσ∗(x) =v∗(x).
Note in Equation (1) that if the context of the probability measure being over sample paths is understood, we may remove the integral sign and useE,
the expectations operator. Can we move the expectations operator inside the summation sign? Yes, using DCT (the Dominated Convergence Theorem).
Letfn =Pnt=0ρtrσ(Xt(ω)),n = 1,2, ...be the partial sums whose limit is the function f = P∞t=0ρtrσ(Xt(ω)) being integrated in Equation (1). Since 0< ρ < 1 and r() is bounded, I can findM > 0 s.t. |rσ(Xt(ω)| ≤ M for all time paths of the state and use the bound|fn| ≤ (1−ρ)M as a constant function g that bounds the partial sum functions. Then I can appeal to DCT: note that the RHS of Equation (1) is just R limfn. This then equals
lim
Z
fn = lim
Z n X
t=0
( ) = lim
n
X
t=0
Z
( ) =
∞
X
t=0
Z
( )
which is the infinite sum of period-by-period expectations. The second equality/interchange follows from the finite additivity of the integral. Note that this, P∞t=0ρtErσ(Xt) equalsP∞t=0ρtMσtrσ(x). So we have for everyX0 = x∈S,
vσ(x) =
∞
X
t=0
ρtMσtrσ(x) (3) where in period t, (givenX0 = x), the expectation is taken with respect to Pσt(x, dy).
2 Main Results
Throughout this section, the assumptions below are maintained.
Assumptions.
A1. r:grΓ→ < is bounded and continuous.
A2. Γ :S → B(A) is continuous and compact-valued.
A3. (x, u)7→F(x, u, z) is continuous ∀z ∈Z.
Definition 1 Given w ∈ bB(S) [i.e. w : S → < is bounded and (B(S),B)- measurable], σ ∈Σ is w-greedy if for all x∈S,
σ(x)∈argmaxu∈Γ(x)
r(x, u) +ρ
Z
w(F(x, u, z))φ(dz)
(4) Lemma 1 Suppose Assumptions A1, A2, A3 hold. If w ∈ bcS, then Σ contains at least one w-greedy policy.
Since w is continuous, the RHS of Equation (4) is continuous w.r.t. u.
And Γ(x) is compact. So a max exists by Weierstrass’ Theorem; and hence we can get a bounded continuous mapping σ to exist. Its measurability can be established using ‘measurable selection’ results (e.g. Aliprantis and Border 1999, Section 17.3).
We now state the main theorem, which says that the value function de- fined above in Equation (2) satisfies Bellman’s equation (and indeed is the unique solution to Bellman’s equation in the theorem).
Theorem 1 Suppose Assumptions A1, A2, A3 hold. Thenv∗ defined above is the unique function in bB(S) that satisfies
v∗(x) = sup
u∈Γ(x)
r(x, u) +ρ
Z
v∗(F(x, u, z))φ(dz)
(5) Morevoer, v∗ ∈ bcS; and a feasible policy is optimal if and only if it is v∗-greedy. At least one such policy exists.
We will use two lemmas below for proving this theorem. The second of the lemmas has to do with the Bellman operator in Equation (5) above being a uniformly strict contraction. The first lemma has to do with the operator Tσ defined below, and showing that it’s a contraction with unique fixed point being the value function for a policy σ, namelyvσ defined above.
Lemma 2 Define Tσ on bB(S) as follows: ∀σ∈Σ and x∈S,
Tσw(x) = r(x, σ(x)) +ρ
Z
w(F(x, σ(x), z))φ(dz) (6) Then, (i) Tσ :bB(S)→bB(S)
(ii) Tσ is monotone.
(iii) ∀w, w0 ∈ B(S),
||Tσw−Tσw0||∞ ≤ ρ||w−w0||∞, so Tσ is uniformly contracting: and its unique fixed point is vσ.
Proof. (i) To be filled in.
(ii). w0 ≥w impliesR w0 ≥R w, so Tσw0 ≥Tσw.
(iii). Take any w, w0 ∈ bB(S), and any x ∈ S. (Interpret this x in the expressions below as X0 =x, the initial or current state). Then
|Tσw(x)−Tσw0(x)|=|rσ(x)+ρMσw(x)−rσ(x)−ρMσw0(x)|=ρ|Mσ(w(x)−w0(x))|
≤ρMσ|w(x)−w0(x)| ≤ρMσ||w−w0||∞=ρ||w−w0||∞
Since this holds for every x ∈ S, it holds for the supremum over them,
||Tσw−Tσw0||∞. The first inequality we did in our treatment of integration:
it’s a sort of triangle inequality for integrals. The second inequality holds because |w(x)− w0(x)| ≤ ||w −w0||∞, the supremum. The final equality follows because the constant ||w−w0||∞ is factored out of the integration, which then integrates the probability measure Pσ(x, dy) to 1.
SoTσ is uniformly contracting. To seeTσvσ =vσ, note that for anyx∈S, vσ(x) =
∞
X
t=0
ρtMσtrσ(x) =rσ(x) +
∞
X
t=1
ρtMσt(x)
=rσ(x) +ρMσ
∞
X
t=0
ρtMσtrσ(x) = rσ(x) +ρMσvσ(x) =Tσvσ(x)
The next lemma is about the Bellman operator.
Lemma 3 The Bellman operator T defined by
T w(x) = sup
u∈Γ(x)
r(x, u) +ρ
Z
w(F(x, u, z))φ(dz)
(7) is monotone, and uniformly contracting on (bB(S), d∞).
Proof. w0 ≥w implies R w0 ≥ R w implies T w0 ≥ T w establishes mono- tone T.
To establish the second claim, note that |supw−supw0| ≤sup|w−w0|, where the sup is taken over all x∈S. This is because
supw= sup(w−w0+w0)≤sup(w−w0) + supw0 ≤sup|w−w0|+ supw0 So supw−supw0 ≤ sup|w−w0|. Interchanging the roles of w and w0 clinches the argument.
Here, we have|T w(x)−T w0(x)| equal to
sup
u
r(x, u) +ρ
Z
w(F(x, u, z))φ(dz)
−sup
u
r(x, u) +ρ
Z
w0(F(x, u, z))φ(dz)
≤ρsup
u
Z
(w(F(x, u, z))−w0(F(x, u, z)))φ(dz)
≤ρsup
u
Z
|w(F(x, u, z))−w0(F(x, u, z))|φ(dz)
≤ρsup
u
Z
||w−w0||∞φ(dz) =ρ||w−w0||∞
This is true for all x∈S, and so for the supremum on the LHS.
The penultimate inequality is again the sort of triangle inequality for the integral done in the integration lecture.
Proof of Theorem 1.
By Lemma 3, there is a uniquew∗ s.t. T w∗ =w∗. We need to show that (i) w∗ ∈ bcS and (ii) w∗ =v∗.
(i) First note that T maps from bcS into itself. Indeed, if w ∈ bcS, then it follows from Berge’s Theorem of the Maximum that T w is continuous;
boundedness is routine as in other proofs above. Note that the Theorem of the Maximum has to do with the continuity of the maximum value function of a maximization problem, and the upper-hemicontinuity of the argmax correspondence.
Second, note thatbcS is a closed subset ofbB(S). This closedness of bcS was established earlier in the course. So, if we start with any w ∈ bcS, the limit of the sequence (Tnw)∞n=0 is in bcS; by Banach’s theorem, this limit is w∗.
(ii) Note that since w∗ ∈ bcS, by Lemma 1 there is at least one w∗-greedy policy, say ¯σ; so, T w∗ =Tσ¯w∗. So w∗ =T w∗ =T¯σw∗. This implies w∗ =v¯σ, the unique fixed point of Tσ¯. However, by definition of the value function v∗, vσ ≤v∗ for any policy σ; and sow∗ =vσ¯ ≤v∗.
Now for the converse, i.e. w∗ ≥ v∗. Take any σ ∈ Σ, so we have w∗ = T w∗ ≥Tσw∗; (sinceσmay not maximize the RHS of the Bellman expression).
Now, Tσ is increasing, so w∗ ≥Tσw∗ implies Tσw∗ ≥Tσ2w∗ by operating on both sides withTσ; and this therefore is less than or equal to w∗. Iteratingk times, w∗ ≥Tσkw∗. Since Tσkw∗ converges to vσ uniformly (by Lemma 2 and Banach’s theorem), taking limits we get w∗ ≥vσ. Since this holds for all σ, it holds for the supremum, and hence w∗ ≥v∗.