Dynamic Programming Basics - Formal Methods for Control Synthesis in Partially Observed Environ

whereH is called the deviation matrixand is given by H “ pIloooooooooooooomoooooooooooooon´T_ssd^{P M}^ϕ^,G`Πssdq^´¹

fundamental matrix,Z

pI´Πssdq. (6.9)

(d)⃗his not unique. If p⃗g,⃗hq is a solution then@kě0,pg,⃗h`Πssd⃗hqis also a solution.

The Poisson equation is important because the quantitygcan be used to compute the probability of visiting the set Avoid^{P M}r ^ϕ^,G for the ssd-global Markov chain in the following theorem. This is crucial because it can then be used to enforce the constraintη^ssd_av “0 in the optimization criterion of Equation (5.12).

Theorem 6.1.3 The probability of ssd-global Markov chain of visiting pAvoid^{P M}_r ^ϕˆG^ssq for an initial distribution, ι¹_initPMSˆG is given by

Pr”

πÑ pAvoid^{P M}_r ^ϕˆG^ssqˇˇι¹_initı

“⃗ι¹^T_init⃗g. (6.10) Proof Note that under T_ssd^{P M}^ϕ^,G, each state in pAvoid^{P M}_r ^ϕ ˆG^ssq is a sink by construction and therefore recurrent. Applying Lemma 5.1.1 gives

Pr“

πÑ pAvoid^{P M}_r ^ϕˆG^ssq |ι¹_init‰

“ lim

TÑ8 1 T

„ _T ř

t“0

r^avprst, gtsq |ι¹_init ȷ

“ ⃗ι¹^T_init Πssd ⃗^S^ˆG

pAvoid^{P Mϕ}r ˆGq

“ ⃗ι¹^T_init Πssd⃗r^av

“ ⃗ι¹^T_init⃗g,

(6.11)

where line 1 implies line 2 due to Equation (4.44), and line 3 follows from the fact that⃗r^av can be re-written as an indicator vector⃗r^av “⃗^SˆG

pAvoid^{P M}r ^ϕˆG^ssq.

Theorem 6.1.3 will be used later in this chapter to enforce the constraint,η_av^ssdprq “0, during the optimization procedure for the conservative optimization criterion of Equation (5.12).

x_t`1“fpxt,αt, wtq, (6.12) wherex_tis the state of the system at time stept,α_tare exogenous inputs that can be applied by an agent andw_tis a disturbance from some probability space. The discussion in this section is restricted to the case of finite state and action spaces, and also to a finite probability space for the disturbance.

It is required that disturbance wt have a conditional distribution of the form ppwt|xt,αtq. Also of concern is the reward obtained by the agent, given by a functionrpxt,αt, wtq. Next, consider apolicy µ“ pµ0, µ1, . . .qwhere at stept, an actionαtis chosen according toµt. For stochastic systems, the policyµtmay require the entire execution history to successfully pick the action, i.e.,

α_t“µ_tpx0,α0, . . . ,α_t´1, x_tq. (6.13) Restricting the discussion to the discounted long term reward and a known initial state of the system, the objective of the agent is to maximize the following expected long term discounted reward

η^βpx0q “sup

ÿ8 t“0

β^trpxt, µ_t, w_tq 0ďβ ă1, (6.14) under the constraint thatxtevolves using Equation (6.12). For this particular choice of objective it is well known that a stationary Markov policy is suﬃcient. Formally, this means thatαt“µtpxtq “ µpxtq.

The dynamic programming algorithm for the preceding algorithm is given by the iteration V0^βpxq “ 0

V_k`^β₁pxq “ sup_α w

”rpx,α, wq `βV_k^βrfpx,α, wqsı

. (6.15)

For a known initial starting statex0, the optimal value η^β˚px0qis given by the limit

η^β˚px0q “ lim

kÑ8V_k^βpx0q. (6.16)

Since it is known that a stationary optimal policy exists in the case of infinite horizon discounted reward over a finite system model, it follows that this policy satisfies theBellman Optimality Equation given by

V^β˚pxq “sup

α w

“rpx,α, wq `βV^β˚rfpx,α, wqs‰

@x,α, w. (6.17)

6.2.1 Dynamic Programming Variants

Equations (6.15) and (6.16) are leveraged to perform dynamic programming in several popular ways in the literature. Restricting the scope of the discussion to a Markov decision process in which the

system evolution is given by a conditional probability distribution x_k`1 „ Tpx_k`1|xk,αq, and the reward is given byrpx,αq, the expectation in Equation (6.15) can be explicitly computed as

V0pxq “ 0 V_k`1pxq “ sup_αř

x¹Tpx¹|x,αq”

rpx,αq `βV_k^βpx¹qı

. (6.18)

This iterative method constitutes what is known as a value iteration methodology for dynamic programming [12, 13]. At each iteration step k, V_k^βpxq is called the value function of the state x.

It denotes the expected long term discounted reward the agent would collect if the initial state x_t“0“x. When the iteration has converged, the policy is computed using

µpxq “argmax

x¹

Tpx¹|x,αq”

rpx,αq `βV_k^βpx¹qı

. (6.19)

It is also well known that for any given policyµ, the value functionV_µ^βpxqsatisfies the following system of equations

V_µ^βpxq “ÿ

x¹

Tpx¹|x,αq”

rpx,αk, wkq `βV_k^βpx¹qı

, (6.20)

called theBellman Equation. While it is possible to solve the above equation using exact methods, in most cases the Bellman equation is solved by iteration: the r.h.s of Equation (6.20) is repeatedly applied on successive values V^β until convergence. The Bellman equation is utilized in another variant of dynamic programming, namely policy iteration, or Howard’s method [62], outlined in Algorithm 6.1.

Algorithm 6.1 Policy Iteration for Markov Decision Process

1: iterÐ0

2: Choose an initial policyµ_iter.

3: V_iter^β pxqÐ0@x

4: repeat

5: iterÐiter`1

6: Policy Improvement: Improve the policy µiterpxq “argmax

x¹

Tpx¹|x,αq”

rpx,αq `βV_k^βpx¹qı

(6.21)

7: Policy Evaluation: Solve the Bellman Equation (6.20) to getV_iter^β pxq @x.

8: untilV_iter^β pxq ´V_iter´^β 1pxqďε^β @x

So far, the discussion of stochastic dynamic programming in this section has only focused on the discounted reward criterion. However, the Bellman Optimality equation, the Bellman Equation, and value and policy iteration techniques can be derived for the expected long term average reward criterion (Definition 2.2.9) as well. However, in the general setting of an arbitrary reward function and infinite state space, the existence of an optimal solution for the average case is not guaranteed

[81, 108]. However, for the set of problems of interest in this thesis, the global Markov chain is a discrete time system that evolves over finite state space, in which case the average reward does have an optimum. Additionally, as will be seen in Section 6.5, the optimal solution for the average case is not required for the algorithm proposed herein. Only the evaluation of the average reward value function under a given FSC is required to guarantee LTL satisfaction. Therefore, the Bellman Equation for the average reward case is suﬃcient for this work. In the succeeding section, the relevant dynamic programming equations for both discounted and average rewards are summarized for the specific case of POMDPs controlled by FSCs.

Dalam dokumen Formal Methods for Control Synthesis in Partially Observed Environments: Application to Autonomous Robotic (Halaman 98-101)