• Tidak ada hasil yang ditemukan

whereH is called the deviation matrixand is given by H “ pIloooooooooooooomoooooooooooooon´TssdP Mϕ,Gssdq´1

fundamental matrix,Z

pI´Πssdq. (6.9)

(d)⃗his not unique. If p⃗g,⃗hq is a solution then@kě0,pg,⃗h`Πssd⃗hqis also a solution.

The Poisson equation is important because the quantitygcan be used to compute the probability of visiting the set AvoidP Mr ϕ,G for the ssd-global Markov chain in the following theorem. This is crucial because it can then be used to enforce the constraintηssdav “0 in the optimization criterion of Equation (5.12).

Theorem 6.1.3 The probability of ssd-global Markov chain of visiting pAvoidP Mr ϕˆGssq for an initial distribution, ι1initPMSˆG is given by

Pr”

πÑ pAvoidP Mr ϕˆGssqˇˇι1initı

“⃗ι1Tinit⃗g. (6.10) Proof Note that under TssdP Mϕ,G, each state in pAvoidP Mr ϕ ˆGssq is a sink by construction and therefore recurrent. Applying Lemma 5.1.1 gives

Pr“

πÑ pAvoidP Mr ϕˆGssq |ι1init

“ lim

TÑ8 1 T

T ř

t“0

ravprst, gtsq |ι1init ȷ

“ ⃗ι1Tinit ΠssdSˆG

pAvoidP Mϕr ˆGq

“ ⃗ι1Tinit Πssd⃗rav

“ ⃗ι1Tinit⃗g,

(6.11)

where line 1 implies line 2 due to Equation (4.44), and line 3 follows from the fact that⃗rav can be re-written as an indicator vector⃗rav “⃗SˆG

pAvoidP Mr ϕˆGssq.

Theorem 6.1.3 will be used later in this chapter to enforce the constraint,ηavssdprq “0, during the optimization procedure for the conservative optimization criterion of Equation (5.12).

xt`1“fpxtt, wtq, (6.12) wherextis the state of the system at time stept,αtare exogenous inputs that can be applied by an agent andwtis a disturbance from some probability space. The discussion in this section is restricted to the case of finite state and action spaces, and also to a finite probability space for the disturbance.

It is required that disturbance wt have a conditional distribution of the form ppwt|xttq. Also of concern is the reward obtained by the agent, given by a functionrpxtt, wtq. Next, consider apolicy µ“ pµ0, µ1, . . .qwhere at stept, an actionαtis chosen according toµt. For stochastic systems, the policyµtmay require the entire execution history to successfully pick the action, i.e.,

αt“µtpx00, . . . ,α1, xtq. (6.13) Restricting the discussion to the discounted long term reward and a known initial state of the system, the objective of the agent is to maximize the following expected long term discounted reward

ηβpx0q “sup

µ

ÿ8 t“0

βtrpxt, µt, wtq 0ďβ ă1, (6.14) under the constraint thatxtevolves using Equation (6.12). For this particular choice of objective it is well known that a stationary Markov policy is sufficient. Formally, this means thatαt“µtpxtq “ µpxtq.

The dynamic programming algorithm for the preceding algorithm is given by the iteration V0βpxq “ 0

Vk`β1pxq “ supα w

”rpx,α, wq `βVkβrfpx,α, wqsı

. (6.15)

For a known initial starting statex0, the optimal value ηβ˚px0qis given by the limit

ηβ˚px0q “ lim

kÑ8Vkβpx0q. (6.16)

Since it is known that a stationary optimal policy exists in the case of infinite horizon discounted re- ward over a finite system model, it follows that this policy satisfies theBellman Optimality Equation given by

Vβ˚pxq “sup

α w

“rpx,α, wq `βVβ˚rfpx,α, wqs‰

@x,α, w. (6.17)

6.2.1 Dynamic Programming Variants

Equations (6.15) and (6.16) are leveraged to perform dynamic programming in several popular ways in the literature. Restricting the scope of the discussion to a Markov decision process in which the

system evolution is given by a conditional probability distribution xk`1 „ Tpxk`1|xk,αq, and the reward is given byrpx,αq, the expectation in Equation (6.15) can be explicitly computed as

V0pxq “ 0 Vk`1pxq “ supαř

x1Tpx1|x,αq”

rpx,αq `βVkβpx1

. (6.18)

This iterative method constitutes what is known as a value iteration methodology for dynamic programming [12, 13]. At each iteration step k, Vkβpxq is called the value function of the state x.

It denotes the expected long term discounted reward the agent would collect if the initial state xt“0“x. When the iteration has converged, the policy is computed using

µpxq “argmax

α

ÿ

x1

Tpx1|x,αq”

rpx,αq `βVkβpx1

. (6.19)

It is also well known that for any given policyµ, the value functionVµβpxqsatisfies the following system of equations

Vµβpxq “ÿ

x1

Tpx1|x,αq”

rpx,αk, wkq `βVkβpx1

, (6.20)

called theBellman Equation. While it is possible to solve the above equation using exact methods, in most cases the Bellman equation is solved by iteration: the r.h.s of Equation (6.20) is repeatedly applied on successive values Vβ until convergence. The Bellman equation is utilized in another variant of dynamic programming, namely policy iteration, or Howard’s method [62], outlined in Algorithm 6.1.

Algorithm 6.1 Policy Iteration for Markov Decision Process

1: iterÐ0

2: Choose an initial policyµiter.

3: Viterβ pxqÐ0@x

4: repeat

5: iterÐiter`1

6: Policy Improvement: Improve the policy µiterpxq “argmax

α

ÿ

x1

Tpx1|x,αq”

rpx,αq `βVkβpx1

(6.21)

7: Policy Evaluation: Solve the Bellman Equation (6.20) to getViterβ pxq @x.

8: untilViterβ pxq ´Viter´β 1pxqďεβ @x

So far, the discussion of stochastic dynamic programming in this section has only focused on the discounted reward criterion. However, the Bellman Optimality equation, the Bellman Equation, and value and policy iteration techniques can be derived for the expected long term average reward criterion (Definition 2.2.9) as well. However, in the general setting of an arbitrary reward function and infinite state space, the existence of an optimal solution for the average case is not guaranteed

[81, 108]. However, for the set of problems of interest in this thesis, the global Markov chain is a discrete time system that evolves over finite state space, in which case the average reward does have an optimum. Additionally, as will be seen in Section 6.5, the optimal solution for the average case is not required for the algorithm proposed herein. Only the evaluation of the average reward value function under a given FSC is required to guarantee LTL satisfaction. Therefore, the Bellman Equation for the average reward case is sufficient for this work. In the succeeding section, the relevant dynamic programming equations for both discounted and average rewards are summarized for the specific case of POMDPs controlled by FSCs.