whereH is called the deviation matrixand is given by H “ pIloooooooooooooomoooooooooooooon´TssdP Mϕ,G`Πssdq´1
fundamental matrix,Z
pI´Πssdq. (6.9)
(d)⃗his not unique. If p⃗g,⃗hq is a solution then@kě0,pg,⃗h`Πssd⃗hqis also a solution.
The Poisson equation is important because the quantitygcan be used to compute the probability of visiting the set AvoidP Mr ϕ,G for the ssd-global Markov chain in the following theorem. This is crucial because it can then be used to enforce the constraintηssdav “0 in the optimization criterion of Equation (5.12).
Theorem 6.1.3 The probability of ssd-global Markov chain of visiting pAvoidP Mr ϕˆGssq for an initial distribution, ι1initPMSˆG is given by
Pr”
πÑ pAvoidP Mr ϕˆGssqˇˇι1initı
“⃗ι1Tinit⃗g. (6.10) Proof Note that under TssdP Mϕ,G, each state in pAvoidP Mr ϕ ˆGssq is a sink by construction and therefore recurrent. Applying Lemma 5.1.1 gives
Pr“
πÑ pAvoidP Mr ϕˆGssq |ι1init‰
“ lim
TÑ8 1 T
„ T ř
t“0
ravprst, gtsq |ι1init ȷ
“ ⃗ι1Tinit Πssd ⃗SˆG
pAvoidP Mϕr ˆGq
“ ⃗ι1Tinit Πssd⃗rav
“ ⃗ι1Tinit⃗g,
(6.11)
where line 1 implies line 2 due to Equation (4.44), and line 3 follows from the fact that⃗rav can be re-written as an indicator vector⃗rav “⃗SˆG
pAvoidP Mr ϕˆGssq.
Theorem 6.1.3 will be used later in this chapter to enforce the constraint,ηavssdprq “0, during the optimization procedure for the conservative optimization criterion of Equation (5.12).
xt`1“fpxt,αt, wtq, (6.12) wherextis the state of the system at time stept,αtare exogenous inputs that can be applied by an agent andwtis a disturbance from some probability space. The discussion in this section is restricted to the case of finite state and action spaces, and also to a finite probability space for the disturbance.
It is required that disturbance wt have a conditional distribution of the form ppwt|xt,αtq. Also of concern is the reward obtained by the agent, given by a functionrpxt,αt, wtq. Next, consider apolicy µ“ pµ0, µ1, . . .qwhere at stept, an actionαtis chosen according toµt. For stochastic systems, the policyµtmay require the entire execution history to successfully pick the action, i.e.,
αt“µtpx0,α0, . . . ,αt´1, xtq. (6.13) Restricting the discussion to the discounted long term reward and a known initial state of the system, the objective of the agent is to maximize the following expected long term discounted reward
ηβpx0q “sup
µ
ÿ8 t“0
βtrpxt, µt, wtq 0ďβ ă1, (6.14) under the constraint thatxtevolves using Equation (6.12). For this particular choice of objective it is well known that a stationary Markov policy is sufficient. Formally, this means thatαt“µtpxtq “ µpxtq.
The dynamic programming algorithm for the preceding algorithm is given by the iteration V0βpxq “ 0
Vk`β1pxq “ supα w
”rpx,α, wq `βVkβrfpx,α, wqsı
. (6.15)
For a known initial starting statex0, the optimal value ηβ˚px0qis given by the limit
ηβ˚px0q “ lim
kÑ8Vkβpx0q. (6.16)
Since it is known that a stationary optimal policy exists in the case of infinite horizon discounted re- ward over a finite system model, it follows that this policy satisfies theBellman Optimality Equation given by
Vβ˚pxq “sup
α w
“rpx,α, wq `βVβ˚rfpx,α, wqs‰
@x,α, w. (6.17)
6.2.1 Dynamic Programming Variants
Equations (6.15) and (6.16) are leveraged to perform dynamic programming in several popular ways in the literature. Restricting the scope of the discussion to a Markov decision process in which the
system evolution is given by a conditional probability distribution xk`1 „ Tpxk`1|xk,αq, and the reward is given byrpx,αq, the expectation in Equation (6.15) can be explicitly computed as
V0pxq “ 0 Vk`1pxq “ supαř
x1Tpx1|x,αq”
rpx,αq `βVkβpx1qı
. (6.18)
This iterative method constitutes what is known as a value iteration methodology for dynamic programming [12, 13]. At each iteration step k, Vkβpxq is called the value function of the state x.
It denotes the expected long term discounted reward the agent would collect if the initial state xt“0“x. When the iteration has converged, the policy is computed using
µpxq “argmax
α
ÿ
x1
Tpx1|x,αq”
rpx,αq `βVkβpx1qı
. (6.19)
It is also well known that for any given policyµ, the value functionVµβpxqsatisfies the following system of equations
Vµβpxq “ÿ
x1
Tpx1|x,αq”
rpx,αk, wkq `βVkβpx1qı
, (6.20)
called theBellman Equation. While it is possible to solve the above equation using exact methods, in most cases the Bellman equation is solved by iteration: the r.h.s of Equation (6.20) is repeatedly applied on successive values Vβ until convergence. The Bellman equation is utilized in another variant of dynamic programming, namely policy iteration, or Howard’s method [62], outlined in Algorithm 6.1.
Algorithm 6.1 Policy Iteration for Markov Decision Process
1: iterÐ0
2: Choose an initial policyµiter.
3: Viterβ pxqÐ0@x
4: repeat
5: iterÐiter`1
6: Policy Improvement: Improve the policy µiterpxq “argmax
α
ÿ
x1
Tpx1|x,αq”
rpx,αq `βVkβpx1qı
(6.21)
7: Policy Evaluation: Solve the Bellman Equation (6.20) to getViterβ pxq @x.
8: untilViterβ pxq ´Viter´β 1pxqďεβ @x
So far, the discussion of stochastic dynamic programming in this section has only focused on the discounted reward criterion. However, the Bellman Optimality equation, the Bellman Equation, and value and policy iteration techniques can be derived for the expected long term average reward criterion (Definition 2.2.9) as well. However, in the general setting of an arbitrary reward function and infinite state space, the existence of an optimal solution for the average case is not guaranteed
[81, 108]. However, for the set of problems of interest in this thesis, the global Markov chain is a discrete time system that evolves over finite state space, in which case the average reward does have an optimum. Additionally, as will be seen in Section 6.5, the optimal solution for the average case is not required for the algorithm proposed herein. Only the evaluation of the average reward value function under a given FSC is required to guarantee LTL satisfaction. Therefore, the Bellman Equation for the average reward case is sufficient for this work. In the succeeding section, the relevant dynamic programming equations for both discounted and average rewards are summarized for the specific case of POMDPs controlled by FSCs.