4.8 Concluding Remarks
5.1.4 Posing the Problem as an Optimization Problem
Using the above various definitions of rewards from Equations (5.1), (5.4) and (5.11), for an FSC of fixed size and partitioning G “ tGtr, Gsu, the following conservative optimization criterion is derived.
Conservative Optimization Criterion maxω,κ ηβprq subject to ηavssdprq “0
ωpg1,α|g, oq “0 g1 PGtr, gPGss ř
pg1,αqPGˆAct
ωpg1,α|g, oq “1 @gPG, oPO
ωpg1,α|g, oq “1 @g, g1PG, oPO,αPAct ř
gPG
κpgq “1.
(5.12)
In the Equation (5.12), frequent visits toRepeatP Mr ϕ are incentivized by maximizing
ηβprq “ lim
TÑ8
« T ÿ
t“0
βtrrβpstqrGr |ιϕinit ff
, 0ăβă1, (5.13)
s1
s1
s2
s2
s3
s3 s4
s4
s5
s5
r1av“1 rβ1 “1
ιϕinitps1q “1
ιϕinitps1q “1 RepeatP M1 ϕ
RepeatP M1 ϕ
AvoidP M1 ϕ AvoidP M1 ϕ
g1PGtr g2PGss
ωpg1,α1|g1, oq “ p1´pq ωpg2,α1|g1, oq “p
ωpg2,α1|g2, oq “1
κpg1q “1, r1Gpg1q “0 κpg2q “0, r1Gpg1q “1
Figure 5.4: Steady state detecting global Markov chain. Consider the example from Figure 5.3.
This figure shows how the ssd-global Markov chain might look if only one action α1 is allowed in the FSC. The FSC is assumed to have two states g1 P Gtr and g2 P Gss indicated by the large dotted squares. Since the global Markov chain’s state space is a cartesian product of the Product- POMDP and the FSC states, each I-state is shown to contain all the Product-POMDP states. The FSC transitions probabilities are shown in dotted lines. While the black arrows denote between the Product-POMDP states show the actions that are allowed in the corresponding FSC state. The FSC starts in g1. Clearly under this policy, the probability of reachings5 is positive when 0 ăpă1, sinces3PAvoidP M1 ϕ is allowed to be visited when controller state isg1PGtr. However, even after the FSC has transitioned to the I-state g2 P Gss and the product state reaches s5 P RepeatP M1 ϕ, there is still a finite probability of visitings3, thereby violating the constraint in Equation (5.8.)
s1
s1
s2
s2
s3
s3 s4
s4
s5
s5
r1av “1 rβ1 “1
ιϕinitps1q “1
ιϕinitps1q “1 RepeatP M1 ϕ
RepeatP M1 ϕ
AvoidP M1 ϕ
AvoidP M1 ϕ
g1PGtr g2PGss
ωpg1,α1|g1, oq “ p1´pq ωpg2,α1|g1, oq “p
ωpg2,α2|g2, oq “1
κpg1q “1, r1Gpg1q “0 κpg2q “0, rG1pg1q “1
Figure 5.5: Steady state detecting global Markov chain. Consider the same example as in previous figure. The FSC again has two I-statesg1 PGtr and g2 PGss. However, under this possible, only α1is allowed ing1and onlyalpha2 is allowed ing2. This FSC again allows visit tos3PAvoidP M1 ϕ when ing1. There is a finite probability of reachingrs5, g2s, withs5PRepeatP M1 ϕ if 0ăpă1. Also under this policy, if the FSC I-state isg2 and the product state reachess5PRepeatP M1 ϕ, then the probability of visitings3at any time in the future is zero, thus satisfying the constraint in Equation (5.8).
and steady state visits toAvoidP Mr ϕ are forbidden the first constraint
ηavssdprq “ lim
TÑ8
1 T ssd
« T ÿ
t“0
rpstqrrG|ιssinit ff
“0 (5.14)
in Equation (5.12), where the new quantity, ιssinit, introduced in Equation (5.14), will be explained shortly. The other constraints relate to the I-state partitioning that has already been introduced and the enforcement of laws of probability for admissible FSCs. Note that the expectation needed to computeηavssd uses the ssd-global Markov chain transition distribution from Equation (5.10), but that ofηβ uses the unmodified Markov chain transition distribution. The product termsrβrrrG and rβrrGr ensure that only those visits to RepeatP Mr ϕ are rewarded when the controller I-state lies in Gss, implying that it can now guarantee no more visits toAvoidP Mr ϕ. This can be seen in Figures 5.4 and 5.5 as well.
Next, defineιssinit as a distribution over the ssd-global Markov chain states as follows.
ιssinitprs, gsq “
$&
%
1
|Gss||RepeatP Mr ϕ| ifsPRepeatP Mr ϕ, gPGss
0 otherwise.
(5.15)
The effect of choosing this initial condition implies that in steady state,
@rs, gs P pRepeatP Mr ϕˆGssq, rs, gs Û pAvoidP Mr ϕˆGq. (5.16)
Compare this to the statement in Equation (5.8), which can be re-written as Pr”
πÑ rs, gs P pRepeatP Mr ϕˆGqı
ą0 ùñ rs, gs Û pAvoidP Mr ϕˆGq. (5.17) This later statement requires this condition to hold for only those states inpRepeatP Mr ϕˆGqunder the currentω and κ. If some repeat state is not visited by the controller during steady state, then our proposed choice of ιssinit adds additional feasibility constraints, which may severely reduce the obtainable reward,ηrβ, and in the worst case, render the problem infeasible. This phenomenon can be understood from the example in Figure 5.6. This is the reason that the optimization problem stated in Equation (5.12) is called a Conservative Optimization Criterion. While being suboptimal, this criterion has some signficant advantages. As will be outlined in the next chapter, the Conser- vative Optimization Criterion can be framed as a policy iteration algorithm, in which each policy improvement step can be solved efficiently. Moreover the policy improvement step also yields a way to add I-states to the controller. This is useful to escape the local maxima encountered during optimization of the total rewardηβr. The intuition behind the addition of FSC I-states is that they allow the generation and differentiation of many new observation and action sequences. This implies
s1
s2 s3
s4
α1
α1
α1
α1
α2
α2
α2
α2
g2PGss g1PGtr
Figure 5.6: Reduced Feasibility arising from Conservative Optimization Criterion. Consider an example of a product-POMDP in which each action results in deterministic state transition as shown by the solid black arrows. RepeatP M1 ϕ “ ts2, s3u, andAvoidP M1 ϕ “ ts4u. There are two I-states in the FSC, withg1PGtr, andg2PGss. Note thatωpg2,α1|g1, oq “1 disconnectss4ands3
from a repeat states2, which is additionally guaranteed to be visited infinitely often, thus making a feasible controller. However, the condition over ηavssd requires that s4 be disconnected from the repeat set s3 as well, even though s3 need not be visited at all. This prevents issuing action α1
completely, and no controller can be found.
that many new paths in the global Markov chain can be explored for improving the optimization objective. The entirety of the next chapter is spent describing this methodology.