Posing the Problem as an Optimization Problem

4.8 Concluding Remarks

5.1.4 Posing the Problem as an Optimization Problem

Using the above various definitions of rewards from Equations (5.1), (5.4) and (5.11), for an FSC of fixed size and partitioning G “ tG^tr, G^su, the following conservative optimization criterion is derived.

Conservative Optimization Criterion maxω,κ ηβprq subject to η_av^ssdprq “0

ωpg¹,α|g, oq “0 g¹ PG^tr, gPG^ss ř

pg¹,αqPGˆAct

ωpg¹,α|g, oq “1 @gPG, oPO

ωpg¹,α|g, oq “1 @g, g¹PG, oPO,αPAct ř

gPG

κpgq “1.

(5.12)

In the Equation (5.12), frequent visits toRepeat^{P M}_r ^ϕ are incentivized by maximizing

ηβprq “ lim

TÑ8

« _T ÿ

t“0

β^tr_r^βpstqr^Gr |ι^ϕ_init ﬀ

, 0ăβă1, (5.13)

s3 s4

r₁^av“1 r^β1 “1

ι^ϕ_initps1q “1

ι^ϕ_initps1q “1 Repeat^{P M}1 ^ϕ

Repeat^{P M}1 ^ϕ

Avoid^{P M}₁ ^ϕ Avoid^{P M}1 ^ϕ

g1PG^tr g2PG^ss

ωpg1,α1|g1, oq “ p1´pq ωpg2,α1|g1, oq “p

ωpg2,α1|g2, oq “1

κpg1q “1, r₁^Gpg1q “0 κpg2q “0, r1^Gpg1q “1

Figure 5.4: Steady state detecting global Markov chain. Consider the example from Figure 5.3.

This figure shows how the ssd-global Markov chain might look if only one action α1 is allowed in the FSC. The FSC is assumed to have two states g1 P G^tr and g2 P G^ss indicated by the large dotted squares. Since the global Markov chain’s state space is a cartesian product of the Product- POMDP and the FSC states, each I-state is shown to contain all the Product-POMDP states. The FSC transitions probabilities are shown in dotted lines. While the black arrows denote between the Product-POMDP states show the actions that are allowed in the corresponding FSC state. The FSC starts in g1. Clearly under this policy, the probability of reachings5 is positive when 0 ăpă1, sinces3PAvoid^{P M}₁ ^ϕ is allowed to be visited when controller state isg1PG^tr. However, even after the FSC has transitioned to the I-state g2 P G^ss and the product state reaches s5 P Repeat^{P M}₁ ^ϕ, there is still a finite probability of visitings3, thereby violating the constraint in Equation (5.8.)

s3 s4

r₁^av “1 r^β1 “1

ι^ϕ_initps1q “1

ι^ϕ_initps1q “1 Repeat^{P M}1 ^ϕ

Repeat^{P M}1 ^ϕ

Avoid^{P M}1 ^ϕ

g1PG^tr g2PG^ss

ωpg1,α1|g1, oq “ p1´pq ωpg2,α1|g1, oq “p

ωpg2,α2|g2, oq “1

κpg1q “1, r1^Gpg1q “0 κpg2q “0, r^G1pg1q “1

Figure 5.5: Steady state detecting global Markov chain. Consider the same example as in previous figure. The FSC again has two I-statesg1 PG^tr and g2 PG^ss. However, under this possible, only α1is allowed ing1and onlyalpha2 is allowed ing2. This FSC again allows visit tos3PAvoid^{P M}₁ ^ϕ when ing1. There is a finite probability of reachingrs5, g2s, withs5PRepeat^{P M}₁ ^ϕ if 0ăpă1. Also under this policy, if the FSC I-state isg2 and the product state reachess5PRepeat^{P M}₁ ^ϕ, then the probability of visitings3at any time in the future is zero, thus satisfying the constraint in Equation (5.8).

and steady state visits toAvoid^{P M}_r ^ϕ are forbidden the first constraint

η_av^ssdprq “ lim

TÑ8

1 T ^ssd

« _T ÿ

t“0

rpstqrr^G|ι^ss_init ﬀ

“0 (5.14)

in Equation (5.12), where the new quantity, ι^ss_init, introduced in Equation (5.14), will be explained shortly. The other constraints relate to the I-state partitioning that has already been introduced and the enforcement of laws of probability for admissible FSCs. Note that the expectation needed to computeη_av^ssd uses the ssd-global Markov chain transition distribution from Equation (5.10), but that ofηβ uses the unmodified Markov chain transition distribution. The product termsr^βrr_r^G and r^βrr^G_r ensure that only those visits to Repeat^{P M}_r ^ϕ are rewarded when the controller I-state lies in G^ss, implying that it can now guarantee no more visits toAvoid^{P M}_r ^ϕ. This can be seen in Figures 5.4 and 5.5 as well.

Next, defineι^ss_init as a distribution over the ssd-global Markov chain states as follows.

ι^ss_initprs, gsq “

|G^ss||Repeat^{P M}r ^ϕ| ifsPRepeat^{P M}_r ^ϕ, gPG^ss

0 otherwise.

(5.15)

The eﬀect of choosing this initial condition implies that in steady state,

@rs, gs P pRepeat^{P M}_r ^ϕˆG^ssq, rs, gs Û pAvoid^{P M}_r ^ϕˆGq. (5.16)

Compare this to the statement in Equation (5.8), which can be re-written as Pr”

πÑ rs, gs P pRepeat^{P M}_r ^ϕˆGqı

ą0 ùñ rs, gs Û pAvoid^{P M}_r ^ϕˆGq. (5.17) This later statement requires this condition to hold for only those states inpRepeat^{P M}_r ^ϕˆGqunder the currentω and κ. If some repeat state is not visited by the controller during steady state, then our proposed choice of ι^ss_init adds additional feasibility constraints, which may severely reduce the obtainable reward,ηr^β, and in the worst case, render the problem infeasible. This phenomenon can be understood from the example in Figure 5.6. This is the reason that the optimization problem stated in Equation (5.12) is called a Conservative Optimization Criterion. While being suboptimal, this criterion has some signficant advantages. As will be outlined in the next chapter, the Conser- vative Optimization Criterion can be framed as a policy iteration algorithm, in which each policy improvement step can be solved eﬃciently. Moreover the policy improvement step also yields a way to add I-states to the controller. This is useful to escape the local maxima encountered during optimization of the total rewardη^βr. The intuition behind the addition of FSC I-states is that they allow the generation and diﬀerentiation of many new observation and action sequences. This implies

s2 s3

α1

α2

g2PG^ss g1PG^tr

Figure 5.6: Reduced Feasibility arising from Conservative Optimization Criterion. Consider an example of a product-POMDP in which each action results in deterministic state transition as shown by the solid black arrows. Repeat^{P M}₁ ^ϕ “ ts2, s3u, andAvoid^{P M}₁ ^ϕ “ ts4u. There are two I-states in the FSC, withg1PG^tr, andg2PG^ss. Note thatωpg2,α1|g1, oq “1 disconnectss4ands3

from a repeat states2, which is additionally guaranteed to be visited infinitely often, thus making a feasible controller. However, the condition over η_av^ssd requires that s4 be disconnected from the repeat set s3 as well, even though s3 need not be visited at all. This prevents issuing action α1

completely, and no controller can be found.

that many new paths in the global Markov chain can be explored for improving the optimization objective. The entirety of the next chapter is spent describing this methodology.

Dalam dokumen Formal Methods for Control Synthesis in Partially Observed Environments: Application to Autonomous Robotic (Halaman 90-94)