Policy Iteration for FSCs - Formal Methods for Control Synthesis in Partially Observed Environm

Belief State

Value

Value Function

0 1

Backed Up Value Function

Figure 6.2: Eﬀect of DP Backup Equation. The solid line shows a (piece-wise linear) value function V. The eﬀect of applying the DP Backup Equation results in pointwise improvement of the value function (dashed line). However not all belief state may admit improvement as they could already be optimal. bis such a point and is called atangent belief state.

at each belief state in the belief space, which is uncountably infinite. The stochastic bounded policy iteration algorithm, described in the remainder of this chapter, circumvents this by using a two pronged approach: (a) setup an eﬃcient optimization problem to find the best ω for an FSC of a given size |G|; and (b) add a small, bounded number of I-states to the FSC to escape the local maxima as they are encountered.

before showing how it can be adapted for solving the Conservative Optimization Criterion given by Equation (5.12).

6.4.1 Bounded Stochastic Policy Iteration

Of concern is the problem of maximizing the expected long term discounted reward criterion over a general POMDP. The state transition probabilities are given by Tps¹|s,αq, and observation probabilities byOpo|sq. Most of this section follows from [118] and [54]. These authors showed that:

1. Allowing stochastic I-state transitions and action selection (i.e., FSC I-state transitions and actions sampled from distributions) enables improvement of the policy without having to add more I-states.

2. If the policy cannot be improved, then the algorithm has reached a local maximum. Specifically, there are some belief states at which no choice ofω for the current size of the FSC allows the value function to be improved. In such a case, a small number of I-states can be added that improve the policy at precisely those belief states, thus escaping the local maximum.

Both of these steps together constitute the Policy Improvement step of the policy iteration Algorithm 6.1.

Definition 6.4.1 (Tangent Belief State) A belief statebis called atangent belief state, ifV^βpbq touches the DP Backup of V^βpbqfrom below. Since V^βpbq must equal V_g^β for some g, we also say that the I-stategis tangent to the backed up value functionV^β atb. Tangency can be seen in Figure 6.2.

Equipped with this definition, the two steps involved in policy improvement can be carried out as follows.

Improving I-States by Solving a Linear Program

And I-state g is said to be improved if the tunable parameters associated to that state can be adjusted so thatV⃗_g^βis increased. This step tries to improve each I-state in a round robin fashion by keeping the other I-states the same. The improvement is posed as a linear program (LP) as follows:

I-state Improvement LP: For the I-stateg, the following LP is constructed over the unknowns

ϵ,ωpg¹,α|g, oq, @g¹,α, o.

ϵ,ωpgmax¹,α|g,oq ϵ subject to

Improvement constraints:

V^βprs, gsq `ϵ ď r^βpsq `β ř

s¹,g¹,α,o

Opo|sqωpg¹,α|g, oqTps¹|s,αqV^βprs¹, g¹sq @s Probability constraints:

g¹,α

ωpg¹,α|g, oq “ 1 @o ωpg¹,α|g, oq ě 0 @g¹,α, o

(6.35)

The linear program searches forωvalues that improve the I-state value vectorV⃗_g^βby maximizing the parametersϵ. If an improvement is found, i.e.,ϵą0, the parameters of the I-state are updated by the corresponding maximizing ω. The value vectorV⃗_g^β may also be updated before proceeding to the next I-state in a round robin fashion.

In [118], the authors show the following interpretation of this optimization: it implicitly considers the value vectors V_g^β of the backed up value. A positive ϵ implies that the LP found a convex combination of the value vectors of the backed up function that dominates the current value of the I-state at every belief state. This is explained further in Figure 6.3, which is adapted from [54]. A key point is that the new value vector of the improved I-state is parallel to its current value, and the improved value becomestangent to the backed up value function.

Escaping Local Maxima by Adding I-States

Eventually no I-state can be improved with further iterations, i.e., @g PG, the corresponding LP yields an optimal value ofϵ“0. This is shown in Figure 6.4.

Theorem 6.4.2 [118] Policy Iteration has reached a local maximum if and only ifVg is tangent to the backed up value function for allgPG.

In order to escape local maxima, the controller can add more I-states to its structure. Here the tangency criterion becomes useful. First note that the dual variables corresponding to the Improve- ment Constraints in the LP provides the tangent belief state(s) whenϵ“0. In some cases, a value vector may be tangent to the backed up value function not just at a single point, but along a line segment. Regardless, at a local maximum, each of the|G|linear programs yield some tangent belief states. Most implementations of LP solvers solve the dual variables simultaneously and so these tangent beliefs are readily available as a by-product of the optimization process introduced above.

Algorithm 6.2 shows how to use the tangent beliefs to escape the local maximum.

Belief State

Value

V1^β

V3^β,¹

V2^β,¹

V₁^β,¹

0 1

Figure 6.3: Graphical depiction of the eﬀect of the I-state Improvement LP. This figure shows how the I-state improvement LP works. Let the LP be solved for the I-state whose value vector is V1^β. The solid purple line shows the backed up value function. Current value function is not shown here. The backed up value vectors V1^β,¹ and V2^β,¹ are such that their convex combination (black dashed line) dominates the value vectorV1^β byϵą0. The parameters of the I-stateg1 are therefore replaced by corresponding maximizing parameters so that its value moves upwards byϵ. Note that the improved value vector given byV1^β`ϵis tangent to the backed up value function.

Belief State

Value

V₁^β V₂^β

0 1

Figure 6.4: Policy Iteration Local Maximum. All current value vectors (solid lines) are tangent to the backed up value function (solid magenta). No improvement of any I-state is possible.

Algorithm 6.2 Bounded PI: Adding I-States to Escape Local Maxima

Input: SetBof tangent beliefs from policy improvement LPs for each I-state,N_new the maximum number of I-states to add.

1: N_addedÐ0.

2: repeat

3: PickbPB, B“Bztbu.

4: F wd“ H

5: for allpα, oq P pActˆOqdo

6: if P rpo|bq “ř

sPSbpsqOpo|sqą0 then

7: Look ahead one step to compute forwarded beliefs b_o,αps¹q “ÿ

Tps¹|s,αq Opo|sqbpsq ř

o¹POOpo¹|sqbpsq. (6.36)

8: F wdÐF wdY tbo,αu

9: end if

10: end for

11: for allbf wdPF wd do

12: Apply the r.h.s. of DP Backup Equation tob_{f wd} V^β,backeduppbf wdq “ max

αPAct

r^βpbf wdq `β ÿ

oPO

Prpo|bf wdq ˆ

maxgPG b^o,α_{f wd}psqV_g^βpsq

˙+

(6.37) where, b^o,α_{f wd} is computed for reach product states¹PS as follows

b^o,α_{f wd}ps¹q “ÿ

Tps¹|s,αq Opo|sqbf wdpsq ř

o¹POOpo¹|sqbf wdpsq. (6.38)

13: Note the maximizing actionα^˚ and I-stateg^˚.

14: if V^β,backeduppbf wdqąV^βpbf wdqthen

15: Add new deterministic I-state gnewsuch that ωpgnew|g^˚,α^˚, oq “1@oPO.

16: NaddedÐNadded`1

17: end if

18: if NaddeděNnew then

19: return

20: end if

21: end for

22: untilB “ H.

The algorithm can be understood as follows. The tangent beliefs are those at which the DP backup results in no improvement of the value function beyond the current value. However, instead of improving the value at the tangent belief, the algorithm tries to improve the value of some belief that can be reached from the tangent belief in one step. These forwarded beliefs are computed in Steps 4-10 of Algorithm 6.2. Next, an attempt is made to improve these forwarded beliefs by DP backup (Step 12). If some actionα^˚ and successor I-stateg^˚ can in fact improve the value, then a new I-state is added which deterministically leads to this action and successor I-state (Steps 13-14).

Note that at the end of the algorithm, the newly added I-states, g_new have no incoming edges, i.e., no pre-existing I-states transition to gnew. However, when the other I-states are improved in subsequent policy improvement steps, they generate transitions to anygnewadded. This new I-state is then improves the value of the original tangent belief.

6.5 Applying Bounded Policy Iteration to LTL Reward Max-

Dalam dokumen Formal Methods for Control Synthesis in Partially Observed Environments: Application to Autonomous Robotic (Halaman 105-110)