• Tidak ada hasil yang ditemukan

Belief State

Value

Value Function

0 1

Backed Up Value Function

b

Figure 6.2: Effect of DP Backup Equation. The solid line shows a (piece-wise linear) value function V. The effect of applying the DP Backup Equation results in pointwise improvement of the value function (dashed line). However not all belief state may admit improvement as they could already be optimal. bis such a point and is called atangent belief state.

at each belief state in the belief space, which is uncountably infinite. The stochastic bounded policy iteration algorithm, described in the remainder of this chapter, circumvents this by using a two pronged approach: (a) setup an efficient optimization problem to find the best ω for an FSC of a given size |G|; and (b) add a small, bounded number of I-states to the FSC to escape the local maxima as they are encountered.

before showing how it can be adapted for solving the Conservative Optimization Criterion given by Equation (5.12).

6.4.1 Bounded Stochastic Policy Iteration

Of concern is the problem of maximizing the expected long term discounted reward criterion over a general POMDP. The state transition probabilities are given by Tps1|s,αq, and observation proba- bilities byOpo|sq. Most of this section follows from [118] and [54]. These authors showed that:

1. Allowing stochastic I-state transitions and action selection (i.e., FSC I-state transitions and actions sampled from distributions) enables improvement of the policy without having to add more I-states.

2. If the policy cannot be improved, then the algorithm has reached a local maximum. Specifically, there are some belief states at which no choice ofω for the current size of the FSC allows the value function to be improved. In such a case, a small number of I-states can be added that improve the policy at precisely those belief states, thus escaping the local maximum.

Both of these steps together constitute the Policy Improvement step of the policy iteration Algorithm 6.1.

Definition 6.4.1 (Tangent Belief State) A belief statebis called atangent belief state, ifVβpbq touches the DP Backup of Vβpbqfrom below. Since Vβpbq must equal Vgβ for some g, we also say that the I-stategis tangent to the backed up value functionVβ atb. Tangency can be seen in Figure 6.2.

Equipped with this definition, the two steps involved in policy improvement can be carried out as follows.

Improving I-States by Solving a Linear Program

And I-state g is said to be improved if the tunable parameters associated to that state can be adjusted so thatV⃗gβis increased. This step tries to improve each I-state in a round robin fashion by keeping the other I-states the same. The improvement is posed as a linear program (LP) as follows:

I-state Improvement LP: For the I-stateg, the following LP is constructed over the unknowns

ϵ,ωpg1,α|g, oq, @g1,α, o.

ϵ,ωpgmax1,α|g,oq ϵ subject to

Improvement constraints:

Vβprs, gsq `ϵ ď rβpsq `β ř

s1,g1,α,o

Opo|sqωpg1,α|g, oqTps1|s,αqVβprs1, g1sq @s Probability constraints:

ř

g1

ωpg1,α|g, oq “ 1 @o ωpg1,α|g, oq ě 0 @g1,α, o

(6.35)

The linear program searches forωvalues that improve the I-state value vectorV⃗gβby maximizing the parametersϵ. If an improvement is found, i.e.,ϵą0, the parameters of the I-state are updated by the corresponding maximizing ω. The value vectorV⃗gβ may also be updated before proceeding to the next I-state in a round robin fashion.

In [118], the authors show the following interpretation of this optimization: it implicitly considers the value vectors Vgβ of the backed up value. A positive ϵ implies that the LP found a convex combination of the value vectors of the backed up function that dominates the current value of the I-state at every belief state. This is explained further in Figure 6.3, which is adapted from [54]. A key point is that the new value vector of the improved I-state is parallel to its current value, and the improved value becomestangent to the backed up value function.

Escaping Local Maxima by Adding I-States

Eventually no I-state can be improved with further iterations, i.e., @g PG, the corresponding LP yields an optimal value ofϵ“0. This is shown in Figure 6.4.

Theorem 6.4.2 [118] Policy Iteration has reached a local maximum if and only ifVg is tangent to the backed up value function for allgPG.

In order to escape local maxima, the controller can add more I-states to its structure. Here the tangency criterion becomes useful. First note that the dual variables corresponding to the Improve- ment Constraints in the LP provides the tangent belief state(s) whenϵ“0. In some cases, a value vector may be tangent to the backed up value function not just at a single point, but along a line segment. Regardless, at a local maximum, each of the|G|linear programs yield some tangent belief states. Most implementations of LP solvers solve the dual variables simultaneously and so these tangent beliefs are readily available as a by-product of the optimization process introduced above.

Algorithm 6.2 shows how to use the tangent beliefs to escape the local maximum.

Belief State

Value

V1β

V3β,1

V2β,1

V1β,1

0 1

ϵ

Figure 6.3: Graphical depiction of the effect of the I-state Improvement LP. This figure shows how the I-state improvement LP works. Let the LP be solved for the I-state whose value vector is V1β. The solid purple line shows the backed up value function. Current value function is not shown here. The backed up value vectors V1β,1 and V2β,1 are such that their convex combination (black dashed line) dominates the value vectorV1β byϵą0. The parameters of the I-stateg1 are therefore replaced by corresponding maximizing parameters so that its value moves upwards byϵ. Note that the improved value vector given byV1β`ϵis tangent to the backed up value function.

Belief State

Value

V1β V2β

0 1

Figure 6.4: Policy Iteration Local Maximum. All current value vectors (solid lines) are tangent to the backed up value function (solid magenta). No improvement of any I-state is possible.

Algorithm 6.2 Bounded PI: Adding I-States to Escape Local Maxima

Input: SetBof tangent beliefs from policy improvement LPs for each I-state,Nnew the maximum number of I-states to add.

1: NaddedÐ0.

2: repeat

3: PickbPB, B“Bztbu.

4: F wd“ H

5: for allpα, oq P pActˆOqdo

6: if P rpo|bq “ř

sPSbpsqOpo|sqą0 then

7: Look ahead one step to compute forwarded beliefs bo,αps1q “ÿ

s

Tps1|s,αq Opo|sqbpsq ř

o1POOpo1|sqbpsq. (6.36)

8: F wdÐF wdY tbo,αu

9: end if

10: end for

11: for allbf wdPF wd do

12: Apply the r.h.s. of DP Backup Equation tobf wd Vβ,backeduppbf wdq “ max

αPAct

#

rβpbf wdq `β ÿ

oPO

Prpo|bf wdq ˆ

maxgPG bo,αf wdpsqVgβpsq

˙+

(6.37) where, bo,αf wd is computed for reach product states1PS as follows

bo,αf wdps1q “ÿ

s

Tps1|s,αq Opo|sqbf wdpsq ř

o1POOpo1|sqbf wdpsq. (6.38)

13: Note the maximizing actionα˚ and I-stateg˚.

14: if Vβ,backeduppbf wdqąVβpbf wdqthen

15: Add new deterministic I-state gnewsuch that ωpgnew|g˚˚, oq “1@oPO.

16: NaddedÐNadded`1

17: end if

18: if NaddeděNnew then

19: return

20: end if

21: end for

22: untilB “ H.

The algorithm can be understood as follows. The tangent beliefs are those at which the DP backup results in no improvement of the value function beyond the current value. However, instead of improving the value at the tangent belief, the algorithm tries to improve the value of some belief that can be reached from the tangent belief in one step. These forwarded beliefs are computed in Steps 4-10 of Algorithm 6.2. Next, an attempt is made to improve these forwarded beliefs by DP backup (Step 12). If some actionα˚ and successor I-stateg˚ can in fact improve the value, then a new I-state is added which deterministically leads to this action and successor I-state (Steps 13-14).

Note that at the end of the algorithm, the newly added I-states, gnew have no incoming edges, i.e., no pre-existing I-states transition to gnew. However, when the other I-states are improved in subsequent policy improvement steps, they generate transitions to anygnewadded. This new I-state is then improves the value of the original tangent belief.

6.5 Applying Bounded Policy Iteration to LTL Reward Max-