Bounded Policy Iteration For Conservative Optimization Criterion

A.4 Natural σ-Algebra and Distributions over Countable Sets

6.3 Bounded Policy Iteration For Conservative Optimization Criterion

Input: (a) An initial feasible FSC, G with I-states G “ tG^tr, G^ssu, such that η_av^ssdprq “ 0. (b) Maximum size of FSCN_max. (c)N_newďN_max number of I-states

1: improvedÐT rue

2: Compute the value vectors, V⃗^β of the discounted reward criterion η_β as in Equation (6.23), or eﬃcient approximation in Section 6.3.1.

3: while|G|ďNmaxand improved“T ruedo

4: improvedÐF alse

5: for allI-statesgPGdo

6: Set up the Constrained Improvement LP as in Section 6.5.1.

7: if Improvement LP results in optimalϵą0 then

8: Replace the parameters for I-stateg

9: improvedÐT rue

10: Compute the value vectors, V⃗_g^β of the discounted reward criterion ηβ as in Equation (6.23), or eﬃcient approximation in Section 6.3.1.

11: end if

12: end for

13: if improved“F alseand|G|ăN_max then

14: naddedÐ0

15: N_new¹ ÐminpNnew, Nmax´ |G|q

16: Try to addN_new¹ I-state(s) toG according to constrained DP backup in Section 6.5.3.

17: naddedÐactual number of I-states added in previous step.

18: if naddedą0 then

19: improvedÐT rue

20: Compute the value vectors, V⃗_g^β of the discounted reward criterion ηβ as in Equation (6.23), or eﬃcient approximation in Section 6.3.1.

21: end if

22: end if

23: end while Output: G

6.5.1 Node Improvement

The first observation is that the search overκcan be dropped. This simplification occurs because the initial node is chosen by computing the best valued node for the initial belief, i.e.,

κpginitq “ 1, where, ginit “ argmax

p⃗ι^ϕ_initq^TV⃗_g^β. (6.40) Once this initial node has been selected, the above objective only diﬀers from the typical discounted reward maximization problem in the previous sections because of the presence of the new constraint

η^ssd_av prq “0, (6.41)

which must be incorporated into the optimization algorithm. Using Theorem 6.1.3, the above constraint can be rewritten as

η_av^ssdprq “0 ðñ `

⃗ι^ssd_init˘T

⃗g“0, (6.42)

where⃗guniquely solves the Poisson Equation (6.2). This allows the node improvement to be written as a bilinear program. Again, one nodegis improved at a time while holding all other nodes constant as follows.

I-state Improvement Bilinear Program:

max

ϵ,ωpg¹,α|g,oq,⃗g,V⃗av

ϵ subject to

Improvement Constraints:

V^βprs, gsq `ϵ ď r^βpsq `β ř

s¹,g¹,α,o

Opo|sqωpg¹,α|g, oqT^{P M}^ϕps¹|s,αqV^βprs¹, g¹sq @s Poisson Equation (ifgPG^ss):

V⃗âv`⃗g “ ⃗râv`T_mod^{P M}^ϕV⃗âv

⃗g“T_mod^{P M}^ϕ⃗g

Feasibility Constraints (ifgPG^ss):

`⃗ι^ss_init,g˘T

⃗g “ 0

FSC Structure Constraints (ifgPG^ss):

ωpg¹,α|g, oq “ 0 ifg¹PG^tr Probability Constraints:

g¹,α

ωpg¹,α|g, oq “ 1 @o ωpg¹,α|g, oq ě 0 @g¹,α, o

(6.43) Note that a node in G^tr does not have to guarantee that Product-POMDP states are not allowed to visit Avoid^{P M}_r ^ϕ and hence the extra Poisson Equation and Feasibility Constraints that appear above need only be applied to I-stateg PG^ss. Further, the FSC structure constraints ensure that once the execution has transitioned to steady state, the I-states in G^tr can no longer be visited.

They are in fact a reduction in the number of unknown variables. Next, the Poisson Equation constraints introduce bilinearity in the optimization. This is because the term T_mod^{P M}^ϕ, which is linear inωpg¹,α|g, oq, is multiplied by the unknowns V⃗^av and⃗gin the two sets of constraints that form the Poisson Equation.

6.5.2 Convex Relaxation of Bilinear Terms

Bilinear problems are in general hard to solve [27], unless they are equivalent to positive semidef- inite or second order cone programs, which make the problem convex. Neither of these convexity assumptions hold for the bilinear constraints in Equation (6.43). However, several convex relaxation schemes exist for bilinear problems. Two of the most popular methods of relaxing the problem are to use the Reformulation-Linearization Technique (RLT) [132, 134] or the more recent SDP relaxation techniques used in [133, 136]. A good overview of these two methods can be found in [124]. This thesis utilizes a linear relaxation resulting from the RLT, which is summarized below, to obtain a possibly sub-optimal solution at each improvement step.

While RLT can be applied to a wide range of problems including discrete combinatorial problems, it is introduced here for the case of Quadratically Constrained Quadratic Problems (QCQPs) over unknownsxPRⁿ, yPR^m. The notation follows closely from [124]. A QCQP can be written as

max x^TQox`a^T_ox`b^T_oy s.t.

x^TQkx`a^T_kx`b^T_kyďck fork“1,2, . . . , p lxiďxiďuxi fori“1,2, . . . , n lyj ďyj ďuyj forj“1,2, . . . , m

. (6.44)

RLT is carried out as follows, for eachxi, xj such that the product termxixj is non zero in either the objective or the constraints, a new variableXijis introduced, which replaces the productxixj in the problem. In addition, the boundslxi, lxj, uxi, uxj are utilized to produce four new linear constraints

Xij´lxixj´lxjxi ě ´lxilxj

Xijúxixjúxjxi ě úxiuxj

X_ij´l_x_ix_jú_x_jx_i ď ´lxiu_x_j X_ijú_x_ix_j´l_x_jx_i ď úxil_x_j.

(6.45)

The above constraints are the McCormick convex envelopes [97]. For bilinear programming with bounded variables, the McCormick convex envelopes are successively used in algorithms such as branch and bound [83] to successively obtain tighter relaxations to obtain globally optimal solutions.

An eﬃcient solver that incorporates this methodology is [89].

The following shows how to relax the bilinear constraints appearing in the Poisson Equation in the I-state Improvement Bilinear Program in Equation 6.43. Note first that while the Poisson Equation spans all global states, i.e., there areSˆ |G|for each of the two sets of Poisson constraints, only the equations pertaining to the current I-state under consideration has bilinear terms. To see this, rewrite the Poisson constraints in block form,

——

—– V⃗_g^av

..1

. V⃗_g^av

... V⃗_g^av

|G|

fi ffiffi ffiffi ffiffi ffiffi ffifl

——

—–

⃗g_g₁ ...

⃗g_g ...

⃗g_g_|G|

fi ffiffi ffiffi ffiffi ffiffi ffifl

“

——

—–

⃗ r^av_g

..1

⃗ r^av_g

...

⃗r_g^av

|G|

fi ffiffi ffiffi ffiffi ffiffi ffifl

——

—— –

T_mod,g^{P M}^ϕ .. 1

. T_mod,g^{P M}^ϕ

... T_mod,g^{P M}^ϕ

|G|

fi ffiffi ffiffi ffiffi ffiffi ffiffi fl

——

—– V⃗_g^av

..1

. V⃗_g^av

... V⃗_g^av

|G|

fi ffiffi ffiffi ffiffi ffiffi ffifl

and

——

—–

⃗g_g₁ ...

⃗g_g ...

⃗g_g_|G|

fi ffiffi ffiffi ffiffi ffiffi ffifl

“

——

—— –

T_mod,g^{P M}^ϕ .. 1

. T_mod,g^{P M}^ϕ

... T_mod,g^{P M}^ϕ

|G|

fi ffiffi ffiffi ffiffi ffiffi ffiffi fl

——

—–

⃗g_g₁ ...

⃗g_g ...

⃗g_g_|G|

fi ffiffi ffiffi ffiffi ffiffi ffifl ,

(6.46)

it can be seen that the bilinearity arises because the rows (in red) ofT_mod^{P M}^ϕwhich are linear terms of the unknownsωp., .|g, .qare multiplied byV⃗^av and⃗grespectively. The other rows ofT_mod^{P M}^ϕ are not functions of the unknowns and their values are used from the values in the previous policy evaluation step. The total number of bilinear terms in both sets of equations is given by 2ˆ |S||O||G||Act|.

Moreover, applying the convex relaxation requires that all terms appearing in bilinear products must have finite bounds. For the unknownsωp., .|g, .q,⃗gandV⃗^av these bounds are given by

⃗0 ď ωp., .|g, .q ď ⃗1

⃗0 ď ⃗g ď ⃗1

´M⃗1 ď V⃗^av ď M⃗2,

(6.47)

whereM1, M2are large positive constants that are manually selected. This is because the feasibility set forV⃗^av is dependent on the eigen values of I´T_mod,g^{P M}^ϕ

|G| [94], which is difficult to represent in terms of the optimization variables. During numerical implementation, this issue was not found to adversely effect the solution quality. This may be due to the fact that V⃗âv does not appear in either the objective or in the feasibility constraints of Equation (6.43). In fact, for the choice ofω only constrains the value⃗g, whereas givenω and⃗ga feasible value ofV⃗âv can always be found. A rigorous analysis of the effect of the choice for the bounds ofV⃗âv remains to be carried out.

6.5.3 Addition of I-States to Escape Local Maxima

When no I-state Improvement LP yieldsϵą0, a local maxima for the Bounded Policy Iteration (BPI) has been reached. The dual variables corresponding to the Improvement constraints in Equation (6.43) again give those belief states that are tangent to the backed up value function. The process for adding I-states will again involve forwarding the tangent beliefs one step and then checking if the value of those forwarded beliefs can be improved. However an additional check for recurrence

constraints will have to made if the involved I-state belongs to theG^ss states of the FSC controller.

In addition, if an I-state is added to the FSC, it must also be assigned to eitherG^tr orG^ss, because the next policy evaluation iteration depends on the I-state partitioning in the computation ofT_mod^{P M}^ϕ. The procedure for adding I-states is provided in Algorithm 6.4.

Algorithm 6.4 can be understood as follows. Assume that a tangent beliefb for some I-stateg.

Similar to Algorithm 6.2, instead of directly improving the value of the tangent belief, the algorithm tries to improve the value of forwarded beliefs reachable in one step from the tangent beliefs. This is given in Step 4 of Algorithm 6.4. Recall from Section 6.4.1 that when a new I-state is added, its successor states are chosen from the existing I-states. A similar approach is used in Algorithm 6.4.

However, a new node may be added to eitherG^trorG^ssdepending on the I-state that generated the original tangent belief. Recall that I-states inG^sshave two additional constraints. First, no state in G^sscan transition to any state inG^tr. This is enforced by limiting the successor state candidates in Steps 6-9. Secondly, for improving a node inG^ss, the allowed actions and transitions must satisfy the Poisson Equation constraints of Equation (6.43). This further reduces or prunes the possible successor candidates in Step 10, which is elaborated as a separate procedure in Algorithm 6.5. The rest of the procedure is identical to Algorithm 6.2, except for Step 20, in which any newly added I-state is placed in the correct partition ofG^tr orG^ss.

Algorithm 6.5 prevents any new I-states to choose a pair of action and successor I-state that may violate the Feasibility Constraints of Equation (6.43). In order to carry out this procedure, a phantom I-state,gphantomPG^ssis temporarily added to the current FSC for a pairpg,αq Pcandidates. Next, the modified transition distributionTmod,phantom^{P M}^ϕ is computed using Equation (5.3), and the Poisson Equation is solved to obtain a new⃗gwhich can be used to verify the Feasibility Constraint. If this constraint is violated. i.e., thenpg,αqis removed from the setcandidates. Note that the algorithm works on a copy of the original FSC, and the solution of the Poisson Equation computed at the last Policy Evaluation step. The addition of gphantom, and recomputation of the Poisson Equation is only used within Algorithm 6.5.

Algorithm 6.4 Adding I-states to Escape Local Maxima of Constrained Optimization Criterion

Dalam dokumen Formal Methods for Control Synthesis in Partially Observed Environments: Application to Autonomous Robotic (Halaman 111-117)