A.4 Natural σ-Algebra and Distributions over Countable Sets
6.3 Bounded Policy Iteration For Conservative Optimization Criterion
Input: (a) An initial feasible FSC, G with I-states G “ tGtr, Gssu, such that ηavssdprq “ 0. (b) Maximum size of FSCNmax. (c)NnewďNmax number of I-states
1: improvedÐT rue
2: Compute the value vectors, V⃗β of the discounted reward criterion ηβ as in Equation (6.23), or efficient approximation in Section 6.3.1.
3: while|G|ďNmaxand improved“T ruedo
4: improvedÐF alse
5: for allI-statesgPGdo
6: Set up the Constrained Improvement LP as in Section 6.5.1.
7: if Improvement LP results in optimalϵą0 then
8: Replace the parameters for I-stateg
9: improvedÐT rue
10: Compute the value vectors, V⃗gβ of the discounted reward criterion ηβ as in Equation (6.23), or efficient approximation in Section 6.3.1.
11: end if
12: end for
13: if improved“F alseand|G|ăNmax then
14: naddedÐ0
15: Nnew1 ÐminpNnew, Nmax´ |G|q
16: Try to addNnew1 I-state(s) toG according to constrained DP backup in Section 6.5.3.
17: naddedÐactual number of I-states added in previous step.
18: if naddedą0 then
19: improvedÐT rue
20: Compute the value vectors, V⃗gβ of the discounted reward criterion ηβ as in Equation (6.23), or efficient approximation in Section 6.3.1.
21: end if
22: end if
23: end while Output: G
6.5.1 Node Improvement
The first observation is that the search overκcan be dropped. This simplification occurs because the initial node is chosen by computing the best valued node for the initial belief, i.e.,
κpginitq “ 1, where, ginit “ argmax
g
p⃗ιϕinitqTV⃗gβ. (6.40) Once this initial node has been selected, the above objective only differs from the typical dis- counted reward maximization problem in the previous sections because of the presence of the new constraint
ηssdav prq “0, (6.41)
which must be incorporated into the optimization algorithm. Using Theorem 6.1.3, the above constraint can be rewritten as
ηavssdprq “0 ðñ `
⃗ιssdinit˘T
⃗g“0, (6.42)
where⃗guniquely solves the Poisson Equation (6.2). This allows the node improvement to be written as a bilinear program. Again, one nodegis improved at a time while holding all other nodes constant as follows.
I-state Improvement Bilinear Program:
max
ϵ,ωpg1,α|g,oq,⃗g,V⃗av
ϵ subject to
Improvement Constraints:
Vβprs, gsq `ϵ ď rβpsq `β ř
s1,g1,α,o
Opo|sqωpg1,α|g, oqTP Mϕps1|s,αqVβprs1, g1sq @s Poisson Equation (ifgPGss):
V⃗av`⃗g “ ⃗rav`TmodP MϕV⃗av
⃗g“TmodP Mϕ⃗g
Feasibility Constraints (ifgPGss):
`⃗ιssinit,g˘T
⃗g “ 0
FSC Structure Constraints (ifgPGss):
ωpg1,α|g, oq “ 0 ifg1PGtr Probability Constraints:
ř
g1,α
ωpg1,α|g, oq “ 1 @o ωpg1,α|g, oq ě 0 @g1,α, o
(6.43) Note that a node in Gtr does not have to guarantee that Product-POMDP states are not allowed to visit AvoidP Mr ϕ and hence the extra Poisson Equation and Feasibility Constraints that appear above need only be applied to I-stateg PGss. Further, the FSC structure constraints ensure that once the execution has transitioned to steady state, the I-states in Gtr can no longer be visited.
They are in fact a reduction in the number of unknown variables. Next, the Poisson Equation constraints introduce bilinearity in the optimization. This is because the term TmodP Mϕ, which is linear inωpg1,α|g, oq, is multiplied by the unknowns V⃗av and⃗gin the two sets of constraints that form the Poisson Equation.
6.5.2 Convex Relaxation of Bilinear Terms
Bilinear problems are in general hard to solve [27], unless they are equivalent to positive semidef- inite or second order cone programs, which make the problem convex. Neither of these convexity assumptions hold for the bilinear constraints in Equation (6.43). However, several convex relaxation schemes exist for bilinear problems. Two of the most popular methods of relaxing the problem are to use the Reformulation-Linearization Technique (RLT) [132, 134] or the more recent SDP relaxation techniques used in [133, 136]. A good overview of these two methods can be found in [124]. This thesis utilizes a linear relaxation resulting from the RLT, which is summarized below, to obtain a possibly sub-optimal solution at each improvement step.
While RLT can be applied to a wide range of problems including discrete combinatorial problems, it is introduced here for the case of Quadratically Constrained Quadratic Problems (QCQPs) over unknownsxPRn, yPRm. The notation follows closely from [124]. A QCQP can be written as
max xTQox`aTox`bToy s.t.
xTQkx`aTkx`bTkyďck fork“1,2, . . . , p lxiďxiďuxi fori“1,2, . . . , n lyj ďyj ďuyj forj“1,2, . . . , m
. (6.44)
RLT is carried out as follows, for eachxi, xj such that the product termxixj is non zero in either the objective or the constraints, a new variableXijis introduced, which replaces the productxixj in the problem. In addition, the boundslxi, lxj, uxi, uxj are utilized to produce four new linear constraints
Xij´lxixj´lxjxi ě ´lxilxj
Xij´uxixj´uxjxi ě ´uxiuxj
Xij´lxixj´uxjxi ď ´lxiuxj Xij´uxixj´lxjxi ď ´uxilxj.
(6.45)
The above constraints are the McCormick convex envelopes [97]. For bilinear programming with bounded variables, the McCormick convex envelopes are successively used in algorithms such as branch and bound [83] to successively obtain tighter relaxations to obtain globally optimal solutions.
An efficient solver that incorporates this methodology is [89].
The following shows how to relax the bilinear constraints appearing in the Poisson Equation in the I-state Improvement Bilinear Program in Equation 6.43. Note first that while the Poisson Equation spans all global states, i.e., there areSˆ |G|for each of the two sets of Poisson constraints, only the equations pertaining to the current I-state under consideration has bilinear terms. To see this, rewrite the Poisson constraints in block form,
»
——
——
——
——
—– V⃗gav
..1
. V⃗gav
... V⃗gav
|G|
fi ffiffi ffiffi ffiffi ffiffi ffifl
`
»
——
——
——
——
—–
⃗gg1 ...
⃗gg ...
⃗gg|G|
fi ffiffi ffiffi ffiffi ffiffi ffifl
“
»
——
——
——
——
—–
⃗ ravg
..1
.
⃗ ravg
...
⃗rgav
|G|
fi ffiffi ffiffi ffiffi ffiffi ffifl
`
»
——
——
——
——
—— –
Tmod,gP Mϕ .. 1
. Tmod,gP Mϕ
... Tmod,gP Mϕ
|G|
fi ffiffi ffiffi ffiffi ffiffi ffiffi fl
»
——
——
——
——
—– V⃗gav
..1
. V⃗gav
... V⃗gav
|G|
fi ffiffi ffiffi ffiffi ffiffi ffifl
and
»
——
——
——
——
—–
⃗gg1 ...
⃗gg ...
⃗gg|G|
fi ffiffi ffiffi ffiffi ffiffi ffifl
“
»
——
——
——
——
—— –
Tmod,gP Mϕ .. 1
. Tmod,gP Mϕ
... Tmod,gP Mϕ
|G|
fi ffiffi ffiffi ffiffi ffiffi ffiffi fl
»
——
——
——
——
—–
⃗gg1 ...
⃗gg ...
⃗gg|G|
fi ffiffi ffiffi ffiffi ffiffi ffifl ,
(6.46)
it can be seen that the bilinearity arises because the rows (in red) ofTmodP Mϕwhich are linear terms of the unknownsωp., .|g, .qare multiplied byV⃗av and⃗grespectively. The other rows ofTmodP Mϕ are not functions of the unknowns and their values are used from the values in the previous policy evaluation step. The total number of bilinear terms in both sets of equations is given by 2ˆ |S||O||G||Act|.
Moreover, applying the convex relaxation requires that all terms appearing in bilinear products must have finite bounds. For the unknownsωp., .|g, .q,⃗gandV⃗av these bounds are given by
⃗0 ď ωp., .|g, .q ď ⃗1
⃗0 ď ⃗g ď ⃗1
´M⃗1 ď V⃗av ď M⃗2,
(6.47)
whereM1, M2are large positive constants that are manually selected. This is because the feasibility set forV⃗av is dependent on the eigen values of I´Tmod,gP Mϕ
|G| [94], which is difficult to represent in terms of the optimization variables. During numerical implementation, this issue was not found to adversely effect the solution quality. This may be due to the fact that V⃗av does not appear in either the objective or in the feasibility constraints of Equation (6.43). In fact, for the choice ofω only constrains the value⃗g, whereas givenω and⃗ga feasible value ofV⃗av can always be found. A rigorous analysis of the effect of the choice for the bounds ofV⃗av remains to be carried out.
6.5.3 Addition of I-States to Escape Local Maxima
When no I-state Improvement LP yieldsϵą0, a local maxima for the Bounded Policy Iteration (BPI) has been reached. The dual variables corresponding to the Improvement constraints in Equation (6.43) again give those belief states that are tangent to the backed up value function. The process for adding I-states will again involve forwarding the tangent beliefs one step and then checking if the value of those forwarded beliefs can be improved. However an additional check for recurrence
constraints will have to made if the involved I-state belongs to theGss states of the FSC controller.
In addition, if an I-state is added to the FSC, it must also be assigned to eitherGtr orGss, because the next policy evaluation iteration depends on the I-state partitioning in the computation ofTmodP Mϕ. The procedure for adding I-states is provided in Algorithm 6.4.
Algorithm 6.4 can be understood as follows. Assume that a tangent beliefb for some I-stateg.
Similar to Algorithm 6.2, instead of directly improving the value of the tangent belief, the algorithm tries to improve the value of forwarded beliefs reachable in one step from the tangent beliefs. This is given in Step 4 of Algorithm 6.4. Recall from Section 6.4.1 that when a new I-state is added, its successor states are chosen from the existing I-states. A similar approach is used in Algorithm 6.4.
However, a new node may be added to eitherGtrorGssdepending on the I-state that generated the original tangent belief. Recall that I-states inGsshave two additional constraints. First, no state in Gsscan transition to any state inGtr. This is enforced by limiting the successor state candidates in Steps 6-9. Secondly, for improving a node inGss, the allowed actions and transitions must satisfy the Poisson Equation constraints of Equation (6.43). This further reduces or prunes the possible successor candidates in Step 10, which is elaborated as a separate procedure in Algorithm 6.5. The rest of the procedure is identical to Algorithm 6.2, except for Step 20, in which any newly added I-state is placed in the correct partition ofGtr orGss.
Algorithm 6.5 prevents any new I-states to choose a pair of action and successor I-state that may violate the Feasibility Constraints of Equation (6.43). In order to carry out this procedure, a phantom I-state,gphantomPGssis temporarily added to the current FSC for a pairpg,αq Pcandidates. Next, the modified transition distributionTmod,phantomP Mϕ is computed using Equation (5.3), and the Poisson Equation is solved to obtain a new⃗gwhich can be used to verify the Feasibility Constraint. If this constraint is violated. i.e., thenpg,αqis removed from the setcandidates. Note that the algorithm works on a copy of the original FSC, and the solution of the Poisson Equation computed at the last Policy Evaluation step. The addition of gphantom, and recomputation of the Poisson Equation is only used within Algorithm 6.5.
Algorithm 6.4 Adding I-states to Escape Local Maxima of Constrained Optimization Criterion