Generate Set To Visit Frequently - Natural σ-Algebra and Distributions over Countable Sets

A.4 Natural σ-Algebra and Distributions over Countable Sets

4.1 Generate Set To Visit Frequently

Input: ϕ-feasible recurrent setRk, Rabin acceptance pairs of product-POMDPΩ^{P M}^ϕ

1: Goodk “ H

2: for allpRepeat^{P M}_i ^ϕ, Avoid^{P M}_i ^ϕq PΩ^{P M}^ϕ do

3: if pAvoid^{P M}_i ^ϕˆGq XRk “ Hthen

4: Goodk “GoodkY`

pRepeat^{P M}_i ^ϕˆGq XRk˘

5: end if

6: end for

7: return Good_k

The algorithm simply identifies those Rabin acceptance pairs that are consistent with the recurrent set Rk, and then selects those states from Rk that project onto theRepeat^{P M}^ϕ part of the Rabin pair. Next, the goal is to ensure that at least some state(s) in the setGoodk is(are) visited frequently insteady state. Recall that steady state implies that the path is already absorbed inRk. The quantity of interest is given by the empirical or pathwise occupation measure from Equation (3.23) by setting AÐGoodk. Writing out the modified equation explicitly gives

π^ptqpGoodk|s0q “1 t

ÿt k“1

prsk, gks PGoodkq, t“1,2, . . . (4.40) Then the expectation of the pathwise occupation measure is taken with the assumption that

paths arealready absorbed inRk, thus implying steady state behavior of the Markov chain. This can be done by additionally taking the expectation due to an initial distributionι^ss_initwhose support is inRk. Additionally, the horizon is taken to be infinite to reflect long term steady state. Formally, this done by modifying the Equation (3.24) in the following way to compute the expected pathwise occupation measure as the horizon goes to8.

tÑ8limE“

π^ptqpGoodkq|ι^ss_init‰

“ lim

tÑ8E

„

1 t

řt

k“1 prsk, gks PGoodkq |ι^ss_init ȷ

“ ř

rs,gsPRk

ι^ss_init´

rs, gs¯ ´

tÑ8limT^ptqpGoodk|rs, gsq¯ ,

where ř

rs,gsPRk

ι^ss_init´ rs, gs¯

“ 1

(4.41)

ensures thatsupportpι^ss_initqĎRk.

Equation (4.41) can be rewritten using the vector and matrix representation of the above quan- tities as follows

tÑ8limE“

π^ptqpGoodkq|ι^ss_init‰

“ p⃗ι^ss_initq^T´

tÑ8limT^ptq¯

⃗1^S_Good^ˆG_k

“ p⃗ι^ss_initq^TΠ⃗1^S_Good^ˆG_k,

(4.42) where line 1 leads to line 2 using the limiting matrix,Π, as introduced in Definition 3.4.12.

4.3.1 Equivalence to Expected Long Term Average Reward

Proposition 4.3.1 : Consider the reward structure over the global state space

rprs, gsq “

1 ifrs, gs PGoodk

0 otherwise. (4.43)

Then the expected long term average reward is the same as the expected occupation measure of set Good_k, i.e.,

ηavpRkq “ lim

tÑ8

«1 t

ÿt k“0

rk|ι^ss_init ﬀ

“ lim

tÑ8E”

π^ptqpGoodkq|ι^ss_initı

(4.44) whererk is the reward obtained at time stepk.

This is an important relationship, as the long term average reward and the computation of its gradient is studied extensively in the literature, especially for the case of the Markov chain being ergodic (or having a single recurrent class) [1, 10]. Our restriction on the support of the initial distribution ensures that the Markov chain evolves exclusively in a single recurrent class Rk, and therefore the methods described in these works can be directly utilized. The derivations of the gradients ∇Θη_avpRkqand ∇Φη_avpRkq are skipped, and the reader is referred to [1] for a detailed

description. The complexity of evaluating the gradient is summarized from [1] here to complete the view of computational burden of the gradient ascent methodology.

4.3.1.1 Complexity of Computing Gradient of ηavpRkq

From [1], in the worst case, the complexity of computing∇Φη_avpRkqis given byOp|S|²|G|²|Φ||Act||O|q similar to the gradient of absorption probability. The reduced practical complexity for sparse transi- tion and observation functions applies as well and is given byOpc|S||G||Φ||Act|qwithc! |G||G||O|.

The gradient ofηavpRkqw.r.t. Θis zero.

4.4 Trade Oﬀ between Absorption Probability and Visitation

corresponding optimizing parameters are given byG^˚pRkq “ tΦ^˚pRkq,Θ^˚pRkqu. Then the optimum value is taken to be

RkĎϕmax-RecSets^G Γ^˚pRkq. (4.49)

The optimum controller is given byG^˚pR^˚_kq “ tΦ^˚pR^˚_kq,Θ^˚pR^˚_kqu, where

R^˚_k “ argmax

RkĎϕ-RecSets^G

Γ^˚pRkq. (4.50)

4.5 Heuristic Search for FSC Structures with a ϕ-Feasible Recurrent Set

In this section it shown how, given a fixed size|G|, candidate FSCs that yield at least oneϕ-feasible recurrent set can be generated. This problem is hard [33] in itself – the hardness arising out of partial observability in which possibly unbounded sequences of actions and observations may be required to ensure that some states are never visited. However, the heuristic described in this section restricts the search over outcomes that can be inferred by a single, most recent, observation and action. Thus, the proposed method is incomplete, in which a solution may exist, but the algorithm is unable to find it. The details of this heuristic search is given in Algorithm 4.2.

In order to understand the algorithm, the reader is pointed to the example in Figure 4.3. It shows a part of a global Markov chain, such that the underlying product POMDP has only one Rabin acceptance pair. The global state,rs4, g4s, denoted in green in Figure 4.3, is the only global

rs1, g1s

rs2, g2s rs3, g3s

rs4, g4s

rs5, g5s

rs6, g6s

rs7, g7s Good

Bad C1

Figure 4.3: Generating Admissible Structures of FSC

There are two communicating classesC1 andC2. In steady state it must be ensured that the green node is recurrent, while the red node is never visited, i.e.,rs2, g2smust be disconnected fromrs1, g1s.

SinceC1 can lead toC2, which is absorbing, the communication betweenrs1, g1s andrs5, g5sneeds to be severed. Thus, in Algorithm 4.2,Good“ trs4, g4su, andBad“ trs2, g2s,rs5, g5su.

state whose projection s4 P Repeat^{P M}1 ^ϕ . The global state, rs2, g2s, denoted in red, is such that s2PAvoid^{P M}1 ^ϕ. There are two communicating classes,C1, andC2, in the global Markov chain s.t.

C1 ÑC2, and C2 is absorbing. Therefore the states inBad “ trs2, g2s,rs5, g5suneed to be made unreachable in steady state, while ensuring that some state in Good“ trs4, g4suis recurrent. The former is done in steps 14-15 of Algorithm 4.2 by disallowing actions which lead to bad states under the latest observation. Recurrence of some state inGoodis ensured in steps 19-20. This recurrence may not always be guaranteed because disconnecting states inBadby removing actions may change the communication properties of other global states. The check for ř

glPG,αPAct

Iωpglα|gk, oq ą0 in steps 9 and 17 makes sure that the modification in Iω does not yield an inadmissible structure as defined in Equation (4.3).

Note, that no discussion about Iκ has been made in the context of feasibility. This is because settingIκpgq “1,@gPG, is suﬃcient and this choice does not aﬀect theϕ-feasibility.

4.5.1 Complexity

Algorithm 4.2, presents two main sources of computational complexity. First is the computation of strongly connected components. For a graph , these components can be found with eﬀort

Dalam dokumen Formal Methods for Control Synthesis in Partially Observed Environments: Application to Autonomous Robotic (Halaman 69-74)