A.4 Natural σ-Algebra and Distributions over Countable Sets
4.1 Generate Set To Visit Frequently
Input: ϕ-feasible recurrent setRk, Rabin acceptance pairs of product-POMDPΩP Mϕ
1: Goodk “ H
2: for allpRepeatP Mi ϕ, AvoidP Mi ϕq PΩP Mϕ do
3: if pAvoidP Mi ϕˆGq XRk “ Hthen
4: Goodk “GoodkY`
pRepeatP Mi ϕˆGq XRk˘
5: end if
6: end for
7: return Goodk
The algorithm simply identifies those Rabin acceptance pairs that are consistent with the recurrent set Rk, and then selects those states from Rk that project onto theRepeatP Mϕ part of the Rabin pair. Next, the goal is to ensure that at least some state(s) in the setGoodk is(are) visited frequently insteady state. Recall that steady state implies that the path is already absorbed inRk. The quantity of interest is given by the empirical or pathwise occupation measure from Equation (3.23) by setting AÐGoodk. Writing out the modified equation explicitly gives
πptqpGoodk|s0q “1 t
ÿt k“1
prsk, gks PGoodkq, t“1,2, . . . (4.40) Then the expectation of the pathwise occupation measure is taken with the assumption that
paths arealready absorbed inRk, thus implying steady state behavior of the Markov chain. This can be done by additionally taking the expectation due to an initial distributionιssinitwhose support is inRk. Additionally, the horizon is taken to be infinite to reflect long term steady state. Formally, this done by modifying the Equation (3.24) in the following way to compute the expected pathwise occupation measure as the horizon goes to8.
tÑ8limE“
πptqpGoodkq|ιssinit‰
“ lim
tÑ8E
„
1 t
řt
k“1 prsk, gks PGoodkq |ιssinit ȷ
“ ř
rs,gsPRk
ιssinit´
rs, gs¯ ´
tÑ8limTptqpGoodk|rs, gsq¯ ,
where ř
rs,gsPRk
ιssinit´ rs, gs¯
“ 1
(4.41)
ensures thatsupportpιssinitqĎRk.
Equation (4.41) can be rewritten using the vector and matrix representation of the above quan- tities as follows
tÑ8limE“
πptqpGoodkq|ιssinit‰
“ p⃗ιssinitqT´
tÑ8limTptq¯
⃗1SGoodˆGk
“ p⃗ιssinitqTΠ⃗1SGoodˆGk,
(4.42) where line 1 leads to line 2 using the limiting matrix,Π, as introduced in Definition 3.4.12.
4.3.1 Equivalence to Expected Long Term Average Reward
Proposition 4.3.1 : Consider the reward structure over the global state space
rprs, gsq “
$&
%
1 ifrs, gs PGoodk
0 otherwise. (4.43)
Then the expected long term average reward is the same as the expected occupation measure of set Goodk, i.e.,
ηavpRkq “ lim
tÑ8
«1 t
ÿt k“0
rk|ιssinit ff
“ lim
tÑ8E”
πptqpGoodkq|ιssinitı
(4.44) whererk is the reward obtained at time stepk.
This is an important relationship, as the long term average reward and the computation of its gradient is studied extensively in the literature, especially for the case of the Markov chain being ergodic (or having a single recurrent class) [1, 10]. Our restriction on the support of the initial distribution ensures that the Markov chain evolves exclusively in a single recurrent class Rk, and therefore the methods described in these works can be directly utilized. The derivations of the gradients ∇ΘηavpRkqand ∇ΦηavpRkq are skipped, and the reader is referred to [1] for a detailed
description. The complexity of evaluating the gradient is summarized from [1] here to complete the view of computational burden of the gradient ascent methodology.
4.3.1.1 Complexity of Computing Gradient of ηavpRkq
From [1], in the worst case, the complexity of computing∇ΦηavpRkqis given byOp|S|2|G|2|Φ||Act||O|q similar to the gradient of absorption probability. The reduced practical complexity for sparse transi- tion and observation functions applies as well and is given byOpc|S||G||Φ||Act|qwithc! |G||G||O|.
The gradient ofηavpRkqw.r.t. Θis zero.
4.4 Trade Off between Absorption Probability and Visitation
corresponding optimizing parameters are given byG˚pRkq “ tΦ˚pRkq,Θ˚pRkqu. Then the optimum value is taken to be
RkĎϕmax-RecSetsG Γ˚pRkq. (4.49)
The optimum controller is given byG˚pR˚kq “ tΦ˚pR˚kq,Θ˚pR˚kqu, where
R˚k “ argmax
RkĎϕ-RecSetsG
Γ˚pRkq. (4.50)
4.5 Heuristic Search for FSC Structures with a ϕ-Feasible Recurrent Set
In this section it shown how, given a fixed size|G|, candidate FSCs that yield at least oneϕ-feasible recurrent set can be generated. This problem is hard [33] in itself – the hardness arising out of partial observability in which possibly unbounded sequences of actions and observations may be required to ensure that some states are never visited. However, the heuristic described in this section restricts the search over outcomes that can be inferred by a single, most recent, observation and action. Thus, the proposed method is incomplete, in which a solution may exist, but the algorithm is unable to find it. The details of this heuristic search is given in Algorithm 4.2.
In order to understand the algorithm, the reader is pointed to the example in Figure 4.3. It shows a part of a global Markov chain, such that the underlying product POMDP has only one Rabin acceptance pair. The global state,rs4, g4s, denoted in green in Figure 4.3, is the only global
rs1, g1s
rs2, g2s rs3, g3s
rs4, g4s
rs5, g5s
rs6, g6s
rs7, g7s Good
Bad C1
C2
Figure 4.3: Generating Admissible Structures of FSC
There are two communicating classesC1 andC2. In steady state it must be ensured that the green node is recurrent, while the red node is never visited, i.e.,rs2, g2smust be disconnected fromrs1, g1s.
SinceC1 can lead toC2, which is absorbing, the communication betweenrs1, g1s andrs5, g5sneeds to be severed. Thus, in Algorithm 4.2,Good“ trs4, g4su, andBad“ trs2, g2s,rs5, g5su.
state whose projection s4 P RepeatP M1 ϕ . The global state, rs2, g2s, denoted in red, is such that s2PAvoidP M1 ϕ. There are two communicating classes,C1, andC2, in the global Markov chain s.t.
C1 ÑC2, and C2 is absorbing. Therefore the states inBad “ trs2, g2s,rs5, g5suneed to be made unreachable in steady state, while ensuring that some state in Good“ trs4, g4suis recurrent. The former is done in steps 14-15 of Algorithm 4.2 by disallowing actions which lead to bad states under the latest observation. Recurrence of some state inGoodis ensured in steps 19-20. This recurrence may not always be guaranteed because disconnecting states inBadby removing actions may change the communication properties of other global states. The check for ř
glPG,αPAct
Iωpglα|gk, oq ą0 in steps 9 and 17 makes sure that the modification in Iω does not yield an inadmissible structure as defined in Equation (4.3).
Note, that no discussion about Iκ has been made in the context of feasibility. This is because settingIκpgq “1,@gPG, is sufficient and this choice does not affect theϕ-feasibility.
4.5.1 Complexity
Algorithm 4.2, presents two main sources of computational complexity. First is the computation of strongly connected components. For a graph , these components can be found with effort