2.2.7 Brief overview of solution methods
The solution methods for typical POMDP problems fall under two categories: exact and approxi- mate. A detailed survey is found in the introductory chapters of [1], and in [24]. These are briefly summarized here.
2.2.7.1 Exact Methods
Exact methods typically need to employ the full information stateIt to design the policy. However this requires infinite memory as t Ñ 8 and is intractable. As has already been mentioned, the belief state bt offers a sufficient statistic for It . Following [29, 137], the problem is solved using Dynamic Programming [16] over a belief space MDP that can be constructed from a POMDP.
However, since the belief state space is uncountably infinite, these algorithms may require infinite memory for representation. Also the complexity of these algorithms grows exponentially with the size of the state space, and hence it is difficult to solve problems with more than a few tens of states, observations and actions.
2.2.7.2 Approximate Methods
In recent years, there has been a lot of work in approximating the value function. For example, one method is to assume that the underlying system is an MDP and learning the underlying Q-function and employing heuristics such as the most likely state heuristic, the voting heuristic,QM DP-heuristic, or exploiting the entropy of the belief state [67,106,135]. Another is to use grid based methods [22,55].
Other approximate methods search only over the reachable belief states and fall under point-based POMDP planning [79,113]. The algorithms in this thesis also utilize an approximate method because they search exclusively in the space of policies that require finite internal memory. This particular controller, called a Finite State Controller is introduced in detail in the next chapter in Section 3.1.
controller, namely the Finite State Controller. This will be used in the rest of the thesis to propose methods for finding such controllers given a POMDP model and LTL specification.
Chapter 3
LTL Satisfaction using Finite State Controllers
In the previous chapter, the POMDP and its associated problems in the form of optimization ob- jectives were introduced. Further, the various methods of designing controllers to maximize these criteria were mentioned. In this chapter, one particular class of controllers, called the finite state controller is studied in detail. The choice of this class of controllers is shown to lead to a finite state space Markov chain for the closed loop controlled system. This allows easy analysis of infinite executions of the system in the context of satisfying an LTL formula of interest. Next, the various categories of problems relating to LTL formulas over POMDPs controlled by FSCs are formalized.
Finally, a brief overview of the solution methodology for these problems is provided.
It is a well known fact that POMDP, and for some criteria, MDP controllers require memory or internal states [1,29,67]. Let the controller’s internal states be denoted bygPG“ tg1, g2, . . . , g|G|u.
Finite state controllers have finite |G|. As mentioned before, infinite horizon problems typically require infinite|G|. The most popular method that employs infinite memory design controllers that work in the belief space which is continuous, which effectively implies uncountably infiniteG. For this case the above definition does not hold.
Finite state controllers are formally defined next.
3.1 Finite State Controllers
Definition 3.1.1 (Deterministic Finite State Controller (det-FSC)) LetPMbe a POMDP with observation setO, action setActand initial distributionιinit as in Definition 2.2.1s. Adeter- ministic finite state controller (det-FSC) for PMis given by the tupleG“ pG,ω,κqwhere
• G“ tg1, g2, . . . , g|G|uis a finite set of internal states.
• ω : GˆO Ñ GˆAct is a function such that given a current internal FSC state gk, and observation o, pgl,αq “ ωpgk, oqchooses the next internal state of the FSC and the action to
apply to PM.
• κ:MS ÑGchooses the start state g0“κpιinitq, of the FSC given initial distributionιinit. Definition 3.1.2 (Stochastic Finite State Controller (sto-FSC)) LetPMbe a POMDP with observation set O, action set Act and initial distribution ιinit. A stochastic finite state controller (sto-FSC)for PMis given by the tupleG“ pG,ω,κqwhere
• G“ tg1, g2, . . . , g|G|uis a finite set of internal states.
• ω :GˆO ÑMGˆAct is a function such that given a current internal state of FSC, gk and observation o,ωpgk, oqis a probability distribution over GˆAct. The next internal state and action pair pgl,αq are chosen by independent sampling of ωpgk, oq. By abuse of notation, we will useωpgl,α|gk, oqas the probability of transitioning to I-stategland taking actionα, when the current I-state isgk and observation received is o.
• κ:MS ÑMGchooses the starting internal stateg0, by independent sampling ofκpιinitq, given initial distribution ιinit of PM. Again, by abuse of notation, we will use κpg|ιinitqto denote the probability of starting the FSC in internal state g when the initial distribution is given by ιinit.
Any deterministic FSC can be written as a special case of stochastic FSCs. This thesis will exclusively consider the stochastic version and so the term FSC will denote a stochastic FSC unless otherwise stated.
A schematic diagram of how an FSC controls the POMDP is shown in Figure 3.1. Under the FSC, the POMDP evolves as follows.
1. Sett“0. POMDP initial states0is initialized by drawing independently from the distribution ιinit. The deterministic or stochastic functionκpιinitqis used to determine or sample the initial FSC I-stateg0.
2. At each time step tě 0, the POMDP emits an observationot according to the distribution Op.|stq.
3. The FSC determines its new stategt`1and actionαt`1according to the deterministic function or stochastic distribution given byωp.|gt, otq.
4. The actionαt`1is applied to the POMDP, which transitions to a new statest`1 according to the distribution Tp.|st,αq.
5. t“t`1, Go to 2.
POMDP
ω α
gk gl
o
FSC
Figure 3.1: POMDP controlled by an FSC
3.1.1 Markov Chain induced by an FSC
Closing the loop around a POMDP with an FSC, as in Figure 3.1, yields the following transition system.
Definition 3.1.3 (Global Markov Chain) Let S be the state space of the POMDP,PM, andG be the set of I-states of the FSC, G, as in Definition 3.1.2. The global Markov chain MP M,GSˆG with execution σ“ trs0, g0s,rs1, g1s, . . .u, rst, gts PSˆGevolves as follows:
• The probability of the initial global staters0, g0sis given by
ιP M,Ginit rrs0, g0ss “ιinitps0qκpg0|ιinitq (3.1)
• The state transition probability is given by
TP M,Grrst`1, gt`1s |rst, gts s “ ÿ
oPO
ÿ
αPAct
Opo|stqωpgt`1,α|gt, oqTpst`1|st,αq (3.2)
Note that for a finite state space POMDP, the global Markov chain has finite state space. Similar to the fully observable case of Markov decision process in [8], the global Markov chain induced by the finite state controller MP M,GSˆG is probabilistically bisimilar to the infinite state space Markov chain described in Section 2.2.3.1. Probabilistic bisimilarity is discussed in [8, 80].
We also remind the reader, that this Markov chain is also associated with a probability space as described in Section 2.2.4.