3. MDP HOMOMORPHISMS AND MINIMIZATION
3.5 Identifying Homomorphisms
Theorem 3: Let h be an MDP homomorphism from an MDP M=hS, A,Ψ, P, Ri to an MDPM0 =hS0, A0,Ψ0, P0, R0i. Then Bh, the partition of Ψ induced by h, is a reward respecting SSP partition.
Proof: Let h = hf,{gs|s ∈ S}i be the homomorphism from M to M0. We need to show that the partition Bh is a reward respecting SSP partition.
First let us tackle the stochastic substitution property. Let (s1, a1),(s2, a2)∈Ψ, be h-equivalent. From the definition of a homomorphism we have that f(s1) =f(s2) = s0 ∈ S0 and gs1(a1) = gs2(a2) = a0 ∈ A0s0. Thus, for any s ∈ S, T(s1, a1,[s]Bh|S) = P0(s0, a0, f(s)) =T(s2, a2,[s]B
h|S). HenceBh is an SSP partition.
From condition 2 in the definition of a homomorphism, it is clear that the partition
induced is reward respecting. 2
Theorem 3 establishes that the partition induced by a homomorphism is a reward respecting SSP partition. On the other hand, given any reward respecting SSP partition B of M it is possible to construct a homomorphic image. Let η(s) be the number of distinct classes of B that contain a state-action pair with s as the state component, and let {[(s, ai)]B|i = 1,2,· · ·, η(s)} be the blocks. Note that if [s1]B|S = [s2]B|S then η(s1) =η(s2), hence the following is well-defined.
Definition: Given a reward respecting SSP partitionBof an MDPM=hS, A,Ψ, P, Ri, the quotient MDP M/B is the MDP hS0, A0,Ψ0, P0, R0i, where S0 = B|S; A0 =
S
[s]B|S∈S0A0[s]
B|S where A0[s]
B|S ={a01, a02,· · ·, a0η(s)} for each [s]B|S ∈ S0; P0 is given by P0([s]f, a0i,[s0]f) = Tb([(s, ai)]B,[s0]B|S) and R0 is given by R0([s]B|S, a0i) =R(s, ai).
Theorem 4: LetBbe a reward respecting SSP partition of MDPM=hS, A,Ψ, P, Ri.
The quotient MDP M/B is a homomorphic image of M.
Proof: Given a reward respecting SSP partition B of M, we show by construc- tion that there exists a homomorphism h from M to the quotient MDP M/B = hS0, A0,Ψ0, P0, R0i.
The homomorphism h=hf,{gs|s∈S}ibetweenMand M/B is given by f(s) = [s]B|S and gs(a) = a0i such that T(s, a,[s0]B|S) = P0([s]B|S, a0i,[s0]B|S) for all [s0]B|S ∈ B|S. In other words, if [(s, a)]B|S is the i-th unique block in the ordering used in the construction of M/B, then gs(a) = a0i. It is easy to verify that h is indeed a
homomorphism. 2
The partition induced on Mbyhis only guaranteed to be a refinement of B and is not always the same partition as B. In other words, B ≥ Bh. In fact, Bh is the least coarse partition such that Bh|S =B|S, and M/B is the same MDP as M/Bh up to a relabeling of states and actions. Thus the converse of the theorem, that for every reward respecting SSP partition there exists a homomorphism that induces it, is not always true.
It is easy to verify (by contradiction) that there exists a unique coarsest reward respecting SSP partition for any MDP. Intuitively one would expect the quotient MDP corresponding to the coarsest reward respecting SSP partition of an MDP M to be a minimal image of M. The following theorem states that formally.
Theorem 5: Let B be the coarsest reward respecting SSP partition of MDP M.
The quotient MDP M/B is a minimal image of M.
Proof: The proof of this theorem is given in Appendix A.
Dean and Givan (1997) propose a polynomial time method to identify the coarsest reward respecting SSP partition of an MDP. Though their method operates with partitions on the state space only, it can easily be extended to Ψ. Given an MDP M=hS, A,Ψ, P, Ri, the outline of a basic model-minimization algorithm is as follows:
1. Start with any reward respecting partition B of Ψ. The most obvious choice is to pick the one that is induced by the expected reward functionR. This is the coarsest possible reward respecting partition, but any suitable partition will do.
2. Pick some block bi of B that does not satisfy the SSP property and split bi so that it does.
3. Repeat step 2 until all violations of the SSP property are resolved. Let Bh be the resulting partition.
4. Form the quotient MDP M/Bh and identify the homomorphism between M and M/Bh.
Now one can solve M/Bh and lift the optimal policy to get an optimal policy for M. It can be shown (Dean and Givan, 1997) that step 2 has to be performed only once for each block in the partition and hence the algorithm runs in time quadratic in|Bh|and linear in |Ψ|. The algorithm converges to the coarsest reward respecting SSP partition, provided we started with a suitable reward respecting partition.
Illustration of Minimization: An Abstract MDP example (revisited) Let us return to the abstract MDP M from Figure 3.4(a), reproduced here in 3.5(a). We now derive the minimal model of this MDP. The admissible state action pairs is given by S×A. We start with the partition induced by the reward function:
BR=n{(s2, a1),(s3, a2)},{(s2, a2),(s3, a1)},{(s1, a1),(s1, a2),(s4, a1),(s4, a2)}o.
We denote the the blocks of the partition by b1, b2 and b3 respectively. Now BR|S=
n{s1, s4},{s2, s3}o, Tb(b1,{s1, s4}) = Tb(b2,{s1, s4}) = 1.0 and Tb(b1,{s2, s3}) =
s2
s1
s3
s4
a1 a2 0.2
+1
0.2
+1
0.8
0.2
1.0 +1
0.2 0.8
0.8 0.8 0.2 0.8
0.2
0.8
1.0 +1
{S ,S }2 3 1
4
α
1
2
0.2 1.0
0.8
+1 0.2 +1 0.8
{S }
α
{S }
1.0
(a) (b)
Figure 3.5. (a) Transition graph of example MDPM. Repeated from Figure 3.4(a).
(b) Transition graph of the quotient MDP M|B. See text for description. Note that this is isomorphicto the MDP in Figure 3.4(b).
Tb(b2,{s2, s3}) = 0.0. Hence b1 and b2 satisfy the SSP property and do not need to be split. Block b3 does violate the SSP property as can be seen below:
T(s1, a1,{s1, s4}) = 0 T(s4, a1,{s1, s4}) = 1.0 T(s1, a2,{s1, s4}) = 0 T(s4, a2,{s1, s4}) = 1.0 T(s1, a1,{s2, s3}) = 1.0 T(s4, a1,{s2, s3}) = 0 T(s1, a2,{s2, s3}) = 1.0 T(s4, a2,{s2, s3}) = 0
We can fix this by splitting b3 into n{(s1, a1),(s1, a2)}, {(s4, a1),(s4, a2)}o. It is easy to see that the resulting partition B given by B = n{(s1, a1),(s1, a2)}, {(s2, a1), (s3, a2)},{(s2, a2),(s3, a1)},{(s4, a1),(s4, a2)}ois a reward respecting SSP partition.
We can derive the quotient MDP M|B =hS0, A0,Ψ0, P0, R0ias follows:
S0 =B|S =n{s1},{s2, s3},{s4}o are the states of M/B.
Now, η(s1) = 1, η(s2) = η(s3) = 2 and η(s4) = 1. Let A0 = {a01, a02}. Hence we set A0{s1} = {a01}, A0{s2,s3} = {a01, a02} and A0{s4} = {a01}. Now P0({s1}, a01,{s2, s3}) = P(s1, a1, s2) +P(s1, a1, s3) =P(s1, a2, s2) +P(s1, a2, s3) = 1.0. Proceeding similarly, we have
P0({s1}, a01,{s2, s3}) = 1.0 P0({s4}, a01,{s4}) = 1.0 P0({s2, s3}, a01,{s1}) = 0.8 P0({s2, s3}, a02,{s1}) = 0.2 P0({s2, s3}, a01,{s4}) = 0.2 P0({s2, s3}, a02,{s4}) = 0.8
R0({s2, s3}, a01) = 0.2, R0({s2, s3}, a02) = 0.8 and all other rewards are zero. Figure 3.5(b) shows the transition graph for M/B. Note that this MDP is the same as that shown in Figure 3.4(b) except for a relabeling of states and actions. The two MDPs are examples ofisomorphicMDPs, a notion we will develop further in the next chapter. Now we can define a homomorphism hf,{gs|s ∈ S}i from M to M/B as follows: f(s1) = {s1}, f(s2) = {s2, s3}, f(s3) = {s2, s3} and f(s4) ={s4}. gs1(ai) = gs4(ai) = a01, for i= 1,2,gs2(a1) =gs3(a2) =a02 and gs2(a2) =gs3(a1) =a01.