Minimization Framework - MDP HOMOMORPHISMS AND MINIMIZATION

3. MDP HOMOMORPHISMS AND MINIMIZATION

3.4 Minimization Framework

Our approach to abstraction can be considered an instance of a general approach known as model minimization. The goal of MDP minimization is to form a reduced model of a system by ignoring irrelevant information. Solving this reduced model should then yield a solution to the original MDP. Frequently minimization is accom- plished by identifying states and actions that are equivalent in a well-defined sense and forming a “quotient” model by aggregating such states and actions. We build a minimization framework, that is an extension of a framework by Dean and Givan

(1997). It differs from their work in the notion of equivalence we employ based on MDP homomorphisms.

In this section we show that homomorphic equivalence leads to preservation of optimal solutions. We start with the following theorem on optimal value equivalence.

This theorem is an extension of the optimal value equivalence theorem developed in Givan et al. (2003) for stochastic bisimulations.

Theorem 1: (Optimal value equivalence) Let M⁰ = hS⁰, A⁰,Ψ⁰, P⁰, R⁰i be the homomorphic image of the MDP M=hS, A,Ψ, P, Ri under the MDP homomorphism h=hf,{gs|s∈S}i. For any (s, a)∈Ψ, Q^?(s, a) =Q^?(f(s), gs(a)).

Proof: (Along the lines of Givan et al. (2003)) Let us define the m-step optimal discounted action value function recursively for all (s, a)∈Ψ and for all non-negative integers m as

Qm(s, a) = R(s, a) +γ ^X

s₁∈S

P(s, a, s₁) max

a1∈As1

Qm−1(s₁, a₁)

and set Q−1(s1, a1) = 0. Letting Vm(s1) = maxa1∈As1 Qm(s1, a1), we can rewrite this as:

Qm(s, a) =R(s, a) +γ ^X

s₁∈S

[P(s, a, s1)Vm−1(s1)].

Now we prove by induction on m that the theorem is true. For the base case of m = 0, we have that Q0(s, a) = R(s, a) = R⁰(f(s), gs(a)) = Q0(f(s), gs(a)). Now let us assume that Qj(s, a) = Qj(f(s), gs(a)) for all values of j less than m and all state-action pairs in Ψ. Now we have,

Qm(s, a) = R(s, a) +γ ^X

s⁰∈S

P(s, a, s⁰)Vm−1(s⁰)

= R(s, a) +γ ^X

[s⁰]_Bh_|S∈Bh|S

T(s, a,[s⁰]_B

h|S)Vm−1(s⁰)

= R⁰(f(s), gs(a)) +γ ^X

s⁰∈S⁰

P⁰(f(s), gs(a), s⁰)Vm−1(s⁰)

= Qm(f(s), gs(a))

The second and third lines use the fact thathis a homomorphism. SinceRis bounded it follows by induction thatQ^?(s, a) =Q^?(f(s), gs(a)) for all (s, a)∈Ψ. 2 Corollaries:

1. For any h-equivalent (s1, a1),(s2, a2)∈Ψ,Q^?(s1, a1) =Q^?(s2, a2).

2. For all equivalent s₁, s₂ ∈S, V^?(s₁) =V^?(s₂).

3. For all s∈S,V^?(s) =V^?(f(s)) .

Proof: Corollary 1 follows from Theorem 1. Corollaries 2 and 3 follow from Theorem

1 and the fact that V^?(s) = maxa∈AsQ^?(s, a). 2

As shown by Givan et al. (2003), optimal value equivalence is not a sufficient notion of equivalence for our stated minimization goal. In many cases even when the optimal values are equal, the optimal policies might not be related and hence we cannot easily transform solutions of M⁰ to solutions of M. But when M⁰ is a homomorphic image, a policy inM⁰ caninduce a policy inMthat is closely related.

The following describes how to derive such an induced policy.

Definition: Let M⁰ be an image of M under homomorphism h = hf,{gs|s ∈ S}i.

For any s∈S,g⁻¹_s (a⁰) denotes the set of actions that have the same imagea⁰ ∈A⁰_f(s) under gs. Let π⁰ be a stochastic policy in M⁰. Then π⁰ lifted to M is the policy π_M⁰ such that for any a∈g_s⁻¹(a⁰), π⁰_M(s, a) =π⁰(f(s), a⁰)^.|g⁻¹_s (a⁰)|.

Note: It is sufficient that ^P_a∈g⁻¹_s _(a0)π_M⁰ (s, a) = π⁰(f(s), a⁰), but we use the above definition to make the lifted policy unique.

Example 2

This example illustrates the process of lifting a policy from an image MDP to the original MDP. Consider MDP M from example 1 and M⁰ =hS⁰, A⁰,Ψ⁰, P⁰, R⁰i with

S⁰ ={s⁰₁, s⁰₂}, A⁰ ={a⁰₁, a⁰₂} and Ψ⁰ ={(s⁰₁, a⁰₁), (s⁰₁, a⁰₂), (s⁰₂, a⁰₁)}. Let h=hf,{gs|s ∈ S}i be a homomorphism fromM toM⁰ defined by

f(s₁) =s⁰₁ f(s₂) =s⁰₂ f(s₃) =s⁰₂ gs1(a₁) =a⁰₂ gs2(a₁) =a⁰₁ gs3(a₁) =a⁰₁ gs1(a2) =a⁰₁ gs2(a2) =a⁰₁

Letπ⁰ be a policy in M⁰ with

π⁰(s⁰₁, a⁰₁) = 0.6 π⁰(s⁰₁, a⁰₂) = 0.4 π⁰(s⁰₂, a⁰₁) = 1.0

Now π⁰ lifted toM, the policyπ⁰_M, is derived as follows:

π⁰_M(s1, a1) =π⁰(s⁰₁, a⁰₂) = 0.4 π_M⁰ (s1, a2) = π⁰(s⁰₁, a⁰₁) = 0.6 π⁰_M(s₂, a₁) =π⁰(s⁰₂, a⁰₁)/2 = 0.5 π_M⁰ (s₂, a₂) = π⁰(s⁰₂, a⁰₁)/2 = 0.5 π⁰_M(s₃, a₁) =π⁰(s⁰₂, a⁰₁) = 1.0

2 Theorem 2: LetM⁰ =hS⁰, A⁰,Ψ⁰, P⁰, R⁰ibe the image ofM=hS, A,Ψ, P, Riunder the homomorphism h=hf,{gs|s ∈S}i. If π^0? is an optimal policy for M⁰, then π_M^0?

is an optimal policy for M.

Proof: Let π^0? be an optimal policy in M⁰. Consider some (s, a) ∈ Ψ such that π^0?(f(s), gs₁(a1)) is greater than zero. Then Q^?(f(s1), gs₁(a1)) is the maximum value of the Q^? function in state f(s₁). From Theorem 1, we know that Q^?(s, a) = Q^?(f(s), gs(a)) for all (s, a) ∈ Ψ. Therefore Q^?(s1, a1) is the maximum value of the Q^? function in states1. Thus a1 is an optimal action in state s1 and henceπ^0?_M is

an optimal policy for M. 2

Theorem 2 establishes that an MDP can be solved by solving one of its homomorphic images. To achieve the most impact, we need to derive a smallest homomorphic

s₂

s₁

s₃

s₄

a₁ a₂ 0.2

0.2

0.8

0.2

1.0 +1

0.2 0.8

0.8 0.8 0.2 0.8

0.2

0.8

1.0 +1

{σ }2 1

0.2 1.0

0.8

+1 0.2 +1

0.8 α

1.0

{σ }

(a) (b)

Figure 3.4. (a) Transition graph of example MDP M. This MDP is irreducible under a traditional minimization framework. Our notion of homomorphic equivalence allows us to minimize this further. (b) Transition graph of the minimal image of the MDP M in (a).

image of the MDP, i.e., an image with the least number of admissible state-action pairs. The following definition formalizes this notion.

Definition: An MDP Mis a minimal MDP if for every homomorphic image M⁰ of M, there exists a homomorphism from M⁰ toM. A minimal image of an MDP M is a homomorphic image of Mthat is also a minimal MDP.

The model minimization problem can now be stated as: “find a minimal image of a given MDP”. Since this can be computationally prohibitive, we frequently settle for a reasonably reduced model, even if it is not a minimal MDP.

Illustration of Minimization: An Abstract MDP example

We illustrate our minimization framework on a very simple abstract MDP shown in Figure 3.4(a). We will use this as a running example while we develop our framework further. We chose such a simple example in order to make the presentation of the computation involved in later stages easier. Note though that this MDP is irreducible under the state-equivalence based MDP minimization framework of Dean and Givan.

The parameters of M = hS, A,Ψ, P, Ri are S = {s₁, s2, s3, s4}, A = {a₁, a2}, Ψ =

S ×A, P defined as in Table 3.1 and R given by: R(s₂, a₁) = R(s₃, a₂) = 0.8 and R(s₂, a₂) =R(s₃, a₁) = 0.2. For all other values of iand j,R(si, aj) equals zero.

to→

↓from s1 s2 s3 s4

s1 0 0.8 0.2 0

s2 0.2 0 0 0.8

s3 0.8 0 0 0.2

s4 0 0 0 1.0

(i) to→

↓from s1 s2 s3 s4

s1 0 0.2 0.8 0

s2 0.8 0 0 0.2

s3 0.2 0 0 0.8

s4 0 0 0 1.0

(ii)

Table 3.1. Transition probabilities for the MDP M shown in Figure 3.4(a): (i) under action a1. (ii) under action a2.

The MDP M⁰ shown in Figure 3.4(b) is a homomorphic image of M. It has the following parameters: S⁰ = {σ₁, σ₂, σ₃}, A⁰ = {α₁, α₂}, Ψ⁰ = {(σ₁, α₁), (σ₂, α₁), (σ₂, α₂), (σ₃, α₁)},P⁰as shown in Table 3.2 andR⁰defined as follows: R⁰(σ₂, α₁) = 0.2, R⁰(σ2, α2) = 0.8 and all other rewards are zero.

P⁰(σ1, α1, σ2) = 1.0 P⁰(σ3, α1, σ3) = 1.0 P⁰(σ₂, α₁, σ₁) = 0.8 P⁰(σ₂, α₂, σ₁) = 0.2 P⁰(σ2, α1, σ3) = 0.2 P⁰(σ2, α2, σ3) = 0.8

Table 3.2. The transition probabilities of the MDP M⁰ shown in Figure 3.4(b).

One can define a homomorphismhf,{gs|s ∈S}ifromMtoM⁰as follows: f(s1) = σ1, f(s2) = f(s3) = σ2, and f(s4) = σ3. gs₁(ai) = gs₄(ai) = α1, for i = 1,2, gs₂(a₁) = gs₃(a₂) = α₂ and gs₂(a₂) = gs₃(a₁) = α₁.

Dalam dokumen AN ALGEBRAIC APPROACH TO ABSTRACTION IN REINFORCEMENT LEARNING (Halaman 38-44)