Directory UMM :Data Elmu:jurnal:O:Operations Research Letters:Vol27.Issue3.2000:

(1)

www.elsevier.com/locate/dsw

Constrained Markovian decision processes: the dynamic

programming approach

(

A.B. Piunovskiy

a;∗

_{, X. Mao}

b

a_{Department of Mathematical Science, Division of Statistics and Operational Research, M and O Building, University of Liverpool,}

Liverpool, L 69 7ZL, UK

b_{Strathclyde University, Glasgow, UK}

Received 1 April 1999; received in revised form 1 May 2000

Abstract

We consider semicontinuous controlled Markov models in discrete time with total expected losses. Only control strategies which meet a set of given constraint inequalities are admissible. One has to build an optimal admissible strategy. The main result consists in the constructive development of optimal strategy with the help of the dynamic programming method. The model studied covers the case of a nite horizon and the case of a homogeneous discounted model with dierent discount factors. c2000 Elsevier Science B.V. All rights reserved.

Keywords:Markovian decision processes; Constrained optimization; Dynamic programming; Penalty functions

1. Introduction

Constrained Markov decision processes have been studied by dierent authors during the last 15 years. From the formal point of view, such problems can be reformulated as linear programs on spaces of mea-sures, and the convex analytic approach makes it possible to formulate necessary and sucient condi-tions of the optimality (Kuhn–Tucker Theorem), to establish the form of optimal strategies, and so on. The corresponding results can be found for instance in the monographs by Altman [1], Borkar [3], and

(_{This work was performed under a grant of the Royal Society,}

UK.

∗_{Corresponding author.}

E-mail address:piunov@liverpool.ac.uk (A.B. Piunovskiy).

Piunovskiy [6], and in the papers by Feinberg and Shwartz [5] and by Piunovskiy [7].

The present article is devoted to the dynamic pro-gramming approach to constrained problems. The main idea is similar to the penalty functions method; as the result, we get the nonconstrained optimization model with deterministic transitions where the Bell-man principle is valid. If the problem were originally deterministic then the basic idea would be to include the accumulated cost corresponding to the constraints in the state, and assign an immediate cost of innity if that part of state exceeds the given required bound on the constraint. In the stochastic case, this approach becomes a bit more complicated since we must also remember the current probability distribution on the states space. It should be emphasized that such an approach can be developed only using the results

(2)

obtained earlier with the help of the convex analytic approach (Theorem 1 below). We consider here the case of total expected losses which turns to the nite horizon model and to the homogeneous discounted model (with dierent discount factors) in particular cases.

2. Model description and auxiliary results

Let us consider the controlled modelZ={X; A; p}

where X is the Borel states space; A is the actions space (the metric compact);pt(dy|x; a) is the

contin-uous transition probability, that isR

X c(y)pt(dy|x; a)

is continuous function for each continuous c(·). As usual a control strategy is a sequence of measur-able stochastic kernelst(da|ht−1) onAwhereht−1=

(x0; a1; x1; : : : ; at−1; xt−1). A strategy is called Markov

if it is of the formt(da|ht−1) =mt(da|xt−1) and is

called stationary ift(da|ht−1)=s(da|xt−1). The

ini-tial probability distributionP0(dx)∈P(X) is assumed

to be xed. Here and further, P(X) is the space of all probability measures on a Borel spaceX equipped with the weak topology.

As is known, each strategy denes the unique probability measureP_{on the trajectories space}_H

∞=

X×(A×X)∞_{. The detailed description of these}

con-structions can be found in [3,6] and in the mono-graphs by Bertsekas and Shreve [2] and by Dynkin and Yushkevich [4]. The integral with respect to the measureP_{is denoted by}_E_.

The traditional optimal control problem consists of the minimization of the following functional:

R() =E

"_∞ X

t=1

t−₀ 1rt(xt−1; at)

#

→min

; (1)

where rt(·) is a cost function and 0¿0 is a

dis-count factor. Problem (1) was investigated in [2– 4]. Such functionals are usually called total expected dis-counted losses as distinct from the average expected losses of the type

lim

T→∞

1 TE

" _T X

t=1

r(xt−1; at)

#

→min

: (2)

Problem (2) was also studied in [3,4] but the present article is devoted to models with total expected losses. (However, see the Conclusion where the average

losses are mentioned.) If there are no costs in (1) beyond time T then one usually puts 0 = 1 (the

case of a nite horizon). If the cost function r and the transition probabilitypdo not depend on the time and0∈(0;1) then we deal with the homogeneous

discounted model.

Let us assume thatrt(x; a) is a lower-semicontinuous

lower-bounded function and the transition prob-ability pt(dy|x; a) is countinuous. Suppose that

lower-semicontinuous lower-bounded functions stn(x; a) are given as well as the discount factors

n¿0 and real numbersdn,n=1;2; : : : ; N. A strategy

is called admissible if the inequalities

Sn() =E

"_∞ X

t=1

t−_n 1sn_t(xt−1; at)

#

6dn; n= 1;2; : : : ; N (3)

are satised. In what follows, expressions (1) and (3) are assumed to be well dened. To be more specic, we study either the model with a nite horizon, or the case of discounted modeln∈(0;1), n= 0;1; : : : ; N.

One must build an optimal admissible strategy; in other words, one must solve problem (1) under constraints (3). Such problems were investigated in [1,3,5,6]. It should be noted that the strongest results were obtained for the case of homogeneous dis-counted model with the common discount factor. The homogeneous model with dierent discount factors was studied in detail only for nite setsX andA[5].

In what follows, we shall need the following known result.

Theorem 1. If there exists at least one admissible strategy then there exists a solution of problem

(1);(3)dened by a Markov strategy. If the model is homogeneous and0=1=· · ·=N∈(0;1)then the class of stationary strategies is sucient in problem

(1);(3).

In principle, one can build the solution of the con-strained problem with the help of the Lagrange mul-tipliers technique [6]. But it seems more convenient to use the dynamic programming approach. The de-scription of that approach presented in Section 3 is the main result of this article.

(3)

nite horizon and of discounted model withn∈(0;1),

n= 0;1; : : : ; N, this assumption does not decrease the generality. Besides, one can include the timetinto the state, ˜x= (x; t), and obtain a homogeneous model, in which all the functions and the transition probability do not depend on the time.

Remark 1. Theorem 1 was proved in the book [6] assuming thatX is compact. But all the proofs can be generalized with the help of the results by Sch al [8]; see the review [7] as well. Models with the nite or countable setX were considered in [1,3].

Example. Let us consider the one-channel Markov queueing system with losses. Put X ={0;1} where xt= 0 (xt= 1) means that the system is free (busy)

at the time momentt;A={0;1}whereat= 0 (at=

1) means that the system eects less intensive (more intensive) servicing at the interval (t−1; t]. The initial probabilityP0(1) of system’s being busy, is known.

The transition probability at the steptis expressed by the formula

pt(y|x; a) =

   

  

p if x= 0; y= 1; 1−p if x= 0; y= 0; qa if x= 1; y= 0; 1−qa if x= 1; y= 1:

Herepis the probability of a customer arriving at the interval (t−1; t];q0 ₍_q1_{) is the probability of the end}

of the service between the time moment t−1 and the moment t for the less (more) intensive regime; 0¡ q0_{¡ q}1_{. Lastly,}_e

a is the cost owed for servicing

by the corresponding regime at the interval (t−1; t]; e1¿ e0¿0;c ¿0 is the penalty caused by the loss

of an order which is paid o only if a customer came into the busy system and was rejected. We have to minimize the service consumption under the constraint that the penalty for loss of requests is no bigger than d. Therefore, we put

r(x; a) =ea; s(x; a) =xpc; N = 1

and investigate the discounted model (1), (3) with the unique discount factor0=1=. The complete

solution of this problem is presented later.

3. Dynamic programming approach

The main concept of this approach is close to the penalty functions method. The new deterministic model will be built in which the losses will be equal to “+∞” for non-admissible strategies. If a strategy is admissible then the value of the main functional (1) does not change.

The state in the new deterministic model is the pair (Pt; Wt) wherePtandWtare the probability distribution

onX and the expected accumulated vector of losses associated with the functions st(·). The action ˜at =

˜

at(d(x; a)) at the instant t is the probability measure

onX×A; ˜at∈A˜=P(X×A). If the state is ˜x= (P; W)

then only those actions ˜aare available for which the projection onX (the marginale) coincides withP:

˜

A(P) ={a˜∈A˜: ˜a(dx×A) =P(dx)}

is the space of available actions. The dynamics of the new model are dened by the relations

Pt(dx) =Pt( ˜at)(dx) =

Z

X×A

˜

at(d(y; a))pt(dx|y; a);

W_tn =W_tn(Wt−n 1;a˜t)

=W_t−n ₁+_nt−1

Z

X×A

sn_t(x; a) ˜at(d(x; a));

n= 1;2; : : : ; N; t= 1;2; : : : (4)

under the given initial conditionsP0andW0= 0.

Since st(·)¿0 and rt(·)¿0, the variableWt does

not decrease in the new model under every control strategy ˜={a˜t}∞t=1. We have the model ˜Z= ( ˜X ;A;˜ p˜)

where ˜X =P(X)×RN, ˜A=P(X ×A), the tranisition probability ˜p_t(d ˜y|x;˜ a˜) is concentrated at the unique point dened by expressions (4) (to put it dierently, the model ˜Z is deterministic). Let ˜={a˜t}∞t=1 be a

deterministic programmed control strategy in the new model (all its elements are equipped with the tilde). The loss functional is dened by the formula

˜ R( ˜) =

∞

X

t=1

˜

Rt(Pt−1; Wt−1;a˜t)→inf

˜

; (5)

where ˜

Rt(P; W;a˜) =

 



t−₀ 1

Z

X×A

rt(x; a) ˜a(d(x; a)) if W6d;

(4)

Obviously, there exists the trivial 1–1 correspon-dence between deterministic programmed strategies ˜ and Markov strategiesm_{in the initial model:}

˜

↔m: ˜at(d(x; a)) =tm(da|x)Pt−1(dx):

Remark 2. Since the model ˜Z is deterministic and the initial state (P0;0) is xed, every deterministic

feedback control in the model ˜Z can be presented as the deterministic programmed strategy ˜a1;a˜2; : : :: all

the elements of the sequence

(P0; W0);a˜1;(P1; W1);a˜2; : : :

are dened univalently.

According to the construction, if ˜↔m_then

˜ R( ˜) =

(

R(m) if S(m)6d;

+∞ in other cases:

Therefore, it is sucient to solve problem (5) (to build an optimal deterministic programmed strategy ˜): the corresponding Markov control strategym_{is a solution}

of problem (1) and (3) in the initial model. Theorem 1 is used here. If there are no admissible strategies in the initial model then inf˜R˜( ˜) = +∞.

Assume that all the functionssn

t(·),n= 1;2; : : : ; N

are nite and continuous. Then the mappings Pt; Wt

(4) are continuous. Besides, ˜A(P) is compact for each P[8] and the reectionP→_A˜₍_P_{) is quasicontinuous,} that is, for each convergent sequence limi→∞Pi=P,

the sequence of arbitrarily chosen points ˜ai∈A˜(Pi)

has a limit point ˜a∞∈A˜(P) [4]. Hence, the model ˜Z

is semicontinuous [4].

Example. In the queueing system described in Sec-tion 2 we have

˜

A(P) ={a˜∈_A˜₌_P₍_X _×_A_{): ˜}_a₍₁_;_{0) + ˜}_a₍₁_;₁₎ =P(1); a˜(0;0) + ˜a(0;1) =P(0)}:

That is, if we have the probability distributionPonX then an action ˜ais available if and only if the marginale of ˜a coincides withP. Clearly,P(0) +P(1) = 1, so we can omit the parameterP(0) everywhere. The dy-namics (4) takes the form

Pt(1) =P( ˜at)(1) =pPt−1(0) + ˜at(1;0)(1−q0)

+ ˜at(1;1)(1−q1)

=p−[ ˜at(1;0) + ˜at(1;1)]

×[p−1 +q0] + ˜at(1;1)(q0−q1);

Wt=Wt(Wt−1;a˜t)

=Wt−1+t−1pc[ ˜at(1;0) + ˜at(1;1)]:

The loss function is equal to ˜

Rt(P(1); W;a˜) =

 



t−1{[ ˜a(0;0) + ˜a(1;0)]e0

+ [ ˜a(0;1) + ˜a(1;1)]e1} if W6d;

+∞ if W ¿ d:

If ˜={a˜t}∞t=1 is a deterministic programmed strategy

then the control

˜

at={a˜t(0;0);a˜t(0;1);a˜t(1;0);a˜t(1;1)}

corresponds to the Markovian control law,_tm(a|x), at stept of the following form:

m_t(0|0) = a˜t(0;0) ˜

at(0;0) + ˜at(0;1)

;

m_t(1|0) = a˜t(0;1) ˜

at(0;0) + ˜at(0;1)

;

m_t(0|1) = a˜t(1;0) ˜

at(1;0) + ˜at(1;1)

;

m_t(1|1) = a˜t(1;1) ˜

at(1;0) + ˜at(1;1)

:

Here

˜

at(0;0) + ˜at(0;1) =Pt(0);

˜

at(1;0) + ˜at(1;1) =Pt(1)

and _tm(0|0) = 0 and m_t(1|0) = 1 if Pt(0) = 0,

m_t(0|1)=0 andm_t(1|1)=1 ifPt(1)=0. The complete

solution of this example is presented in Section 4. Now we return to the general model (4) and (5). The Bellman equation in this situation is of the standard form:

vt(P; W) = inf

˜

a∈_A˜₍_P₎{

˜

Rt+1(P; W;a˜)

+vt+1(Pt+1( ˜a); Wt+1(W;a˜))};

t= 0;1; : : : (6)

It has a solution in the class of lower-semicontinuous lower-bounded functions. We are interested only in the mimimal nonnegative solution which can be obtained by the successive approximations [2]:

(5)

where

where U is called the Bellman operator. The limit function v∞_{= lim}

k→∞vk exists and coincides with

the solution of interest, that is, the Bellman func-tion. If a control ˜a∗_t(P; W) provides the inmum in (6) then the feedback strategy ˜∗_t = ˜a∗_t(Pt; Wt) is

op-timal in the model ˜Z. Notice that such a strategy (a measurable mapping ˜a∗_t(P; W)) exists since vt(·)

is lower-semicontinuous lower-bounded function and ˜

A(P) is compact.

Therefore, problem (5) has a solution in the form of a feedback control law and in the form of the de-terministic programmed strategy (see Remark 2).

If we deal with the model with the nite horizonT then the sequencevk converges by the nite number of steps:vT+1₌_v∞_.

Let us consider the homogeneous case when all the cost functions and the transition probability do not depend on the time. We introduce the new variable

˜

The component ˜dn_t equals the expected loss of the typenwhich is admissible on the remaining interval

{t+ 1; t+ 2; : : :}. The loss function at one step can be

We deal with the standard discounted model

ˆ

where the homogeneous dynamic equation for Pt is

given by (4) and the dynamics of the component ˜dt

can be dened by the following equation:

˜

problem (7) is of the form

ˆ

and we are interested in its minimal nonnegative solu-tion. Eq. (8) can be also solved by the successive ap-proximations method. It is simpler than Eq. (6) since the time dependence is absent.

In actual practice, it is often easy to build the do-main Gwhere ˆv(P;d˜) =∞(there are no admissible strategies). One can show that Gis an open set. Let

G={(P;d˜)∈P(X)×RN: ˆv(P;d˜)¡ +∞}

has a unique lower-semicontinuous uniformly bounded solution on G. (The Bellman operator in the right-hand part is a contraction in the space of lower-semicontinuous bounded functions on G.) It should be emphasized that this solution extended by innity on G provides the minimal nonnegative solution of Eq. (6) with the help of the formula

vt(P; W) =t0vˆ

Eq. (8) cannot have any other bounded solutions on G. On the contrary, ifvis a solution of (6) thenv+c is also the solution of Eq. (6) for every constantc.

4. Solving the example

In this section, we present the complete solution of the example described in Section 2. In what follows, we use the notationg,p=(1−).

(6)

the initial distributionP(x) and the boundarydsatisfy

The multifunction ˜A(P) was dened in the previous section. As before, the variable ˜ddenotes the expected penalty for loss of requests which is admissible on the remaining time interval.

It is convenient to introduce the conditional proba-bilities

which are the two independent parameters of the ac-tion ˜a. (If the denominatorP(0) orP(1) equals zero then the corresponding fraction equals an arbitrary number, say zero.) Notice that if ˜={a˜t}∞t=1is a

de-terministic programmed strategy then the correspond-ing Markov strategy in the initial model has the form m

t (a|x) = ˜at(a|x). Since the operatorsPandDdo not

depend on ˜a(1|0), the optimal values of ˜a(0;1) and ˜

a(1|0) are zero, and thusP(0)= ˜a(0;0). Therefore, Eq. (10) has the following form:

ˆ

also0 _{dened later.}

The unique continuous uniformly bounded on G solution of Eq. (12) is dened in what follows.

(i) If ˜d¿(pc(g+P(1)))=(1−+q0₊_p_{) then}

ˆ

v(P;d˜) = (e0=1−). In this case the unique value of

˜

a(1|1) providing the minimum in the right-hand part of (12) is zero.

Recall that according to (11) the dynamics equa-tions forPt(1) and ˜dtlook like following:

In the case considered, if the inequality

˜ d0=d¿

pc(g+P0(1))

1−+q0₊_p (14)

is satised at the initial step then one must always choose the action ˜at(1|1)≡0, and inequality (14) is

satised at every instant t for ˜dt and Pt(1). In this

connection, the constraint in the initial problem is not essential and the solution of problem (1) and (3) coincides with the solution of the unconstrained problem (1).

(ii) As was established at the beginning of Section 4, if ˜d ¡(pc(g+P(1)))=(1−+q1₊_p_{) then}

ˆ

v(P;d˜) = +∞and all the actions are equivalent (all control strategies are not admissible). If the inequality

˜ d0=d ¡

pc(g+P0(1))

1−+q1₊_p (15)

is satised at the initial step then

˜ dt¡

pc(g+Pt(1))

1−+q1₊_p

at every instant t ¿1 independently on the actions ˜

a1;a˜2; : : : ;a˜t∈A˜. In the case considered, there are no

(7)

then

In this case one can choose an arbitrary action 06a˜(1|1)61 in the non-empty interval that the action ˜at(1|1) must be chosen arbitrarily from

the interval

max{0;1−1(Pt−1;d˜t−1)}6a˜t(1|1)

6min{1; 0(Pt−1;d˜t−1)} (18)

at every epocht (see (16)) and inequalities (17) are satised at each stept. If the both inequalities in (17) are strict at the stept−1 then the left-hand part of (18) is strictly less than the right-hand part and there is a lot of dierent control strategies providing the solution of the initial problem. One of those strategies is of big interest:

step; relations (18) are satised as well. The proof can be carried out by the induction based on the explicit formulae

tionary control strategy ˜a dened by the conditional probabilities

˜

as(1|1) = ˜a∗; a˜s(1|0) = 0

is optimal. (It corresponds to the strategys₍₁_|_{0) = 0,}

s₍₁_|_{1) = ˜}_a∗

in the initial model.) One can show that there are no other optimal stationary control strategies.

Thus, the solution of the example described in Section 2 looks like following. In the case (17) one should always select the action a = 0 if the system is free; if the system is occu-pied then the probability of the more intensive regime must be equal to expression (19). In the case (14), the constraint is inessential and one has always to choose the less intensive regime a = 0. In the case (15), there are no admissible strategies.

5. Conclusion

The dynamic programming approach can be used also if the constraint inequalities must be satised al-most surely. For example, if we consider the model with the nite horizon

RN, the actionaremains as in the original model, and the Bellman equation similar to (6) can be rewritten in the form:

(8)

Now consider problem (2) of minimizing the aver-age expected losses in a homogeneous model. As is known this problem can be reduced to the homoge-neous discounted one if there exists a minorant [4]. In this case, the corresponding constrained problem can be also reduced to the constrained discounted prob-lem stated in Section 2. (The details can be found in [6].) So, the dynamic programming approach can be suitable also for problems with the average expected losses.

Note that the dynamic programming approach (in problems with total expected losses) makes it pos-sible to build all optimal deterministic programmed strategies for the auxiliary model ˜Z. The strategy ˜ is optimal in ˜Z if and only if∀t= 1;2; : : :the action ˜

at provides the inmum in (6) at the current values

P=Pt−1 and W =Wt−1. To put it dierently, the

presented method allows one to construct all Markov control strategies which are optimal in the initial con-strained problem. This is conrmed by the example solved in Section 4.

Lastly, it should be emphasized that if the function ris bounded then the symbol +∞in the denition of the loss ˜R in the model ˜Z can be replaced by some suciently large nite constant.

Acknowledgements

The main idea of this paper was discussed with S. Gaubert of INRIA (Paris). Besides, the authors are thankful to the editor and to the anonymous referee for constructive comments which helped to improve this article.

References

[1] E. Altman, Constrained Markov Decision Processes, Chapman & Hall/CRC, Boca Raton, 1999.

[2] D.P. Bertsekas, S.E. Shreve, Stochastic Optimal Control, Academic Press, N.Y-S. Francisco-London, 1978.

[3] V.S. Borkar, Topics in Controlled Markov Chains, Vol. 240, Longman Scientic and Technical, England, 1991.

[4] E.B. Dynkin, A.A. Yushkevich, Controlled Markov Processes and their Applications, Springer, New York, 1979.

[5] E.A. Feinberg, A. Shwartz, Constrained Markov decision models with discounted rewards, Math. Oper. Res. 20 (1995) 302–320. [6] A.B. Piunovskiy, Optimal Control of Random Sequences in Problems with Constraints, Kluwer Academic Publishers, Dordrecht, 1997.

[7] A.B. Piunovskiy, Controlled random sequences: the convex analytic approach and constrained problems, Russ. Math. Surveys 6 (1998) 1233–1293.