Case Studies - Formal Methods for Control Synthesis in Partially Observed Environments: Applica

max

ω,V⃗^av,V⃗_{f eas}^av ,⃗g,⃗gf eas

`⃗ι^{P M}_init^ϕ˘T

⃗g_{f eas,g} subject to

Poisson Equation 1:

V⃗âv`⃗g “ ⃗râv`T_mod^{P M}^ϕV⃗âv

⃗g “ T_mod^{P M}^ϕ⃗g Poisson Equation 2:

V⃗_{f eas}^av `⃗g_beta “ ⃗r^β`T_mod^{P M}^ϕV⃗_{f eas}^av

⃗g_{f eas} “ T_mod^{P M}^ϕ⃗g_{f eas} Feasibility constraints (@gPG^ss)

`⃗ι^ss_init,g˘T

⃗g “ 0 FSC Structure Constratins:

ωpg¹,α|g, oq “ 0 ifgPG^tr andgPG^ss Probability constraints:

g¹,α

ωpg¹,α|g, oq “ 1 @o ωpg¹,α|g, oq ě 0 @g¹,α, o

(6.54)

Any positive value of the objective `

⃗ι^{P M}_init^ϕ˘T

⃗g_{f eas,g} gives a feasible controller, and therefore the optimization need not be carried out to optimality. If the problem is infeasible, then states inG^ss can be successively added to search for a positive objective.

1 2 4 5 6

7 8 9 10

a b R

11 12

M=7 N

R

1.0

c

1.0

start

(a) Det-World

1 2 4 5 6

7 8 9 10

a b R

11 12

M=7 N

R

0.1

0.1 0.8

0.8

c

0.1

1.0 1.0

start

(b) MDP-World

1 2 4 5 6

7 8 9 10

a b R

11 12

M=7 N

R

0.1

0.1 0.8

0.8

c

0.1

1.0 0.05 0.05

0.05

0.05 0.05

0.05

0.6

start

Figure 6.5: System models for Policy Iteration case studies. (a) Det-World (b) MDP-World (c) POMDP-World. A description of these systems can be found in Figure 4.4.

0 10 20 30 40 50 60 70 80 0

2 4 6 8

|G|

0 10 20 30 40 50 60 70 80

0 1 2 3

Value of Initial Belief

0 10 20 30 40 50 60 70 80

0 0.2 0.4 0.6

Prob of goal within 20 steps

Policy Iteration Number

Figure 6.6: Transient behavior optimization using Bounded Policy Iteration. The x-axis denotes the number of policy improvement steps carried out. (a) The growth of the FSC size. (b) The value of the initial belief increases monotonically with each iteration. This value denotes the expected long term discounted reward for the given initial belief. (c) Since the goal is to reach cell 6, this sub-figure shows the increase in probability of reaching the goal state within 20 time steps as the FSC is optimized.

wherebandc, shown in Figure 6.5 are requirements for the robot to navigate to cell 6, and stay there, while avoiding cell 3, respectively.

Results: The diﬃculty in this specification is that the robot must localize itself to the top edge of the corridor before moving rightward to cell 6. Note that a random walk performed by the robot is feasible: there is a finite probability that actions chosen randomly will lead the robot to cell 6 without visiting cell 3. The FSC used to seed the BPI algorithm was chosen to have uniform distribution for I-state transitions and actions. This is another advantage as compared to the gradient ascent algorithm, in which initial parametrization for the FSC would lead to zero gradient as mentioned in Section 4.6. Figure 6.6 shows the result of the BPI in detail. It can be seen that the value of the initial belief increases monotically with successive policy improvement steps, which includes both the optimization of Equation (6.43) and the addition of I-states to escape local maxima, as discussed in Section 6.5.3.

0 0.5 1 1.5 2 2.5 3 3.5

4x 10⁻³

Start, |G|=3,|G^ss|=2 |G|=3,|G^ss|=2 |G|=4,|G^ss|=3 |G|=5,|G^ss|=4 |G|=6,|G^ss|=5 FSC Size and Structure

Frequency of visiting Repeat0

Figure 6.7: Eﬀect of Bounded Policy Iteration on steady state behavior. Bounded Policy Iteration applied to the specificationϕ1. The above graph shows the improvement in steady state behavior as the size of the FSC increases. Only states inG^ss were allowed to be added. The y-axis denotes the expectedfrequencywith which states inRepeat^{P M}0 ^ϕ were visited for the product-POMDP resulting for the POMDP-World from Figure 6.5(c) and the DRA ofϕ1given by Equation (6.56).

6.7.2 Case Study II - Repeated Reachability with Safety

This case study illustrates how the Bounded Policy Iteration, especially the addition of I-states to the FSC, improves the steady state behavior of the controlled system.

System Model and LTL specification: The model used for this example is the POMDP-World of Figure 6.5(c) forN “3, and the LTL specification isϕ1 of Equation (4.51), repeated here.

ϕ1“l♦a^l♦b^l␣c. (6.56)

Results: For this example, the controller was seeded with a feasible FSC of size|G| = 3, with

|G^ss| “2, using Algorithm 4.2. After the first few policy improvement steps, the initial I-state was found to be in G^ss. Since by construction, once the FSC transitions to an I-state in G^ss it can no longer visit states in G^tr, when local maxima was encountered, subsequently all new I-states were assigned toG^ss. The improvement in steady state behavior with the addition of each I-state is shown in Figure 6.7, where it can be seen that the expected frequency of visiting Repeat^{P M}0 ^ϕ

steadily increases with the addition of I-states.

6.7.3 Concluding Remarks

In this chapter, it was shown how the Poisson Equation, which is closely related to the Bellman Equation arising in dynamic programming, can be leveraged to solve the Conservative Optimization Problem. The problem was shown to be bilinear and one convex relaxation method involving Mc- Cormick envelopes was introduced as a solution. The stochastic bounded policy iteration algorithm, which is normally applied to constrained discounted reward problem, was adapted to the case in which certain states were required to be never visited. This allowed that a path in the global Markov chain of the controlled POMDP eventually is absorbed in a recurrent set that does not include states fromAvoid^{P M}_r ^ϕ, while simultaneously incentivizing frequent visits to some state(s) inRepeat^{P M}_r ^ϕ. The key benefit of using this variant of dynamic programming is that it allows a controlled growth in the size of the FSC, and can be treated as an anytime algorithm in which the performance of the controller improves with successive iterations, but can be stopped by the user based on time or memory considerations. Case studies highlighting key attributes of the algorithm can be found in Section 6.7, and its application to robotic manipulation tasks can be found in Section 7.5.

Chapter 7

Robot Task Planning for Manipulation

This chapter gives a detailed introduction to task planning in robotics. Task planning has tradi- tionally been studied under the area of domain independent planning and therefore, some historical overview of this field of study is also provided. This chapter diﬀerentiates three diﬀerent types of planning/control problem in manipulation tasks – the path or motion planning, simple sequencing to achieve short horizon tasks, and high level task planning for manipulation robots. This is because manipulation tasks present some unique and specific challenges relating to kinematic constraints of manipulators, some of which can be addressed using specialized re-manipulation or re-grasping tech- niques. This re-grasping can therefore be separated with relative ease from the path planner as well as the high level task planner. Finally, the chapter shows the details of applying the policy iteration algorithm of Chapter 6 to a few concrete task planning problems inspired by the real challenge tasks presented to the teams participating in the DARPA ARM-S challenge.

Dalam dokumen Formal Methods for Control Synthesis in Partially Observed Environments: Application to Autonomous Robotic (Halaman 119-124)