Formal Methods for Control Synthesis in Partially Observed Environments: Application to Autonomous Robotic

I was very fortunate to have Joel's unreserved understanding—in pursuing my own research problem of interest, exploring opportunities outside of the PhD program, and challenging my long-distance marriage. I am also grateful to have had the opportunity to work on the DARPA ARM-S challenge at the Jet Propulsion Laboratory (JPL).

Motivation

Since the tool and belt manipulation tasks are kinematically linked, the robot can pick up the impact driver in a position where the abrasive nuts can be removed, but the driver trigger cannot be depressed. By execution, it is implied that a set of local controllers are arranged in time to effectuate the robot.

Figure 1.1: The DARPA ARM-S Robot. It consists of two arms are Barrett Technology 7-DOF WAM arms with a 6-DOF force sensor at each wrist

Related Work

In the case of the (fully observable) Markov decision process (MDP), the synthesis of controllers with probabilistic satisfaction guarantees of LTL specification is well understood [8]. In this thesis a goal-oriented approach is taken: the controller has discrete states and optimizes the input to the controlled system and its own dynamics for each state according to the LTL specification.

Thesis Overview and Contributions

Verification of LTL satisfaction using Automata

The interesting properties of the system are assumed to be given by a set of atomic statements AP about the variables V of the system. At the beginning of the system execution, the DRA corresponding to ϕ is initialized to its initial stateq0.

Common LTL formulas in Control

Safety or Invariance
Guarantee or Reachability
Progress or Recurrence
Stability or Persistence
Obligation
Response

The states of the DRA are indicated by the nodes (circles or boxes) and are numbered.

Figure 2.1: Graphical representation of the DRA translations of common LTL specifications.

Labeled Partially Observable Markov Decision Process

Information State Process (ISP) induced by the POMDP
Belief State Process (BSP) induced by the POMDP
POMDP controllers

Markov Chain Induced by a Policy

Probability Space over Markov Chains
Typical Problems over POMDPs
Optimal and ϵ´optimal Policy
Brief overview of solution methods

Exact Methods
Approximate Methods

The initial probability distribution is given by. and the state transition probabilities are given by . The underlying set of outcomes X is given by the set of (infinite) paths, P athspMq, Markov chains, M .

Figure 2.2: Partially observable probabilistic grid world. It is described in example 2.2.2.

Concluding Remarks

Markov Chain induced by an FSC

Note that for a finite state space POMDP, the global Markov chain has a finite state space. Similar to the fully observable case of the Markov decision process in [8], the global Markov chain induced by the finite state controller MP M,GSˆG is probabilistically bisimilar to the Markov chain in the infinite state space described in Section 2.2.3.1.

LTL satisfaction over POMDP executions

Inducing an FSC for PM from that of PM ϕ

These action and observation sets remain unchanged between the original model PM and the product POMDP PMϕ. However, the starting status of the FSC is determined by the initial distribution of the product POMDP.

Verification of LTL Satisfaction using Product-POMDP

The product POMDP is the basis for a method to find policies as defined by an FSC. Note that in the case of fully observable MDPs, the choice of action as a function of the current state of the Product POMDP is usually sufficient [8].

Measuring the Probability of Satisfaction of LTL Formulas

Since the definition of the global Markov chain is unique, hereafter the subscript on the probability operator in the r.h.s. Likewise, any use of the expectation operator E will also be defined using the underlying probability measure over the global Markov chain.

Background: Analysis of Probabilistic Satisfaction

Qualitative Problems

Finally, define PrpPM (ϕ|Gq fi PrpPMϕ (ϕ|Gq) as the probability that the original model satisfies the LTL formula.

Quantitative Problems

Complexity of Solution for Qualitative and Quantitative Problems

Excursus on Markov chains

The Markov chain can therefore be studied exclusively in the smaller state space C, and is called the limit of MtoC. The expected value of the path wise occupancy measure is the t´step expected occupancy measure.

Overview of Solution Methodology

Solutions for Qualitative Problems

Theorem 3.4.13 Given the limiting matrix Π, the limit of the Cesaro sum of transition matrix T, the quantity I´T`Π is non-singular and its inverse. Theorem 3.5.2 The formula ϕ is almost certainly satisfied, i.e. PrpPM(ϕq “1, if there exists an FSC,G such that for some Rabin acceptance pair pHepeatP Mi ϕ, AvoidP Mi ϕq.

Solutions for Quantitative Problems

Proposition 3.5.4 The problem of computing the satisfaction probability ϕ under a given FSC, G can be solved by computing . Structure: The structure of the FSC has two components: a) Number of internal nodes, G, available.

Concluding Remarks

Structure

Formally, the structure of the FSC is defined by the collection I“ tG,Iω,Iκu, where the three components are:. g|G|uindexes FSC states. Then for each observationPO, there must be an I-stateglPG and an actionαPAct, such that the FSC can switch to the action of releasing the gland to the product-POMDP.

Structure Preserving Parametrization of an FSC

However, a given FSC structure is acceptable only if. 4.3) The above condition can be understood as follows. Note that the admissible structure of FSC requires that for each observationok PO there is at least one pairpgnext,αnextq PGˆActs such that FSC passes tognextand issues actionαnext.

Figure 4.1: The logistic function with the parameter x can be used to continuously parametrize the Bernoulli trial by allowing the probability of success p of the trial.

Maximizing Probability of LTL Satisfaction for an FSC of Fixed Structure

Vector Notation for Finite Sets
Ordering Global States by Recurrent classes
Probability of Absorption in ϕ-feasible Recurrent Sets

Complexity and Eﬃcient Approximation

Gradient of Probability of Absorption

Complexity and Eﬃcient Computation
Gradient Based Optimization

This section attempts to calculate the probability of absorption in this set, given the initial distribution of the product POMDP, PMϕ. One source of computational complexity arises from the need to compute the repeating sets of the global Markov chain.

Ensuring Non-Infinitesimal Frequency of Visiting Repeat P M ϕ States

Equivalence to Expected Long Term Average Reward

Complexity of Computing Gradient of η av pR k q

This is an important relationship, since the long-term average reward and the calculation of its gradient are widely studied in the literature, especially for the case when the Markov chain is ergodic (or has a single repeated class) [1, 10]. The complexity of gradient estimation is summarized by [1] here to complete the view of the computational burden of the gradient lift methodology.

Trade Oﬀ between Absorption Probability and Visitation Frequency

From [1], in the worst case, the computation complexity ∇ΦηavpRkqi is given byOp|S|2|G|2|Φ||Vepro||O|q similar to the absorption probability gradient. Furthermore, the objective is nonlinear and first-order methods can only guarantee convergence to a local maximum.

Heuristic Search for FSC Structures with a ϕ-Feasible Recurrent Set

Complexity

For the global Markov chain, the number of nodes is |S||G|, while the worst-case number of edges is given by. On the other hand, the nested loops in steps 4-16 have the worst complexity of Op|S|2|G|2|Act||O|q.

Initialization of Θ and Φ

Case Studies

Case Study I - Repeated Reachability with Safety

For the MDP world, the robot must first make sure it has moved to cell 16 and then issue a right-move R signal. In addition, there is a fifth Stop action, marked X, which causes the robot to remain in its current cell.

Figure 4.4: System Models for Case Study I. (a) A planar grid world in which a robot moves deterministically, i.e., each action has a single outcome

Case Study II - Stability with Safety

For the initial starting state, as shown in Figure 4.7, the shortest path algorithm would always direct the robot to cell 36, even when the formula φ2 is satisfied with probability 1 by navigating to one of the cells or cells. Running the gradient ascent algorithm in POMDP-World yielded two controllers that also correspond to ϕ2paqand ϕ2pbq.

Figure 4.6: Dependence of expected steady state average reward η on size of FSC, for POMDP- POMDP-World, N=4

Case Study III - Initial Feasible Controller

A test trajectory for the POMDP with the best controller is also shown in the same figure via a blue dashed line. ForN “3, although possible FSCs exist for the POMDP world, the heuristic is unable to find them.

Concluding Remarks

Incentivizing Frequent Visits to Repeat P M r ϕ
Computing the Probability of Visiting Avoid P M ϕ in Steady State
Partitioned FSC and Steady State Detecting Global Markov Chain
Posing the Problem as an Optimization Problem

During the transient phase of the global Markov chain execution, the global state in the aϕ-feasible recurrent set is rapidly absorbed. During the steady state phase of the global Markov chain execution, the state frequently visits the states in RepeatP Mr ϕ.

Figure 5.1: Assigning rewards for visiting Repeat P M ϕ frequently. The above shows the state space of product-POMDP

Concluding Remarks

In fact, the value for the scalar ηav can be obtained by solving a slightly different version of the Poisson equation (6.2) given by. Further discussion of the Poisson equation in the context of using dynamic programming to solve it is provided in Section (6.3.2).

Dynamic Programming Basics

Dynamic Programming Variants

However, the Bellman Optimality equation, the Bellman Equation, and iterative value and policy techniques can also be derived for the expected long-run average reward criterion (Definition 2.2.9). However, in the general setting of an arbitrary reward function and infinite state space, the existence of an optimal solution for the average case is not guaranteed.

Summary of Dynamic Programming Facts

Value Function of Discounted Reward Criterion

Only the evaluation of the average reward value function under a given FSC is required to ensure LTL satisfaction. Clearly, equation (6.25) shows that the value of a given I-state is a linear function of the belief state.

Figure 6.1: Value Function for a two state POMDP. The value of each I-state is a linear function of the belief state

Value Function of Average Reward Criterion

However, ⃗g “ Πssd⃗rav is unique to the Poisson equation, and can be precalculated independently of Vav. Therefore, the trick used in Section 4.2.4.1 can be applied until⃗gconverges to within a tolerance εΠssd ±0.

Bellman Optimality Equation / DP Backup Equation - Discounted Case

Again, the exact methods to solve the linear system of equations is cubic is number of equations, which is 2|S||G|in Equation (6.2) with such number of variables, consisting of both V⃗av and⃗g. The pre-calculation of ⃗g only makes it possible to solve a linear system of equations in the unknown variablesVavusing Equation (6.2b).

Policy Iteration for FSCs

Bounded Stochastic Policy Iteration

In [118], the authors show the following interpretation of this optimization: it implicitly considers the value vectors Vgβ of the support value. Tangent beliefs are those in which reserving the DP results in no improvement in the value function beyond the present value.

Figure 6.3: Graphical depiction of the eﬀect of the I-state Improvement LP. This figure shows how the I-state improvement LP works

Applying Bounded Policy Iteration to LTL Reward Maximization

Node Improvement

Opo|sqωpg1,α|g, oqTP Mϕps1|s,αqVβprs1, g1sq @s Poisson equation (ifgPGss):. ωpg1,α|g, oq “ 0 ifg1PGtr Probability constraints:. 6.43). This is because the term TmodP Mϕ, which is linear inωpg1,α|g, oq, is multiplied by the unknowns V⃗av and⃗gin the two sets of constraints that form the Poisson equation.

Convex Relaxation of Bilinear Terms

Below is how to relax the bilinear constraints that appear in the Poisson equation in the I-state Improvement Bilinear Program in equation 6.43. This may be due to the fact that V⃗av does not appear in the objective or feasibility constraints of equation (6.43).

Addition of I-States to Escape Local Maxima

Note that the algorithm operates on a copy of the original FSC and the solution of the Poisson equation is computed at the last step of policy evaluation. The addition of the gphantom and recalculation of the Poisson equation is only used in Algorithm 6.5.

Finding an Initial Feasible Controller

Similar to Algorithm 6.2, instead of directly improving the value of the tangent belief, the algorithm tries to improve the value of the intermediate beliefs that are reachable in one step from the tangent beliefs. The rest of the procedure is identical to Algorithm 6.2, except for step 20, in which each newly added I-state is placed in the correct Gtr or Gss partition.

Case Studies

Case Study I - Stability with Safety

The x-axis indicates the number of policy improvement steps implemented. a) The growth of the FSC size. The above graph shows the improvement in steady state behavior as the size of the FSC increases.

Figure 6.5: System models for Policy Iteration case studies. (a) Det-World (b) MDP-World (c) POMDP-World

Case Study II - Repeated Reachability with Safety

The y-axis indicates the expected frequency with which states inRepeatP M0 ϕ were visited for the product POMDP resulting for the POMDP world from Figure 6.5(c) and the DRA for ϕ1 given by Equation (6.56).

Concluding Remarks

Task scheduling has traditionally been studied in the field of domain-independent scheduling and therefore a historical overview of this field is also provided. This chapter distinguishes three different types of planning/control problems in manipulation tasks: path or motion planning, simple sequencing to realize short-horizon tasks, and high-level task planning for manipulation robots.

Introduction

Robotic Paradigms

For long-term goals, it is necessary to have a thoughtful planner who is able to plan the implementations that realize the goals. Low levels can still be highly reactive, while higher levels provide increasingly intensive thinking for long-term goal satisfaction.

Planning in Robotics

These rely on the model of all the actions available to the robot or controller, which include prerequisites of the state of the world for applicability and the impact of each action on the world. Although the world itself can be quite complex, the planner works from a world model.

Figure 7.1: Various Robotic Paradigms. (a) The Deliberative/Hierarchical Robotic Paradigm in which the robot senses the state of itself and environment, plans its next action and then acts, in sequence

Background on Domain Independent Planning

Then the state space of the world model is given by all possible truth assignments, 2L. The planner simply needs to find a path (a sequence of actions and thus a sequence of intermediate points in state space) that takes the initial state of the world to a given final state.

Figure 7.2: A Hybrid Robot Architecture.

The DARPA Autonomous Robotic Manipulation Software Challenge

An Example DARPA Challenge Task for ARM-S

One of the tasks the ARM-S teams had to perform was a wheel change scenario. The process of changing a wheel includes the following main tasks: finding a battery-powered impact driver on the table (workspace), removing nuts from the wheel using the impact driver, removing the wheel from the axle and placing it on the table.

Overview of Planning Challenges in ARM-S

Motion Planning

The Manipulation Planner, in combination with the Wing Planner, selects among nearby accessible positions for a given primitive task. The main purpose of the Weapon Planner is to work in the joint space of the 7-DOF wing to find collision-free paths to the positions suggested by the Manipulation Planner.

Limited Task Planner / Sequencer for ARM-S

Re-manipulation or Re-grasping
Kinematically Dependent Tasks
Kinematic Verification Based Execution

For example, for the EXECUTE BEH SEQ("tool use grasp", IMPACT, Ø) instruction, the grip planner may not find any feasible solutions in the current table impact position. Example 2: This example deals with the case where re-manipulation of the impact gear is required to make the next task (grip tool use) feasible.

Figure 7.5: Abstracted digraph for removing nuts with impact driver, Example 2. Two alternate routes exist with lower preference for re-manipulation.

Task Planning for ARM-S

Probabilistic Outcomes and Partial Observability at the Task Level

OrObsAttachedN utTo ow heelpnutiq |AttachedN utTo ow heelpnutiqs “ p Or␣ObsAttachedN utTo ow heelpnutiq |AttachedN utTo ow heelpnutiqs “ 1´p OrObsAttachedN utTo ow heelpnutiq |␣AttachedN utTo ow heelpnutiqs “ 1´q Or ␣ObsAttachedN u tT oW heelpnutiq |␣BijgevoegdN utT oW heelpnutiqs “q. Tr␣AttachedN utTo ow heelpnutiq |AttachedN utTo ow heelpnutiq, ActRemoveN utpnutiqs “ r TrAttachedN utTo ow heelpnutiq |AttachedN utTo ow heelpnutiq, ActRemoveN utpnutiqs “ 1´r.

LTL Goals for ARM-S Task Planner - Case Studies

Simple Reachability Goal
Temporally Extended Goal

The robot must perform the wheel removal task as the mounting accessories appear in front of it. The wheel removal task may take an unknown amount of time due to the probabilistic nature of the task.

Description of the System in RDDL

But when the task is finished, it would be useful if the robot itself could signal that it can receive a new task.

Application of Bounded Policy Iteration to ARM-S Tasks

Preprocessing
ARM-S Task 1
ARM-S Task 2
Resulting Control Policy

Deterministic model: In this case, all parameters are sp“ q “ r “1, indicating that the nut removal task is always successful and the AttachedN utT oW heelpnutiq states are fully visible. PrpObsAttachedN utT oW heelpnutiq|AttachedN utT oW heelpnutiqq “ p “ 0.5 Prp␣ObsAttachedN utT oW heelpnutiq|AttachedN utT oW heelpnutiqq “ 1´p “ 0.5 Prp␣ObsAttachedN utT oW heelpnutiq|␣Attach EDN utT oW heelpnutiqq “ q “ 0.5 PrpObsAttachedN utT oW heelpnutiq |␣AtThedN utT oW heelpnutiqq “ 1´q “ 0.5.

Table 7.1: Problem size of policy iteration for naive implementation.

Concluding Remarks

However, if the robot tries to remove the wheel from the hub, the success of the wheel removal can be used to infer the state of the nuts. However, during practical implementation, the off-the-shelf MATLAB tool in [89] was found to run out of memory for the ARM-S case studies.

Figure 7.6: Execution for Example 2, Figure 7.5. (a) Initial world state. (b) Initial scan used to populate W model

Open Issues and Future Work

This can be seen as a multi-valued extension of the bandit problem popular in the Reinforcement Learning literature [69]. However, sampling-based approaches need to be extended to ensure pointwise satisfaction of the Poisson equation in the conservative optimization criterion and not only by the sampled beliefs.

Measurable Space and Measure

Probability Space and Measure

Natural σ-Algebra and Distributions over Countable Sets

The DARPA ARM-S Robot
Graphical representation of the DRA translations of common LTL specifications
Partially observable probabilistic grid world
Evolution of a labeled POMDP
POMDP controlled by an FSC
Eﬀect of FSC structure on ϕ-feasibility
Logistic function
Repeat P M ϕ can be visited with vanishing frequency
Generating Admissible Structures of FSC
System Models for Case Study I
DRA for the LTL specification ϕ 1 “ ˝ ♦a ^ ˝ ♦b ^ ␣c
Dependence of expected steady state average reward η on size of FSC
System Model for Case Study II
Sample trajectories under optimal controllers
Assigning rewards for visiting Repeat P M ϕ frequently
Modifying T ϕ for steady state ϕ-feasibility
Example where visiting Avoid P M ϕ is required to reach Repeat P M ϕ
Steady state detecting global Markov chain
Steady state detecting global Markov chain
Reduced Feasibility arising from Conservative Optimization Criterion
Value Function for a two state POMDP
Eﬀect of DP Backup Equation
Eﬀect of I-state Improvement LP
Policy Iteration Local Maximum
System models for Policy Iteration case studies
Transient behavior optimization using Bounded Policy Iteration
Eﬀect of Bounded Policy Iteration on steady state behavior
Various Robotic Paradigms
A Hybrid Robot Architecture
The DARPA ARM-S Robot
Wheel Removal Task for the ARM-S Robot
Abstracted digraph for removing nuts with impact driver
Execution for remanipulation task of Example 2
Probabilistic Outcomes and Partial Observability in ARM-S
DRA for ARM-S Tasks
DBN of the ARM-S task planning domain
Bounded Policy Iteration for ARM Task 1
Bounded Policy Iteration for ARM Task 2
ARM-S Task 2 Policy
Complexity of LTL satisfaction over POMDPs
Results for GW-B under ϕ 2
Finding the Initial Feasible Controller by Algorithm
Problem size of policy iteration for naive implementation
Reduce problem size for policy iteration after basic preprocessing
Generate Set To Visit Frequently
Generate Candidate FSCs
Policy Iteration for Markov Decision Process
Bounded PI: Adding I-States to Escape Local Maxima
Bounded Policy Iteration For Conservative Optimization Criterion
Adding I-states to Escape Local Maxima of Constrained Optimization Criterion
Pruning candidate successor I-states and actions to satisfy recurrence constraints
A common compound task
Task sequence for Example 1

The role and application of the stochastic linear-quadratic-Gaussian problem in the design of control systems. The use of timing logic in the specification and verification of reactive systems: a review of current trends.