I was very fortunate to have Joel's unreserved understanding—in pursuing my own research problem of interest, exploring opportunities outside of the PhD program, and challenging my long-distance marriage. I am also grateful to have had the opportunity to work on the DARPA ARM-S challenge at the Jet Propulsion Laboratory (JPL).
Motivation
Since the tool and belt manipulation tasks are kinematically linked, the robot can pick up the impact driver in a position where the abrasive nuts can be removed, but the driver trigger cannot be depressed. By execution, it is implied that a set of local controllers are arranged in time to effectuate the robot.
Related Work
In the case of the (fully observable) Markov decision process (MDP), the synthesis of controllers with probabilistic satisfaction guarantees of LTL specification is well understood [8]. In this thesis a goal-oriented approach is taken: the controller has discrete states and optimizes the input to the controlled system and its own dynamics for each state according to the LTL specification.
Thesis Overview and Contributions
Verification of LTL satisfaction using Automata
The interesting properties of the system are assumed to be given by a set of atomic statements AP about the variables V of the system. At the beginning of the system execution, the DRA corresponding to ϕ is initialized to its initial stateq0.
Common LTL formulas in Control
- Safety or Invariance
- Guarantee or Reachability
- Progress or Recurrence
- Stability or Persistence
- Obligation
- Response
The states of the DRA are indicated by the nodes (circles or boxes) and are numbered.
Labeled Partially Observable Markov Decision Process
- Information State Process (ISP) induced by the POMDP
- Belief State Process (BSP) induced by the POMDP
- POMDP controllers
- Markov Chain Induced by a Policy
- Probability Space over Markov Chains
- Typical Problems over POMDPs
- Optimal and ϵ´optimal Policy
- Brief overview of solution methods
- Exact Methods
- Approximate Methods
The initial probability distribution is given by. and the state transition probabilities are given by . The underlying set of outcomes X is given by the set of (infinite) paths, P athspMq, Markov chains, M .
Concluding Remarks
Markov Chain induced by an FSC
Note that for a finite state space POMDP, the global Markov chain has a finite state space. Similar to the fully observable case of the Markov decision process in [8], the global Markov chain induced by the finite state controller MP M,GSˆG is probabilistically bisimilar to the Markov chain in the infinite state space described in Section 2.2.3.1.
LTL satisfaction over POMDP executions
Inducing an FSC for PM from that of PM ϕ
These action and observation sets remain unchanged between the original model PM and the product POMDP PMϕ. However, the starting status of the FSC is determined by the initial distribution of the product POMDP.
Verification of LTL Satisfaction using Product-POMDP
The product POMDP is the basis for a method to find policies as defined by an FSC. Note that in the case of fully observable MDPs, the choice of action as a function of the current state of the Product POMDP is usually sufficient [8].
Measuring the Probability of Satisfaction of LTL Formulas
Since the definition of the global Markov chain is unique, hereafter the subscript on the probability operator in the r.h.s. Likewise, any use of the expectation operator E will also be defined using the underlying probability measure over the global Markov chain.
Background: Analysis of Probabilistic Satisfaction
Qualitative Problems
Finally, define PrpPM (ϕ|Gq fi PrpPMϕ (ϕ|Gq) as the probability that the original model satisfies the LTL formula.
Quantitative Problems
Complexity of Solution for Qualitative and Quantitative Problems
Excursus on Markov chains
The Markov chain can therefore be studied exclusively in the smaller state space C, and is called the limit of MtoC. The expected value of the path wise occupancy measure is the t´step expected occupancy measure.
Overview of Solution Methodology
Solutions for Qualitative Problems
Theorem 3.4.13 Given the limiting matrix Π, the limit of the Cesaro sum of transition matrix T, the quantity I´T`Π is non-singular and its inverse. Theorem 3.5.2 The formula ϕ is almost certainly satisfied, i.e. PrpPM(ϕq “1, if there exists an FSC,G such that for some Rabin acceptance pair pHepeatP Mi ϕ, AvoidP Mi ϕq.
Solutions for Quantitative Problems
Proposition 3.5.4 The problem of computing the satisfaction probability ϕ under a given FSC, G can be solved by computing . Structure: The structure of the FSC has two components: a) Number of internal nodes, G, available.
Concluding Remarks
Structure
Formally, the structure of the FSC is defined by the collection I“ tG,Iω,Iκu, where the three components are:. g|G|uindexes FSC states. Then for each observationPO, there must be an I-stateglPG and an actionαPAct, such that the FSC can switch to the action of releasing the gland to the product-POMDP.
Structure Preserving Parametrization of an FSC
However, a given FSC structure is acceptable only if. 4.3) The above condition can be understood as follows. Note that the admissible structure of FSC requires that for each observationok PO there is at least one pairpgnext,αnextq PGˆActs such that FSC passes tognextand issues actionαnext.
Maximizing Probability of LTL Satisfaction for an FSC of Fixed Structure
- Vector Notation for Finite Sets
- Ordering Global States by Recurrent classes
- Probability of Absorption in ϕ-feasible Recurrent Sets
- Complexity and Efficient Approximation
- Gradient of Probability of Absorption
- Complexity and Efficient Computation
- Gradient Based Optimization
This section attempts to calculate the probability of absorption in this set, given the initial distribution of the product POMDP, PMϕ. One source of computational complexity arises from the need to compute the repeating sets of the global Markov chain.
Ensuring Non-Infinitesimal Frequency of Visiting Repeat P M ϕ States
Equivalence to Expected Long Term Average Reward
- Complexity of Computing Gradient of η av pR k q
This is an important relationship, since the long-term average reward and the calculation of its gradient are widely studied in the literature, especially for the case when the Markov chain is ergodic (or has a single repeated class) [1, 10]. The complexity of gradient estimation is summarized by [1] here to complete the view of the computational burden of the gradient lift methodology.
Trade Off between Absorption Probability and Visitation Frequency
From [1], in the worst case, the computation complexity ∇ΦηavpRkqi is given byOp|S|2|G|2|Φ||Vepro||O|q similar to the absorption probability gradient. Furthermore, the objective is nonlinear and first-order methods can only guarantee convergence to a local maximum.
Heuristic Search for FSC Structures with a ϕ-Feasible Recurrent Set
Complexity
For the global Markov chain, the number of nodes is |S||G|, while the worst-case number of edges is given by. On the other hand, the nested loops in steps 4-16 have the worst complexity of Op|S|2|G|2|Act||O|q.
Initialization of Θ and Φ
Case Studies
Case Study I - Repeated Reachability with Safety
For the MDP world, the robot must first make sure it has moved to cell 16 and then issue a right-move R signal. In addition, there is a fifth Stop action, marked X, which causes the robot to remain in its current cell.
Case Study II - Stability with Safety
For the initial starting state, as shown in Figure 4.7, the shortest path algorithm would always direct the robot to cell 36, even when the formula φ2 is satisfied with probability 1 by navigating to one of the cells or cells. Running the gradient ascent algorithm in POMDP-World yielded two controllers that also correspond to ϕ2paqand ϕ2pbq.
Case Study III - Initial Feasible Controller
A test trajectory for the POMDP with the best controller is also shown in the same figure via a blue dashed line. ForN “3, although possible FSCs exist for the POMDP world, the heuristic is unable to find them.
Concluding Remarks
- Incentivizing Frequent Visits to Repeat P M r ϕ
- Computing the Probability of Visiting Avoid P M ϕ in Steady State
- Partitioned FSC and Steady State Detecting Global Markov Chain
- Posing the Problem as an Optimization Problem
During the transient phase of the global Markov chain execution, the global state in the aϕ-feasible recurrent set is rapidly absorbed. During the steady state phase of the global Markov chain execution, the state frequently visits the states in RepeatP Mr ϕ.
Concluding Remarks
In fact, the value for the scalar ηav can be obtained by solving a slightly different version of the Poisson equation (6.2) given by. Further discussion of the Poisson equation in the context of using dynamic programming to solve it is provided in Section (6.3.2).
Dynamic Programming Basics
Dynamic Programming Variants
However, the Bellman Optimality equation, the Bellman Equation, and iterative value and policy techniques can also be derived for the expected long-run average reward criterion (Definition 2.2.9). However, in the general setting of an arbitrary reward function and infinite state space, the existence of an optimal solution for the average case is not guaranteed.
Summary of Dynamic Programming Facts
Value Function of Discounted Reward Criterion
Only the evaluation of the average reward value function under a given FSC is required to ensure LTL satisfaction. Clearly, equation (6.25) shows that the value of a given I-state is a linear function of the belief state.
Value Function of Average Reward Criterion
However, ⃗g “ Πssd⃗rav is unique to the Poisson equation, and can be precalculated independently of Vav. Therefore, the trick used in Section 4.2.4.1 can be applied until⃗gconverges to within a tolerance εΠssd ±0.
Bellman Optimality Equation / DP Backup Equation - Discounted Case
Again, the exact methods to solve the linear system of equations is cubic is number of equations, which is 2|S||G|in Equation (6.2) with such number of variables, consisting of both V⃗av and⃗g. The pre-calculation of ⃗g only makes it possible to solve a linear system of equations in the unknown variablesVavusing Equation (6.2b).
Policy Iteration for FSCs
Bounded Stochastic Policy Iteration
In [118], the authors show the following interpretation of this optimization: it implicitly considers the value vectors Vgβ of the support value. Tangent beliefs are those in which reserving the DP results in no improvement in the value function beyond the present value.
Applying Bounded Policy Iteration to LTL Reward Maximization
Node Improvement
Opo|sqωpg1,α|g, oqTP Mϕps1|s,αqVβprs1, g1sq @s Poisson equation (ifgPGss):. ωpg1,α|g, oq “ 0 ifg1PGtr Probability constraints:. 6.43). This is because the term TmodP Mϕ, which is linear inωpg1,α|g, oq, is multiplied by the unknowns V⃗av and⃗gin the two sets of constraints that form the Poisson equation.
Convex Relaxation of Bilinear Terms
Below is how to relax the bilinear constraints that appear in the Poisson equation in the I-state Improvement Bilinear Program in equation 6.43. This may be due to the fact that V⃗av does not appear in the objective or feasibility constraints of equation (6.43).
Addition of I-States to Escape Local Maxima
Note that the algorithm operates on a copy of the original FSC and the solution of the Poisson equation is computed at the last step of policy evaluation. The addition of the gphantom and recalculation of the Poisson equation is only used in Algorithm 6.5.
Finding an Initial Feasible Controller
Similar to Algorithm 6.2, instead of directly improving the value of the tangent belief, the algorithm tries to improve the value of the intermediate beliefs that are reachable in one step from the tangent beliefs. The rest of the procedure is identical to Algorithm 6.2, except for step 20, in which each newly added I-state is placed in the correct Gtr or Gss partition.
Case Studies
Case Study I - Stability with Safety
The x-axis indicates the number of policy improvement steps implemented. a) The growth of the FSC size. The above graph shows the improvement in steady state behavior as the size of the FSC increases.
Case Study II - Repeated Reachability with Safety
The y-axis indicates the expected frequency with which states inRepeatP M0 ϕ were visited for the product POMDP resulting for the POMDP world from Figure 6.5(c) and the DRA for ϕ1 given by Equation (6.56).
Concluding Remarks
Task scheduling has traditionally been studied in the field of domain-independent scheduling and therefore a historical overview of this field is also provided. This chapter distinguishes three different types of planning/control problems in manipulation tasks: path or motion planning, simple sequencing to realize short-horizon tasks, and high-level task planning for manipulation robots.
Introduction
Robotic Paradigms
For long-term goals, it is necessary to have a thoughtful planner who is able to plan the implementations that realize the goals. Low levels can still be highly reactive, while higher levels provide increasingly intensive thinking for long-term goal satisfaction.
Planning in Robotics
These rely on the model of all the actions available to the robot or controller, which include prerequisites of the state of the world for applicability and the impact of each action on the world. Although the world itself can be quite complex, the planner works from a world model.
Background on Domain Independent Planning
Then the state space of the world model is given by all possible truth assignments, 2L. The planner simply needs to find a path (a sequence of actions and thus a sequence of intermediate points in state space) that takes the initial state of the world to a given final state.
The DARPA Autonomous Robotic Manipulation Software Challenge
An Example DARPA Challenge Task for ARM-S
One of the tasks the ARM-S teams had to perform was a wheel change scenario. The process of changing a wheel includes the following main tasks: finding a battery-powered impact driver on the table (workspace), removing nuts from the wheel using the impact driver, removing the wheel from the axle and placing it on the table.
Overview of Planning Challenges in ARM-S
Motion Planning
The Manipulation Planner, in combination with the Wing Planner, selects among nearby accessible positions for a given primitive task. The main purpose of the Weapon Planner is to work in the joint space of the 7-DOF wing to find collision-free paths to the positions suggested by the Manipulation Planner.
Limited Task Planner / Sequencer for ARM-S
- Re-manipulation or Re-grasping
- Kinematically Dependent Tasks
- Kinematic Verification Based Execution
For example, for the EXECUTE BEH SEQ("tool use grasp", IMPACT, Ø) instruction, the grip planner may not find any feasible solutions in the current table impact position. Example 2: This example deals with the case where re-manipulation of the impact gear is required to make the next task (grip tool use) feasible.
Task Planning for ARM-S
Probabilistic Outcomes and Partial Observability at the Task Level
OrObsAttachedN utTo ow heelpnutiq |AttachedN utTo ow heelpnutiqs “ p Or␣ObsAttachedN utTo ow heelpnutiq |AttachedN utTo ow heelpnutiqs “ 1´p OrObsAttachedN utTo ow heelpnutiq |␣AttachedN utTo ow heelpnutiqs “ 1´q Or ␣ObsAttachedN u tT oW heelpnutiq |␣BijgevoegdN utT oW heelpnutiqs “q. Tr␣AttachedN utTo ow heelpnutiq |AttachedN utTo ow heelpnutiq, ActRemoveN utpnutiqs “ r TrAttachedN utTo ow heelpnutiq |AttachedN utTo ow heelpnutiq, ActRemoveN utpnutiqs “ 1´r.
LTL Goals for ARM-S Task Planner - Case Studies
- Simple Reachability Goal
- Temporally Extended Goal
The robot must perform the wheel removal task as the mounting accessories appear in front of it. The wheel removal task may take an unknown amount of time due to the probabilistic nature of the task.
Description of the System in RDDL
But when the task is finished, it would be useful if the robot itself could signal that it can receive a new task.
Application of Bounded Policy Iteration to ARM-S Tasks
- Preprocessing
- ARM-S Task 1
- ARM-S Task 2
- Resulting Control Policy
Deterministic model: In this case, all parameters are sp“ q “ r “1, indicating that the nut removal task is always successful and the AttachedN utT oW heelpnutiq states are fully visible. PrpObsAttachedN utT oW heelpnutiq|AttachedN utT oW heelpnutiqq “ p “ 0.5 Prp␣ObsAttachedN utT oW heelpnutiq|AttachedN utT oW heelpnutiqq “ 1´p “ 0.5 Prp␣ObsAttachedN utT oW heelpnutiq|␣Attach EDN utT oW heelpnutiqq “ q “ 0.5 PrpObsAttachedN utT oW heelpnutiq |␣AtThedN utT oW heelpnutiqq “ 1´q “ 0.5.
Concluding Remarks
However, if the robot tries to remove the wheel from the hub, the success of the wheel removal can be used to infer the state of the nuts. However, during practical implementation, the off-the-shelf MATLAB tool in [89] was found to run out of memory for the ARM-S case studies.
Open Issues and Future Work
This can be seen as a multi-valued extension of the bandit problem popular in the Reinforcement Learning literature [69]. However, sampling-based approaches need to be extended to ensure pointwise satisfaction of the Poisson equation in the conservative optimization criterion and not only by the sampled beliefs.
Measurable Space and Measure
Probability Space and Measure
Natural σ-Algebra and Distributions over Countable Sets
- The DARPA ARM-S Robot
- Graphical representation of the DRA translations of common LTL specifications
- Partially observable probabilistic grid world
- Evolution of a labeled POMDP
- POMDP controlled by an FSC
- Effect of FSC structure on ϕ-feasibility
- Logistic function
- Repeat P M ϕ can be visited with vanishing frequency
- Generating Admissible Structures of FSC
- System Models for Case Study I
- DRA for the LTL specification ϕ 1 “ ˝ ♦a ^ ˝ ♦b ^ ␣c
- Dependence of expected steady state average reward η on size of FSC
- System Model for Case Study II
- Sample trajectories under optimal controllers
- Assigning rewards for visiting Repeat P M ϕ frequently
- Modifying T ϕ for steady state ϕ-feasibility
- Example where visiting Avoid P M ϕ is required to reach Repeat P M ϕ
- Steady state detecting global Markov chain
- Steady state detecting global Markov chain
- Reduced Feasibility arising from Conservative Optimization Criterion
- Value Function for a two state POMDP
- Effect of DP Backup Equation
- Effect of I-state Improvement LP
- Policy Iteration Local Maximum
- System models for Policy Iteration case studies
- Transient behavior optimization using Bounded Policy Iteration
- Effect of Bounded Policy Iteration on steady state behavior
- Various Robotic Paradigms
- A Hybrid Robot Architecture
- The DARPA ARM-S Robot
- Wheel Removal Task for the ARM-S Robot
- Abstracted digraph for removing nuts with impact driver
- Execution for remanipulation task of Example 2
- Probabilistic Outcomes and Partial Observability in ARM-S
- DRA for ARM-S Tasks
- DBN of the ARM-S task planning domain
- Bounded Policy Iteration for ARM Task 1
- Bounded Policy Iteration for ARM Task 2
- ARM-S Task 2 Policy
- Complexity of LTL satisfaction over POMDPs
- Results for GW-B under ϕ 2
- Finding the Initial Feasible Controller by Algorithm
- Problem size of policy iteration for naive implementation
- Reduce problem size for policy iteration after basic preprocessing
- Generate Set To Visit Frequently
- Generate Candidate FSCs
- Policy Iteration for Markov Decision Process
- Bounded PI: Adding I-States to Escape Local Maxima
- Bounded Policy Iteration For Conservative Optimization Criterion
- Adding I-states to Escape Local Maxima of Constrained Optimization Criterion
- Pruning candidate successor I-states and actions to satisfy recurrence constraints
- A common compound task
- Task sequence for Example 1
The role and application of the stochastic linear-quadratic-Gaussian problem in the design of control systems. The use of timing logic in the specification and verification of reactive systems: a review of current trends.