Learning to Optimize: from Theory to Practice

Introduction

Motivation

The field of machine learning studies a related problem: can an algorithm / program be learned from data. Second, some parameters in the outline control the behavior of an algorithm in the optimization process for individual cases.

Challenges

Machine learning techniques have also been studied in the context of the SAT solution (Xu et al., 2008) which won several SAT competitions. To make matters worse, the rigidity of the interface with the solvers means that we are often unable to collect some useful information, eg, interactive feedback for imitative learning.

Thesis Organization

So we randomly sample subsets of sizep· |I|forp= 0.01, i.e. we expect a pseudo-backdoor to contain 1% of the integer variables. In the second project (section 6.4) we further equip the surrogate with additional regularization properties, for example submodularity.

Related Works

Related Works on Learning to Optimize

Additionally, one can also train a model to decide when to run primitive heuristics that are built into many ILP solvers (E.B. Khalil et al., 2017). The work by (Andrychowicz et al., 2016) uses LSTMs to predict the learning rate for gradient descent algorithms.

Policy Learning for Sequential Decision-Making

The intuition is that, when we do not know a∗ = π∗(s), we should maximize the agreement between aA =πA(sA)andB =πB(sB). In the first project (Section 6.1), we learn to estimate a safety cost called Approximate Clearance Assessment (ACE) (Otsu et al., 2020) which is used in the Enhanced Navigation (ENav) library (Toupet et al ., 2020). to plan paths for the Persistence Rover on the Mars 2020 mission (Williford et al., 2018).

Learning to Search with Retrospective Imitation

Introduction

First, our approach is iteratively refined toward solutions that may be of better quality or easier to find from the policy than the original demonstrations. We demonstrate that our approach improves upon previous imitation learning work (He, Daume III, & Eisner, 2014) as well as commercial solutions such as Gurobi (for integer programs).

Problem Setting & Preliminaries

An imitation learning policy is initialized based on expert traces and is deployed to generate its own traces. The policy is then updated based on the feedback generated by the retrospective oracle, as shown in Figure 3.2.

Retrospective Imitation Learning

A retrospective oracle (with query access to the environment) takes a search track τ as input and outputs a retrospective optimal track π∗(τ, s) for each terminal states. In Figure 3.1, given the current track with a terminal state?(step 2O), the retrospective optimal track is the path along red nodes (step 3O).

Theoretical Results

Moreover, the retrospective feedback of the oracle fits the true training goal, while the dataset collected by imitation learning does not, as explained in Section 3.3. We then analyze how lower error rates affect the number of actions to reach the final state.

Figure 3.3: An example search trace by a policy. The solid black nodes ( 1 → 6 → 8 → 9 ) make up the best trace to a terminal state in retrospect

Experimental Results

The number of candidate repairs is potentially exponential in the size of the neighborhood, which explains the "big" in LNS. Maximizing Sequential Information: When Is Greed Near Optimal?” In: Conference on Learning Theory, pp.

Figure 3.4: Representation of polygonal obstacle by disjuctive linear constraints.

Co-training for Policy Learning

Introduction

This is related to the conjoint training problem (Blum and Mitchell, 1998), where different representations of features of the same problem enable more efficient learning than using only one representation (Wan, 2009; Kumar and Daumé, 2011). In this paper, we propose CoPiEr (co-training for policy learning), a policy co-creation meta-framework that can include both reinforcement learning and imitation learning as subroutines.

Background & Preliminaries

A deterministic transition function is the obvious choice of adding a node to the current partial solution. A deterministic transition function is an obvious choice to add a new node to the search tree.

A Theory of Policy Co-training

For each state, we can directly compare the actions chosen by the two policies since the action space is the same. This insight leads to a more robust analysis result where we can narrow the gap between a co-trained policy and an optimal policy.

Figure 4.2: Co-training with shared action space.

The CoPiEr Algorithm

For the special case with a shared action space, we can collect more informative feedback beyond the trajectory level. Instead, we collect interactive state-level feedback, as is popular in imitation learning algorithms such as DAgger (Stéphane Ross, Gordon, & D. Bagnell, 2011) and related approaches W .

Experiments

For example, Table 5.6 shows a benchmark comparison between Gurobi and a more state-of-the-art learning approach built on SCIP (Gasse et al., 2019) (reporting the geometric mean 1-shifted wall clock time on hard cases in their paper). Gross Margin Methods for Structured and Interdependent Production Variables.” In:Journal of Machine Learning Research6.9.

Figure 4.4: Discrete & continuous control tasks. Experiment results are across 5 random seeded runs

Incorporating Existing Solvers as Sub-routines

A General Large Neighborhood Search Framework for Solving Inte-

Designing algorithms for solving difficult combinatorial optimization problems remains a valuable and challenging task. Motivated by the aforementioned drawbacks, we study how to design abstractions of large-scale combinatorial optimization problems that can exploit existing state-of-the-art solvers as a generic black-box subroutine.

Background on LNS

At a high level, our LNS framework operates on an ILP by defining the decomposition of its full variables into disjoint subsets. We first describe a version of LNS for integer programs based on integer variable decomposition, which is a modified version of the evolutionary approach proposed in Rothberg, 2007, described in Alg 10.

Learning a Decomposition

The (weighted) adjacency matrix of the graph contains all the information to define the optimization problem by using it as input to the training model. Whichever characterization we use, the function matrix has the same number of rows as the number of integer variables we consider, so we can easily add the value of the variable to the solution as an additional function.

Emprical Validation for Learning-based LNS

More importantly, in Figure 5.2d, Gurobi was given 2 hours of wall clock time and failed to match the solution found by Random-LNS in just under 5 seconds (time axis is in log scale) . In all three cases, the imitation learning methods, BC-LNS and FT-LNS, outperform Random-LNS.

Table 5.1: Comparison of different LNS methods and Gurobi for MVC and MAX- MAX-CUT problems.

Learning Pseudo-backdoors for Mixed Integer Programs

We represent MIPs as bipartite graphs with different node characteristics for variables and constraints as in (Gasse et al., 2019), and use graph attention networks (Veličković et al., 2018) with pooling to learn both models.

Problem Statement for Learning Pseudo-backdoors

Learning Pseudo-Backdoors

These candidate pseudo-backdoor sets are ranked according to the scoring module S(P,B;θS) to predict the best pseudo-backdoorB∗. We then label the MIP instance and the pseudo-backdoor based on whether the pseudo-backdoor results in faster resolution time.

Figure 5.3: The pseudo-backdoor deployment pipeline visualizes the different com- com-ponents used for solving a single MIP instance with the two learned models, the scoring module S(P, B ; θ S ) and the classification module C(P, B ; θ C )

Experiment Results for Learning Pseudo-backdoors

We provide a thorough analysis of the proposed algorithm and prove strong performance guarantees for the learned target. We first consider the bound on the loss of the learned policy as measured against the expert's policy.

Table 5.8: Runtime comparison in seconds of standard gurobi (grb), the score model (scorer), and the score model with subsequent classification (scorer+cls) across 3 hardness settings of gisp

Learning Surrogates for Optimization

Learning Safety Surrogate for the Perseverance Rover

ENav uses the Approximate Clearance Evaluation (ACE) algorithm Otsu et al., 2020 to evaluate a sorted list of paths for safe passage. In this project, we train a machine learning (ML) classifier to derive ACE values to more efficiently sort the rover paths before the ACE evaluation step.

Method for Learning Safety Surrogate

However, the ACE cost depends not only on the terrain, but also on the rover's heading. Note that the result of this prediction is the probability that ACE returns a security violation, which is different from the output of the ACE algorithm itself.

Figure 6.1: An example of a Learned Heuristic. Sets of terrain heightmaps (a) and maps generated by the ACE algorithm (b) were used to train a neural network to generate an inferred ACE probability map (c).

Evaluations of the Learned Safety Surrogate

When the number of ACE estimates exceeds the threshold, this indicates that ENav is "thinking too much" and the rover may need to stop driving until a solution is found. This would result in far fewer times where the rover has to stop moving before the next path is found.

Figure 6.2: A summary of key rover path planning performance metrics across var- var-ious experiments

Learning Surrogate with Submodular-Norm Regularization

The last two properties allow us to prove approximation guarantees for the resulting greedy heuristic. Our results suggest that, compared to standard learning-based baselines: (a) LeaSuRe requires significantly fewer oracle calls during training to learn the target target (i.e., to minimize the approximation error against the oracle target); and (b) during the test, LeaSuRe achieves superior performance on the corresponding optimization task (i.e. to minimize the regret of the original combinatorial optimization task).

Background and Problem Statement

As we discussed in the previous sections, many sequential decision-making problems can be characterized as a restricted monotonic submodular maximization problem. For example, in the case of active learning, geexp(A, e) could be the expert acquisition function that ranks the importance of labeling each unlabeled point, given the currently labeled subset.

Learning with Submodular Regularization

In the cover set case, gexp(A, e) can be the function that scores each vertex and determines the next best vertex to add to the cover set. Then g can learn not only from the experts' selections of hA, xi, but can also see the labels of tuples that the expert would not have chosen.

Analysis for LeaSuRe

Note that the previous term regret corresponds to the average difference in score function between the learned policy and the expert policy. Although this result shows that LeaSuRe is consistent with the expert, it does not directly address how well the learned policy performs in terms of the utility obtained.

Evaluations LeaSuRe on Protein Engineering

Conclusion & Future Directions

Proofs

Since we have trajectory mappings between MA and MB, we can convert an occupancy metric in MA to one in MB by mapping each trajectory and running the count in the new MDP representation. So we can make this weaker assumption, which is also more intuitive and the original statement still holds with a different term.

Pictorial Representation of the Two-views in Risk-aware Path Planning107

For the MILP view, we directly use randomly generated rectangles to define the constraint space. However, for the QCQP view, we enclose the rectangular obstacles with a circle to define the quadratic constraint.

Discrete/Continuous Control Results in Tabular Form

Algorithm Configuration Results

Visualization

Each subimage contains information about the locations of obstacles (light blue squares) and the locations of waypoints after the current LNS iteration. We highlight the subset of 5 obstacles (red circles) and 5 waypoints (dark blue squares) that appear most often in the first neighborhood of the current breakdown.

Table B.4: Parameter sweep results for (k, t) of the MAXCUT dataset for Barabási- Barabási-Albert random graphs with 500 vertices.

Model Architecture

Barriers in red and transition points in dark blue are the most frequent in subsets leading to high local enhancement.

Domain Heuristics

The first is greedy: at each step, we accept the highest bid among the remaining bids, remove its desired items, and eliminate other bids that want any of the removed items.

Table B.8: Model architectures for all the experiments.

Proof for section 6.7

Supplemental Details for the Protein Engineering Experiments