MDP Interdiction - Algorithms for Large-Scale Adversarial Decision Problems

10.2.1 Factored Representation and Scalable Bi-Level Optimization

The primary challenge in MDP interdiction is that the state space is exponential in the number of state variables. We address this problem by leveraging approximation tech- niques for factored MDPs. We present a mixed integer linear program formulation of a Stackelberg game model of factored MDP interdiction. We represent the value function as a linear combination of a set of basis functions. For effective basis representation, we make use of a Fourier basis over a Boolean hypercube to represent the value function over the binary state variables. Since the exact Fourier basis representation is still exponential, we generate basis function iteratively while we compute the attacker’s optimal policy by approximate factored MDP solution approaches. In doing so, we are able to efficiently represent the exponential state space for value function computations, using an optimized set of Fourier basis functions. However, in solving the MDP interdiction game, we face a second challenge of a super-exponential set of constraints corresponding to alternative eva- sion plans of the attacker. To tackle this problem, we develop a novel constraint generation algorithm using a combination of linear programming factored MDP solvers and heuristics for attack plan generation.

Our algorithms achieve dramatically improved scalability on realistic MDP problem

domains from the IPC and scale up to more than 2⁶⁰state space sizes compared to the exact interdiction baseline which does not scale beyond toy problem sizes (about 2⁸). We also demonstrate empirically that our algorithms achieve near-optimal interdiction decisions in each MDP domain in our experiments.

10.2.2 MDP Initial State Interdiction: Single-Level Optimization

Although our proposed MDP interdiction algorithms are reasonably scalable, some sig- nificant scalability limitations continue to prevail. For example, if we want to capture uncertainty about the attacker, we would need to compute an optimal policy for each attacker type, and this becomes intractable quite easily. We propose a novel interdiction model in which, unlike prior work, the defender modifies the initial state of the attacker. The attacker then starts planning from this modified initial state to compute an optimal policy. This is also very general because the previous interdiction approaches can be modeled by adding action-specific preconditions as state variables. We demonstrate that this model achieves much simpler and more scalable interdiction algorithms. Specifically, the attacker’s optimal policy computation is independent of the defender’s interdiction decision (modifica- tion of initial state to a new start state). This results in a single-level optimization problem corresponding to interdiction as compared to the more difficult bi-level optimization problems faced in previous work. We present a baseline initial state interdiction algorithm with factored MDP solution approaches with Fourier basis and formulate the interdiction optimization as an integer linear program. However, we observe that the difficulty of solving the factored MDP grows exponentially in the number of interdependencies among state variables in the underlying Bayesian network. Consequently, we propose model-free reinforcement learning to further improve scalability by directly learning the value function from observations.

10.2.3 Improving Scalability with Reinforcement Learning

We begin with linear action-value Q-function approximation using the Fourier basis set and adapt the traditional Q-learning algorithm to embed an a basis function selection step in between gradient descent iterations. Specifically, we formulate a mixed integer linear program that computes the most important basis function to add to an existing set, based on the largest marginal impact on the Q-function approximation. We then extend the interdiction integer program to work with the Q-function and present an algorithm for state interdiction with linear Q-learning. Extensive experiments demonstrate that this approach achieves dramatically improved scalability. However, the performace still depends on the subset of basis functions chosen for the approximation. To address this limitation, we generalize the reinforcement learning framework to incorporate non-linear Q-function approximation, e.g., neural networks-based Q-function. Since the integer linear program that we developed for interdiction will no longer work with the non-linear approximation, we propose two local search approaches, a) greedy local search by changing one state vari- able at a time and b) a local linear approximation algorithm which linearizes the Q-function locally and solves an integer linear program in that region. Extensive experiments demonstrate that the non-linear Q-learning approach with local linear approximation achieves the best scalability while ensuring reasonable solution accuracy. We reason that the Q-learning based approaches scale better primarily because these do not need to explicitly solve the MDP and learn the value function directly from observations.

10.2.4 Bayesian Interdiction

So far, in our models, we assumed that the interdiction game has complete information.

However, this is often not the case in real-world situations. We model this uncertainty about the attacker (e.g., its capabilities and actions) in a Bayesian game framework to incorporate several possible attacker types (in terms of the initial attack state such that the defender

does not have access to the full initial state). We observe that the value function learning is in terms of the full initial state, whether or not observed by the defender. Therefore, all our interdiction algorithms can be extended to the Bayesian game framework by introducing additional variables and constraints specific to an attack type (and averaging over attacker types in case of non-linear Q-learning). Extensive experiments demonstrate a large benefit to the defender from considering Bayesian interdiction compared to the baseline interdiction of a worst-case attack. While the runtime increases in Bayesian interdiction, it is small relative to the actual time it takes to solve the MDP. To summarize, we present a scalable extension of MDP interdiction to include uncertainty about the attacker in a Bayesian game framework.

10.3 Robust Antibody Design as an Interdiction Game

Dalam dokumen Algorithms for Large-Scale Adversarial Decision Problems (Halaman 151-154)