Chapter II: Background
2.5 Dueling Bandits
In a number of real-world situations, absolute numerical feedback is unavailable. For instance, as discussed in Chapter 1, humans often cannot reliably specify numerical scores. Yet, in many cases, people can more accurately provide various forms of qualitative feedback, and in particular, much of this thesis focuses on online learning from pairwise preferences. In the bandit setting, preference-based learning is called thedueling bandits problemand was first formalized in Yue, Broder, et al.
(2012). The dueling bandit setting is a sequential optimization problem, in which the learning agent selects actions and receives relative pairwise preference feedback between them. In other words, the agent poses queries of the form, βDo you prefer A or B?β The dueling bandits paradigm formalizes the problem of online regret minimization through preference-based feedback.
The dueling bandits setting is defined by a tuple, (A, π), in whichA is the set of possible actions and π is a function defining the feedback generation mechanism.
The setting proceeds in a sequence of iterations or rounds. In each iterationπ, the algorithm chooses a pair of actions ππ1,ππ2 β A to duel or compare, and observes the outcome of that comparison (ππ1andππ2can be identical).3The outcome of each duel betweenπ,π0 β Arepresents a preference between them, and is an independent sample of a Bernoulli random variable. We denote a preference for action π over actionπ0asπ π0, and define the probability thatπis preferred to π0as:
π(π π0):=π(π,π0) + 1 2,
where π(π,π0) β [β1/2,1/2] denotes the stochastic preference betweenπ andπ0, so thatπ(π π0) > 1
2 if and only ifπ(π,π0) > 0.
In each learning iteration π, the preference for the πth query is denoted by π¦π :=
I[ππ2ππ1]β12 β
β12,1
2 , whereI[Β·]denotes the indicator function, so thatπ
π¦π= 12
=
3The actions are bolded because they could be represented by π-dimensional vectors (i.e., A βRπ); however, such a representation is not required.
1β π
π¦π =β1
2
= π(ππ2 ππ1) = π(ππ2,ππ1) + 1
2. The outcome π¦π is always a binary preference (not a tie), but preference feedback need not be received for every preference query.
Bounding the regret of a bandit algorithm requires some assumptions on the tuple (A, π). For instance, ifAis an infinite set and the outcomes associated with different actions are independentβthat is, observingπ π0yields no information about any actionπ00 β{π,π0}βthen, it would be impossible to achieve sublinear regret. Thus, eitherAmust be finite, or there must be structural dependencies between outcomes associated with different actions.4
In the dueling bandits setting, the agentβs decision-making quality can be quantified using a notion of cumulative regret in terms of the preference probabilities:
RegDBπ (π) =
π
Γ
π=1
[π(πβ,ππ1) +π(πβ,ππ2)], (2.15) whereπβis the optimal action. Importantly, this regret definition assumes the exis- tence of a Condorcet winner, that is, a unique actionπβsuperior to all other actions (Sui, Zoghi, et al., 2018); formally,πβ β Ais a Condorcet winner ifπ(πβ π) > 1
2
for anyπβ β π β A. When the learning algorithm has converged to the best action πβ, then it can simply duel πβ against itself, incurring no additional regret. Intu- itively, one can interpret Eq. (2.15) as a measure of how much the user(s) would have preferred the best action over the ones presented by the algorithm.
The Multi-Dueling Bandits Setting
The multi-dueling bandits setting (Brost et al., 2016; Sui, Zhuang, et al., 2017) generalizes the dueling bandits setting such that in each iteration, the algorithm can select more than 2 actions. More specifically, in theπth iteration, the algorithm selects a setππ β Aof actions, and observes outcomes of duels between some pairs of the actions inππ. These observed preferences could correspond to all action pairs inππor to any subset. The number of actions selected in each iteration is assumed to be a fixed constant,π := |ππ|. Whenπ =2, the problem reduces to the original dueling bandits setting. For any π β₯ 2, the dueling bandits regret formulation in Eq. (2.15) extends to:
RegMDBπ (π)=
π
Γ
π=1
Γ
πβππ
π(πβ,π). (2.16)
4Both of these conditions can also be true simultaneously.
The learning algorithm seeks action subsetsππthat minimize the cumulative regret in Eq. (2.16). Intuitively, the algorithm must explore the action space while aiming to minimize the number of times that it selects suboptimal actions. As the algorithm converges to the best action πβ, then it increasingly chooses ππ to only contain πβ, thus incurring no additional regret.
The multi-dueling setting could correspond to a number of different feedback col- lection mechanisms. For example, in some applications it may be viable to collect pairwise feedback between all pairs of actions inππ. In other applications, it is more realistic to observe a subset of these preferences, for example the βwinnerβ of ππ, that is, a preference set of the form,{π π0 | π0β ππ\π}for someπ β ππ.
Algorithms for the Dueling (and Multi-Dueling) Bandits Setting
A number of algorithms have been proposed for the preference-based bandit setting, many of which are equipped with regret bounds. Earlier work on dueling bandits considered an βexplore-then-exploitβ algorithmic structure; examples include the Interleaved Filter (Yue, Broder, et al., 2012), Beat-the-Mean (Yue and Joachims, 2011), and SAVAGE (Urvoy et al., 2013) algorithms. Confidence-based approaches include RUCB (Zoghi, Whiteson, Munos, et al., 2014), MergeRUCB (Zoghi, White- son, and RΔ³ke, 2015), and CCB (Zoghi, Karnin, et al., 2015). The RMED algorithm (Komiyama et al., 2015), meanwhile, defines Bernoulli distributions whose param- eters are actionsβ probabilities of being preferred to one another. The approach calculates Kullback-Leibler divergences between these distributions, and uses them to estimate each actionβs probability of being the Condorcet winner and to decide whether actions have been sufficiently explored. RMED is optimal in the sense that when a Condorcet winner exists, the constant factor in its regret matches the regret lower bound for the dueling bandits problem.
Ailon, Karnin, and Joachims (2014) propose several strategies for reducing the dueling bandits problem to a standard multi-armed bandit problem with numerical feedback. In particular, this work introduces the Sparring method, in which two multi-armed bandit algorithms are trained in parallel. These two algorithms could be instances of any standard multi-armed bandit algorithm, for instance RUCB or RMED. In every iteration, each of the two algorithms selects an action; let ππ1and ππ2be the actions selected by algorithms 1 and 2 in iterationπ, respectively. These two actions are dueled against each other. If π1 wins, then algorithm 1 receives a reward of 1 and algorithm 2 receives a reward of 0; conversely, if π2 wins, then
Algorithm 3IndependentSelfSparring (Sui, Zhuang, et al., 2017)
1: Input:π =the number of actions taken in each iteration,π=the learning rate
2: For each actionπ =1,2,Β· Β· Β· , π΄, setππ =0,πΉπ =0.
3: forπ =1,2, . . .do
4: for π =1, . . . , πdo
5: For each actionπ =1,2,Β· Β· Β· , π΄, sampleππ βΌ Beta(ππ+1, πΉπ+1)
6: Selectππ(π) :=argmaxπππ
7: end for
8: Executeπactions:{π1(π), . . . , ππ(π)}
9: Observe pairwise feedback matrix π β {0,1,β }πΓπ
10: for π , π =1, . . . , πdo
11: ifππ π β β then
12: ππ
π(π) βππ
π(π)+πΒ·ππ π,πΉπ
π(π) β πΉπ
π(π)+π(1βππ π)
13: end if
14: end for
15: end for
algorithm 1 receives a reward of 0 and algorithm 2 receives a reward of 1.
There are also a number of posterior sampling-based strategies for the dueling bandit problem. The Relative Confidence Sampling algorithm, for instance, on each iteration selects one action via posterior sampling and another via an upper confidence bound strategy (Zoghi, Whiteson, De RΔ³ke, et al., 2014). Meanwhile, the Double Thompson Sampling algorithm (Wu and Liu, 2016) selects both actions in each iteration via posterior sampling; however, each action is selected from a restricted set. The first action is selected among those thought to be relatively well performing, while the second is chosen among actions with less certain priors.
In particular, SelfSparring (Sui, Zhuang, et al., 2017) is a state-of-the-art posterior sampling framework for preference-based learning in the multi-dueling bandits setting, inspired by Sparring (Ailon, Karnin, and Joachims, 2014). SelfSparring maintains a distribution over the probability of selecting each action, and updates it during each iteration based on the observed preferences. Actions are selected by drawing independent samples from this probability distribution, and for each sample, selecting the action with the highest sampled reward. Intuitively, a flatter sampling distribution results in more exploration, while more peaked sampling distributions reflect certainty about the optimal action and yield more exploitation.
Algorithm 3 describes SelfSparring for the case of a finite action set (|A | = π΄) with independent actions, that is, observing a preferenceπ π0does not yield information about the performance of any action π00 β {π,π0}. SelfSparring maintains a Beta
posterior for each action, and updates it similarly to the Bernoulli bandit posterior sampling algorithm (Algorithm 1). For each action π, the variables ππ and πΉπ, respectively, record the number of times that the action wins and loses against any other action (possibly scaled by a learning rate, π). Actions are then selected by drawing a sample for each actionπ β A, ππ βΌ Beta(ππ+1, πΉπ+1), and choosing the action for whichππis maximized. Sui, Zhuang, et al. (2017) derive a no-regret guarantee for Algorithm 3 when the actions have a total ordering with respect to the preferences.
Additionally, Sui, Zhuang, et al. (2017) propose a kernel-based variant of Self- Sparring called KernelSelfSparring, which leverages Gaussian process regression to model dependencies among rewards corresponding to different actions. Other dueling bandit approaches which handle structured action spaces include the Duel- ing Bandit Gradient Descent algorithm for convex utility functions over continuous action spaces (Yue and Joachims, 2009); the Correlational Dueling Bandits al- gorithm, in which a large number of actions have a low-dimensional correlative structure based on a given similarity function (Sui, Yue, and Burdick, 2017); and the StageOpt algorithm for safe learning with Gaussian process dueling bandits (Sui, Zhuang, et al., 2018).
The multi-dueling bandits setting is also considered in Brost et al. (2016), which proposes a confidence interval strategy for estimating preference probabilities.
Related Problem Settings
Aside from pairwise preferences, a number of other forms of qualitative reward feedback have also been studied. For instance, in the coactive feedback problem (Shivaswamy and Joachims, 2015; Shivaswamy and Joachims, 2012; Raman et al., 2013), the agent selects an action ππ in the πth iteration, and the environment reveals an action π0π which is preferred to ππ. With ordinal feedback, meanwhile, the user assigns each trial to one of a discrete set of ordered categories (Chu and Ghahramani, 2005a; Herbrich, Graepel, and Obermayer, 1999). In the βlearning to rankβ problem, an agent learns a ranking function for information retrieval using feedback such as usersβ clickthrough data (Burges, Shaked, et al., 2005; Radlinski and Joachims, 2005; Burges, Ragno, and Le, 2007; Liu, 2009; Schuth, Sietsma, et al., 2014; Schuth, Oosterhuis, et al., 2016; Yue, Finley, et al., 2007).