ODPP: A Unified Algorithm Framework for Unsupervised Option Discovery based on Determinantal Point Process
Item Type Preprint
Authors Chen, Jiayu;Aggarwal, Vaneet;Lan, Tian Eprint version Pre-print
Publisher arXiv
Rights This is a preprint version of a paper and has not been peer reviewed. Archived with thanks to arXiv.
Download date 2023-10-31 12:19:08
Link to Item http://hdl.handle.net/10754/686237
ODPP: A Unified Algorithm Framework for Unsupervised Option Discovery based on
Determinantal Point Process
Jiayu Chen, Vaneet Aggarwal, and Tian Lan
Abstract—Learning rich skills through temporal abstractions without supervision of external rewards is at the frontier of Reinforcement Learning research. Existing works mainly fall into two distinctive categories: variational and Laplacian-based option discovery. The former maximizes the diversity of the discovered options through a mutual information loss but overlooks coverage of the state space, while the latter focuses on improving the coverage of options by increasing connectivity during exploration, but does not consider diversity. In this paper, we propose a unified framework that quantifies diversity and coverage through a novel use of the Determinantal Point Process (DPP) and enables unsupervised option discovery explicitly optimizing both objectives. Specifically, we define the DPP kernel matrix with the Laplacian spectrum of the state transition graph and use the expected mode number in the trajectories as the objective to capture and enhance both diversity and coverage of the learned options. The proposed option discovery algorithm is extensively evaluated using challenging tasks built with Mu- joco and Atari, demonstrating that our proposed algorithm substantially outperforms SOTA baselines from both diversity- and coverage-driven categories. The codes are available at https://github.com/LucasCJYSDL/ODPP.
Index Terms—Reinforcement Learning, Option Discovery, De- terminantal Point Process.
I. INTRODUCTION
R
EINFORCEMENT Learning (RL) has achieved impres- sive performance in a variety of scenarios, such as games [1], [2], [3] and robotic control [4], [5]. However, most of its applications rely on carefully-crafted, task-specific reward signals to drive exploration and learning, limiting its use in real-life scenarios often with sparse or no rewards. To this end, acquiring rich skills from the experience in a task-agnostic manner by extracting temporal action-sequence abstractions (denoted as options) to support efficient exploration and learn- ing can be essential. Further, the skills acquired are not task- specific and can be applied to solve multiple downstream tasks more effectively, by employing a meta-controller on top of these skills in a hierarchical manner. For instance, in a robotic navigation task, the robot can first learn locomotion skills in the environment, and then an agent only needs to learn a controller to select among these skills, so the downstream task is greatly simplified to a discrete one and the same setJ. Chen and V. Aggarwal are with the School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, USA [email protected], [email protected]. V. Aggarwal is also with the CS Department, KAUST, Thuwal, Saudi Arabia. T. Lan is with the Department of Electrical and Computer Engineering, George Washington University, Washington D.C., 20052, USA[email protected]
of locomotion skills can be directly applied to different goal- achieving tasks in the environment.
A commonly-adopted solution is unsupervised option dis- covery. Existing approaches broadly fall into two categories:
(1) Variational Option Discovery, e.g., [6], [7], [8], which aims to improve diversity of discovered options by maximizing the mutual information [9] between the options and trajectories they generate. It tends to reinforce already discovered behav- iors for diversity rather than exploring to discover new ones (e.g., visiting poorly-connected states), so the learned options may have limited coverage of the state space. (2) Laplacian- based Option Discovery, e.g., [10], [11], which clusters the state space using a Laplacian spectrum embedding of the state transition graph, and then learns options to connect different clusters to improve connectivity during exploration.
This approach is shown to improve the algebraic connectivity of the state space [12] and reduce expected covering time during exploration. However, these options lack diversity due to the focus on improving connectivity (rather than finding diverse trajectories) and thus are difficult to adapt to diverse downstream tasks. We note that diversity and coverage may not go hand-in-hand in option discovery, as visualized in Figure 1(a) and 1(b). Attempts such as [13], [14] have been made to address this gap, but they rely on expert datasets that contain diverse trajectories covering the whole state spaces, and lack an analytical framework for diversity and coverage.
Making novel use of Determinantal Point Process (DPP), this paper proposes a unified framework for quantifying diver- sity and coverage of options. It then enables an explicit opti- mization of both objectives for unsupervised option discovery.
A DPP on a set of itemsWprovides a probability measure on all the subsets ofW. The expected cardinality – defined as the expected size of a random subset drawn fromWaccording to the DPP – can be used as a diversity measure, since subsets with more diverse items are more likely to be sampled in a DPP and the expected cardinality measures the number of distinct terms in W. First, to improve diversity among options, we consider a DPP on the set of trajectories subject to different options. By maximizing its expected cardinality, an agent is encouraged to explore diverse trajectories under different options.Second, we leverage Spectral Clustering [15]
to generate a state embedding using the Laplacian spectrum of the state transition graph (i.e., −→
bi in Section III-C), and formulate another DPP on the set of states visited by a trajectory. By maximizing the expected cardinality of this DPP, a trajectory is encouraged to visit distant states from
arXiv:2212.00211v1 [cs.LG] 1 Dec 2022
Fig. 1. Illustrative example: (a) Trajectories from variational methods with good diversity but poor coverage. (b) Trajectories from Laplacian-based methods aiming to improve coverage by visiting poorly-connected corner states, but with poor diversity. (c) Trajectories from our proposed method with DPP to maximize both diversity and coverage. (d) The coverage and diversity measure (defined as Eq.(7) and (9)) of the options in (a)-(c).
different communities, leading to maximum coverage of the state space. As shown in Figure 1(d), these two objectives mentioned above provide effective quantitative measures for diversity and coverage, based on which we can learn diverse options with superior coverage of the state space as shown in Figure 1(c). Last, to establish the mapping between options and their policy so that we can learn multiple options at a time, we need to maximize the mutual information between the option choice and trajectories subject to it like in variational option discovery. Instead of using the whole trajectory [16]
or only the goal state [6], we extract the landmark states in a trajectory through maximum a posterior (MAP) inference of a DPP to keep enough information while getting rid of the noise. To sum up, our proposed framework incorporates the advantages of variational and Laplacian-based methods through a unified modeling tool – DPP.
Our key contributions are as follows. (1) To the best of our knowledge, this is the first work to adopt DPP for option discovery. (2) We propose a novel option discovery algorithm that enables explicit maximization of diversity and coverage, capturing the advantages of both variational and Laplacian- based methods. (3) We test on a series of continuous control tasks built with Mujoco and Atari and show the superiority of our algorithm compared with SOTA baselines.
II. BACKGROUND ANDRELATEDWORKS
A. Unsupervised Option Discovery
As proposed in [17], the option framework consists of three components: an intra-option policy π : S × A → [0,1], a termination condition β : S → {0,1}, and an initiation set I⊆ S. An option< I, π, β >is available in statesif and only if s∈I. If the option is taken, actions are selected according to π until it terminates according to β (i.e., β = 1). In this
paper, we focus on option discovery without supervision of extrinsic reward signals which are usually hard to acquire in real-life scenarios, i.e., Unsupervised Option Discovery.
A main research direction in this category is Variational Option Discovery. Its main goal is to learn a latent-conditioned option policy π(a|s, c) where c denotes the option latent, through maximizing the mutual information [9] between the option choice C and the trajectory obtained by following the policy conditioned on C. As shown in [18], this objective is equivalent to a Variational Autoencoder (VAE) [19]. Accord- ing to the structure of the VAE they use, Variational Option Discovery can be divided into two groups:(1) Trajectory-first methods, such as EDL [13] and OPAL [14], with the structure τ−→E c−→D τ.τis the trajectory of the agent,DandEdenote the decoder and encoder in VAE respectively. In this case, they train the policy networkπ(a|s, c)as the decoderD. The quality of the options learned by these methods relies on the set of expert trajectories which is used as input of the encoder. For example, OPAL is used for Offline RL [20] and they assume the access to a trajectory dataset generated by a mixture of diverse policies starting at diverse initial states. In EDL, they rely on highly-efficient exploration strategies, like State Marginal Matching [16], to simulate perfect exploration and getτwith good coverage of the state space for option learning.
(2)Option-first methods, such as VIC [6], DIAYN [7], VALOR [18] and DADS [8], with the structurec−→E τ−→D c. In this case, they learn π(a|s, c) as the encoder E, and instead of learning from “exploration expert” like EDL and OPAL, these methods start from scratch (random policy) to discover a set of diverse options. However, due to the lack of exploration, these methods fail to expand the set of states visited by the random policy. Thus, the options discovered provide a poor coverage of the state space [13]. For example, in Figure 1(a), the trajectories of different options are quite distinguishable, but none of them enters the corridor for exploration.
In this paper, we follow the option-first methods and con- sider a more challenging setting where the agent needs to learn to identify diverse options while fully exploring the environment, starting from a random policy (i.e., learning and exploring by itself) rather than relying on the expert trajectory set like trajectory-first methods (i.e., learning from the expert).
We realize this through integrating the variational method with another important algorithm branch in Unsupervised Option Discovery – Laplacian-based Option Discovery, such as [10], [11]. It is based on the Laplacian spectrum of the state- transition graph which can be estimated based on the state transitions in the replay buffer. The state transition graph and its Laplacian matrix are formally defined in Appendix A-A. In particular, in [11], they first estimate the Fiedler vector, i.e, the eigenvector corresponding to the second smallest eigenvalue of the Laplacian of the transition graph, and then train options that can connect states with high and low values in the Fiedler vector. States within these two areas are loosely connected by
”bottleneck” states. Through connecting them with options, the algebraic connectivity of the state space can be enhanced and thus exploration within it can be accelerated [12]. However, the options discovered through these methods lack diversity and
cannot be used for multiple downstream tasks. For example, in Figure 1(b), each option can cover a long distance through the right corridor, but all of them follow the same direction given by the Fiedler vector. In this case, we propose to integrate the advantages of variational and Laplacian-based methods through a novel use of Determinantal Point Process (DPP) (introduced in the next section), to get options with both superior diversity and coverage. As Figure 1(c), multiple diverse options are discovered, most of which go through the
“bottleneck” state for better coverage.
B. Determinantal Point Process
According to [21], given a set of itemsW={w1,· · ·, wN}, a point process P onW is a probability measure on the set of all the subsets of W. P is called a Determinantal Point Process (DPP) if a random subset Wdrawn according to P has probability:
PL(W)(W=W) = det(LW) P
W0⊆Wdet(LW0) = det(LW) det(L+I) (1) whereI∈RN×N is the identity matrix,L=L(W)∈RN×N is the DPP kernel matrix which should be symmetric and positive semidefinite andLW ∈R|W|×|W|is the sub-matrix of Lindexed by elements inW. Specifically,PL(W={wi})∝ Lii and PL(W = {wi, wj}) ∝ LiiLjj −L2ij where Lij is a measure of similarity between item i and j. Hence, DPP can promote diversity within a set because the inclusion of one item makes the inclusion of a similar item less likely by decreasing PL(W={wi, wj}).
The DPP kernel matrix L can be constructed as a Gram Matrix [22]:
L=BeTBe=Diag(−→q)·BTB·Diag(−→q) (2)
−
→q = [q1,· · · , qN] ∈ RN with qi ≥ 0 denotes the quality measure. B = h−→
b1· · ·−→ bN
i ∈ RD×N is the stacked feature matrix where −→
bi ∈ RD is the feature vector corresponding to wi and satisfies ||bi||2 = 1. The inner product of feature vectors, e.g.,−→
biT−→
bj, is used as the similarity measure between items in W. From Eq. (1)-(2), we can see thatPL(W=W) is propotional to the squared |W|-dimension volume of the parallelepiped spanned by the columns of Be corresponding to the elements in W. Hence, diverse sets are more probable because their feature vectors are more orthogonal and can span larger volumes.
The expected cardinality of samples related to a DPP is a good measure of the diversity ofWand reflects the number of modes inW[23], which can be used as the objective function for diverse generation tasks. We provide detailed reasons why we choose the expected cardinality rather than the likelihood in Eq. (1) as the diversity measure in Appendix B-A. According to [21], the expected cardinality of the set sampled in a DPP can be acquired with Eq. (3), whereλWi is thei-th eigenvalue of L(W).
W∼PEL(W)
[|W|] =
N
X
i=1
λWi
λWi + 1 (3)
III. PROPOSEDAPPROACH
As mentioned in Section II-A, to define an option, we need to relate the option with its initial state s0, specify its termination condition, and learn the intra-option policy. In this work, we focus on discovering multiple diverse options, each of which is encoded by a one-hot vector c. To realize this, we need to learn a prior network Pω(c|s0)to decide on the option choice at the initial state, and the intra-option policy networkπθ(a|s, c)to interact with the environment for a fixed number of time steps (i.e., the termination condition). In this way, we can obtain a trajectoryτ = (s0, a0,· · ·, sT)for each corresponding option.
A. Overview
As shown in Figure 1, on the one hand, the options discovered with variational methods (i.e., (a)) are diverse but lack exploration of the whole state space and thus have poor coverage; on the other hand, each option discovered with the Laplacian-based method (i.e., (b)) covers a long-range, but the options are similar with each other and lack diversity. In this section, we propose to unify the advantages of these two algorithm branches with a novel use of DPP.
In Section III-B, we propose an information-theoretic objec- tive (i.e., Eq. (5)) and its variational lower bound (i.e., Eq. (6)) for optimization. Different from previous variational methods, we choose to maximize the mutual information between the landmark states in a trajectory and its corresponding option choice. In this way, the discovered options would try to cover different landmark states rather than usual states in the state space to be diverse, and the learning is more efficient since the redundant information in the trajectory is filtered. DPP is adopted to extract the landmark states in this part.
In Section III-C, to further improve the exploration and coverage of the discovered options, we introduce another objective term LDP P1 in Eq. (7) based on DPP. Specifically, we use the states in a trajectory as the sample space and the expected number of modes from the view of DPP as the objective. By maximizing LDP P1 , each option is encouraged to cover multiple modes (i.e., landmark states) in its trajectory.
Moreover, we claim that LDP P1 generalizes the Laplacian- based option discovery objectives by using the Laplacian spectrum as the feature embedding in the DPP. On the other hand, we note that the mutual information objective only im- plicitly models the diversity among the options as the difficulty to distinguish them with the variational decoder. Thus, we further propose two DPP-based objectives, i.e., LDP P2 and LDP P3 in Eq. (8)-(9), as an explicit diversity measure of the options. In particular, we use the trajectories sampled with the current prior and policy network as the sample space.
Then, we maximize and minimize the expected number of modes in trajectories corresponding to different options and the same option respectively, from the view of a trajectory- level DPP. In this way, the discovered options can be diverse for different options and consistent for a certain option choice.
We show through ablation study that the explicit diversity enhancement withLDP P2 andLDP P3 can further improve the learned options.
At last, in Section III-D, we present the overall algorithm framework, including the pseudo code and unbiased gradient estimators (Eq. (11)-(14)) for updating the networks in our training system using the overall objective (Eq. (10)). The options discovered with our unified algorithm achieve both superior diversity and coverage, as shown in Figure 1 (c).
B. Landmark-based MI Maximization
Previous works tried to improve the diversity of the learned options by maximizing the mutual information (MI) between the option choice c and the trajectory τ [18] or goal state sT [24] generated by the corresponding intra-option policy.
However, the whole trajectory contains noise information, and the goal state only can’t represent the option policy sufficiently.
In this case, we propose to maximize the mutual information betweenc and the landmark statesGinτ instead. Landmark states are representative states selected from the state space with diverse features. More precisely, after clustering all states in the state space with respect to their features, a set of landmark states that are diverse in their feature embeddings are identified as representatives for different clusters. The landmark states G can be extracted from the trajectory τ as a result of maximum a posterior (MAP) inference of a DPP, shown as Eq. (4) where X denotes the set of states inτ. The intuition is that these landmarks constitute a diverse subset of X and thus should be the most probable under this DPP.
G=argmax
X∈X
PL(X)(X=X) = det(LX)
det(L+I) (4) In order to maximize the mutual information betweenGand cwhile filtering the redundant information for option discovery in τ, we define the objective function (Eq. (5)) based on the Information Bottleneck framework [25], whereµ(·)is the initial state distribution, I(·) denotes the mutual information, andβ ≥0 is a Lagrange multiplier.
max E
s0∼µ(·)
[I(c, G|s0)−βI(c, τ|s0)] (5) It is infeasible to directly compute and optimize Eq. (5), so we maximize its variational lower bound shown as Eq. (6) instead, where s0 ∼µ(·), c ∼Pω(·|s0), τ ∼Pθ(·|s0, c). The detailed derivation is available in Appendix A-B.
max
ω,θ,φ LIB =H(C|S) + E
s0,c,τ
PDP P(G|τ)logPφ(c|s0, G)
−β E
s0,c
[DKL(Pθ(τ|s0, c)||U nif(τ|s0))]
(6) To be brief, the first two terms of Eq. (6) constitute a lower bound of the first term in Eq. (5). Based on the definition, the mutual information includes an entropy term and a posterior probability term.H(C|S)in Eq. (6) denotes the entropy term related to the distribution of the option choice at a given state, which can be estimated based on the output of the prior network Pω. Pφ(c|s0, G) is a variational estimation of the posterior termP(c|s0, G)(cannot be explicitly computed) in I(c, G|s0), which is modeled and trained as a neural network. Further, the use of PDP P(G|τ), i.e., the probability of extractingGfromτunder a DPP, allows us to estimate this lower bound by sampling on a trajectory level, which can be computed with det(L+I)det(LG) according to Eq. (4).
On the other hand, the third term in Eq. (6) corresponds to a lower bound of the second term in Eq. (5). Instead of directly calculating −βI(c, τ|s0) which is implausible, we introduce U nif(τ|s0), i.e., the probability of the trajectory τ given s0 under the random walk policy, to convert it to a regularization term as in Eq. (6). To be specific, through minimizing DKL(Pθ(τ|s0, c)||U nif(τ|s0)), i.e., the KL Di- vergence [9] between the distribution of the trajectories under our policy and a random walk policy, the exploration of the trained policy πθ can be enhanced. Note that Pθ(τ|s0, c) = QT−1
t=0 πθ(at|st, c)P(st+1|st, at), where P(st+1|st, at)is the transition function in MDP.
Another challenging part in calculating LIB (Eq. (6)) is to infer the landmarks G with Eq. (4). In this equation, L is the kernel matrix of the DPP, which can be constructed with Eq. (2) and is further introduced in the next section. The determinant terms in Eq. (4) can be solved in real time, since the horizon of the trajectory (i.e., size of L) is a customized parameter and not larger than 1000. Further, the overall MAP inference problem related to a DPP shown as Eq. (4) is NP- hard [26], and greedy algorithms have shown to be promising as a solution. In this paper, we adopt the fast greedy method proposed in [27], which is further introduced in Appendix B with its pseudo code and complexity analysis.
C. Quantifying Diversity and Coverage via DPP
In this section, we propose three additional optimization terms defined with DPP to explicitly model the coverage and diversity of the learned options. By integrating these terms with Eq. (6), the learned options are expected to have better coverage of the state space and be more distinguishable, so the mutual information between the landmark states and their corresponding options can be further improved to achieve a better local optimum. In the three objectives, i.e., Eq. (7)-(9), we use the expected cardinality related to three different DPPs as diversity measure, i.e., Eq. (3), where the eigenvalues can be calculated in real time since the size of the kernel matrix are limited (see Appendix B-D).
The first term is to maximize the coverage of the trajectory τ for each option c in the state space: (s0∼µ(·))
max
ω,θ LDP P1 =E
s0
"
X
c,τ
Pω(c|s0)Pθ(τ|s0, c)f(τ)
# ,
f(τ) = E
X∼PL(X)[|X|] =
T+1
X
i=1
λXi λXi + 1
(7)
whereX is the set of states inτ, L(X)is the kernel matrix built with the feature vectors corresponding to the states in X,f(τ) is the expected number of modes (landmark states) covered in a trajectoryτ which can be viewed as a measure of the coverage ofτ. Thus, we can enhance the coverage and exploration of the learned options by maximizingLDP P1 . As for the kernel matrix, according to Eq. (2), we can construct L(X) by defining the quality measure −→q and normalized feature vector for each state inX. Considering that we don’t have prior knowledge or reward signals of the states, we assign equal quality measure to each state as 1. As for the
normalized feature vectors of each state, we define them using the Laplacian spectrum, i.e., eigenvectors corresponding to the D-smallest eigenvalues of the Laplacian matrix of the state transition graph, denoted as [−→v1,· · ·,−v→D]. To be specific, for each state si, its normalized feature is defined as:−→
bi = [−→v1(si),· · ·,−v→D(si)]/ q
PD
j=1(−→vj(si))2. The reason for this feature design is as follows:
(1) The same feature design has been used in Spectral Clustering [15], where states with higher similarity in this feature embedding fall in the same cluster. By sampling under a DPP (i.e., X ∼ PL(X)) with this feature design, we can collect states with diverse feature embeddings belonging to different clusters, i.e., landmark states. Then, through maxi- mizingLDP P1 (i.e., the expected number of landmarks covered inX), the agent is encouraged to have a better exploration by going through multiple clusters in the state space.
(2) Through this feature design, we can generalize the Laplacian-based Option Discovery method [11]. Specifically, they set a threshold to partition the state space into two parts – the set of states with higher value in the Fiedler vector (i.e.,−→v2) than the threshold is used as the initiation set of an option and the other states are used as the termination set, and the option policy is trained to connect states within these two areas. We claim that, as a special case, whenD= 2, we can get similar options through maximizingLDP P1 . Note that the eigenvector corresponding to the smallest eigenvalue of a Laplacian matrix is −→v1 =~1, so states with diverse feature embeddings encoded by [−→v1,−→v2](i.e., D= 2) tend to differ in−→v2. By maximizing LDP P1 , the options are trained to visit the states that are as different in −→v2 as possible in a trajectory, which is similar with the ones learned in [11] as mentioned above. We will empirically prove this in Section IV. Note that the Laplacian spectrum [−→v1,· · ·,−v→D] for infinite-scale state space can be learned as a neural network through SOTA representation learning techniques [28] which is introduced in Appendix B-C.
Thus, our algorithm can be applied to tasks with continuous state and action spaces, for which we additionally provide a complexity analysis of our algorithm in Appendix B-D, including extracting landmarks by Eq. (4), calculating the spectral features and the three objectives defined with DPP.
Second, we expect the trajectories related to the same option c and starting from the same state s0 to be consistent and thus hard to distinguish by DPP, which is important given the stochasticity of the policy output. This is realized by: (s0 ∼ µ(·), c∼Pω(·|s0))
min
ω,θ LDP P2 = E
s0,c
X
−
→τ(s
0,c)
Pθ(−→τ|s0, c)g(−→τ)
,
g(−→τ) = E
Y∼PL(Y)[|Y|] =
M
X
i=1
λYi λYi + 1
(8)
where Y is the set of M trajectories related to c starting at s0 (i.e., {τ1,· · ·, τM} = −→τ(s0,c) = −→τ), L(Y) is the kernel matrix built with the feature vectors of each trajectory −−→
b(τi).
The feature vector of a trajectory can be obtained based on the ones of the landmark states Gi in the trajectory through the
Structured DPP framework [29]:−−→
b(τi) =P
s∈Gi
−−→b(s). For the same reason given in Section III-B, we only keep the landmark states rather than the whole trajectory. Another natural choice to obtain the feature vector of a trajectory is to use the hidden layer output of the decoder Pφ(c|G, s0), which embeds the information contained inGand is commonly adopted in DPP- related works [22], [30], [31]. We will compare these two choices in Section IV.
Last, the set of trajectories corresponding to different op- tions should be diverse which is realized by Eq. (9), where s0∼µ(·), c∼Pω(·|s0),−→τ(s0,c)∼Pθ(·|s0, c),Z is the union set of Y related to different c, i.e.,∪c0−→τ(s0,c0), andL(Z) is the kernel matrix built with feature vectors of the trajectories inZ.
max
ω,θ LDP P3 = E
s0,c,→−τ(s0,c)
hh(∪
c0
−
→τ(s0,c0))i ,
h(∪
c0
−
→τ(s0,c0)) = E
Z∼PL(Z)
[|Z|] =
K
X
i=1
λZi λZi + 1
(9)
To sum up, the objective function is as Eq. (10), where αi ≥ 0 are the weights for each DPP objective term and fine-tuned as hyperparameters (listed in Appendix B).LDP P1 measures the coverage of each trajectory by computing the number of landmark states that a trajectory goes through.
By maximizing it, each trajectory is expected to cover a larger area in the state space. LDP P2 measures the diversity among the trajectories of the same option. By minimizing it, the consistency among the trajectories related to the same option can be enhanced.LDP P3 measures the diversity among trajectories corresponding to different options, which should be maximized to improve the diversity among options. By maximizing LIB, we can enhance the mutual information between the option and the trajectory corresponding to it so that we can learn multiple options at a time and each option can generate trajectories specific to it. These terms are proposed for different aims, which do not influence each other or increase the training difficulty, and by integrating them, we can learn multiple diverse options at a time, each of which has a good coverage in state space.
max
ω,θ,φ L=LIB+α1LDP P1 −α2LDP P2 +α3LDP P3 (10) Regarding the hyperparameters, α1:3 in Eq. (10) and β in Eq. (6) determine the importance of each objective term in our learning system, which are the most important ones.
We list our hyperparameter setting in Appendix B-E and we note that all our evaluations on very different tasks (including Mujoco Room/Corridor/Locomotion tasks and Atari games) are conducted using the same set of hyperparameters. Thus, our algorithm is not sensitive to the hyperparameters and is robust for a wide range of challenging RL tasks. Moreover, as mentioned above, these objective terms do not influence each other, so we can determine their weights separately.
Specifically, the hyperparameters are chosen in a simple se- quential, greedy manner, following the process of our ablation study (Section IV-A), which is further introduced in the last paragraph of Appendix B-E.
Algorithm 1 Option Discovery based on DPP (ODPP)
1: Initialize the prior networkPω, policy network πθ and decoder Pφ 2: foreach training episodedo
3: −→τ ←− {}
4: fori= 1,2, . . . , N do
5: Sample an initial statesi0∼µ(·) and an optionc∼Pω(·|si0)
6: Collect a trajectoryτi= (si0, ai0,· · · , siT−1, aiT−1, siT), where ait∼πθ(·|sit, c)
7: −→τ ←− −→τ ∪ {τi}
8: end for
9: UpdatePω,πθ with PPO andPφ with SGD, based on−→τ and Eq. (11)-(14)
10: end for
D. Algorithm Framework
Based on Eq. (10), we can calculate the gradients with re- spect toθ, ω, φ, i.e., the parameters of the prior network, policy network and variational decoder, and get the corresponding algorithms for optimizing them. First, the gradient with respect toφ is as Eq. (11) (s0∼µ(·), c∼Pω(·|s0), τ ∼Pθ(·|s0, c)), which can be optimized as a standard likelihood maximization problem with SGD [32].
∇φL= E
s0,c,τ
PDP P(G|τ)∇φlogPφ(c|s0, G)
(11) Next, regardingω andθ, we can get the unbiased gradient estimators shown as Eq. (12)-(14), of which the derivation is in Appendix A-C. APω andAπmθ are the advantage functions of the prior and policy network respectively, based on which we can optimize these networks through off-the-shelf RL algorithms. We use PPO [33] as the base RL algorithm.
∇ωL= E
s0,c[∇ωlogPω(c|s0)APω],
∇θL= E
s0,c,−→τ
[
M
X
m=1 T−1
X
t=0
∇θlogπθ(amt |smt , c)Aπmθ]
(12)
APω(c, s0) =−logPω(c|s0) +→E−
τ
[
M
X
m=1
Aπmθ(−→τ , s0, c)]
(13) Aπmθ = PDP P(Gm|τm)logPφ(c|s0, Gm)
M − β
M
T−1
X
t=0
logπθ +α1
Mf(τm)−α2g(−→τ(s0,c)) +α3h(∪
c0
−
→τ(s0,c0)) (14) Our algorithm (ODPP) is summarized as Algorithm 1 where the agent needs to interact with the environment to collect a batch of trajectories (i.e.,−→τ) and then utilize the gradient esti- mators mentioned above for update. For more implementation details, please refer to Appendix B.
IV. EVALUATION ANDMAINRESULTS
In this section, we will compare ODPP with SOTA baselines on a series of challenging evaluation tasks. (1) For intuitive visualization, we test these algorithms on maze tasks built with Mujoco [34]. In particular, we select the Point (7-dim state, 2-dim action) and Ant (30-dim state, 8-dim action) as the training agent, and put them in complex Mujoco Maze environments (Figure 4(a) and 4(d)). Both the Point agent
and Ant agent have continuous (i.e., infinite-scale) state and action spaces. We evaluate the diversity and coverage of the options learned with different algorithms, through a qualitative comparison of the visualization of the trajectories subject to different options. Then, we provide a quantitative study to see if these options can aid learning in the downstream tasks – goal-achieving or exploration tasks in the maze environments.
Both tasks are long-horizontal with an episode length of 500.
(2)To show the applicability of our algorithm on a wide range of general RL tasks, we test it on 3D Ant locomotion tasks (Figure 6(a)) and Atari video games. In this part, we focus on evaluating if the agent can learn effective controlling behaviors without supervision of task-specific rewards, and whether the agent can learn a large number of options at a time with satisfaction.
For the baselines, we follow the option-first variational option discovery, which starts from a random policy rather than an efficient exploration policy like in the trajectory-first methods. To keep it fair, we compare our algorithm with SOTA option-first methods: VIC [6], DIAYN [7], VALOR [18]. We don’t compare with DADS [8] because it is used for model- based planning rather than model-free deep RL. Further, we claim to integrate the variational and Laplacian-based option discovery, so we compare with the SOTA algorithm in Laplacian-based methods as well: DCO [11].
A. Ablation Study
In Figure 2, we visualize the trajectories of the options learned with different algorithms in the Point Room task, which start from the same initial state and have the same horizon (i.e., 50 steps). Note that the visualizations in Figure 1-3 are aerial views of the Mujoco Room/Corridor rather than toy 2D grid mazes. As mentioned in Section III-C, the objective of DCO is to train options that can connect states with high and low values in the Fielder vector of the state- transition graph. Due to its algorithm setting, we can only learn one option with DCO at a time. While, with the others, we can learn multiple options at the same time as shown (10 options). (1) From (a)-(c), we can see that variational option discovery methods can discover diverse options, but these options can hardly go through the “bottleneck” states in the environment which restricts their coverage of the state space. On the contrary, in (d), the option trained with DCO can go through two “bottleneck” states but lacks diversity,
Fig. 2. Ablation study: (a) VIC, (b) DIAYN, (c) VALOR, (d) DCO, (e) ODPP (LIB), (f) ODPP (LIB,LDP P1 ), (g) ODPP using trajectory feature defined by hidden layer output, (h) ODPP using trajectory feature defined with the Structured DPP framework. From the visualizations, we can see the effectiveness of each component of our algorithm design and also our algorithm can construct options with higher diversity and better coverage than the baselines. Note that the visualizations in Fig. 1-3 are aerial views of the 3D Mujoco Room/Corridor rather than toy 2D grid mazes.
(a) DCO (b) ODPP (D= 2) (c) ODPP
Fig. 3. (a) Options learned with DCO will lead the agent from areas with higher value (darker color) in the Fiedler vector to areas with lower value, but lack diversity; (b) As a special case, when setting the feature dimension as 2 and number of options to learn as 1, our algorithm can get similar options as DCO; (c) Our algorithm can learn multiple diverse options at a time, each of which is with high coverage.
since it can only go along the direction given by the Fiedler vector (further shown in Figure 3(a)). While, as shown in (h), options learnt with our algorithm have both superior diversity and coverage.(2) In (e), we can already get significant better options than the baselines by only usingLIBas the objective.
From (e) to (f), it can be observed that additionally introducing the objective term LDP P1 can encourage the learned options to cover more landmark states. From (f) to (h), we further add LDP P2 andLDP P3 , which makes the option trajectories more distinguishable from the view of DPP. Also, in (g) and (h), we adopt different definitions of the trajectory feature. It shows that using trajectory features defined with the Structured DPP framework as mentioned in Section III-C is better than using the hidden layer output of the decoder Pφ which takes the sequence of landmark statesGof the trajectory as input. The evaluation above shows the effectiveness of each component in our algorithm.
Next, as mentioned in Section III-C, if setting the feature dimension D= 2, ODPP is expected to learn similar options
with DCO. In Figure 3, we visualize the value of each state in the Fiedler vector of the state-transition graph as the background color (the darker the higher), and the options learned with DCO and ODPP in the Point Corridor task starting from different points. We can observe from (a) and (b) that most of the learned options are similar both in direction and length. Further, if we adopt the normal setting, where D = 30 and the number of options to learn at a time is 10, we can get diverse options shown as (c), which can be beneficial for the downstream tasks. In this case, ODPP can be viewed as a generalization and extension of DCO through using variational tools (e.g., LIB) to learn multiple diverse options at a time.
B. Evaluation in Downstream Tasks
In Figure 4, we evaluate the options learned with different algorithms on a series of downstream tasks. These options are trained without task-specific rewards, and thus have the potential to be used in different downstream tasks in the envi- ronment. Firstly, we test the options in Point Room/Corridor goal-achieving tasks where a point agent is trained to achieve a certain goal (i.e., red points in Figure 4(a)-4(d)). This task is quite challenging since: (1) The agent does not know the location of the goal area. (2) The agent will get a positive reward only when it achieves the goal area; otherwise, it will get a penalty term. Hence, the reward setting is highly sparse and delayed, and the agent needs to fully explore the environment for the goal state. In Figure 4(b)-4(c), we compare the mean and standard deviation of the performance (i.e., mean reward of a decision step) of different algorithms in the training process, which is repeated four times (each time with a different goal). It can be observed that, with options learned by ODPP, the convergence speed and value can be much higher. This is because the options learned with ODPP
(a) Mujoco Room (b) Point Room Task (c) Point Corridor Task
(d) Mujoco Corridor (e) Ant Room Task (f) Ant Corridor Task
Fig. 4. (a)(d) Mujoco Maze tasks. (b)(c) Applying the options to goal-achieving tasks in the Point Room/Corridor where the agent needs to achieve one of the four goals (red points). We show the mean and standard deviation of the agent performance across the four goals. (e)(f) Applying options to exploration tasks in the Ant Room/Corridor where the agent needs to explore as far as possible to get higher reward. We take an average among five random seeds and show the mean and standard deviation. It can be observed that the agent has better performance in different downstream tasks with the skills learned with our algorithm (i.e., ODPP).
Fig. 5. (a) Agent orientation distribution corresponding to different options. (b) Setup of the coordinate system and start points. (c) The output distribution of the prior network at different start points. (d) Performance improvement in the downstream task when applying the prior initialization. The trained prior gives preference to more useful options at corresponding states. For example, at Location #1, Option #5, which tends to go left or down, is preferred; at Location
#3, Option #3 is preferred which can lead the agent to go up or right. Complete results of the 10 options and 4 locations are provided in Appendix C-A.
have good coverage of the whole state space with which the agent can search for the goal state a lot more efficiently. Note that in order to adopt the pretrained options in downstream tasks, we need to train an option selectorPψ(c|s)which gives out option choice cat state s. In this way, we can simplify a continuous control task to a discrete task, and the advantage can be shown through the comparison with using PPO (i.e.,
“Base RL”) directly on the task. To keep it fair, the PPO agent is pretrained for the same number of episodes as the option learning. Moreover, we evaluate these algorithms in Ant Room/Corridor exploration tasks where an Ant agent is trained to explore the areas as far from the start point (center) as possible. The reward for a trajectory is defined with the largest distance that the agent has ever reached during this training episode. In Figure 4(e)-4(f), we present the trajectory reward change during the training process of these algorithms (repeated five times with different random seeds). We can
observe that options trained with ODPP provides a good initialization for this exploration task and will finally cover a quite larger area in the state space than the baselines.
As mentioned in Section III, we learn a prior networkPω together with the option policy network πθ. In Figure 5, we take the Point Corridor goal-achieving task as an example to show that we can get a further performance improvement in the downstream task by initializing the option selectorPψwith Pω, based on the fact that they have the same structure. First, we sample 10000 trajectories for each option and visualize the agent orientation distribution corresponding to different options as (a). In (b), we provide the setup of the coordinate system as well as four turning points at which the prior network outputs are shown as (c). It can be observed from (c) that Pω gives preference to the most significant options at a state. At Location #1, Option #5, which tends to go left or down, is preferred; while, at Location #3, Option #3 is
Fig. 6. (a) Visualization of the controlling behaviors learned by ODPP. (b) The change of the mean and standard deviation of the trajectory reward (i.e., the sum of the rewards within the duration of a skill) in the training process. (c) The performance change during the training process as the number of options to learn at the same time goes up, where the solid line and shadow area denote the mean and standard deviation of the rewards corresponding to different options.
It can be observed from (a)(b) that our algorithm can discover effective skills even without the supervision of behavior-specific reward signals. Moreover, we can see from (c) that we can learn a large number of options at the same time using ODPP with satisfaction. The standard deviation of the trajectory rewards of different options increases in the training process, showing the diversity among different options goes up.
(a) CartPole (b) RiverRaid (c) AirRaid
Fig. 7. Evaluation results on the Atari games (including CartPole, RiverRaid and AirRaid) to show the superiority of our algorithm on more general, non-maze tasks. Each experiment is repeated for five times with different random seeds. The solid lines represent the mean value across different random seeds of the trajectory reward of the learned options in the training process and the shadow areas represent the standard deviation. It can be observed that our algorithm performs the best in all the three tasks, and the performance improvement becomes more significant as the task difficulty increases
preferred which can lead the agent to go up or right. For complete results of the 10 options and 4 locations, please refer to Appendix C-A. Further, we show in (d) that the agent performance can be further improved in the downstream goal- achieving task through initialization with the prior network.
C. Evaluation in 3D Locomotion Tasks
In the previous sections, we focus on visualization of 2D trajectories of the learned options. In this section, we will adopt ODPP to learn a 3D Locomotion task where an Ant agent needs to coordinate its four legs to move by applying continuous torques on its eight hinges. In Figure 6(a), we visualize two of the controlling behaviors learned by ODPP without supervision of any extrinsic rewards. Please refer to Appendix C-B for more visualizations. The picture above shows that the Ant rolls to the right by running on Leg 1 first and then Leg 4, which is a more speedy way to move ahead.
While, the picture below shows that it learns how to walk to the right by stepping on its front legs (2&3) and back legs (1&4) by turn. These behaviors are learned with only the intrinsic objectives related to diversity and coverage defined in Section III, which is quite meaningful since it is usually difficult to design effective reward functions for complex behaviors.
Further, we show some numeric results to evaluate the learned controlling behaviors. We use the reward setting intro- duced in [35] as the metric, which is designed to encourage the Ant agent to move as fast as possible at the least control cost.
As Figure 6(b), we take the average on five different random
seeds, and the result shows that options trained by ODPP outperform the baselines. The reward drops in the training for some baselines, which is reasonable since this is not the reward function used for option learning (Eq. (13)(14)). At last, in order to see whether we can learn a large number of options in the meantime with ODPP, we test the performance of the discovered options when setting the number of options to learn as 10, 20, 40, 60. In Fig. 6(c), we use the same reward setting as Fig. 6(b), and the solid line and shadow area denote the mean and standard deviation of the rewards corresponding to different options. It can be observed that even when learning a large number of options at the same time, we can still get options with high quality (mean) and diversity (standard deviation) which increase during the training.
D. Evaluation in Atari Video Games
In order to demonstrate the applicability and effectiveness of our algorithm (i.e., ODPP), as shown in Figure 7, we compare it with the SOTA baselines on the more general, challenging Atari tasks [3], including CartPole, RiverRaid, and AirRaid.
For CartPole, a pole is attached by an unactuated joint to a cart, which moves along a frictionless track; the pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart’s velocity. While, RiverRaid is a top-down shooting game with the aim of destroying enemy tankers, helicopters, and jets; the player gets a score for each entity destroyed; with only 4 lives and finite fuel (can be refilled in game), the player aims to maximize the total score.
Last, in AirRaid, the player controls a ship that scrolls side- to-side directly above two buildings, with the objective of protecting the buildings from being destroyed by the bombs of enemy ships above. CartPole has a 4-dimensional continuous state space. While, for RiverRaid and AirRaid, they use gray- scale images as states which are flattened as 128-dimensional vectors for input.
We adopt ODPP and the baselines to learn skills (i.e., op- tions) for these tasks in an unsupervised manner. Specifically, the skills are discovered/trained based on the mutual infor- mation or Laplacian-based objectives. Then, these skills are evaluated with the carefully-crafted reward functions designed for each task, which are provided by the designers of the Atari benchmark. For each algorithm, we learn 10 skills of which the average cumulative rewards (i.e., the sum of the rewards within the duration of a skill) in the training process are shown in Figure 7. The skill duration is set as 100 for CartPole and 50 for the other two. Note that the complete episode horizon is 200 for CartPole and 10000 for AirRaid and RiverRaid. Thus, it would be unfair to compare the cumulative reward of a skill with the one of a whole episode in the game. For each task, we repeat the evaluations for five times with different random seeds, and plot the mean value across the five experiments as the solid lines and the standard deviation as the shadow areas. It can be observed from Figure 7 that our algorithm performs the best in all the three tasks, and the performance improvement becomes more significant as the task difficulty increases. Similar to the results in Figure 6(b), it’s reasonable that the performance of some algorithms drops in the middle, because the “Reward” used as evaluation metrics (i.e., the Y-axis in Figure 7), are the Atari environment/task rewards as mentioned above, while the unsupervised training of the skills is based on the variational or Laplacian objectives. Even without the supervision of reward signals, our algorithm can find effective skills for specific tasks with high return.
V. CONCLUSION ANDFUTUREWORKS
This work provides a novel unsupervised option discovery algorithm (ODPP) based on Determinantal Point Process, which unifies the advantages of variational and Laplacian- based option discovery methods. Besides the information- theoretic objective terms commonly used in previous varia- tional option discovery works, we propose three DPP-related terms to explicitly measure and enhance the diversity and coverage of the discovered options. Moreover, through a novel design of the DPP kernel matrix based on the Laplacian spec- trum of the state transition graph, our algorithm generalizes the SOTA Laplacian-based option discovery algorithms for better exploration and coverage of the state space. Regarding the evaluation, we compare ODPP with SOTA algorithms from both categories by visualizing the discovered options and applying them to a series of challenging downstream tasks.
The results show that our algorithm significantly outperforms the baselines.
For future challenges and extensions, we may work on a better understanding of the learned Laplacian embeddings and analyze how they can affect the options built on them. How
to take advantage of SOTA exploration strategies with ODPP would also be interesting. Moreover, we can integrate it with the Meta-Learning framework to discover options that can be applied in multiple diverse scenarios with much lower training costs. It would also be an interesting direction to extend ODPP to multi-agent scenarios, i.e., multi-agent option discovery based on DPP, which aims at discovering diverse cooperative skills among the agents in a task.
REFERENCES
[1] N. Brown and T. Sandholm, “Superhuman ai for multiplayer poker,”
Science, vol. 365, no. 6456, pp. 885–890, 2019.
[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc- tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis,
“Mastering the game of go with deep neural networks and tree search,”
Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[3] I. Hosu and T. Rebedea, “Playing atari games with deep reinforcement learning and human checkpoint replay,” CoRR, vol. abs/1607.05077, 2016. [Online]. Available: http://arxiv.org/abs/1607.05077
[4] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,”arXiv preprint arXiv:1812.00568, 2018.
[5] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,”arXiv preprint arXiv:1509.02971, 2015.
[6] K. Gregor, D. J. Rezende, and D. Wierstra, “Variational intrinsic control,” inProceedings of the 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net, 2017.
[7] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” inProceedings of the 7th International Conference on Learning Representations, ICLR 2019.
OpenReview.net, 2019.
[8] A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman, “Dynamics- aware unsupervised discovery of skills,” in Proceedings of the 8th International Conference on Learning Representations, ICLR 2020.
OpenReview.net, 2020.
[9] T. M. Cover,Elements of information theory. John Wiley & Sons, 1999.
[10] Y. Jinnai, J. W. Park, D. Abel, and G. D. Konidaris, “Discovering options for exploration by minimizing cover time,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 3130–3139.
[11] Y. Jinnai, J. W. Park, M. C. Machado, and G. D. Konidaris, “Exploration in reinforcement learning with deep covering options,” inProceedings of the 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020.
[12] A. Ghosh and S. P. Boyd, “Growing well-connected graphs,” inPro- ceedings of the 45th IEEE Conference on Decision and Control, CDC 2006. IEEE, 2006, pp. 6605–6611.
[13] V. Campos, A. Trott, C. Xiong, R. Socher, X. Gir´o-i-Nieto, and J. Torres,
“Explore, discover and learn: Unsupervised discovery of state-covering skills,” inProceedings of the 37th International Conference on Machine Learning, ICML 2020, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 1317–1327.
[14] A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum, “OPAL:
offline primitive discovery for accelerating offline reinforcement learn- ing,” inProceedings of the 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net, 2021.
[15] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in Neural Information Processing Systems 14, NeurIPS 2001. MIT Press, 2001, pp. 849–856.
[16] L. Lee, B. Eysenbach, E. Parisotto, E. P. Xing, S. Levine, and R. Salakhutdinov, “Efficient exploration via state marginal matching,” CoRR, vol. abs/1906.05274, 2019. [Online]. Available:
http://arxiv.org/abs/1906.05274
[17] R. S. Sutton, D. Precup, and S. P. Singh, “Between mdps and semi- mdps: A framework for temporal abstraction in reinforcement learning,”
Artificial Intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
[18] J. Achiam, H. Edwards, D. Amodei, and P. Abbeel, “Variational option discovery algorithms,” CoRR, vol. abs/1807.10299, 2018. [Online].
Available: http://arxiv.org/abs/1807.10299
[19] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceedings of the 2nd International Conference on Learning Repre- sentations, ICLR 2014, 2014.
[20] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,”CoRR, vol. abs/2005.01643, 2020. [Online]. Available: https://arxiv.org/abs/
2005.01643
[21] A. Kulesza and B. Taskar, “Determinantal point processes for machine learning,”Foundations and Trends in Machine Learning, vol. 5, no. 2-3, pp. 123–286, 2012.
[22] M. Elfeki, C. Couprie, M. Rivi`ere, and M. Elhoseiny, “GDPP: learning diverse generations using determinantal point processes,” inProceedings of the 36th International Conference on Machine Learning, ICML 2019, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 1774–1783.
[23] B. Gong, W. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” inAdvances in Neural Information Processing Systems 27, NeurIPS 2014, 2014, pp. 2069–
2077.
[24] T. Kwon, “Variational intrinsic control revisited,” inProceedings of the 9th International Conference on Learning Representations, ICLR 2021.
OpenReview.net, 2021.
[25] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” inProceedings of the 5th International Con- ference on Learning Representations, ICLR 2017. OpenReview.net, 2017.
[26] C. Ko, J. Lee, and M. Queyranne, “An exact algorithm for maximum entropy sampling,”Operation Research, vol. 43, no. 4, pp. 684–691, 1995.
[27] L. Chen, G. Zhang, and E. Zhou, “Fast greedy MAP inference for determinantal point process to improve recommendation diversity,” in Advances in Neural Information Processing Systems 31, NeurIPS 2018, 2018, pp. 5627–5638.
[28] K. Wang, K. Zhou, Q. Zhang, J. Shao, B. Hooi, and J. Feng, “To- wards better laplacian representation in reinforcement learning with generalized graph drawing,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, 2021, pp. 11 003–11 012.
[29] A. Kulesza and B. Taskar, “Structured determinantal point processes,” in Advances in Neural Information Processing Systems 23, NeurIPS 2010.
Curran Associates, Inc., 2010, pp. 1171–1179.
[30] Y. Yuan and K. M. Kitani, “Diverse trajectory forecasting with de- terminantal point processes,” inProceedings of the 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020.
[31] Y. Yang, Y. Wen, J. Wang, L. Chen, K. Shao, D. Mguni, and W. Zhang,
“Multi-agent determinantal q-learning,” inProceedings of the 37th Inter- national Conference on Machine Learning, ICML 2020, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 10 757–
10 766.
[32] S. Amari, “Backpropagation and stochastic gradient descent method,”
Neurocomputing, vol. 5, no. 3, pp. 185–196, 1993.
[33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,”CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
[34] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
[35] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel,
“High-dimensional continuous control using generalized advantage estimation,” in Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, 2016. [Online]. Available:
http://arxiv.org/abs/1506.02438
[36] M. Fiedler, “Algebraic connectivity of graphs,”Czechoslovak Mathemat- ical Journal, vol. 23, no. 2, pp. 298–305, 1973.