A Theory of Policy Co-training - Co-training for Policy Learning

Chapter IV: Co-training for Policy Learning

4.3 A Theory of Policy Co-training

visitation distribution. In policy iteration, we aim to maximize:

Eτ∼π⁰[Xⁿ⁻¹

i=0 γⁱAπ(si, ai)],

=Xⁿ⁻¹

i=0 Esi∼ρ_π0(s)[Eai∼π⁰(si)[γⁱAπ(si, ai)]],

≈Xⁿ⁻¹

i=0 Esi∼ρ_π(s)[Eai∼π⁰(si)[γⁱAπ(si, ai)]].

This is done instead of taking an expectation overρπ⁰(s)which has a complicated dependency on a yet unknown policy π⁰. Policy gradient methods tend to use the approximation by using ρπ which depends on the current policy. We define the approximate objective as:

ηπ(π⁰,M)

=η(π,M) +Xⁿ⁻¹

i=0 Esi∼ρπ(s)[Eai∼π⁰(si)[γⁱAπ(si, ai)]], and its associated expectation overDasJπ(π⁰) =E^M∼D[ηπ(π⁰,M)].

better thanπ^B on this instance , and we could use the translated trajectory ofπÂ as a demonstration forπ^B. Even whenJ(πÂ) > J(π^B), becauseJ is computed in expectation overD,π^Bcan still outperformπÂon some MDPs. Thus it is possible for the exchange of demonstrations to go in both directions.

Formally, we can partition the distributionDinto two (unnormalized) partsD1and D2 such that the support ofD, supp(D) = supp(D1)∪supp(D2)and supp(D1)∩ supp(D2) =∅, where for an MDPM ∈supp(D1), η(πÂ,MÂ)≥ η(π^B,M^B)and for an MDPM ∈ supp(D²), η(π^B,M^B)> η(πÂ,MÂ). By construction, we can quantify the performance gap as:

Definition 1.

δ1 =E^M∼D1[η(πÂ,MÂ)−η(π^B,M^B)]≥0, δ2 =E^M∼D2[η(π^B,M^B)−η(πÂ,MÂ)]>0.

We can now state our first result on policy improvement.

Theorem 5. (Extension of Theorem 1 in (Kang, Jie, and Feng, 2018)) Define:

Here DKL & DJ S denote the Kullback-Leibler and Jensen-Shannon divergence respectively. Then we have:

J(π^0A)≥J_πÂ(π^0A)− 2γÂ(4β_D^B₂^B_D₂ +αÂ_DÂ_D) (1−γÂ)² +δ2, J(π^0B)≥J_π^B(π^0B)−2γ^B(4βÂ_D

1^A_D

1 +α^B_D^B_D) (1−γ^B)² +δ1.

Compared to conventional analyses on policy improvement, the new key terms that determine how much the policy improves are the β’s and δ’s. The β’s, which

s^A₀

s^B₀

a^A₀

a^B₀ a0

π^A

π^B

s^A₁

s^B₁

a^A₁

a^B₁ a1

π^A

π^B

Figure 4.2: Co-training with shared action space.

quantify the maximal divergence betweenπ^Aandπ^B, hinders improvement, while the δ’s contribute positively. If the net contribution is positive, then the policy improvement bound is larger than that of conventional single view policy gradient.

This insight motivates co-training algorithms that explicitly aim to minimize the β’s.

One technicality is how to compute DJ S(π^A(s)kπ^B(s)) given that the state and action spaces for the two representations might be different. Proposition 6 ensures that we can measure the Jensen-Shannon divergence between two policies with different MDP representations.

Proposition 6. For representationsMÂandM^Bof an MDP satisfying Assumption 1, the quantities max_sDJ S(π^B(s)kπÂ(s))andmax_sDJ S(πÂ(s)kπ^B(s))are well- defined.

Minimizing β_D^B₂ and β_D^A₁ is not straightforward since the trajectory mappings between the views can be very complicated. We present practical algorithms in Section 4.4.

Special Case: Performance Gap from Optimal Policy in Shared Action Setting We now analyze the special case where the action spaces of the two views are the same, i.e., A^A =A^B. Figure 4.2 depicts the learning interaction betweenπ^Aand π^B. For each states, we can directly compare actions chosen by the two policies since the action space is the same. This insight leads to a stronger analysis result where we can bound the gap between a co-trained policy with an optimal policy.

The approach we take resembles learning reduction analyses for interactive imitation learning.

For this analysis we focus on discrete action spaces with k actions, deterministic learned policies, and a deterministic optimal policy (which is guaranteed to exist (Puterman, 2014)). We reduce policy learning to classification: for a given states,

a^A s^A s^B a^B a^∗

Figure 4.3: Graphical model encodes the conditional independence model.

the task of identifying the optimal actionπ^∗(s)is a classification problem. We build upon the PAC generalization bound results in (Dasgupta, Littman, and McAllester, 2002) and show that under Assumption 2, optimizing a measure of disagreements between the two policies leads to effective learning ofπ^∗.

Assumption 2. For a states, its two representations s^A ands^B are conditionally independent given the optimal actionπ^∗(s).

This assumption is common in analyses of co-training for classification (Blum and Mitchell, 1998; Dasgupta, Littman, and McAllester, 2002). Although this assumption is typically violated in practice (Nigam and Ghani, 2000), our empirical evaluation still demonstrates strong performance.

Assumption 2 corresponds to a graphical model describing the relationship between optimal actions and the state representations (Figure 4.3). The intuition is that, when we do not know a^∗ = π^∗(s), we should maximize the agreement between aÂ =πÂ(sÂ)anda^B =π^B(s^B). By the data-processing inequality in information theory (Cover and Thomas, 2012), we know thatI(aÂ;a^∗)≥I(aÂ;a^B). In practice, this means that ifaÂ anda^B agree a lot, they must reveal substantial information about what a^∗ is. We formalize this intuition and obtain an upper bound on the classification error rate, which enables quantifying the performance gap. Notice that if we do not have any information fromπ^∗, the best we can hope for is to learn a mapping that matchesπ^∗ up to some permutation of the action labels (Dasgupta, Littman, and McAllester, 2002). Thus we assume we have enough state-action pairs fromπ^∗ so that we can recover the permutation. In practice this is satisfied as we demonstrate in Section 4.5.

Formally, we connect the performance gap between a learned policy and an optimal policy with an empirical estimation on the disagreement in action choices among two co-trained policies. Let {τ_iÂ}^mi=1 be sampled trajectories from πÂ and {fA→B(τ_iÂ)}^mi=1be the mapped trajectories inM^B. In{fA→B(τ_iÂ)}^mi=1, letN(aÂ= i)be the number of times action iis chosen byπÂandN =Pk

i=1N(a^A = i)be the total number of actions in one trajectory set. LetN(a^B = i)be the number of

times actioniis chosen byπ^Bwhen going through the states in{fA→B(τ_i^A)}^mi=1and N(a^A=i, a^B =i)record when both actions agree oni.

We also require a measure of model complexity, as is common in PAC style analysis.

We use|π|to denote the number of bits needed to representπ. We can now state our second main result quantifying the performance gap with respect to an optimal policy:

Theorem 7. If Assumption 2 holds forM ∼ Dand a deterministic optimal policy π^∗. Letπ^Aandπ^B be two deterministic policies for the two representations.

Define:

Pˆ(aÂ=i|a^B =i) = N(aÂ=i, a^B =i) N(a^B =i) , Pˆ(aÂ6=i|a^B =i) = N(aÂ6=i, a^B =i)

N(a^B =i) , i(π^A, π^B, σ) =

ln 2(|πÂ|+|π^B|) + ln (2k/σ) 2N(a^B =i) , ζi(πÂ, π^B, σ) = ˆP(aÂ=i|a^B =i)

−Pˆ(a^A6=i|a^B =i)−2i(π^A, π^B, σ),

bi(π^A, π^B, σ) = 1

ζi(πÂ, π^B, δ)(ˆP(aÂ6=i|a^B =i) +i(πÂ, π^B, σ)),

`(s, π) =1(π(s)6=π^∗(s)), A=E^s∼ρ_πA[`(s, π^A)].

Then with probability1−σ:

A≤maxj∈{1,···,k}bj(πÂ, π^B, σ), η(πÂ,MÂ)≥η(π^∗,M)−uT A,

where T is the time horizon anduis the largest one-step deviation loss compared withπ^∗.

To obtain a small performance gap compared toπ^∗, one must minimizeA, which measures the disagreement between π^A and π^∗. However, we cannot directly es- timate this quantity since we only have limited sample trajectories from π^∗. Al- ternatively, we can minimize an upper bound, maxj∈{1,···,k}bj(π^A, π^B, δ), which

measures the maximum disagreement on actions between π^A and π^B and, impor- tantly, can be estimated via samples. In Section 4.4, we design an algorithm that approximately minimizes this bound. The advantage of two views over a single view enables us to establish an upper bound onA, which is otherwise unmeasurable.

Dalam dokumen Learning to Optimize: from Theory to Practice (Halaman 52-57)