Pattern-Learning with Memory and Prediction

Chapter V: Vehicle Traffic Congestion Control in Signalized Intersection Networks

5.2 Pattern-Learning with Memory and Prediction

ate the performance of our controller. First, define the average cumulative waiting time to be W:= (1/V_D[T_sim])P

v∈V_D[Tsim]W_v[T_sim]. Second, define the average travel deviation to be D:= (1/V_D[T_sim])P

v∈V_D[Tsim](D_v−D^∗_v). Third, we keep track ofV_C[t], the number of vehicles that did not reach their destinations by t∈[0, T_sim].

constructed as Eq(ψ₁) =Eq₀(ψ₁), where Eq₀(ψ_j) :=

{v·ψ_j, v∈ {2,· · ·, v^∗}} ∪ {[v₁+ψ_j,1,· · · , v₈+ψ_j,8],[v₁,· · · , v₈]∈ {0,· · · , v^∗}⁸\0}, (5.1) where · denotes multiplication by a scalar, v^∗ is from Section 5.1.3, and 0∈R⁸ is the all- zeros vector. This means Eq₀(ψ_j) contains the following two types of elements: 1) every elementwise multiple of ψ_j up to a factor of v^∗, 2) every nonzero additive variation of the entries of ψ_j up to v^∗.

For each time t+ 1 when a new pattern ψ_k∈/Eq(Ψ_I[t]) is observed at intersection I, its equivalence class is constructed iteratively as:

Eq(ψ_k) :=







∅ if ∃ψ_j∈Eq(Ψ_I[t]) s.t. ψ_k∈Eq(ψ_j) f(Eq₀(ψ_k),{Eq(ψ_j)}^K[t]_j=1,Eq(Ψ_I[t])) else

, (5.2)

where Eq₀ is defined in (5.1). The function f is designed to check if every ψ∈Eq₀(ψ_k) is already in the pattern collection, whether as a unique key or an equivalence class member:

f(Eq₀(ψk),{Eq(ψj)}^K[t]_j=1,Eq(ΨI[t])) :=

ψ∈Eq₀(ψk) : ˜f(ψ,{Eq(ψj)}^K[t]_j=1,Eq(ΨI[t])) = 1 o

f(ψ,˜ {Eq(ψ_j)}^K[t]_j=1,Eq(Ψ_I[t])) =







0 if ∃ψ_j∈Eq(Ψ_I[t]) s.t. (ψ=ψ_j∨ψ∈Eq(ψ_j)) 1 else

. (5.3) This construction allows all elements of Ψ_I[t] to be partitioned into its unique keys and disjoint equivalence classes for all time t, i.e., Ψ_I[t] =Eq(Ψ_I[t])∪ Eq(ψ₁)∪ · · · ∪Eq(ψ_K[t]).

Looking up Q-values then amounts to looking through only Eq(Ψ_I[t]) instead of the entire collection Ψ_I[t], which reduces memory compared to other episodic control approaches.

The update method of each intersection’s memory table follows similarly to episodic control.

At specific intersection I, suppose ψ is the current pattern snapshot observed at time t.

If ψ6∈Ψ_I[t], the Q-value is approximated with ˆQ_I, which averages the Q-values of the k- nearest-neighbor (kNN) patterns in Eq(Ψ_I[t]):

Qˆ_I(t,ψ, m) :=







1 k

j=1

Q_I(t,ψˆ_j, m) if ψ6∈Ψ_I[t]

Q_I(t,ψ, m) else

, (5.4)

where {ψˆ_j}^k_j=1⊆Eq(Ψ_I[t]) are the k unique keys with the nearest distance to ψ at time t.

Here, “nearest” is measured with`₁-norm difference, modulus the structure of the equivalence

6 0

Mode 1 4 2 0 0 0 00 2 0

lft rt lft rt lft rt lft rt

Mode 8

. . .

2 1 0 0 0 00 1 0 1 2 0 0 0 00 2 0

Figure 5.2: Example memory table Q_I for intersection I:= (1,1) with current pattern ψ= [4,2,0,0,0,0,2,0] (blue), v^∗= 2, and k= 3 nearest neighbors. Entries in Eq(ΨI[t]) are marked with white circles. For mode 1, ψ does not exist in Q_I, so the 3 nearest patterns (large red ball) are used during lookup; one example of a “near” pattern is in green, where the left-turn lane in the East direction has three less vehicles. For mode 8, an entry for ψ already exists because it is equivalent to the red pattern, which is ψ/2.

classes:d(ψ_k,ψ_j) :=k({ψ_k} ∪Eq(ψ_k))−({ψ_j} ∪Eq(ψ_j))k₁ where we briefly abuse notation to denote kB₁− B₂k₁ := min{kb₁−b₂k₁, b₁∈ B₁, b₂∈ B₂}. During training, the Q-values of the memory table are updated by comparing the existing value with the Bellman update.

Denote ψ∈ S_N to be the expansion of ψ where zeros are placed in the positions of right- turning vehicles. Suppose the pair (ψ, m) at timettransitions to the patternψ^∗via transition function TI(ψ^∗|ψ, m) and yields reward RI(ψ, m,ψ^∗), where TI and RI are dimension- reduced versions of T and R (from Section5.1.3) for individual intersections. Then define:

r^∗ := (1−α) ˆQ_I(t,ψ, m) +α(RI(ψ, m,ψ^∗) +γQˆ_I(t,ψ^∗, m^∗)). (5.5) Here, ˆQI is the estimated Q-value computed through (5.4),α∈[0,1] is the learning rate, and γ∈[0,1] is the reward discount rate. Modem^∗ is the optimal light signal mode from pattern ψ^∗ (and varies by algorithm, e.g., Q-learning, SARSA). The update for entry (ψ, m) is then performed as follows:

QI(t+ 1,ψ, m)←







max{Q_I(t,ψ, m), r^∗} if (t,ψ, m)∈ Q_I

r^∗ else

. (5.6)

The actiona_t∈ A is then constructed by putting together all the optimal modes m^∗ of each intersection into a single vector. The PLMP algorithm with only memory implemented (with-

0 5 10

3 1 1 1 1 1 +1 +1

0 0 0 0 0

Figure 5.3: Sample prediction procedure for intersection (0,0) and its neighbors (0,1) and (1,0).

Here, ∆t_L= 2 and ∆t_I= 1. There are a total of three vehicles at (0,0) at time 0: two vehicles (one right-turning, one forward-going) at direction S are given the green light to pass at time 0 while one vehicle (forward-going) at directionWis given the green light to pass at time 6. Here, there are no other vehicles in the system, so each vehicle takes ∆t_I+ ∆t_L= 3 timesteps to reach their next intersection.

out prediction) will henceforth be called pattern-learning with memory (PLM); note that it differs from episodic control by implementation of the equivalence classes. For concreteness and variety, we consider two different ways of choosing the optimal modem given patternψ.

First,greedy exploitation uses transition functionTIto approximate the nextψ^∗and chooses the mode m that maximizes the immediate reward RI(ψ, m,ψ^∗). Second, episodic control (EC) exploitation chooses the action m^∗ which maximizes (5.4). We also enable exploration with some probability ∈[0,1) , i.e., randomly choose m∈ M.

5.2.2 Learning from Temporal Patterns

The VMDP implements the prediction part of the PLMP controller architecture by approxi- mating future occurrences of patterns so that future light signal sequences can be scheduled in advance. Because the objective is to demonstrate the advantage of enabling prediction, we use a simple one-timestep lookahead assuming that all predictions are accurate due to sensors being abundantly placed throughout the grid; we defer the treatment of noisy predictions to future work.

We employ an augmented pattern representation φ_k= [ψ^>_k,ζ_k^>]^>∈(Z^≥0)¹⁶ associated with

each original pattern ψ_k∈Ψ_I[t]. The eight additional entries ζ_k∈(Z^≥0)⁸ contain the counts of incoming vehicles in its adjacent links, and can be viewed as a projection of states_t,L∈ S_L down to left and forward turns per direction. DefineP: (Z^≥0)¹⁶→(Z^≥0)⁸ to be a projection mapping such that P(φ_k) is equal to the pattern which will occur in the next timestep.

Because a vehicle’s transition time from a link to an incoming node depends on the number of other vehicles that are currently present on the link, we do not write the explicit form of P; essentially, we achieve accurate predictions by enabling one-timestep lookahead using the augmented pattern. For example, when ∆t_L= 1 and there are no other vehicles in the left-turn lane of the link to the East of intersection I, we getP([0^>,e^>₁]^>) =e^>₁, where e1 is the first standard basis vector of (Z^≥0)⁸.

We conclude this section with a side-by-side comparison of the algorithm pseudocode for vehicle traffic congestion control with PLM and PLMP.

Algorithm 1Congestion Control via PLM 1: Initialize VMDP.

2: Initialize pattern tables{Ψ_I[0]}.

3: Create next pattern ψ.

4: Create next traffic light fromψ.

5: for t= 1 :Tsim do 6: Propagate 1 step.

7: Add any new vehicle arrivals.

8: Update VMDP state.

9: Update pattern tables {Ψ_I[t]}.

10: Create next pattern ψ.

11: Create next traffic light from ψ.

12: end for

Algorithm 2 Congestion Control via PLMP

1: Initialize VMDP.

2: Initialize pattern tables {Ψ_I[0]}.

3: Predict next pattern ψ^∗=P(φ).

4: Create next traffic light from ψ^∗. 5: for t= 1 :T_sim do

6: Propagate 1 step.

7: Add any new vehicle arrivals.

8: Update VMDP state.

9: Update pattern tables {Ψ_I[t]}.

10: Predict next pattern ψ^∗=P(φ).

11: Create next traffic light fromψ^∗. 12: end for

Dalam dokumen Control and State-Estimation of Jump Stochastic Systems by Learning Recurrent Spatiotemporal Patterns (Halaman 131-135)