• Tidak ada hasil yang ditemukan

Chapter V: Vehicle Traffic Congestion Control in Signalized Intersection Networks

5.2 Pattern-Learning with Memory and Prediction

ate the performance of our controller. First, define the average cumulative waiting time to be W:= (1/VD[Tsim])P

v∈VD[Tsim]Wv[Tsim]. Second, define the average travel deviation to be D:= (1/VD[Tsim])P

v∈VD[Tsim](Dv−Dv). Third, we keep track ofVC[t], the number of vehicles that did not reach their destinations by t∈[0, Tsim].

constructed as Eq(ψ1) =Eq01), where Eq0j) :=

{v·ψj, v∈ {2,· · ·, v}} ∪ {[v1j,1,· · · , v8j,8],[v1,· · · , v8]∈ {0,· · · , v}8\0}, (5.1) where · denotes multiplication by a scalar, v is from Section 5.1.3, and 0∈R8 is the all- zeros vector. This means Eq0j) contains the following two types of elements: 1) every elementwise multiple of ψj up to a factor of v, 2) every nonzero additive variation of the entries of ψj up to v.

For each time t+ 1 when a new pattern ψk∈/Eq(ΨI[t]) is observed at intersection I, its equivalence class is constructed iteratively as:

Eq(ψk) :=

∅ if ∃ψj∈Eq(ΨI[t]) s.t. ψk∈Eq(ψj) f(Eq0k),{Eq(ψj)}K[t]j=1,Eq(ΨI[t])) else

, (5.2)

where Eq0 is defined in (5.1). The function f is designed to check if every ψ∈Eq0k) is already in the pattern collection, whether as a unique key or an equivalence class member:

f(Eq0k),{Eq(ψj)}K[t]j=1,Eq(ΨI[t])) :=

n

ψ∈Eq0k) : ˜f(ψ,{Eq(ψj)}K[t]j=1,Eq(ΨI[t])) = 1 o

,

f(ψ,˜ {Eq(ψj)}K[t]j=1,Eq(ΨI[t])) =

0 if ∃ψj∈Eq(ΨI[t]) s.t. (ψ=ψj∨ψ∈Eq(ψj)) 1 else

. (5.3) This construction allows all elements of ΨI[t] to be partitioned into its unique keys and disjoint equivalence classes for all time t, i.e., ΨI[t] =Eq(ΨI[t])∪ Eq(ψ1)∪ · · · ∪Eq(ψK[t]).

Looking up Q-values then amounts to looking through only Eq(ΨI[t]) instead of the entire collection ΨI[t], which reduces memory compared to other episodic control approaches.

The update method of each intersection’s memory table follows similarly to episodic control.

At specific intersection I, suppose ψ is the current pattern snapshot observed at time t.

If ψ6∈ΨI[t], the Q-value is approximated with ˆQI, which averages the Q-values of the k- nearest-neighbor (kNN) patterns in Eq(ΨI[t]):

I(t,ψ, m) :=





1 k

k

P

j=1

QI(t,ψˆj, m) if ψ6∈ΨI[t]

QI(t,ψ, m) else

, (5.4)

where {ψˆj}kj=1⊆Eq(ΨI[t]) are the k unique keys with the nearest distance to ψ at time t.

Here, “nearest” is measured with`1-norm difference, modulus the structure of the equivalence

0

6 0

2

Mode 1 4 2 0 0 0 00 2 0

lft rt lft rt lft rt lft rt

Mode 8

. . .

2 1 0 0 0 00 1 0 1 2 0 0 0 00 2 0

Figure 5.2: Example memory table QI for intersection I:= (1,1) with current pattern ψ= [4,2,0,0,0,0,2,0] (blue), v= 2, and k= 3 nearest neighbors. Entries in Eq(ΨI[t]) are marked with white circles. For mode 1, ψ does not exist in QI, so the 3 nearest patterns (large red ball) are used during lookup; one example of a “near” pattern is in green, where the left-turn lane in the East direction has three less vehicles. For mode 8, an entry for ψ already exists because it is equivalent to the red pattern, which is ψ/2.

classes:d(ψkj) :=k({ψk} ∪Eq(ψk))−({ψj} ∪Eq(ψj))k1 where we briefly abuse notation to denote kB1− B2k1 := min{kb1−b2k1, b1∈ B1, b2∈ B2}. During training, the Q-values of the memory table are updated by comparing the existing value with the Bellman update.

Denote ψ∈ SN to be the expansion of ψ where zeros are placed in the positions of right- turning vehicles. Suppose the pair (ψ, m) at timettransitions to the patternψvia transition function TI|ψ, m) and yields reward RI(ψ, m,ψ), where TI and RI are dimension- reduced versions of T and R (from Section5.1.3) for individual intersections. Then define:

r := (1−α) ˆQI(t,ψ, m) +α(RI(ψ, m,ψ) +γQˆI(t,ψ, m)). (5.5) Here, ˆQI is the estimated Q-value computed through (5.4),α∈[0,1] is the learning rate, and γ∈[0,1] is the reward discount rate. Modem is the optimal light signal mode from pattern ψ (and varies by algorithm, e.g., Q-learning, SARSA). The update for entry (ψ, m) is then performed as follows:

QI(t+ 1,ψ, m)←

max{QI(t,ψ, m), r} if (t,ψ, m)∈ QI

r else

. (5.6)

The actionat∈ A is then constructed by putting together all the optimal modes m of each intersection into a single vector. The PLMP algorithm with only memory implemented (with-

1

0 5 10

2

3 1 1 1 1 1 +1 +1

0 0 0 0 0

+1

Figure 5.3: Sample prediction procedure for intersection (0,0) and its neighbors (0,1) and (1,0).

Here, ∆tL= 2 and ∆tI= 1. There are a total of three vehicles at (0,0) at time 0: two vehicles (one right-turning, one forward-going) at direction S are given the green light to pass at time 0 while one vehicle (forward-going) at directionWis given the green light to pass at time 6. Here, there are no other vehicles in the system, so each vehicle takes ∆tI+ ∆tL= 3 timesteps to reach their next intersection.

out prediction) will henceforth be called pattern-learning with memory (PLM); note that it differs from episodic control by implementation of the equivalence classes. For concreteness and variety, we consider two different ways of choosing the optimal modem given patternψ.

First,greedy exploitation uses transition functionTIto approximate the nextψand chooses the mode m that maximizes the immediate reward RI(ψ, m,ψ). Second, episodic control (EC) exploitation chooses the action m which maximizes (5.4). We also enable exploration with some probability ∈[0,1) , i.e., randomly choose m∈ M.

5.2.2 Learning from Temporal Patterns

The VMDP implements the prediction part of the PLMP controller architecture by approxi- mating future occurrences of patterns so that future light signal sequences can be scheduled in advance. Because the objective is to demonstrate the advantage of enabling prediction, we use a simple one-timestep lookahead assuming that all predictions are accurate due to sensors being abundantly placed throughout the grid; we defer the treatment of noisy predictions to future work.

We employ an augmented pattern representation φk= [ψ>kk>]>∈(Z≥0)16 associated with

each original pattern ψk∈ΨI[t]. The eight additional entries ζk∈(Z≥0)8 contain the counts of incoming vehicles in its adjacent links, and can be viewed as a projection of statest,L∈ SL down to left and forward turns per direction. DefineP: (Z≥0)16→(Z≥0)8 to be a projection mapping such that P(φk) is equal to the pattern which will occur in the next timestep.

Because a vehicle’s transition time from a link to an incoming node depends on the number of other vehicles that are currently present on the link, we do not write the explicit form of P; essentially, we achieve accurate predictions by enabling one-timestep lookahead using the augmented pattern. For example, when ∆tL= 1 and there are no other vehicles in the left-turn lane of the link to the East of intersection I, we getP([0>,e>1]>) =e>1, where e1 is the first standard basis vector of (Z≥0)8.

We conclude this section with a side-by-side comparison of the algorithm pseudocode for vehicle traffic congestion control with PLM and PLMP.

Algorithm 1Congestion Control via PLM 1: Initialize VMDP.

2: Initialize pattern tables{ΨI[0]}.

3: Create next pattern ψ.

4: Create next traffic light fromψ.

5: for t= 1 :Tsim do 6: Propagate 1 step.

7: Add any new vehicle arrivals.

8: Update VMDP state.

9: Update pattern tables {ΨI[t]}.

10: Create next pattern ψ.

11: Create next traffic light from ψ.

12: end for

Algorithm 2 Congestion Control via PLMP

1: Initialize VMDP.

2: Initialize pattern tables {ΨI[0]}.

3: Predict next pattern ψ=P(φ).

4: Create next traffic light from ψ. 5: for t= 1 :Tsim do

6: Propagate 1 step.

7: Add any new vehicle arrivals.

8: Update VMDP state.

9: Update pattern tables {ΨI[t]}.

10: Predict next pattern ψ=P(φ).

11: Create next traffic light fromψ. 12: end for