Jaebak Hwang

In the graph, we can see that the PPO algorithm with self-imitation cannot put out all fires. On the graph, we can see that the PPO algorithm with self-imitation and behavior cloning can extinguish all fire.

Calibrating Microscopic Traffic Simulator

This thesis aims at two main topics, (1) Calibration of microscopic traffic simulations by parallel search; and (2) Find optimal fire extinguishing policies using deep reinforcement learning. Therefore, this section also has two parts of the related works, the first one is related to traffic simulations and the second one is related to deep reinforcement learning.

Finding Optimal Policy of Simulations Using Deep Reinforcement Learning

In Robocup Rescue competitions, most teams used clustering methods to simplify the map and generate graphs based on the result of the k-clustering algorithm. In this paper, we applied one of the machine learning algorithms, the deep reinforcement learning model to choose the action of agents without a k-clustering algorithm or A*search algorithm.

Calibrating Microscopic Traffic Simulations

CMA-ES generates offspring based on the mutation power σandn× the novariance matrix C from the search point θ, and updates the search path vectors of Cusingn and pc dimensions. CMA-ES and Active-CMA-ES have many parameters, but most of them are set by some recommended equations.

Finding Optimal Policy of Simulations using Deep Reinforcement Learning

Because two deep neural networks are updated separately, the Actor-Critic algorithm reduces the variance of the value function and makes the convergence speed of the value function faster. RoboCup Rescue Simulation (RCRS) is one of the famous disaster simulators, which is used in the RoboCup competition. The purpose of the Self-Imitation algorithm is to allow models to imitate their good past experiences.

This section is part of the first main goal: calibrating dynamic traffic allocation models through parallel search. PTV Vissim3, one of the famous microscopic traffic simulators, uses a module called Dynamic Assignment to iteratively calculate the traffic flow in equilibrium based on a map and an OD matrix.

Figure 1: Agent selects action based on policy and environment return changed state and reward.

Definition of Problem

Nowadays, we can obtain a lot of real-time traffic data from inductive loop sensors installed on roads, and data-driven approaches for analyzing traffic flows can potentially be more accurate than traditional macroscopic modeling of traffic flow. Dynamic traffic assignment models are used to calculate the equilibrium traffic flow in a traffic simulation. To calibrate a DTA model, a calibration algorithm adjusts the OD matrix until the equilibrium traffic flow matches the collected data.

The gray roads are the road segments that traffic in PTV Vissim was simulated in our experiments. All but one red box contains two inductive loop sensors, one for each direction of the road.

Active CMA-ES and Restart-WSPSA

So, we also compared the total running time depending on the number of simulations in parallel. The gap between CMA-ES and other algorithms is too large than we expected, so we combined restart strategies with the WSPSA algorithm, which is called the Restart-WSPSA algorithm. In both small map or toy simulations cases, CMA-ES shows better results than SPSA algorithm and WSPSA algorithm, so we tested map figure 4 in experiment setting section.

The most important advantage of CMA-ES is that multiple simulations can run in parallel. Since the SPSA-based algorithm can run 2 parallel simulations as the maximum number of parallel runs, we run CMA-ES 2 as the default and increase the number of parallel simulations to 4, 8, and 16.

Experimental Settings

Because CMA-ES is worse than Restart-WSPSA, we upgrade CMA-ES to Active-CMA-ES, which is an improved version of CMA-ES to converge speed. Therefore, we estimated the total running time of the algorithms as the number of CPU cores used by the algorithms in parallel increases in Section 4.4. Because VISSIM supports 4 simulations in parallel, and the runtimes of other computations were negligible, we can estimate the total runtime by greedily allocating simulations in each generation to the 8 and 16 CPU cores.

The CPU has six cores, so it can run six traffic simulations in parallel. Both Restart-WSPSA and Active-CMA-ES are implemented in Python 3, and they start and control the simulations in PTV Vissim through the COM interface.

Calibration Errors vs. the Number of Simulations

Nevertheless, PTV Vissim can use a maximum of four cores in parallel due to licensing issues. To increase the performance of Restart-WSPSA, we optimized the W matrix in each time interval by using the calibrated OD matrix in the previous time interval. We also estimated the additional loss metric for each iteration by an additional simulation and then rejected the iteration if the additional loss metric is greater than the previous iteration loss metric by 20%.

Therefore, each iteration in Restart-WSPSA has three simulations, two for calculating the gradient of SPSA and one for loss measurement.

Figure 5: The performance of Restart-WSPSA and Active-CMS-ES in each of the four-time intervals when the traffic simulations ran sequentially and in parallel.

Calibration Errors vs. the Total Running Time

Calibration Errors of Individual Roads

Conclusions

This section is part of the second major goal, Finding the Optimal Fire Suppression Policy Using Deep Reinforcement Learning. Recently, there are several examples of solving problems by learning artificial intelligence models from various simulators such as AlphaGo [3] and Alphastar [4]. However, research on finding the best model for finding the optimal policy of disaster agents in disaster simulators such as fires or earthquakes is not yet active.

This is because disaster response measures such as overwhelming fires often become uncontrollably difficult due to a small error in the early stages, and a small difference in behavior in the early stages produces significantly different results. Thus, we have developed a software package by finding an optimal policy to extinguish the fire using Deep Reinforcement Learning.

Definition of Problem

For example, if the agent extinguishes all fires 50 times earlier than the maximum time step, this results in a bonus yield of 0.5.

Self-Imitation and Behavior Cloning

PPO Algorithm to RCRS

The second part is a condition for Agent and there is the water level and the x,y coordinate value. After the model is given states of RCRS as environment, the model has to choose the action of the agents. Thus, if the model selects the id of each building, the agent will go to the building and check whether the building is on fire or not, and if the building is on fire, the agent will try to extinguish the building until the water tank is empty or the fire is extinguished.

After the model chooses an action for agents, we need to return the reward of each choice. So, the total flow of deep reinforcement learning is, get observation space as input conditions and feed it to the model.

Experiment Setting

The water level is reduced when officers attempt to extinguish the fire (e.g. 0.8->0.6) and if the water level is 0, officers cannot attempt to extinguish the fire. In the case of fire departments, officers must choose which buildings to check and extinguish. The yellow boxes are on fire and the color of each building can be changed as the fire level is changed.

When the fire department tried to extinguish a fire, the fire decreases and finally it is 0 when the fire is extinguished perfectly. If the agent uses all the water in the water tank, the agent must refill with water from a fire hydrant.

Fixed Location of Initial Fire

The other light gray boxes are roads, and the red objects on the roads are hydrants. In the case of an agent that can extinguish all fires, the deep reinforcement learning model cannot find a way to extinguish all fires, but the self-imitating PPO algorithm shows the best result than other algorithms and remains much less fire building than others. To test more complex cases, we changed the starting location of the fire as random in the next section.

Figure 8: Total return at the end of an episode in the case of 8 initial fired with fixed location

Random Location of Initial Fire

As we can see in Figure 15, the PPO algorithm with self-imitation can put out all fires when there is behavioral cloning. In Figure 16 we can see how the PPO algorithm works with self-imitation and behavioral cloning. Since the greedy algorithm cannot extinguish all fires during six initial fire events, we can see that the PPO algorithm with self-imitation and behavioral cloning is much better than the greedy algorithm.

In the graph, we can see that the PPO algorithm with self-imitation and behavior cloning can extinguish all the fire because the mean return value is greater than 0, which means that the model can extinguish all the fire in most cases . In the graph, we can see that the PPO algorithm with self-imitation and behavior cloning can extinguish all the fire in most cases, but the PPO algorithm with self-imitation but without behavior cloning cannot extinguish all the fire.

Figure 12: The number of remaining fired buildings at the end of an episode in the case of 6 initial fired with fixed location.

Toy Simulator

Experiment Setups

Random Location of Initial Fire

Even if behavioral cloning algorithms are not applied, the model can figure out how to put out the entire fire. Behavioral cloning algorithms are not implemented and the model cannot figure out how to put out all the fire. When behavioral cloning algorithms are applied, the model can figure out how to put out the entire fire.

In this graph, we reduced the pre-training epoch of the behavior cloning algorithm and we can see that the model can learn to extinguish all fires. Thus, we reduced the pre-train epoch of the behavior cloning algorithm again, and we see that the model cannot learn to extinguish all fires as Figure 26.

Figure 20: This is a big map for the toy simulator. There are 78 buildings with 4 initial random fires, and 3 hydrants

Conclusions

However, the model cannot put out all the fire in a larger map in case the behavior cloning algorithm is applied with a small amount of epoch like Figure 26. So we can see in the toy simulator that the PPO algorithm with self-imitation and behavior cloning can put out all the fire, but it cannot learn how to put out all the fire in case the behavior cloning algorithm is used with a small amount of epoch when the map size increases. We also tested the most efficient algorithm in maps of different sizes using a toy simulator and showed that the PPO algorithm with self-imitation and behavior cloning can effectively extinguish all fires in most cases, as shown in Figure 21,23,25.

We assume that when the map size is increased, the size of the observation space multiplies and needs to be explored explosively, and the model cannot find the right path in the small amount of epoch of pre-training data. In conclusion, in the toy simulator, we can see that the PPO algorithm with self-imitation and behavioral cloning is the best algorithm than the TD3 algorithm or the greedy algorithm, and the algorithm can extinguish all fires with different conditions.