Deep Reinforcement Learning in Multi-End Games

Practically speaking, sample efficiency is also one of the most important metrics for determining a model's performance. Especially for the environment of large search spaces (e.g., a continuous action space), it is a very critical condition to achieve state-of-the-art performance. A game with multiple endings has several subgames that are independent of each other, but influence the outcome of the game by some rules of the domain.

The degree of the risk must therefore also be taken into account during the decision-making process. They are independent of each other, but the result of an ending determines the rule of the right next ending. 15, 16] It is a simulation-based tree search algorithm that selects nodes from the tree and uses them to explore optimal moves.

By evaluating the information of a tree node using its neighbors, the number of simulations needed to find the optimal one is significantly lower than other tree search algorithms. 19] However, it is not yet possible to model all changes in ice friction coefficients. Pebbles [20] are one of the main factors, which are small granular water frozen in the ice layer, making it difficult to estimate them between stones and ice.

Thus, general digital curl simulators assume a fixed friction coefficient which affects the simulation result as an execution uncertainty.

Figure 1: The architecture of our policy-value network. A feature map which indicates the state information is the input of this network

Reinforcemnt Learning

Policy iteration is an algorithm that updates and generates a policy directly by repeating the policy evaluation and refinement steps, rather than finding it indirectly by optimizing a value function. In a large action space, such as continuous space and multidimensional spaces, function approximation makes it possible to compare the use of real physical resources to generate all possible examples with a tabular form. Here, we wrote an algorithm with an approximation function, a deep neural network [27–29], which represents the policy and value function via the network parameters ρ and θ, respectively.

First, during the iteration there is a step called policy evaluation to find a true value function. To judge the current policyπ is a right one, a value function v(s) is used to evaluate the importance of each state through the policyπ. The problem is that the value function is not the right value function, so it should be updated throughout the iteration.

The parameterized value function vθ is approximated by a neural network and is thus trained by the estimate. For certain states and action a, the policy can be trained using the stochastic gradient ascent method to maximize the probability of action. For each time step, the policyπρ is parameterized byρ trained by maximizing the expected resultr(st).

Monte Carlo Tree Search

The information is usually composed of the number of visits corresponding to the node and an expected value of the node itself. During the tree search algorithm, there is an important step of choosing which nodes to explore. The method that uses UCB as a selection function is called Upper Confidence Limits Applied to Trees (UCT) [14].

More specifically, a node of the game tree has two types of information; the expected reward value v¯a and the number of visits for corresponding action a. For each iteration of the algorithm, it visits an action that maximizes one side of a confidence interval Chernoff Hoeffding inequality [33] as, . It selects an action that has the maximum expected value and the lowest number of visits at the same time.

Kernel Regression

The Game of Curling

The overall goal of this game is to get the highest accumulated score than that of the opposing team. It consists of the even number of endings (8 or 10) except in the case of overtime. At each end, eight stones are given to each team and the players take turns throwing their stones on the ice sheet towards the circle called home (scoring area).

A team that has a stone closer to the center of the house called the button than any other opponent gets the score, and the number of stones corresponding to the rule is the amount of points they can get. If no one has taken the point, the turn of the game is held. The turn is very important to get the point, because a player who throws the last stone (hammering) can determine the outcome of each end.

It indicates that the same strategy in one-end games gives different results in multi-end games. Thus, by optimizing the strategy for one-end games, it cannot guarantee that it can also be a good strategy for multi-end games [23].

The Policy-Value Network

Search in Continuous Action Space

When the recursion reaches the terminal (line 6), we choose the action that has the highest number of visits as the optimal action. Expansion During expansion, we use a technique called progressive expansion to control the number of branches for each parent node. In this continuous search algorithm, the action selected by the UCB function is not the action to be explored or extended directly.

In line 15, it selects another action a0t among the randomly sampled actions that minimizes the estimated number of visits W(a) and satisfies the agreement with the selected action a greater than the constant τ. This is a method not only to. search in continuous space, but also to explore efficiently. That is, it returns the placement of stones after simulation of chosen action and statement of the game.

To use the kernel-based estimators, it is necessary to define initial actions before estimating values. The problem is that it is difficult to do the well-planned rollout in the beginning. However, using our value network vθ, which is trained based on the playback data, we can get the value directly without constructing the default policy or even simulating the scenarios generated by the function.

Backpropagation The last step is to update parameters of the nodes from the leaf to the root nodes along with the explored way of recursion. Shots whose rotation is clockwise are analyzed only, and the opposite case is upside down with the center line. However, to take out the opponent's stones, it is clear that the player aims to throw a stone behind the back line.

For our reference program, the difference between drawing or removing the stones is clear. Note that the curl of this figure is not the state space (from hogline to backline), but the action space. One is supervised learning with data from a reference program, and the other is reinforcement learning with a stack of self-play data.

Supervised Learning

We analyzed the reference program as in Figure 2 and found that eliminating part of the action space beyond the backline is very effective during training, as it reduces the action space.

Self-Play Reinforcement Learning

When the difference after the last end is equal to zero, which means a tie, the winning rate is considered half.

Domain Knowledges of Curling

Multi-End Strategy

Dataset

Settings

Results

The graph shows the win rates of 2,000 two-end games against DL-UCT with supervised learning only, 1,000 games as first player and 1,000 games as second player, respectively. AyumuGAT’16 is the reference program for training the policy value network in a supervised learning manner, so our first program KR-DL performs slightly worse than it. After further training the network based on the self-play data, KR-DRL will outperform the network and other top programs in the coming year.

JiritsukunGAT'17 is a program that uses deep learning to identify multi-sided strategy and a well-designed evaluation function to check the value of each game state. It has the highest competition score among the reference programs, but our program KR-DRL-MES outperforms it as shown in Figure 4.

Figure 3: Learning curve for KR-DL-UCT and DL-UCT. The plot shows the winning percentages of 2,000 two-end games against DL-UCT with supervised learning only, 1,000 games as the first player and 1,000 games as the second player respectively

Summary

Future Work

I would like to express my gratitude to my advisor Jaesik Choi who is a mentor of all my thesis work. Hassabis, "Mastering the game of go with deep neural networks and tree search," Nature, pp. Tanaka, "A curler based on the Monte-Carlo tree search considering the similarity of the best action among similar states," in Proceedings of Advances in Computer Games, ACG, 2017, pp.

Coulom, "Efficient selectivity and backup operators in monte-carlo tree search," inComputers and Games, 2007, s. Szepesvári, "Bandit based monte-carlo planning," i Proceeding of European Conference on Machine Learning, ECML, 2006, pp. Rosman, "An analysis of monte carlo tree search," i Thirty- First AAAI Conference on Artificial Intelligence, 2017.

Munos, ‘Unifying count-based explore and intrinsieke motivatie’, inProceedings of the Neural Information Processing Systems, NIPS, 2016, pp. Munos, ‘Online optimalisatie in x-armed bandits’, in Proceedings of the Neural Information Processing Systems, NIPS, 2008, pp. Kitasei, “Voorstel en implementatie van digitale curling”, in Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, 2015, pp.

Bowling, "Monte carlo tree search in continuous action spaces with execution uncertainty," in Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2016, p. Joint Conference on Artificial Intelligence, IJCAI, 2016, p. Kim, “Development of a curl simulation for performance improvement based on a physics engine,” Procedia Engineering, p.

Wolfowitz et al., “Stohastična ocena maksimuma regresijske funkcije,” The Annals of Mathematical Statistics, vol. Mansour, »Policy gradient method for reinforcement learning with function approximation«, v Proceedings of the Neural Information Processing Systems, NIPS, 1999, str. Hoeffding, »Probability Inequalities for sums of bounded random variables«, v The Collected Works of Wassily Hoeffding .