A Study on Riversi AI using Deep Q-Networks

(1)

卒業研究報告題目

指導教員

岐阜工業高等専門学校電気情報工学科

平成３１年（２０１９年）２月１５日提出

DQN を利用したリバーシ AI に関する研究 A Study on Riversi AI using Deep Q-Networks

出口利憲教授

2014E27

長澤紘汰

(2)

Abstract

Recently, the reserch of Artificial Intelligence is applied to various fields of discipline and keep developing rapidly during about 70 years after it was born in Dartmouth conference. Deep Reinforcement Learning is a field which is recently and specially interested in Artifitial Intellingence. The reasons why it can become to decide everything very rapidly and precisely combining the best regression of Deep Learning with the best decision of Reinforcement Learning on a problem having big computational complexity.

In this reserch, we apply the DQN algorithm combined NFQ argorithm which is a one of Deep Reinforcement Learning Algorithms to a game of 8x8 Reversi because we don’t have a computer which is embedded Tensor Processer Unit. First, I attempt to choose the best hyper parameters, size of networks and quantity of neurons using its algorithm from random records of Cartpole problem. Second, we apply to 8x8 Reversi and examine the performance of all AI’s which are made using Keras in TensorFlow on a construction of Combolutional Neural Networks using Deep Learning. As a result, we found that its AI could initially success to learn 8x8 Reversi strategies using the best hyper parameters with the size of networks and the number of neurons selected through pre-Exp. However, its AI failed in the middle (Table 1). Therefore, in order not to fail to learn, we should make its AI learn itself using pure DQN algorithm.

(3)

Table 1 DQN AI’s win rate

episode MTS:100 Random

0 0

1 0.3 0.8

2 0.5 0.8

3 0.2 1.0

4 0.7 0.9

5 0.3 0.9

6 0.3 0.9

7 0.4 0.9

8 0.4 1.0

9 0.4 0.9

(4)

Chapter 1 Introduciton

(AI) 70

1)

DQN Cartpole

TensorFlow Keras 8x8 AI

(7)

Chapter 2 Neural Network

2.1 Neural Network

( )

2)

2.1.1 Neuron

Figure 2.1 1000 2000

2.1.2 Neuron Modeling

Figure 2.2

Frank Rosenblatt

y (2.2) xi i

wi i f

u=

!n

xiwi (2.1)

(8)

Figure 2.1 Neuron

Figure 2.2 Simple Perceptron

2.1.3 Activation Function

(Transfer Function)

(9)

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-10 -9.1 -8.2 -7.3 -6.4 -5.5 -4.6 -3.7 -2.8 -1.9 -1 -0.1 0.8 1.7 2.6 3.5 4.4 5.3 6.2 7.1 8 8.9 9.8

y

u Step

Figure 2.3 Step function

• Step function(Figure 2.3)

y =

⎧⎪

⎪⎨

⎪⎪

⎩

1 (u >0) 0 (u≤0)

(2.3)

• Sigmoid function(Figure 2.4)

y = 1

1 +e⁻^u (2.4)

• ReLU function(Figure 2.5)

y=

⎧⎪

⎪⎨

⎪⎪

⎩

u (u >0) 0 (u≤0)

(2.5)

(10)

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-10 -9.1 -8.2 -7.3 -6.4 -5.5 -4.6 -3.7 -2.8 -1.9 -1 -0.1 0.8 1.7 2.6 3.5 4.4 5.3 6.2 7.1 8 8.9 9.8

y

u Sigmoid

Figure 2.4 Sigmoid function

(11)

-2 0 2 4 6 8 10 12

-10 -9.2 -8.4 -7.6 -6.8 -6 -5.2 -4.4 -3.6 -2.8 -2 -1.2 -0.4 0.4 1.2 2 2.8 3.6 4.4 5.2 6 6.8 7.6 8.4 9.2 10

y

u ReLU

Figure 2.5 ReLU function

(12)

0 20 40 60 80 100 120

-10 -9.1 -8.2 -7.3 -6.4 -5.5 -4.6 -3.7 -2.8 -1.9 -1 -0.1 0.8 1.7 2.6 3.5 4.4 5.3 6.2 7.1 8 8.9 9.8

L

e MSE

Figure 2.6 Mean Squared Error

2.1.4 Loss Function

( )

• (mse)

(Fig- ure 2.6)

e L

• Huber

• τ-

(13)

0 2 4 6 8 10 12

-10 -9.2 -8.4 -7.6 -6.8 -6 -5.2 -4.4 -3.6 -2.8 -2 -1.2 -0.4 0.4 1.2 2 2.8 3.6 4.4 5.2 6 6.8 7.6 8.4 9.2 10

L

t - f Logistics

Figure 2.7 Logistics function

• ϵ-

• 0-1

• Logisitics 0

Logistics (Figure 2.7)

t−f L

• Hinge

(14)

Figure 2.8 Multilayer Perceptron

2.1.5 Multilayer Perceptron

Figure 2.8 1

3

David E. Rumelhart

2.2 Deep Learning

2

4

Geoﬀrey Everest Hinton

2010

•

(15)

•

2.2.1 Convolutional Neural Network

(Figure 2.9)

4 1

2)

2.3 Learning Method

RMSprop Adam

(16)

Figure 2.9 Convolutional Neural Network

2.3.1 Optimizer

• Stochastic gradient decent (SGD)

(2.6) w

w^t+1 ←w^t−η∂L(w^t)

∂w^t (2.6)

t η L

• Backpropagation

3 1986

David E. Rumelhart

–

– 0

(17)

Wkj

∆Wkj (2.7)

∆Wkj =−η ∂

∂Wkj

1 2

!n i=1

(ti −Oi)²

=η(tk−Ok)Ok(1−Ok)Hj (2.7)

O_k k t_k

( ) Hj j η

AdaGrad

∆W_ji (2.8)

∆W_ji =−η ∂

∂Wji

1 2

!n i=1

(t_i−O_i)²

=ηHj(1−Hj)

!n k=1

{Wkj(tk−Ok)Ok(1−Ok)}Xi (2.8)

Xi i

• AdaGrad

AdaGrad 2011 John C. Duchi

SGD

(2.12) w

h0 =ϵ (2.9)

ht =ht−1+∇L(w^t)² (2.10) ηt = η₀

√h_t (2.11)

w^t+1 =w^t−ηt∇E(w^t) (2.12)

(18)

RMSprop 2012 Tijmen Tieleman AdaGrad

(2.15) w

ht=αht−1+ (1−α)∇L(w^t)² (2.13) ηt= η₀

√h_t+ϵ (2.14)

w^t+1 =w^t−ηt∇L(w^t) (2.15)

α

• Adam

Adam 2012 Google Brain team

Diederik P. Kingma Mo-

mentam AdaGrad

Adam (2.20) w

mt+1 =β1mt+ (1−β1)∇L(w^t) (2.16) vt+1 =β2vt+ (1−β2)∇L(w^t)² (2.17)

ˆ

m = mt+1

1−β₁^t (2.18)

ˆ

v = vt+1

1−β₂^t (2.19)

w^t+1 =w^t−α mˆ

√vˆ+ϵ (2.20)

β1 m β2 v α

ˆ m vˆ

Adam AdaGrad

(19)

Chapter 3 Reinforcement Learning

(Figure 3.1)

TD

Q ³⁾

(MDP)

(Figure 3.2)

•

(POMDP) belief MDP

3.1 Q-Learning

Q Richard S. Sutton 1984 TD(

λ ) Christopher J.C.H. Watkins 1989

(Figure 3.3) ³⁾

π Q

(20)

Figure 3.1 Reinforcement Learning

Figure 3.2 Markov decision process

a b c d 4

4 Q Q(s_t, a) Q(s_t, b) Q(s_t, c) Q(s_t, d)

Q Q

(21)

Figure 3.3 Q-Learning

ϵ

Q ϵ -greedy

(3.1) softmax

π(s, a) =

exp(Q(s, a)

T )

!

p∈A

exp(Q(s, p)

T )

(3.1)

T A s

Q st

a s_t+1 Q(s_t, a) (3.2)

Q(st, a)←Q(st, a) +α[rt+1+γmax

p Q(st+1, a)−Q(st, a)] (3.2)

α γ

0 1 rt+1 st+1

(3.2) Q

(22)

Table 3.1 Q-Table

State\Action a1 a2 a3 an

s0 0 0.24 0.59 0.77

s₁ −0.14 0.02 −0.57 0.24 s2 −0.85 −0.04 −0.38 −0.95

sm 0.07 0.15 1.0 −1.0

α γ

α (3.3) (3.4)

Q

!∞ t=0

α(t)→ ∞ (3.3)

!∞ t=0

α(t)² < ∞ (3.4)

• Q

•

• Q

Q Q (Ta-

ble 3.1)

Neural Fitted Q Iteration Deep Q-Networks

3.1.1 ϵ -greedy Algorithm

ϵ-greedy ϵ ( 1−ϵ)

(Figure 3.4) ϵ

(23)

Figure 3.4 ϵ -greedy Algorithm

3.2 Neural Fitted Q Iteration (NFQ)

NFQ Q Q

Martin Riedmiller 2005

Q NFQ

4)

3.3 Deep Q-Networks (DQN)

(24)

0 2 4 6 8 10 12

-10 -9.2 -8.4 -7.6 -6.8 -6 -5.2 -4.4 -3.6 -2.8 -2 -1.2 -0.4 0.4 1.2 2 2.8 3.6 4.4 5.2 6 6.8 7.6 8.4 9.2 10

L

e Huber

Figure 3.5 Huber function

5)

Q Q

(3.5)

R(s, a) +γmax

a^′∈AQ(s^′, a^′) (3.5)

eθi Lθi (3.6) (3.7)

eθi =R(s, a) +γmax

a^′∈AQ(s^′, a^′;θi−1)−Q(s, a;θi) (3.6)

Lθi =

⎧⎪

⎪⎨

⎪⎪

⎩

E^&1

2(eθi)²^' (|eθi|≤1)

|e_θ_i| (|e_θ_i|>1)

(3.7)

θi i

E Huber (Figure 3.5) −1 1

Huber

∇L(θi) (3.8)

(25)

Figure 3.6 Experience Replay

∇L(θi) = E^&(R(s, a) +γmax

a^′∈AQ(s^′, a^′;θi−1)−Q(s, a;θi))∇Q(s, a;θi)^' (3.8)

E[A] A θ

Q ϵ-greedy

Experience Replay Fixed Target Q-Network clipping

6)

3.3.1 Experience Replay

Experience Replay

(Figure 3.6)

3.3.2 Fixed Target Q-Network

Fixed Target Q-Network DQN

(26)

3.3.3 Reward Clipping

Clipping −1 0 1

−1 0 1

(27)

Chapter 4 Experiment

DQN clipping Cartpole

DQN

8x8 DQN

ReLU Adam

Huber

4.1 Environment Setting

• CPU: Intel Core i7 5930K

• GPU: Nvidia Geforce GTX 980Ti

• GPU : 6GB 3

• OS: Windows10 Pro

• : Python Tensorflow Keras⁷⁾⁸⁾

4.2 pre-Exp

python Gym CartPole-v0

CartPole-v0

(Figure 4.1)

+1 0

4

( or ) Q 2 (Figure 4.2)

CSV episode

(28)

Figure 4.1 CartPole-v0

• : 10steps

• : 195rewards

• Experience Replay : 10000

• Fixed Target Q-Network : 32

DQN

4.2.1 pre-Exp1: Hyper Parametars

Q

α γ

• : 4 ( , , , )

• 1: 16 ( )

• 2: 16 ( )

• : 2 ( Q , Q )

0.00001 0.0001 0.001 0.01 0.99 0.97

0.95 12

(29)

Figure 4.2 A network for CartPole-v0

4.2.2 pre-Exp2: Networks

8x8 CartPole

16

Q α γ 0.001

0.95 16

3 4 5

Tensorflow Keras plot model (Figure 4.3)

4.2.3 pre-Exp3: Neurons

Q α γ 0.001 0.95

(30)

Figure 4.3 Output Image

(8,16) (16,8) (16,16) (16,32) (32,16) (32,32) 7

4.3 Experiment:Reversi AI

8x8

DQN AI

100 (MTS:100) AI

AI

AI DQN

26

NFQ DQN

AI

500 AI MTS:50

26 AI

1 AI

• Convolution2D

(31)

•

(32)

Chapter 5 Result

5.1 pre-Exp

5.1.1 pre-Exp1:Hyper Parametars

Figure 5.1 Figure 5.2 Figure 5.3

Discussion

Figure 5.1 0.0001

201episode 10episode 195step 0.001

episode 20 30 1000episode

0.01

0.0001 180episode

DQN 0

1episode

Figure 5.2 0.001

123episode 10episode 195step

0.01 50episode

Figure 5.3 0.001

46episode 10episode 195step

0.01

0.95 1.0

0.001 0.95

(33)

0 20

40

60

80 1 00

1 20

1 40

1 60

1 80

2 00

0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960

Step s

epi sodes α=0.0 00 01 α=0.0 00 1 α=0.0 01 α=0.0 1

(34)

0 20

40

60

80 1 00

1 20

1 40

1 60

1 80

2 00

0 39 78 117 156 195 234 273 312 351 390 429 468 507 546 585 624 663 702 741 780 819 858 897 936 975

Step s

epi sodes α=0.0 00 01 α=0.0 00 1 α=0.0 01 α=0.0 1

Figure 5.2 γ = 0.97

(35)

0 20

40

60

80 1 00

1 20

1 40

1 60

1 80

2 00

0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960

Step s

epi sodes α=0.0 00 01 α=0.0 00 1 α=0.0 01 α=0.0 1

(36)

5.1.2 pre-Exp2:Networks

Figure 5.4

Discussion

Figure 5.4 3 100episode 4 46episode

5

150step 10episode

195step 4 10

10step 5 150step

8x8

4 5

5.1.3 pre-Exp3:Neurons

Figure 5.5

Discussion

Figure 5.5 8-16

16-32 32-16 32-32 30episode

64-64 32

10

64 640-640-640

(37)

0 20

40

60

80 1 00

1 20

1 40

1 60

1 80

2 00

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480

Step s

Epi sodes laye r: 3 laye r: 4 laye r: 5

(38)

0 20

40

60

80 1 00

1 20

1 40

1 60

1 80

2 00

0 13 26 39 52 65 78 91 104 117 130 143 156 169 182 195 208 221 234 247 260 273 286 299

Step s

Epi sodes (8,8 ) (8,1 6) (16 ,8 ) (16 ,1 6 ) (16 ,3 2 ) (32 ,1 6 ) (32 ,3 2 )

Figure 5.5 The number of neurons

(39)

Table 5.1 DQN AI’s win rate

episode MTS:100 Random

0 0

1 0.3 0.8

2 0.5 0.8

3 0.2 1.0

4 0.7 0.9

5 0.3 0.9

6 0.3 0.9

7 0.4 0.9

8 0.4 1.0

9 0.4 0.9

5.2 Experiment:Reversi AI

8x8 DQN Table 5.1 0

AI

Discussion

Table 5.1 MTS:100 AI DQN AI 4 0.70

4

AI 3 AI 0.9

5

NFQ

(40)

(41)

Chapter 6 Conclusion

8x8 DQN

Q 8x8

DQN

8x8

DQN DQN NFQ

26

Tensor Processor Unit DQN

Google Google Colab AI

AI

5

DQN 2013

2018 Rainbow Ape-X

TensorFlow Keras

AI

(42)

AI

(43)

Acknowledgements

(44)

Bibliography

1) (2008) .

2) (2018) “

” .

3) Richard S.Sutton , Andrew G.Barto (2000) , .

4) Riedmiller, Martin. (2005) “Neural fitted Q iteration–first experiences with a data eﬃcient neural reinforcement learning method.” European Conference on Machine Learning. Springer, Berlin, Heidelberg.

5) Mnih, Volodymyr, et al. (2013) “Playing atari with deep reinforcement learning.”

arXiv preprint arXiv:1312.5602.

6) (2018) PyTorch

.

7) Andreas C. Muller, Sarah Guido (2017) Python scikit-

learn 5 O’Reilly

Japan.

8) (2016) Tensorflow Google

R&D.