卒 業 研 究 報 告 題 目
指 導 教 員
岐 阜 工 業 高 等 専 門 学 校 電気情報工学科
平 成 3 1 年( 2 0 1 9 年 ) 2 月 1 5 日 提 出
DQN を利用したリバーシ AI に関する研究 A Study on Riversi AI using Deep Q-Networks
出口 利憲 教授
2014E27
長澤 紘汰Abstract
Recently, the reserch of Artificial Intelligence is applied to various fields of discipline and keep developing rapidly during about 70 years after it was born in Dartmouth confer- ence. Deep Reinforcement Learning is a field which is recently and specially interested in Artifitial Intellingence. The reasons why it can become to decide everything very rapidly and precisely combining the best regression of Deep Learning with the best decision of Reinforcement Learning on a problem having big computational complexity.
In this reserch, we apply the DQN algorithm combined NFQ argorithm which is a one of Deep Reinforcement Learning Algorithms to a game of 8x8 Reversi because we don’t have a computer which is embedded Tensor Processer Unit. First, I attempt to choose the best hyper parameters, size of networks and quantity of neurons using its algorithm from random records of Cartpole problem. Second, we apply to 8x8 Reversi and examine the performance of all AI’s which are made using Keras in TensorFlow on a construction of Combolutional Neural Networks using Deep Learning. As a result, we found that its AI could initially success to learn 8x8 Reversi strategies using the best hyper parameters with the size of networks and the number of neurons selected through pre-Exp. However, its AI failed in the middle (Table 1). Therefore, in order not to fail to learn, we should make its AI learn itself using pure DQN algorithm.
Table 1 DQN AI’s win rate
episode MTS:100 Random
0 0
1 0.3 0.8
2 0.5 0.8
3 0.2 1.0
4 0.7 0.9
5 0.3 0.9
6 0.3 0.9
7 0.4 0.9
8 0.4 1.0
9 0.4 0.9
Contents
Abstract
Chapter 1 Introduciton 1
Chapter 2 Neural Network 2
2.1 Neural Network . . . 2
2.1.1 Neuron . . . 2
2.1.2 Neuron Modeling . . . 2
2.1.3 Activation Function . . . 3
2.1.4 Loss Function . . . 7
2.1.5 Multilayer Perceptron . . . 9
2.2 Deep Learning . . . 9
2.2.1 Convolutional Neural Network . . . 10
2.3 Learning Method . . . 10
2.3.1 Optimizer . . . 11
Chapter 3 Reinforcement Learning 14 3.1 Q-Learning . . . 14
3.1.1 ϵ -greedy Algorithm . . . 17
3.2 Neural Fitted Q Iteration (NFQ) . . . 18
3.3 Deep Q-Networks (DQN) . . . 18
3.3.1 Experience Replay . . . 20
3.3.2 Fixed Target Q-Network . . . 20
3.3.3 Reward Clipping . . . 21
Chapter 4 Experiment 22 4.1 Environment Setting . . . 22
4.2 pre-Exp . . . 22
4.2.1 pre-Exp1: Hyper Parametars . . . 23
4.2.2 pre-Exp2: Networks . . . 24
4.2.3 pre-Exp3: Neurons . . . 24
4.3 Experiment:Reversi AI . . . 25
Chapter 5 Result 27
5.1 pre-Exp . . . 27
5.1.1 pre-Exp1:Hyper Parametars . . . 27
5.1.2 pre-Exp2:Networks . . . 31
5.1.3 pre-Exp3:Neurons . . . 31
5.2 Experiment:Reversi AI . . . 34
Chapter 6 Conclusion 36
Bibliography 39
Chapter 1 Introduciton
(AI) 70
1)
DQN Cartpole
TensorFlow Keras 8x8 AI
Chapter 2 Neural Network
2.1 Neural Network
( )
2)
2.1.1 Neuron
Figure 2.1 1000 2000
2.1.2 Neuron Modeling
Figure 2.2
Frank Rosenblatt
y (2.2) xi i
wi i f
u=
!n
xiwi (2.1)
Figure 2.1 Neuron
Figure 2.2 Simple Perceptron
2.1.3 Activation Function
(Transfer Function)
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
-10 -9.1 -8.2 -7.3 -6.4 -5.5 -4.6 -3.7 -2.8 -1.9 -1 -0.1 0.8 1.7 2.6 3.5 4.4 5.3 6.2 7.1 8 8.9 9.8
y
u Step
Figure 2.3 Step function
• Step function(Figure 2.3)
y =
⎧⎪
⎪⎨
⎪⎪
⎩
1 (u >0) 0 (u≤0)
(2.3)
• Sigmoid function(Figure 2.4)
y = 1
1 +e−u (2.4)
• ReLU function(Figure 2.5)
y=
⎧⎪
⎪⎨
⎪⎪
⎩
u (u >0) 0 (u≤0)
(2.5)
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
-10 -9.1 -8.2 -7.3 -6.4 -5.5 -4.6 -3.7 -2.8 -1.9 -1 -0.1 0.8 1.7 2.6 3.5 4.4 5.3 6.2 7.1 8 8.9 9.8
y
u Sigmoid
Figure 2.4 Sigmoid function
-2 0 2 4 6 8 10 12
-10 -9.2 -8.4 -7.6 -6.8 -6 -5.2 -4.4 -3.6 -2.8 -2 -1.2 -0.4 0.4 1.2 2 2.8 3.6 4.4 5.2 6 6.8 7.6 8.4 9.2 10
y
u ReLU
Figure 2.5 ReLU function
0 20 40 60 80 100 120
-10 -9.1 -8.2 -7.3 -6.4 -5.5 -4.6 -3.7 -2.8 -1.9 -1 -0.1 0.8 1.7 2.6 3.5 4.4 5.3 6.2 7.1 8 8.9 9.8
L
e MSE
Figure 2.6 Mean Squared Error
2.1.4 Loss Function
( )
• (mse)
(Fig- ure 2.6)
e L
• Huber
• τ-
0 2 4 6 8 10 12
-10 -9.2 -8.4 -7.6 -6.8 -6 -5.2 -4.4 -3.6 -2.8 -2 -1.2 -0.4 0.4 1.2 2 2.8 3.6 4.4 5.2 6 6.8 7.6 8.4 9.2 10
L
t - f Logistics
Figure 2.7 Logistics function
• ϵ-
• 0-1
• Logisitics 0
Logistics (Figure 2.7)
t−f L
• Hinge
Figure 2.8 Multilayer Perceptron
2.1.5 Multilayer Perceptron
Figure 2.8 1
3
David E. Rumelhart
2.2 Deep Learning
2
4
Geoffrey Everest Hinton
2010
•
•
•
•
•
•
2.2.1 Convolutional Neural Network
(Figure 2.9)
4 1
2)
2.3 Learning Method
RMSprop Adam
Figure 2.9 Convolutional Neural Network
2.3.1 Optimizer
• Stochastic gradient decent (SGD)
(2.6) w
wt+1 ←wt−η∂L(wt)
∂wt (2.6)
t η L
• Backpropagation
3 1986
David E. Rumelhart
–
– 0
Wkj
∆Wkj (2.7)
∆Wkj =−η ∂
∂Wkj
1 2
!n i=1
(ti −Oi)2
=η(tk−Ok)Ok(1−Ok)Hj (2.7)
Ok k tk
( ) Hj j η
AdaGrad
∆Wji (2.8)
∆Wji =−η ∂
∂Wji
1 2
!n i=1
(ti−Oi)2
=ηHj(1−Hj)
!n k=1
{Wkj(tk−Ok)Ok(1−Ok)}Xi (2.8)
Xi i
• AdaGrad
AdaGrad 2011 John C. Duchi
SGD
(2.12) w
h0 =ϵ (2.9)
ht =ht−1+∇L(wt)2 (2.10) ηt = η0
√ht (2.11)
wt+1 =wt−ηt∇E(wt) (2.12)
RMSprop 2012 Tijmen Tieleman AdaGrad
(2.15) w
ht=αht−1+ (1−α)∇L(wt)2 (2.13) ηt= η0
√ht+ϵ (2.14)
wt+1 =wt−ηt∇L(wt) (2.15)
α
• Adam
Adam 2012 Google Brain team
Diederik P. Kingma Mo-
mentam AdaGrad
Adam (2.20) w
mt+1 =β1mt+ (1−β1)∇L(wt) (2.16) vt+1 =β2vt+ (1−β2)∇L(wt)2 (2.17)
ˆ
m = mt+1
1−β1t (2.18)
ˆ
v = vt+1
1−β2t (2.19)
wt+1 =wt−α mˆ
√vˆ+ϵ (2.20)
β1 m β2 v α
ˆ m vˆ
Adam AdaGrad
Chapter 3 Reinforcement Learning
(Figure 3.1)
TD
Q 3)
(MDP)
(Figure 3.2)
•
•
•
(POMDP) belief MDP
3.1 Q-Learning
Q Richard S. Sutton 1984 TD(
λ ) Christopher J.C.H. Watkins 1989
(Figure 3.3) 3)
π Q
Figure 3.1 Reinforcement Learning
Figure 3.2 Markov decision process
a b c d 4
4 Q Q(st, a) Q(st, b) Q(st, c) Q(st, d)
Q Q
Figure 3.3 Q-Learning
ϵ
Q ϵ -greedy
(3.1) softmax
π(s, a) =
exp(Q(s, a)
T )
!
p∈A
exp(Q(s, p)
T )
(3.1)
T A s
Q st
a st+1 Q(st, a) (3.2)
Q(st, a)←Q(st, a) +α[rt+1+γmax
p Q(st+1, a)−Q(st, a)] (3.2)
α γ
0 1 rt+1 st+1
(3.2) Q
Table 3.1 Q-Table
State\Action a1 a2 a3 an
s0 0 0.24 0.59 0.77
s1 −0.14 0.02 −0.57 0.24 s2 −0.85 −0.04 −0.38 −0.95
sm 0.07 0.15 1.0 −1.0
α γ
α (3.3) (3.4)
Q
!∞ t=0
α(t)→ ∞ (3.3)
!∞ t=0
α(t)2 < ∞ (3.4)
• Q
•
• Q
Q Q (Ta-
ble 3.1)
Neural Fitted Q Iteration Deep Q-Networks
3.1.1 ϵ -greedy Algorithm
ϵ-greedy ϵ ( 1−ϵ)
(Figure 3.4) ϵ
Figure 3.4 ϵ -greedy Algorithm
3.2 Neural Fitted Q Iteration (NFQ)
NFQ Q Q
Martin Riedmiller 2005
Q NFQ
4)
3.3 Deep Q-Networks (DQN)
0 2 4 6 8 10 12
-10 -9.2 -8.4 -7.6 -6.8 -6 -5.2 -4.4 -3.6 -2.8 -2 -1.2 -0.4 0.4 1.2 2 2.8 3.6 4.4 5.2 6 6.8 7.6 8.4 9.2 10
L
e Huber
Figure 3.5 Huber function
5)
Q Q
(3.5)
R(s, a) +γmax
a′∈AQ(s′, a′) (3.5)
eθi Lθi (3.6) (3.7)
eθi =R(s, a) +γmax
a′∈AQ(s′, a′;θi−1)−Q(s, a;θi) (3.6)
Lθi =
⎧⎪
⎪⎨
⎪⎪
⎩
E&1
2(eθi)2' (|eθi|≤1)
|eθi| (|eθi|>1)
(3.7)
θi i
E Huber (Figure 3.5) −1 1
Huber
∇L(θi) (3.8)
Figure 3.6 Experience Replay
∇L(θi) = E&(R(s, a) +γmax
a′∈AQ(s′, a′;θi−1)−Q(s, a;θi))∇Q(s, a;θi)' (3.8)
E[A] A θ
Q ϵ-greedy
Experience Replay Fixed Target Q-Network clipping
6)
3.3.1 Experience Replay
Experience Replay
(Figure 3.6)
3.3.2 Fixed Target Q-Network
Fixed Target Q-Network DQN
3.3.3 Reward Clipping
Clipping −1 0 1
−1 0 1
Chapter 4 Experiment
DQN clipping Cartpole
DQN
8x8 DQN
ReLU Adam
Huber
4.1 Environment Setting
• CPU: Intel Core i7 5930K
• GPU: Nvidia Geforce GTX 980Ti
• GPU : 6GB 3
• OS: Windows10 Pro
• : Python Tensorflow Keras7)8)
4.2 pre-Exp
python Gym CartPole-v0
CartPole-v0
(Figure 4.1)
+1 0
4
( or ) Q 2 (Figure 4.2)
CSV episode
Figure 4.1 CartPole-v0
• : 10steps
• : 195rewards
• Experience Replay : 10000
• Fixed Target Q-Network : 32
DQN
4.2.1 pre-Exp1: Hyper Parametars
Q
α γ
• : 4 ( , , , )
• 1: 16 ( )
• 2: 16 ( )
• : 2 ( Q , Q )
0.00001 0.0001 0.001 0.01 0.99 0.97
0.95 12
Figure 4.2 A network for CartPole-v0
4.2.2 pre-Exp2: Networks
8x8 CartPole
16
Q α γ 0.001
0.95 16
3 4 5
Tensorflow Keras plot model (Figure 4.3)
4.2.3 pre-Exp3: Neurons
Q α γ 0.001 0.95
Figure 4.3 Output Image
(8,16) (16,8) (16,16) (16,32) (32,16) (32,32) 7
4.3 Experiment:Reversi AI
8x8
DQN AI
100 (MTS:100) AI
AI
AI DQN
26
NFQ DQN
AI
500 AI MTS:50
26 AI
1 AI
• Convolution2D
•
Chapter 5 Result
5.1 pre-Exp
5.1.1 pre-Exp1:Hyper Parametars
Figure 5.1 Figure 5.2 Figure 5.3
Discussion
Figure 5.1 0.0001
201episode 10episode 195step 0.001
episode 20 30 1000episode
0.01
0.0001 180episode
DQN 0
1episode
Figure 5.2 0.001
123episode 10episode 195step
0.01 50episode
Figure 5.3 0.001
46episode 10episode 195step
0.01
0.95 1.0
0.001 0.95
0 20
40
60
80 1 00
1 20
1 40
1 60
1 80
2 00
0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960
Step s
epi sodes α=0.0 00 01 α=0.0 00 1 α=0.0 01 α=0.0 1
0 20
40
60
80 1 00
1 20
1 40
1 60
1 80
2 00
0 39 78 117 156 195 234 273 312 351 390 429 468 507 546 585 624 663 702 741 780 819 858 897 936 975
Step s
epi sodes α=0.0 00 01 α=0.0 00 1 α=0.0 01 α=0.0 1
Figure 5.2 γ = 0.97
0 20
40
60
80 1 00
1 20
1 40
1 60
1 80
2 00
0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960
Step s
epi sodes α=0.0 00 01 α=0.0 00 1 α=0.0 01 α=0.0 1
5.1.2 pre-Exp2:Networks
Figure 5.4
Discussion
Figure 5.4 3 100episode 4 46episode
5
150step 10episode
195step 4 10
10step 5 150step
8x8
4 5
5.1.3 pre-Exp3:Neurons
Figure 5.5
Discussion
Figure 5.5 8-16
16-32 32-16 32-32 30episode
64-64 32
10
64 640-640-640
0 20
40
60
80 1 00
1 20
1 40
1 60
1 80
2 00
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480
Step s
Epi sodes laye r: 3 laye r: 4 laye r: 5
0 20
40
60
80 1 00
1 20
1 40
1 60
1 80
2 00
0 13 26 39 52 65 78 91 104 117 130 143 156 169 182 195 208 221 234 247 260 273 286 299
Step s
Epi sodes (8,8 ) (8,1 6) (16 ,8 ) (16 ,1 6 ) (16 ,3 2 ) (32 ,1 6 ) (32 ,3 2 )
Figure 5.5 The number of neurons
Table 5.1 DQN AI’s win rate
episode MTS:100 Random
0 0
1 0.3 0.8
2 0.5 0.8
3 0.2 1.0
4 0.7 0.9
5 0.3 0.9
6 0.3 0.9
7 0.4 0.9
8 0.4 1.0
9 0.4 0.9
5.2 Experiment:Reversi AI
8x8 DQN Table 5.1 0
AI
Discussion
Table 5.1 MTS:100 AI DQN AI 4 0.70
4
AI 3 AI 0.9
5
NFQ
Chapter 6 Conclusion
8x8 DQN
Q 8x8
DQN
8x8
DQN DQN NFQ
26
Tensor Processor Unit DQN
Google Google Colab AI
AI
5
DQN 2013
2018 Rainbow Ape-X
TensorFlow Keras
AI
AI
Acknowledgements
Bibliography
1) (2008) .
2) (2018) “
” .
3) Richard S.Sutton , Andrew G.Barto (2000) , .
4) Riedmiller, Martin. (2005) “Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method.” European Conference on Machine Learning. Springer, Berlin, Heidelberg.
5) Mnih, Volodymyr, et al. (2013) “Playing atari with deep reinforcement learning.”
arXiv preprint arXiv:1312.5602.
6) (2018) PyTorch
.
7) Andreas C. Muller, Sarah Guido (2017) Python scikit-
learn 5 O’Reilly
Japan.
8) (2016) Tensorflow Google
R&D.