AML.tt#t2o9
Reinforcement learning
Agent
chooses actions ,get
a( probabilistic)
rewardDetermine a
strategy
to choose aehnis to maximizelongterm
rewardExploitation
vsExploration
I
b
choosing greedily
Learn more aboutsystem
Non- associative model
Rewards are not based on current state
(
ie.only
one state
)
-
Possibility of
timevarying
rewardsk- bandit
problem
choose
among
k actionsAssociative
modelMarkov
DecisionProcess
Markov
chain Sr. - Sj - - S.
States
I
Transition
probability
matrix si[ Pij ]
Pg
:prob of
si →Sj
§ Pig
-- 1Finite
state,
one next state distribution
per
stateGenerator
At S, choose a -
Agent
's choice\
Associated probability pls !
rf
s,a)
=?
s -9 ru
Sn
pls
'Is
,a)
=Er pls
',rls
,a)
§
,pls
'Is ,a7
=LExpected
rewardr
(
s,a, s ')
--g
r -Pga pls
'Is)
,
a)
Idealized setting for
'' associative"reinforcement
learning
"
State information
isavailable
toagent
What
does it meanto
maximiselong
term reward?
Finite
trajectory
-
Play
a game till it ends in winfloss (
draw- T moves
,
t -- r . . - ,T
I
Expected
return at time f"
Episode
"At
=Rt
# trttzt - - - tRT t
action At at t
generates
rewardRtt
,Cst
,they
→Stat
, RetiMany
situations - nofinite episode boundary Infinite sequence of
rewardsDiscounted reward
Gt defy Rtt
, t 8Rttz tf
Retz t - - -O # El
=
E 8k Rttkti
k=o
T --O - "
myopic
" r→1GE
= Rtt I t r Retz t 82 Retz t - -= Rtt , t 8
(
Retz trRetz trktth
--)
-
htt
GE
--Reti
tVA
Etitf
R -- tialways
a
=I
I
-rTo
reconcilefmte
&infinite
case-
Episodic
tasks , assume that- terminal state have a
single self loop
- Reward is O
for self loop
-
D @
r=oUse
samediscounted
rewarddefn
in bothcases
Actual
"value"depends
onpolicy
Vets )
- valueof
state s artpolicy
toEta @
tIst
-- s]
--Ee [ E. orkrttun Is
State
-valuefunction
Eels
,a) Et Ea [ at Isles
,At =D
Vela
--Gillet
sees]
=
fit [ Rent Thet
,I
Sees]
-
→ -
ay.sk#a.--Eaitlals7sE.EpCstrls.a7f
O
rifle
'[ rtrEEGttilstt.es 'D
si
.- se'vacs )
-_Eautalssrpcsirls
,a) frtrritcs 'D
vacs )
=Ea Hals ) sq
,P' Girls
,a) ft
rre GD ]
Bellman
equation for
'hots )
Find
asolution to
these( simultaneous ) equations
-
Given
apolicy
, we have Bellmanequations
tosolve
for
htLD
What we
really
want to do is tofind
"best
" IBest it
?
IT, Eltz
if VI
-Ve
,(
s)
EVitals )
There
existoptimal policies
Suppose
it, ik are bothoptimal
-
Hs
.Yo
, =Vitals )
-
Vrs
,agig
,G.
a) =Eads
,a)
Hypothetically
, there areoptimum 4×67
,9*6
.a)
We must have
¥67
=maax
9 *(
s,a)
=
imaax
E[ at 1 Sees
, At=D
=
maax
E[ Rtt
t rGotti 1st
--site
--D
=
may E [ Rtt
itry (
Seti) ( St
.-as,AED
vids )
--may § pls
'.ir/s.a)frtrrv*Ls 'D
✓*
G)
--maax fr pls ! rls
,a) [
rtrrv*Cs 'D
±
In vsvacs )
=Ea Hals ) sq
,P' Girls
,a) (
r t rreCs'D
-
Similarly
9.
*Is
,a)
=fr pls
',rls
,a) fr
tomaay 9*61
a'D