• Tidak ada hasil yang ditemukan

AML.tt#t2o9 - Reinforcement learning

N/A
N/A
Protected

Academic year: 2024

Membagikan "AML.tt#t2o9 - Reinforcement learning"

Copied!
15
0
0

Teks penuh

(1)

AML.tt#t2o9

Reinforcement learning

Agent

chooses actions ,

get

a

( probabilistic)

reward

Determine a

strategy

to choose aehnis to maximize

longterm

reward

Exploitation

vs

Exploration

I

b

choosing greedily

Learn more about

system

(2)

Non- associative model

Rewards are not based on current state

(

ie.

only

one state

)

-

Possibility of

time

varying

rewards

k- bandit

problem

choose

among

k actions

Associative

model

Markov

Decision

Process

(3)

Markov

chain Sr

. - Sj - - S.

States

I

Transition

probability

matrix si

[ Pij ]

Pg

:

prob of

si

Sj

§ Pig

-- 1

Finite

state

,

one next state distribution

per

state
(4)

Generator

At S, choose a -

Agent

's choice

\

Associated probability pls !

r

f

s,

a)

=?

s -9 ru

Sn

pls

'

Is

,

a)

=

Er pls

',

rls

,

a)

§

,

pls

'

Is ,a7

=L
(5)

Expected

reward

r

(

s,a, s '

)

--

g

r -

Pga pls

'Is

)

,

a)

Idealized setting for

'' associative"

reinforcement

learning

"

State information

is

available

to

agent

What

does it mean

to

maximise

long

term reward

?

(6)

Finite

trajectory

-

Play

a game till it ends in win

floss (

draw

- T moves

,

t -- r . . - ,T

I

Expected

return at time f

"

Episode

"

At

=

Rt

# trttzt - - - t

RT t

action At at t

generates

reward

Rtt

,

Cst

,

they

Stat

, Reti
(7)

Many

situations - no

finite episode boundary Infinite sequence of

rewards

Discounted reward

Gt defy Rtt

, t 8

Rttz tf

Retz t - - -

O # El

=

E 8k Rttkti

k=o

T --O - "

myopic

" r1
(8)

GE

= Rtt I t r Retz t 82 Retz t - -

= Rtt , t 8

(

Retz tr

Retz trktth

--

)

-

htt

GE

--

Reti

t

VA

Eti

tf

R -- ti

always

a

=

I

I

-r
(9)

To

reconcile

fmte

&

infinite

case

-

Episodic

tasks , assume that

- terminal state have a

single self loop

- Reward is O

for self loop

-

D @

r=o

Use

same

discounted

reward

defn

in both

cases

(10)

Actual

"value"

depends

on

policy

Vets )

- value

of

state s art

policy

to

Eta @

t

Ist

-- s

]

--

Ee [ E. orkrttun Is

State

-value

function

Eels

,

a) Et Ea [ at Isles

,

At =D

(11)

Vela

--

Gillet

sees

]

=

fit [ Rent Thet

,

I

Sees

]

-

→ -

ay.sk#a.--Eaitlals7sE.EpCstrls.a7f

O

rifle

'

[ rtrEEGttilstt.es 'D

si

.- se'

vacs )

-_

Eautalssrpcsirls

,

a) frtrritcs 'D

(12)

vacs )

=

Ea Hals ) sq

,

P' Girls

,

a) ft

r

re GD ]

Bellman

equation for

'

hots )

Find

a

solution to

these

( simultaneous ) equations

-

Given

a

policy

, we have Bellman

equations

to

solve

for

ht

LD

What we

really

want to do is to

find

"

best

" I
(13)

Best it

?

IT, Eltz

if VI

-

Ve

,

(

s

)

E

Vitals )

There

exist

optimal policies

Suppose

it, ik are both

optimal

-

Hs

.

Yo

, =

Vitals )

-

Vrs

,a

gig

,

G.

a) =

Eads

,

a)

Hypothetically

, there are

optimum 4×67

,

9*6

.

a)

(14)

We must have

¥67

=

maax

9 *

(

s,

a)

=

imaax

E

[ at 1 Sees

, At

=D

=

maax

E

[ Rtt

t r

Gotti 1st

--

site

--

D

=

may E [ Rtt

i

try (

Seti

) ( St

.-as,

AED

vids )

--

may § pls

'

.ir/s.a)frtrrv*Ls 'D

(15)

*

G)

--

maax fr pls ! rls

,

a) [

rtrrv*

Cs 'D

±

In vs

vacs )

=

Ea Hals ) sq

,

P' Girls

,

a) (

r t rreCs

'D

-

Similarly

9.

*

Is

,

a)

=

fr pls

',

rls

,

a) fr

to

maay 9*61

a

'D

Referensi

Dokumen terkait