Baru-Baru Ini Dicari

Tidak ada hasil yang ditemukan

Tag

Tidak ada hasil yang ditemukan

Dokumen

Tidak ada hasil yang ditemukan

Unggah

Beranda Sekolah Topik

Masuk

AML.tt#t2o9 - Reinforcement learning

Membagikan "AML.tt#t2o9 - Reinforcement learning"

N/A

N/A

Protected

Tahun akademik: 2024

Info

Protected

Academic year: 2024

Membagikan "AML.tt#t2o9 - Reinforcement learning"

Copied!

15

0

0

Bebas

15

0

0

Memuat.... (Lihat teks lengkap sekarang)

Unduh sekarang ( 15 Halaman )

Teks penuh

(1)

AML.tt#t2o9

Reinforcement learning

Agent

^chooses ^actions ^,

get

^a

( probabilistic)

^reward

Determine ^a

strategy

^to ^choose ^aehnis ^to ^maximize

longterm

^reward

Exploitation

^vs

Exploration

I

b

choosing greedily

^Learn ^more ^about

system

(2)

Non^- associative model

Rewards ^are ^not ^based ^on ^current ^state

(

^ie^.

only

one state

)

-

Possibility of

^time

varying

^rewards

k^- ^bandit

problem

choose

among

^k ^actions

Associative

^model

Markov

^Decision

Process

(3)

Markov

^chain _Sr

. - Sj ^{- -} ^S^.

States

I

Transition

probability

^matrix ^si

[ Pij ]

Pg

^:

prob of

^si ^→

Sj

§ Pig

^-^- ¹

Finite

state

,

one next state distribution

per

^state

(4)

Generator

At S_, choose ^a ^-

Agent

^'s ^choice

\

Associated probability pls !

^r

f

^s^,

a)

=?

s -9 ^ru

Sn

pls

^'

Is

^,

a)

⁼

Er pls

^'^,

rls

^,

a)

§

^,

pls

^'

Is ,a7

^=L

(5)

Expected

^reward

r

(

^s^,^a^, ^s ^'

)

^-^-

g

^r ^-

Pga pls

^'_Is

)

,

a)

Idealized setting for

^'^' associative^"

reinforcement

learning

"

State information

^is

available

^to

agent

What

^does ^it ^mean

to

maximise

long

^term ^reward

?

(6)

Finite

trajectory

-

Play

â ^game ^till ît ênds ⁱⁿ ^win

floss (

^draw

- T ^moves

,

t ^-^- r _. ^{. -} ,T

I

Expected

^return ^at ^time ^f

"

Episode

^"

At

⁼

Rt

# trttzt ^- ^- ^- ^t

RT t

action At ^{at t}

generates

^reward

Rtt

_,

Cst

^,

they

^→

Stat

, Reti

(7)

Many

^situations ^- ^no

finite episode boundary Infinite sequence of

^rewards

Discounted ^reward

Gt defy Rtt

^, ^t ⁸

Rttz tf

^Retz ^t ^- ^- ^-

O # El

=

E 8k Rttkti

k=o

T ^-^-O ^- ^"

myopic

^" ^r^→¹

(8)

GE

⁼ ^Rtt ^I ^t ^r ^Retz ^t ⁸² ^Retz ^t ^- ^-

= Rtt , t 8

(

^Retz ^tr

Retz trktth

^-^-

)

-

htt

GE

^-^-

Reti

^t

VA

Eti

tf

^R ^-^- ^ti

always

a

⁼

I

I

^-^r

(9)

To

_reconcile

fmte

^&

infinite

^case

-

Episodic

^tasks ^, ^assume ^that

- terminal ^state have ^a

single self loop

- Reward ^is ^O

for self loop

-

D @

^r=o

Use

_same

discounted

^reward

defn

ⁱⁿ ^both

cases

(10)

Actual

^"^value^"

depends

^on

policy

Vets )

^- ^value

of

^state ^s ^art

policy

^to

Eta @

^t

Ist

^-^- ^s

]

^-^-

Ee [ E. orkrttun Is

State

^-value

function

Eels

^,

a) Et Ea [ at Isles

^,

At =D

(11)

Vela

^-^-

Gillet

^sees

]

=

fit [ Rent Thet

^,

I

^Sees

]

-

→ -

ay.sk#a.--Eaitlals7sE.EpCstrls.a7f

O

rifle

^'

[ rtrEEGttilstt.es 'D

si

^.^- se^'

vacs )

^-_

Eautalssrpcsirls

^,

a) frtrritcs 'D

(12)

vacs )

⁼

Ea Hals ) sq

^,

P' Girls

^,

a) ft

^r

re GD ]

Bellman

equation for

^'

hots )

Find

a

solution to

^these

( simultaneous ) equations

-

Given

^a

policy

^, ^we ^have ^Bellman

equations

^to

solve

for

^ht

LD

What _we

really

^want ^to ^do ^is ^to

find

^"

best

^" ^I

(13)

Best ^it

?

IT, Eltz

if VI

^-

Ve

^,

(

^s

)

^E

Vitals )

There

_exist

optimal policies

Suppose

ît^, îk âre ^both

optimal

-

Hs

^.

Yo

_, ⁼

Vitals )

-

Vrs

_,a

gig

_,

G.

^a) ⁼

Eads

^,

a)

Hypothetically

^, ^there ^are

optimum 4×67

^,

9*6

^.

a)

(14)

We must have

¥67

⁼

maax

⁹ ^*

(

^s^,

a)

=

imaax

^E

[ at 1 Sees

^, ^At

=D

=

maax

^E

[ Rtt

^{t r}

Gotti 1st

^-^-

site

^-^-

D

=

may E [ Rtt

ⁱ

try (

^Seti

) ( St

^.^-^as^,

AED

vids )

^-^-

may § pls

^'

.ir/s.a)frtrrv*Ls 'D

(15)

✓_*

G)

^-^-

maax fr pls ! rls

^,

a) [

^rtrrv^*

Cs 'D

±

In ^vs

vacs )

⁼

Ea Hals ) sq

^,

P' Girls

^,

a) (

^{r t} ^r^reCs

'D

-

Similarly

9.

^*

Is

^,

a)

⁼

fr pls

^'^,

rls

^,

a) fr

^to

maay 9*61

^a

'D

Referensi

Unduh sekarang ( PDF - 15 Halaman - 5.82 MB )

Dokumen terkait

Apakah Anda tertarik untuk menghasilkan pendapatan pasif? Mari kita mulai bersama kami.

Semakin banyak dokumen yang Anda bagikan, semakin banyak uang yang Anda hasilkan!

Dokumen terkait

UNIT 4 Learning Theories and Application

UNIT 4 Learning Theories and Application

53

0

0

Practical Reinforcement Learning Develop self evolving, intelligent agents with OpenAI Gym, Python and Java pdf pdf

Practical Reinforcement Learning Develop self evolving, intelligent agents with OpenAI Gym, Python and Java pdf pdf

469

0

0

View of The Development of Supporting Reinforcement and Intrinsic Reward Model affecting to Loyalty Behavior to Organization and Job Performance of the Staff in Software Industry in Thailand.

View of The Development of Supporting Reinforcement and Intrinsic Reward Model affecting to Loyalty Behavior to Organization and Job Performance of the Staff in Software Industry in Thailand.

16

0

0

Augmenting Deep Reinforcement Learning with Clustering

Augmenting Deep Reinforcement Learning with Clustering

56

0

0