• Tidak ada hasil yang ditemukan

DYNAMIC PROGRAMMING FOR ACTION IN THE ENVIRONMENT

N/A
N/A
Protected

Academic year: 2024

Membagikan "DYNAMIC PROGRAMMING FOR ACTION IN THE ENVIRONMENT"

Copied!
56
0
0

Teks penuh

(1)

DYNAMIC PROGRAMMING FOR ACTION IN THE ENVIRONMENT

Cognitive Dynamic Systems, Simon Haykin Presented by

Joo Yeon Kim & Yunseong Lee

(2)

THE BIG PICTURE

(3)

COGNITIVE DYNAMIC SYSTEMS

Build up rules of behavior over time through

learning from continuous experiential interactions with the environment,

and thereby deal with environmental uncertainties.

(4)

THE PERCEPTION-ACTION CYCLE

(5)

THE PERCEPTION-ACTION CYCLE

(6)

THE PERCEPTION-ACTION CYCLE

Power Spectrum:

Power Spectrum Estimation (Cognitive Radio)

Bayesian Filtering:

State Estimation (Cognitive Radar)

(7)

THE PERCEPTION-ACTION CYCLE

Scene Analysis

Feedback Channel

(8)

THE PERCEPTION-ACTION CYCLE

(9)

DYNAMIC PROGRAMMING

(10)

WHAT IS DYNAMIC PROGRAMMING?

Dynamic programming is a technique that deals with situations where decisions are made in stages (i.e. different time steps), with the outcome of each decision being predictable to some extent before the next decision is made.

A key aspect of such situations is that decisions cannot be made in isolation. Rather, the desire for a low cost at the present is

balanced against the undesirability of a high cost in the future.

(11)

DYNAMIC PROGRAMMING Markov Decision Processes

Bellman’s Optimality Criterion

Policy Iteration

Value Iteration

(12)

MARKOV’S DECISION PROCESS

GOAL!!

DEAD END BOOBY TRAP

START

You can move UP, DOWN, LEFT, RIGHT

(13)

DETERMINISTIC VS. STOCHASTIC

Deterministic

Every state can be uniquely determined with model

parameters and previous states.

Always perform identically for a given set of initial conditions.

Stochastic

States determined by probability distributions.

Such probability distributions include a notion of randomness.

(14)

MARKOV’S DECISION PROCESSES The Actuator’s Decision Making Process

Discrete-time & Stochastic

Decision Maker

Environment

Action Reward

State

Set of States: 𝑺 = 𝒔

Set of Actions: 𝑨 = 𝒂 , 𝒂: 𝑺 → 𝑺

Reward Function: 𝑹 𝒔

Transition Model:

𝑻 𝒔, 𝒂, 𝒔+ or 𝑷 𝒔+ 𝒔, 𝒂)

Policy: 𝝅 𝒔 → 𝒂

(15)

MARKOV’S DECISION PROCESSES The Actuator’s Decision Making Process

Discrete-time & Stochastic

Decision Maker

Environment

Action Reward

State

Set of States: 𝑺 = 𝒔

Set of Actions: 𝑨 = 𝒂 , 𝒂: 𝑺 → 𝑺

Reward Function: 𝑹 .

Discount Factor: 𝜸

Initial State: 𝒔𝟏

Markov Property

𝑷 𝒔

𝒕3𝟏

𝒂, 𝒔

𝟎

, … , 𝒔

𝒕

) = 𝑷(𝒔

𝒕3𝟏

| 𝒂, 𝒔

𝒕

)

(16)

MARKOV’S DECISION PROCESS

GOAL!!

DEAD END

BOOBY TRAP

START

You can move UP, DOWN, LEFT, RIGHT

Set of States: 𝑺 = 𝒔

Set of Actions: 𝑨 = 𝒂 , 𝒂: 𝑺 → 𝑺

Reward Function: 𝑹 𝒔

Transition Model:

𝑻 𝒔, 𝒂, 𝒔+ or 𝑷 𝒔+ 𝒔, 𝒂)

Policy: 𝝅 𝒔 → 𝒂

(17)

MARKOV’S DECISION PROCESS

GOAL!!

DEAD END

BOOBY TRAP

START

You can move UP, DOWN, LEFT, RIGHT

Set of States: 𝑺 = 𝒔

Set of Actions: 𝑨 = 𝒂 , 𝒂: 𝑺 → 𝑺

Reward Function: 𝑹 𝒔

Transition Model:

𝑻 𝒔, 𝒂, 𝒔+ or 𝑷 𝒔+ 𝒔, 𝒂)

Policy: 𝝅 𝒔 → 𝒂

Define UTILITY:

𝑼 𝒔

𝟎

, … , 𝒔

𝑵

= 𝑹 𝒔

𝟎

+ 𝑹 𝒔

𝟏

+ ⋯ + 𝑹(𝒔

𝑵

)

(18)

MARKOV’S DECISION PROCESS

𝑼 𝒔𝟎, … , 𝒔𝑵 = 𝑹 𝒔𝟎 + 𝑹 𝒔𝟏 + ⋯ + 𝑹(𝒔𝑵)

But this N can increase infinitely!

𝑼 𝒔𝟎, … = 𝑹 𝒔𝟎 + 𝜸𝟏𝑹 𝒔𝟏 + ⋯ + 𝜸𝑵𝑹 𝒔𝑵 + ⋯

Discount Factor: 0 < 𝛾 < 1

(19)

MARKOV’S DECISION PROCESS

𝑼 𝒔𝟎, … = 𝑹 𝒔𝟎 + 𝜸𝟏𝑹 𝒔𝟏 + ⋯ + 𝜸𝑵𝑹 𝒔𝑵 + ⋯

We move from 𝑠B to 𝑠B3C by action 𝝅 𝒔𝒕

GOAL

Find a sequence of actions, 𝝅 𝒔𝒕 , such that the resulting sequence of states maximizes

the total discounted reward, U

(20)

MARKOV’S DECISION PROCESS

𝑼 𝒔𝟎, … = 𝑹 𝒔𝟎 + 𝜸𝟏𝑹 𝒔𝟏 + ⋯ + 𝜸𝑵𝑹 𝒔𝑵 + ⋯

We move from 𝑠B to 𝑠B3C by action 𝝅 𝒔𝒕

GOAL

Find a sequence of actions, 𝝅 𝒔𝒕 , such that the resulting sequence of states maximizes

the total discounted reward, U

Optimal Policy:

The policy 𝜋

that maximizes the expected utility U of

the sequence of states generated by 𝜋

for all initial states 𝑠

(21)

MARKOV’S DECISION PROCESS

Question:

How do we compute the optimal policy 𝜋?

(22)

If action a is taken in state s,

𝑼 𝒔 = 𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔+

𝑺+

𝑼 𝒔+

𝑼 𝒔 : Utility in state s

𝑹 𝒔, 𝒂 : Reward when taking action a from state s 𝜸: Discount factor

𝑻 𝒔, 𝒂, 𝒔+ : Transition probability from s to s’ by taking a

BELLMAN’S EQUATION

(23)

If action a is taken in state s,

𝑼 𝒔 = 𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔+

𝑺+

𝑼 𝒔+

BELLMAN’S EQUATION

s

s’1

R(s, a)

a s’2

𝑠+M+

S’

𝑼 𝒔 𝜸 H 𝑻 𝒔, 𝒂, 𝒔+

𝑺+

𝑼 𝒔+ Expected value of future

rewards from state s’

(24)

𝑼 𝒔 = max

𝒂 (𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔+

𝑺Q

𝑼 𝒔+ )

BELLMAN’S OPTIMALITY CRITERION

s

𝑎C 𝑎S

𝑎T

𝑅 𝑠, 𝑎C + 𝛾 H 𝑇 𝑠, 𝑎C, 𝑠C+

MQ

𝑈 𝑠C+

𝑅 𝑠, 𝑎S + 𝛾 H 𝑇 𝑠, 𝑎S, 𝑠S+

MQ

𝑈 𝑠S+

𝑠C+ 𝑠S+

𝑠T+ 𝑅 𝑠, 𝑎T + 𝛾 H 𝑇 𝑠, 𝑎T, 𝑠T+

MQ

𝑈 𝑠T+

The Maximum Value à 𝑼 𝑺

(25)

𝑼 𝑺 = max

𝒂 (𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔+

𝑺Q

𝑼 𝒔+ )

BELLMAN’S OPTIMALITY CRITERION

s

𝑎C 𝑎S

𝑎T

𝑅 𝑠, 𝑎C + 𝛾 H 𝑇 𝑠, 𝑎C, 𝑠C+

MQ

𝑈 𝑠C+

𝑅 𝑠, 𝑎S + 𝛾 H 𝑇 𝑠, 𝑎S, 𝑠S+

MQ

𝑈 𝑠S+

𝑠C+ 𝑠S+

𝑠T+ 𝑅 𝑠, 𝑎T + 𝛾 H 𝑇 𝑠, 𝑎T, 𝑠T+

MQ

𝑈 𝑠T+

The Maximum Value à 𝑼 𝑺

Again, how do we find the optimal

policy?

(26)

POLICY ITERATION

Initialize: 𝛑𝟎 guess Evaluate: given 𝛑𝒕 ,

calculate 𝑼𝒕 𝒔 = 𝑅 𝑠 + 𝛾 ∑ 𝑇 𝑠,MQ 𝝅𝒕 𝑠 , 𝑠+ 𝑼𝒕 𝒔+ ) (Bellman’s eq)

Improve: 𝝅𝒕3𝟏 = argmax

] ∑ 𝑇(𝑠,MQ 𝝅𝒕 𝑠 , 𝑠′) 𝑼𝒕 𝒔′ )

(27)

POLICY ITERATION

Initialize: 𝛑𝟎 guess Evaluate: given 𝛑𝒕 ,

calculate 𝑼𝒕 𝒔 = 𝑅 𝑠 + 𝛾 ∑ 𝑇 𝑠,MQ 𝝅𝒕 𝑠 , 𝑠+ 𝑼𝒕 𝒔+ ) (Bellman’s eq)

Improve: 𝝅𝒕3𝟏 = argmax

] ∑ 𝑇(𝑠,MQ 𝝅𝒕 𝑠 , 𝑠′) 𝑼𝒕 𝒔′ )

Evaluate Improve 𝜋B: Poilcy

𝑈B: Utility

Repeat until the Utility converges à 𝑼𝒕 𝒔 ≈ 𝑼𝒕3𝟏 𝒔

(28)

Initialize: Start with arbitrary utilities Update utilities based on neighbors

Û𝒕3𝟏 𝒔 = 𝑅 𝑠 + 𝛾 𝑚𝑎𝑥

] ∑ 𝑇 𝑠,MQ 𝜋B 𝑠 , 𝑠+ Û𝒕 𝒔+ ) (Bellman’s eq)

VALUE ITERATION

(29)

Initialize: Start with arbitrary utilities Update utilities based on neighbors

Û𝒕3𝟏 𝒔 = 𝑅 𝑠 + 𝛾 𝑚𝑎𝑥

] ∑ 𝑇 𝑠,MQ 𝜋B 𝑠 , 𝑠+ Û𝒕 𝒔+ ) (Bellman’s eq)

à Repeat until converges.

The quality of Utility gets better as more truth gets involved

VALUE ITERATION

(30)

Markov Decision Process

Decision making process in a cognitive dynamic system

Bellman’s Optimality Criterion

Defines the optimization problem the decision-maker must solve

Policy Iteration

Finds the optimal policy by iteratively estimating the policy

Value Iteration

Finds the optimal policy by iteratively estimating the utility

CHECKPOINT

(31)

INTERMISSION

(32)

REINFORCEMENT LEARNING

as Dynamic Programming

(33)

REINFORCEMENT LEARNING RECAP

Agent

Environment

Action Reward

State

(34)

REINFORCEMENT LEARNING RECAP

Agent

Environment

Action Reward

State

Model-Based Model-Free

(35)

REINFORCEMENT LEARNING RECAP

Decision Maker

Environment

Action Reward

State

Set of States: 𝑺 = 𝒔

Set of Actions: 𝑨 = 𝒂 , 𝒂: 𝑺 → 𝑺

Reward Function: 𝑹 𝒔

Transition Model:

𝑻 𝒔, 𝒂, 𝒔+ or 𝑷 𝒔+ 𝒔, 𝒂)

Policy: 𝝅 𝒔 → 𝒂

Model-Based (MDP)

(36)

REINFORCEMENT LEARNING RECAP

Decision Maker

Environment

Action Reward

State

Set of States: 𝑺 = 𝒔

Set of Actions: 𝑨 = 𝒂 , 𝒂: 𝑺 → 𝑺

Reward Function: 𝑹 𝒔

Transition Model:

𝑻 𝒔, 𝒂, 𝒔+ or 𝑷 𝒔+ 𝒔, 𝒂)

Policy: 𝝅 𝒔 → 𝒂

There is no explicit model for R and T We instead have explicit transition data:

< 𝒔, 𝒂, 𝒔+, 𝒓 >

Model-Free

(37)

MODEL-FREE

REINFORCEMENT LEARNING Temporal Difference Learning

Q-Learning

(38)

TEMPORAL DIFFERENCE (TD) LEARNING

𝑼 𝒔 = 𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔+

𝑺+

𝑼 𝒔+

Upon an action 𝑎 = 𝜋 𝑠

For all 𝑠’, successor of 𝑠, 𝑈(𝑠) must be “in between”

a) the new value considering only s’: 𝑅 𝑠 + 𝛾 𝑈 𝑠’

b) the old value 𝑈(𝑠)

(39)

TEMPORAL DIFFERENCE (TD) LEARNING

𝑼 𝒔 = 𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔+

𝑺+

𝑼 𝒔+

Upon an action 𝑎 = 𝜋 𝑠

For all 𝑠’, successor of 𝑠, 𝑈(𝑠) must be “in between”

a) the new value considering only s’: 𝑅 𝑠 + 𝛾 𝑈 𝑠’

b) the old value 𝑈(𝑠)

The Notion of Temporal Difference

(40)

TEMPORAL DIFFERENCE (TD) LEARNING

𝑼(𝒔) ≔ (𝟏 − 𝜶) 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’)) The new approximation of 𝑼(𝒔) using 𝜶

when moving from a state 𝒔 to another state 𝒔’

Rearrange the above equation to update 𝑼 𝒔 : 𝑼 𝒔 ≔ 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’) – 𝑼(𝒔))

(41)

TEMPORAL DIFFERENCE (TD) LEARNING

𝑼 𝒔 ≔ 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’) – 𝑼(𝒔))

Learning Rate: 0 ≤ 𝛼 < 1

(42)

TEMPORAL DIFFERENCE (TD) LEARNING

𝑼 𝒔 ≔ 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’) – 𝑼(𝒔))

Learning Rate: 0 ≤ 𝛼 < 1

But this says nothing about

the optimal policy, 𝜋

(43)

Q-LEARNING

𝑼 𝒔 ≔ 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’) – 𝑼(𝒔)) Policy is about taking an action 𝒂 from a state 𝒔

--> Let’s introduce action to 𝑼 𝒔 !

𝑸(𝒔, 𝒂): Expected sum of future discounted rewards after taking action 𝒂 at state s

𝑸(𝒔, 𝒂) ≔ 𝑸(𝒔, 𝒂) + 𝜶(𝑹(𝒔) + 𝜸 𝐦𝐚𝐱

𝒂+ 𝑸(𝒔+, 𝒂+) – 𝑸(𝒔, 𝒂))

(44)

Q-LEARNING

𝑸(𝒔, 𝒂) ≔ 𝑸(𝒔, 𝒂) + 𝜶(𝑹(𝒔) + 𝜸 𝐦𝐚𝐱

𝒂+ 𝑸(𝒔+, 𝒂+) – 𝑸(𝒔, 𝒂))

Finding the optimal policy is now a matter of finding the actions for the states in maximizing the Q-value:

𝝅(𝒔) = 𝐚𝐫𝐠𝐦𝐚𝐱

𝒂 𝑸(𝒔, 𝒂)

(45)

EXAMPLE

(46)

SCENARIO

GOAL!!

DEAD END BOOBY TRAP

START

You can move UP, DOWN, LEFT, RIGHT

(47)

PACMAN

(48)

SUMMARY

(49)

SUMMARY

Reinforcement Learning

Model-Based Model-Free

MDP- Bellman Equation - Value Iteration

- Policy Iteration

TD Learning Q-Learning

(50)

DISCUSSION

(51)

QUESTIONS FOR THOUGHT

Q1. What if the problem we are trying to solve is large scale?

Þ State-space is large!

Þ ex. value iteration may take forever!

Þ examining the entire training dataset is endless!

Q2. What if our solution “overfits” data given by the environment?

(52)

APPROXIMATE

DYNAMIC PROGRAMMING

(53)

LINEAR APPROXIMATION APPROACH

The curse-of-dimensionality problem: the exponential growth of computational complexity with increasing dimensionality of the state space.

a) Abandon the idealized notion of optimality

and be content with a suboptimal solution J b) Approximate functions for data: < 𝒔, 𝒂, 𝒔+, 𝒓 >

- can prevent overfitting

(54)

Q & A

(55)

APPENDIX

(56)

NOTATION & TERMINOLOGY

We found that the notations used in the book are tricky to understand, and concluded to use the equivalent alternatives that were easier.

<THE SLIDES>

𝑠 ∈ 𝑆: State

𝜋(𝑠): Policy

𝑈(𝑠): Utility

𝑅 𝑠, 𝑎 : Reward (or R(𝑠))

𝑇 𝑠, 𝑎, 𝑠+ : Transition Probability

<TEXTBOOK>

𝑖~𝑋w: State

𝜇(𝑖): Policy

𝐽z (𝑖): Cost-to-go function

𝑔 𝑋w, 𝜇w, 𝑋w3C : Transition Cost

𝑝T}(𝜇(𝑖)): Transition Probability

Referensi

Dokumen terkait

This project describes an approach, based on the Analytical Hierarchy Process (AHP) that assists decision makers or manufacturing engineers to determine the most

Furthermore, rather than engaging static mode at contact, an “ ε - criterion” as shown in figure 2 is introduced: once the magnitude of relative tangential velocity is less than ε ,

In order to assess the results of hydrodynamic flow and wave models with respect to the criteria of FNC grouper mariculture development in Decision Support System, we conclude

The experimental results show that the algorithm could solve the 3D container loading problems in online fashion and is com- petitive against the one of best static algorithms both in

Decision making in determining process priority for dairy milk quality improvement: a study in the application of AHP Analytical Hierarchy Process U E Malika1, J C A Wijaya2, A H

These 2 main factors can separate into 7 factor including successful succession process, employee-oriented firm, family-control in executive decision making, non-family members within

The conducted study proves that the use of DOT makes it possible to solve problems that have become especially relevant today in connection with the development of digitalization of

Some studies on the impact of health technology as- sessment on health policy have shown that only half of health technology assessment reports affect the decision- making process..