DYNAMIC PROGRAMMING FOR ACTION IN THE ENVIRONMENT

(1)

DYNAMIC PROGRAMMING FOR ACTION IN THE ENVIRONMENT

Cognitive Dynamic Systems, Simon Haykin Presented by

Joo Yeon Kim & Yunseong Lee

(2)

THE BIG PICTURE

(3)

COGNITIVE DYNAMIC SYSTEMS

Build up rules of behavior over time through

learning from continuous experiential interactions with the environment,

and thereby deal with environmental uncertainties.

(4)

THE PERCEPTION-ACTION CYCLE

(5)

(6)

Power Spectrum:

Power Spectrum Estimation (Cognitive Radio)

Bayesian Filtering:

State Estimation (Cognitive Radar)

(7)

Scene Analysis

Feedback Channel

(8)

(9)

DYNAMIC PROGRAMMING

(10)

WHAT IS DYNAMIC PROGRAMMING?

• Dynamic programming is a technique that deals with situations where decisions are made in stages (i.e. different time steps), with the outcome of each decision being predictable to some extent before the next decision is made.

• A key aspect of such situations is that decisions cannot be made in isolation. Rather, the desire for a low cost at the present is

balanced against the undesirability of a high cost in the future.

(11)

DYNAMIC PROGRAMMING • Markov Decision Processes

• Bellman’s Optimality Criterion

• Policy Iteration

• Value Iteration

(12)

MARKOV’S DECISION PROCESS

GOAL!!

DEAD END BOOBY TRAP

START

You can move UP, DOWN, LEFT, RIGHT

(13)

DETERMINISTIC VS. STOCHASTIC

Deterministic

• Every state can be uniquely determined with model

parameters and previous states.

• Always perform identically for a given set of initial conditions.

Stochastic

• States determined by probability distributions.

• Such probability distributions include a notion of randomness.

(14)

MARKOV’S DECISION PROCESSES The Actuator’s Decision Making Process

Discrete-time & Stochastic

Decision Maker

Environment

Action Reward

State

• Set of States: 𝑺 = 𝒔

• Set of Actions: 𝑨 = 𝒂 , 𝒂: 𝑺 → 𝑺

• Reward Function: 𝑹 𝒔

• Transition Model:

𝑻 𝒔, 𝒂, 𝒔⁺ or 𝑷 𝒔⁺ 𝒔, 𝒂)

• Policy: 𝝅 𝒔 → 𝒂

(15)

MARKOV’S DECISION PROCESSES The Actuator’s Decision Making Process

Discrete-time & Stochastic

Decision Maker

Environment

Action Reward

State

• Reward Function: 𝑹 .

• Discount Factor: 𝜸

• Initial State: 𝒔_𝟏

Markov Property

𝑷 𝒔

_𝒕3𝟏

𝒂, 𝒔

_𝟎

, … , 𝒔

_𝒕

) = 𝑷(𝒔

_𝒕3𝟏

| 𝒂, 𝒔

_𝒕

)

(16)

GOAL!!

DEAD END

BOOBY TRAP

START

(17)

GOAL!!

DEAD END

BOOBY TRAP

START

Define UTILITY:

𝑼 𝒔

_𝟎

, … , 𝒔

_𝑵

= 𝑹 𝒔

_𝟎

+ 𝑹 𝒔

_𝟏

+ ⋯ + 𝑹(𝒔

_𝑵

)

(18)

𝑼 𝒔_𝟎, … , 𝒔_𝑵 = 𝑹 𝒔_𝟎 + 𝑹 𝒔_𝟏 + ⋯ + 𝑹(𝒔_𝑵)

But this N can increase infinitely!

𝑼 𝒔_𝟎, … = 𝑹 𝒔_𝟎 + 𝜸^𝟏𝑹 𝒔_𝟏 + ⋯ + 𝜸^𝑵𝑹 𝒔_𝑵 + ⋯

Discount Factor: 0 < 𝛾 < 1

(19)

𝑼 𝒔_𝟎, … = 𝑹 𝒔_𝟎 + 𝜸^𝟏𝑹 𝒔_𝟏 + ⋯ + 𝜸^𝑵𝑹 𝒔_𝑵 + ⋯

We move from 𝑠_B to 𝑠_B3C by action 𝝅 𝒔_𝒕

GOAL

Find a sequence of actions, 𝝅 𝒔_𝒕 , such that the resulting sequence of states maximizes

the total discounted reward, U

(20)

𝑼 𝒔_𝟎, … = 𝑹 𝒔_𝟎 + 𝜸^𝟏𝑹 𝒔_𝟏 + ⋯ + 𝜸^𝑵𝑹 𝒔_𝑵 + ⋯

We move from 𝑠_B to 𝑠_B3C by action 𝝅 𝒔_𝒕

GOAL

Find a sequence of actions, 𝝅 𝒔_𝒕 , such that the resulting sequence of states maximizes

the total discounted reward, U

Optimal Policy:

The policy 𝜋

^∗

that maximizes the expected utility U of

the sequence of states generated by 𝜋

^∗

for all initial states 𝑠

(21)

Question:

How do we compute the optimal policy 𝜋^∗?

(22)

• If action a is taken in state s,

𝑼 𝒔 = 𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔⁺

𝑺+

𝑼 𝒔⁺

𝑼 𝒔 : Utility in state s

𝑹 𝒔, 𝒂 : Reward when taking action a from state s 𝜸: Discount factor

𝑻 𝒔, 𝒂, 𝒔⁺ : Transition probability from s to s’ by taking a

BELLMAN’S EQUATION

(23)

• If action a is taken in state s,

𝑼 𝒔 = 𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔⁺

𝑺+

𝑼 𝒔⁺

BELLMAN’S EQUATION

s

s’₁

R(s, a)

a _s’₂

𝑠⁺_M+

… ^S’

𝑼 𝒔 𝜸 H 𝑻 𝒔, 𝒂, 𝒔⁺

𝑺+

𝑼 𝒔⁺ Expected value of future

rewards from state s’

(24)

𝑼 𝒔 = max

𝒂 (𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔⁺

𝑺^Q

𝑼 𝒔⁺ )

BELLMAN’S OPTIMALITY CRITERION

s

𝑎_C 𝑎_S

𝑎_T

𝑅 𝑠, 𝑎_C + 𝛾 H 𝑇 𝑠, 𝑎_C, 𝑠_C⁺

M^Q

𝑈 𝑠_C⁺

𝑅 𝑠, 𝑎_S + 𝛾 H 𝑇 𝑠, 𝑎_S, 𝑠_S⁺

M^Q

𝑈 𝑠_S⁺

𝑠_C⁺ 𝑠_S⁺

𝑠_T⁺ ^{𝑅 𝑠, 𝑎}^T + 𝛾 H 𝑇 𝑠, 𝑎_T, 𝑠_T⁺

M^Q

𝑈 𝑠_T⁺

The Maximum Value ^à 𝑼 𝑺

(25)

𝑼 𝑺 = max

𝒂 (𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔⁺

𝑺^Q

𝑼 𝒔⁺ )

BELLMAN’S OPTIMALITY CRITERION

s

𝑎_C 𝑎_S

𝑎_T

𝑅 𝑠, 𝑎_C + 𝛾 H 𝑇 𝑠, 𝑎_C, 𝑠_C⁺

M^Q

𝑈 𝑠_C⁺

𝑅 𝑠, 𝑎_S + 𝛾 H 𝑇 𝑠, 𝑎_S, 𝑠_S⁺

M^Q

𝑈 𝑠_S⁺

𝑠_C⁺ 𝑠_S⁺

𝑠_T⁺ ^{𝑅 𝑠, 𝑎}^T + 𝛾 H 𝑇 𝑠, 𝑎_T, 𝑠_T⁺

M^Q

𝑈 𝑠_T⁺

The Maximum Value ^à 𝑼 𝑺

Again, how do we find the optimal

policy?

(26)

POLICY ITERATION

Initialize: 𝛑_𝟎 ← guess Evaluate: given 𝛑_𝒕 ,

calculate 𝑼_𝒕 𝒔 = 𝑅 𝑠 + 𝛾 ∑ 𝑇 𝑠,_M^Q 𝝅_𝒕 𝑠 , 𝑠⁺ 𝑼_𝒕 𝒔⁺ ) (Bellman’s eq)

Improve: 𝝅_𝒕3𝟏 = argmax

] ∑ 𝑇(𝑠,_M^Q 𝝅_𝒕 𝑠 , 𝑠′) 𝑼_𝒕 𝒔′ )

(27)

POLICY ITERATION

Initialize: 𝛑_𝟎 ← guess Evaluate: given 𝛑_𝒕 ,

calculate 𝑼_𝒕 𝒔 = 𝑅 𝑠 + 𝛾 ∑ 𝑇 𝑠,_M^Q 𝝅_𝒕 𝑠 , 𝑠⁺ 𝑼_𝒕 𝒔⁺ ) (Bellman’s eq)

Improve: 𝝅_𝒕3𝟏 = argmax

] ∑ 𝑇(𝑠,_M^Q 𝝅_𝒕 𝑠 , 𝑠′) 𝑼_𝒕 𝒔′ )

Evaluate Improve 𝜋_B: Poilcy

𝑈_B: Utility

Repeat until the Utility converges à 𝑼_𝒕 𝒔 ≈ 𝑼_𝒕3𝟏 𝒔

(28)

Initialize: Start with arbitrary utilities Update utilities based on neighbors

Û_𝒕3𝟏 𝒔 = 𝑅 𝑠 + 𝛾 𝑚𝑎𝑥

] ∑ 𝑇 𝑠,_M^Q 𝜋_B 𝑠 , 𝑠⁺ Û_𝒕 𝒔⁺ ) (Bellman’s eq)

VALUE ITERATION

(29)

Initialize: Start with arbitrary utilities Update utilities based on neighbors

Û_𝒕3𝟏 𝒔 = 𝑅 𝑠 + 𝛾 𝑚𝑎𝑥

] ∑ 𝑇 𝑠,_M^Q 𝜋_B 𝑠 , 𝑠⁺ Û_𝒕 𝒔⁺ ) (Bellman’s eq)

à Repeat until converges.

The quality of Utility gets better as more truth gets involved

VALUE ITERATION

(30)

• Markov Decision Process

Decision making process in a cognitive dynamic system

• Bellman’s Optimality Criterion

Defines the optimization problem the decision-maker must solve

• Policy Iteration

Finds the optimal policy by iteratively estimating the policy

• Value Iteration

Finds the optimal policy by iteratively estimating the utility

CHECKPOINT

(31)

INTERMISSION

(32)

REINFORCEMENT LEARNING

as Dynamic Programming

(33)

REINFORCEMENT LEARNING RECAP

Agent

Environment

Action Reward

State

(34)

Agent

Environment

Action Reward

State

Model-Based Model-Free

(35)

Decision Maker

Environment

Action Reward

State

Model-Based (MDP)

(36)

Decision Maker

Environment

Action Reward

State

There is no explicit model for R and T We instead have explicit transition data:

< 𝒔, 𝒂, 𝒔⁺, 𝒓 >

Model-Free

(37)

MODEL-FREE

REINFORCEMENT LEARNING • Temporal Difference Learning

• Q-Learning

(38)

TEMPORAL DIFFERENCE (TD) LEARNING

𝑼 𝒔 = 𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔⁺

𝑺+

𝑼 𝒔⁺

Upon an action 𝑎 = 𝜋 𝑠

For all 𝑠’, successor of 𝑠, 𝑈(𝑠) must be “in between”

a) the new value considering only s’: 𝑅 𝑠 + 𝛾 𝑈 𝑠’

b) the old value 𝑈(𝑠)

(39)

𝑼 𝒔 = 𝑹 𝒔, 𝒂 + 𝜸 H 𝑻 𝒔, 𝒂, 𝒔⁺

𝑺+

𝑼 𝒔⁺

Upon an action 𝑎 = 𝜋 𝑠

For all 𝑠’, successor of 𝑠, 𝑈(𝑠) must be “in between”

a) the new value considering only s’: 𝑅 𝑠 + 𝛾 𝑈 𝑠’

b) the old value 𝑈(𝑠)

The Notion of Temporal Difference

(40)

𝑼(𝒔) ≔ (𝟏 − 𝜶) 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’)) The new approximation of 𝑼(𝒔) using 𝜶

when moving from a state 𝒔 to another state 𝒔’

Rearrange the above equation to update 𝑼 𝒔 : 𝑼 𝒔 ≔ 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’) – 𝑼(𝒔))

(41)

𝑼 𝒔 ≔ 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’) – 𝑼(𝒔))

Learning Rate: 0 ≤ 𝛼 < 1

(42)

𝑼 𝒔 ≔ 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’) – 𝑼(𝒔))

Learning Rate: 0 ≤ 𝛼 < 1

But this says nothing about

the optimal policy, 𝜋

^∗

(43)

Q-LEARNING

𝑼 𝒔 ≔ 𝑼(𝒔) + 𝜶 (𝑹(𝒔) + 𝜸 𝑼(𝒔’) – 𝑼(𝒔)) Policy is about taking an action 𝒂 from a state 𝒔

--> Let’s introduce action to 𝑼 𝒔 !

𝑸(𝒔, 𝒂): Expected sum of future discounted rewards after taking action 𝒂 at state s

𝑸(𝒔, 𝒂) ≔ 𝑸(𝒔, 𝒂) + 𝜶(𝑹(𝒔) + 𝜸 𝐦𝐚𝐱

𝒂+ 𝑸(𝒔⁺, 𝒂⁺) – 𝑸(𝒔, 𝒂))

(44)

Q-LEARNING

𝑸(𝒔, 𝒂) ≔ 𝑸(𝒔, 𝒂) + 𝜶(𝑹(𝒔) + 𝜸 𝐦𝐚𝐱

𝒂+ 𝑸(𝒔⁺, 𝒂⁺) – 𝑸(𝒔, 𝒂))

Finding the optimal policy is now a matter of finding the actions for the states in maximizing the Q-value:

𝝅(𝒔) = 𝐚𝐫𝐠𝐦𝐚𝐱

𝒂 𝑸(𝒔, 𝒂)

(45)

EXAMPLE

(46)

SCENARIO

GOAL!!

DEAD END BOOBY TRAP

START

(47)

PACMAN

(48)

SUMMARY

(49)

SUMMARY

Reinforcement Learning

Model-Based Model-Free

MDP- Bellman Equation - Value Iteration

- Policy Iteration

TD Learning Q-Learning

(50)

DISCUSSION

(51)

QUESTIONS FOR THOUGHT

Q1. What if the problem we are trying to solve is large scale?

Þ State-space is large!

Þ ex. value iteration may take forever!

Þ examining the entire training dataset is endless!

Q2. What if our solution “overfits” data given by the environment?

(52)

APPROXIMATE

DYNAMIC PROGRAMMING

(53)

LINEAR APPROXIMATION APPROACH

The curse-of-dimensionality problem: the exponential growth of computational complexity with increasing dimensionality of the state space.

a) Abandon the idealized notion of optimality

and be content with a suboptimal solution ^J b) Approximate functions for data: < 𝒔, 𝒂, 𝒔⁺, 𝒓 >

- can prevent overfitting

(54)

Q & A

(55)

APPENDIX

(56)

NOTATION & TERMINOLOGY

• We found that the notations used in the book are tricky to understand, and concluded to use the equivalent alternatives that were easier.

<THE SLIDES>

• 𝑠 ∈ 𝑆: State

• 𝜋(𝑠): Policy

• 𝑈(𝑠): Utility

• 𝑅 𝑠, 𝑎 : Reward (or R(𝑠))

• 𝑇 𝑠, 𝑎, 𝑠⁺ : Transition Probability

<TEXTBOOK>

• 𝑖~𝑋_w: State

• 𝜇(𝑖): Policy

• 𝐽^z (𝑖): Cost-to-go function

• 𝑔 𝑋_w, 𝜇_w,𝑋_w3C : Transition Cost

• 𝑝_T}(𝜇(𝑖)): Transition Probability