Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所...

94
Reinforcement Learning Elementary Solution Methods 主主主 主主主 主主主主主主主 主主主主主 主主主

Transcript of Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所...

Page 1: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Reinforcement LearningElementary Solution Methods

主講人:虞台文

大同大學資工所智慧型多媒體研究室

Page 2: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

ContentIntroductionDynamic ProgrammingMonte Carlo MethodsTemporal Difference Learning

Page 3: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Reinforcement LearningElementary Solution Methods

Introduction

大同大學資工所智慧型多媒體研究室

Page 4: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Basic Methods Dynamic programming

– well developed but require a complete and accurate model of the environment

Monte Carlo methods– don't require a model and are very simple

conceptually, but are not suited for step-by-step incremental computation

Temporal-difference learning– temporal-difference methods require no model and

are fully incremental, but are more complex to analyze

Q-Learning

Page 5: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Reinforcement LearningElementary Solution Methods

Dynamic Programming

大同大學資工所智慧型多媒體研究室

Page 6: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Dynamic Programming

A collection of algorithms that can be used to compute optimal policies given a perfect model of the environment.– e.g., a Markov decision process (MDP).

Theoretically important– An essential foundation for the understanding of

other methods.– Other methods attempt to achieve much the same

effect as DP, only with less computation and without assuming a perfect model of the environment.

Page 7: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Finite MDP Environments

An MDP consists of: – A set of finite states S or S+, – A set of finite actions A, – A transition distribution

– Expected immediate rewards1( | , )t

ass t tP s s s s a a P ,s s +S

aA

1 1[ | , , ]t ts tas tE r s s a a s s R

Page 8: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Review

21 2 3 1

0

kt t t t t k

k

R r r r r

State-Value function for policy :

10

( ) [ | ] kt t k

kt tV s R rs s sE sE

Bellman equation for V:

,( ) ( )( )a s

as

asssV s V ss a

RP

Bellman Optimality Equation:* *

( )( ) max ( )a

ssa s

s

assV s V s

ARP

Page 9: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Methods of Dynamic Programming

Policy Evaluation Policy Improvement Policy Iteration Value IterationAsynchronous DP

Page 10: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Policy Evaluation

Bellman equation for V:

,( ) ( )( )a s

as

asssV s V ss a

RP

A system of |S| linear equations.

Given policy , compute the state-value function.

It can be solved straightforward, but may be tedious.

We’ll use iterative method.

Page 11: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Iterative Policy Evaluation

0 1 1k kV V V V V

a “sweep”

A sweep consists of applying a backup operation to each state.

,( ) ( )( )a s

as

asssV s V ss a

RP

1( ( )( ,) ) as

assk k

a ss ss aV s V

P R

full backup:

Page 12: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

The Algorithm Policy Evaluation

Input the policy to be evaluatedInitialize V(s) = 0 for all sS+

Repeat 0For each sS

v V(s)

max(, |v V(s)|)

Until < (a small positive number)Output V V

,( ) () )(a s

ass

assV s V ss a

RP

Page 13: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Grid World) Possible actions from any state s: A = {up, down, left, right} Terminal state in top-left & bottom right (same state) Reward is 1 on all transitions until terminal state is reached All values initialized to 0 Out of bounds results in staying in same state

Page 14: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Grid World)

We start with an equiprobable random policy, finally we obtain the optimal policy.

Page 15: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Policy Improvement

Consider V for a deterministic policy .

In what condition, would it be better to do an action a (s) when we are in state s?

11 ( , ) ,( )

( )

t t t

a

t

ass ss

s

Q s a E r s

V s

s as aV

P R

( , ) ( )Q s a V s ?

The action-value of doing a in state s is:

Is it better to switch to action a if

Page 16: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Policy Improvement

( , ) ( )Q s a V s

Let ’ be a policy the same as except in state s.

Suppose that ’(s) = a and

( ) ( )V s V s Given a policy and its value function, we can easily evaluate a change in the policy at a single state to a particular action.

Page 17: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Greedy Policy ’

Selecting at each state the action that appears best according to Q(s, a).

arg max ( , )

arg max

( )

( )

a

a ass ss

as

Q s as

V s

P R

( ) ( )V s V s

Page 18: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Greedy Policy ’arg max( ) ( )a a

ss ssa

s

s V s

P R

V V What if ?

m )a( x) (a ass ss

as

V s V s

P R

m( ) (ax )a ass ss

as

V s V s

P R

Bellman Optimality Equation:

* *

( )( ) (max )a a

ss ssa s

s

V s V s

A

P R

What can you say about this?

Page 19: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Policy Iteration

0 1 * *1

*0 V V V

policy evaluation policy improvement“greedification”

Page 20: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Policy Evaluation

Policy Improvement

Policy Iteration

Page 21: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Policy Iteration

Policy Evaluation

Policy Improvement

Optimal Policy

Page 22: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Policy Iteration

Policy Evaluation

Policy Improvement

Optimal Policy

arg max( ) ( )a ass ss

as

s V s

P R

1( ) ( ), ) (a ass sk ks

a s

V V ss as

P R

Page 23: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Value Iteration

Policy Evaluation

Policy Improvement

Optimal Policy

arg max( ) ( )a ass ss

as

s V s

P R

1( ) ( ), ) (a ass sk ks

a s

V V ss as

P R

Combine these two.

Combine these two.

Page 24: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Policy Evaluation

Policy Improvementarg max( ) ( )a a

ss ssa

s

s V s

P R

1( ) ( ), ) (a ass sk ks

a s

V V ss as

P R

Combine these two.

Combine these two.

Value Iteration

Optimal Policy

1 arg max( ) ( )ak s s

as

ka

s sV s V s

P R

Page 25: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Value Iteration

Page 26: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Asynchronous DP

All the DP methods described so far require exhaustive sweeps of the entire state set.

Asynchronous DP does not use sweeps. Instead it works like this:– Repeat until convergence criterion is met:– Pick a state at random and apply the appropriate

backup Still need lots of computation, but does not

get locked into hopelessly long sweeps Can you select states to backup intelligently?

– YES: an agent’s experience can act as a guide.

Page 27: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Generalized Policy Iteration (GPI)

V

Evaluation

V V

Improvement

( )greedy V

Optimal Policy

Page 28: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Efficiency of DP To find an optimal policy is polynomial in the

number of states… BUT, the number of states is often astronomical

– e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).

In practice, classical DP can be applied to problems with a few millions of states.

Asynchronous DP can be applied to larger problems, and appropriate for parallel computation.

It is surprisingly easy to come up with MDPs for which DP methods are not practical.

Page 29: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Reinforcement LearningElementary Solution Methods

Monte Carlo Methods

大同大學資工所智慧型多媒體研究室

Page 30: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

What is Monte Carlo methods?

Monte Carlo methods Random Search Method

It does not assume complete knowledge of the environment

Learning from actual experience– sample sequences of states, actions, and rewards from a

ctual or simulated interaction with an environment

Page 31: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo methods vs. Reinforcement Learning

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns.

To ensure that well-defined returns are available, we define Monte Carlo methods only for episodic tasks.

Incremental in an episode-by-episode sense, but not in a step-by-step sense.

Page 32: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo methods for Policy Evaluation V(s)

V

Evaluation

V V

Improvement

( )greedy V

Optimal Policy

Monte Carlo methods

Page 33: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo methods for Policy Evaluation V(s)

Goal: learn V(s) Given: some number of episodes under which contain s Idea: Average returns observed after visits to s

s s s An episode:

A visit to s

The first visit to s

Return(s) Return(s) Return(s)

Page 34: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo methods for Policy Evaluation V(s)

Every-Visit MC: – average returns for every time s is visited in

an episode

First-visit MC: – average returns only for first time s is

visited in an episode

Both converge asymptotically

Page 35: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

First-Visit MC Algorithm

Initialize Policy to be evaluated– V An arbitrary state value function– Returns(s) An empty list for all s S

Repeat forever Generate episode using the policy For each state, s, occurring in the episode

Get the return, R, following the first occurrence of s

Append R to Returns(s) Set V(s) with the average of Returns(s)

Page 36: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example: Blackjack

Object: – Have your card sum be greater than the dealers

without exceeding 21. States (200 of them):

– current sum (12-21)– dealer’s showing card (ace-10)– do I have a useable ace?

Reward: – +1 for winning, 0 for a draw, 1 for losing

Actions: – stick (stop receiving cards), hit (receive another card)

Policy: – Stick if my sum is 20 or 21, else hit

Page 37: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example: Blackjack

Page 38: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo Estimation forAction Values Q(s,a)

If a model is not available, then it is particularly useful to estimate action values rather than state values.

By action values, we mean the expected return when starting in state s, taking action a, and thereafter following policy .

The every-visit MC method estimates the value of a state-action pair as the average of the returns that have followed visits to the state in which the action was selected.

The first-visit MC method is similar, but only records the first-visit (like before).

Page 39: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Maintaining Exploration

Many relevant state-action pairs may never be visited.

exploring starts– The first step of each episode starts at a state-action

pair– Every such pair has a nonzero probability of being

selected as the start.

But not a great idea to do in practice. – It's better to just choose a policy which has a

nonzero probability of selecting all actions.

Page 40: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo Control to Approximate Optimal Policy

Q

Evaluation

Q Q

Improvement

( )greedy Q

Optimal Policy

Page 41: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo Control to Approximate Optimal Policy

0 10 1 2 * *E I E I E I EQ Q Q

: Complete Policy Evaluation

: Policy Improvement

E

I

1( ) arg max ( , )kk

as Q s a

Page 42: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo Control to Approximate Optimal Policy

1( ) arg max ( , )kk

as Q s a

1 arg max ( , )( ) ,k k k

as Q sV Qs a

max ,k

aQ s a , ( )k

kQ s s

kV s

1k kV V What if ? 1 *k kV V V A ns.

Page 43: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo Control to Approximate Optimal Policy

1k kV V What if ? 1 *k kV V V A ns.

This, however, requires that– Exploration starts with each state-action

pair having nonzero probability to be selected as the start.

– Infinite number of episodes.

Page 44: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

A Monte Carlo Control Algorithm Assuming Exploring Starts

Initialize– Q(s, a) arbitrary (s) arbitrary– Returns(s, a) empty list

Repeat forever Generate an episode using For each pair (s, a) appearing in the episode

R return following the first occurrence of (s, a) Append R to Returns(s, a) Q(s, a) average of Returns(s, a)

For each s in the episode (s) arg maxa Q(s, a)

Page 45: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example: Blackjack

Exploring starts Initial policy as described before

Page 46: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

On-Policy Monte Carlo Control

On-Policy– Learning from the current executing policy

What if we don't have exploring starts?We must adopt some method of

exploring states which would not have been explored otherwise.

We will introduce the –greedy method.

Page 47: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

-Soft and -Greedy

-soft policy: ( , ) , and ( )( )

s a s a ss

S AA

-greedy policy:

non-gready action( )

( , )

1 gready action( )

ss a

s

A

A

Page 48: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

-Greedy Algorithm Initialize for all states, s, and actions, a.

– Q(s, a) arbitrary. – Returns(s, a) empty list. an arbitrary -soft policy

Repeat Forever:– Generate an episode using .– For each (s, a) appearing in the episode.

R return following the first occurrence of (s, a)Append R to Returns(s, a) Q(s, a) average of Returns(s, a)

– For each state, s, in the episode: * arg max ( , )aa Q s a

For all a A(s)*

( )( , )

1 *( )

a as

s a

a as

A

A

Page 49: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

( ) ?V s Goal:

Episodes:Generated using ’

How to evaluate V(s) using the episodes generated by ’?

Assumption: ( , ) 0 ( , ) 0s a s a

Page 50: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

( )ip s [ ( )]iE R s

. . . . . . . . . .

s

( )ip s [ ( )]iE R s

. . . . . . . . . .

s

( , ) 0 ( , ) 0s a s a

Page 51: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

( )ip s [ ( )]iE R s

. . . . . . . . . .

s

( , ) 0 ( , ) 0s a s a

1

( ) ( )sm

i ii

V s p s E R

1

( )( )

( )

smi

i ii i

p sp s E R

p s

1

1

( )

( )

( )

( )

s

s

mi

ii i

mi

i i

p sE R

p s

p s

p s

Page 52: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

( )ip s [ ( )]iE R s

. . . . . . . . . .

s

1

1

( )

( )( )

( )

( )

s

s

mi

ii i

mi

i i

p sE R

p sV s

p s

p s

1

1

( )

( )( )

( )

( )

s

s

mi

ii i

mi

i i

p sE R

p sV s

p s

p s

( )ip s [ ( )]iE R s

. . . . . . . . . .

s

Suppose ns samples are taken using ’

Suppose ns samples are taken using ’

1

1

( )( )

( )( )

( )

( )

s

s

ni

ij j

ni

j i

p sR s

p sV s

p s

p s

1

1

( )( )

( )( )

( )

( )

s

s

ni

ij j

ni

j i

p sR s

p sV s

p s

p s

Page 53: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

1

1

( )

( )

( )(

( ))

)

( )

(

s

s

i

i

ii

i

i

n

n

i

R sp s

p

V s

s

s

s

sV

p

p

:

:

S

S

ith first visit to state s

( )ip s

( )ip s

( )

(?

)

i

i

p

p

s

s

Page 54: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

:

:

S

S

ith first visit to state s

( )ip s

( )ip s

1

( ) 1

( ) ( , )i

k

k k

T sa

i t k k s sk t

p s s a

P

1

( ) 1

( ) ( , )i

k

k k

T sa

i t k k s sk t

p s s a

P

1

1

( )

( )

( )(

( ))

)

( )

(

s

s

i

i

ii

i

i

n

n

i

R sp s

p

V s

s

s

s

sV

p

p

1

1

( )

( )

( )(

( ))

)

( )

(

s

s

i

i

ii

i

i

n

n

i

R sp s

p

V s

s

s

s

sV

p

p

)

1

(

( )

1

( )(

,

, )

( )( )

i

iT si t

k

T s

k ki t k t

kk t

p ss

p s

a

s a

Page 55: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Summary

1

1

( )

( )

( )(

( ))

)

( )

(

s

s

i

i

ii

i

i

n

n

i

R sp s

p

V s

s

s

s

sV

p

p

1

1

( )

( )

( )(

( ))

)

( )

(

s

s

i

i

ii

i

i

n

n

i

R sp s

p

V s

s

s

s

sV

p

p

)

1

(

( )

1

( )(

,

, )

( )( )

i

iT si t

k

T s

k ki t k t

kk t

p ss

p s

a

s a

How to approximate Q(s, a)?

Page 56: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

How to approximate Q(s, a)?

. . . . . . . . . .

s

a( , )s a

. . . . . . . . . .

s

a

( , )s a

. . . . . .s . . . . . .sa

ssP assP

Page 57: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

How to approximate Q(s, a)?

. . . . . . . . . .

s

a( , )s a

. . . . . . . . . .

s

a

( , )s a

. . . . . .s . . . . . .sa

ssP assP1

1

( )

( )

( )(

( ))

)

( )

(

s

s

i

i

ii

i

i

n

n

i

R sp s

p

V s

s

s

s

sV

p

p

1

1

( )

( )

( )(

( ))

)

( )

(

s

s

i

i

ii

i

i

n

n

i

R sp s

p

V s

s

s

s

sV

p

p

)

1

(

( )

1

( )(

,

, )

( )( )

i

iT si t

k

T s

k ki t k t

kk t

p ss

p s

a

s a

)

1

(

( )

1

( )(

,

, )

( )( )

i

iT si t

k

T s

k ki t k t

kk t

p ss

p s

a

s a

To obtain Q(s, a),set (s, a) = 1

To obtain Q(s, a),set (s, a) = 1

Page 58: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

How to approximate Q(s, a)?

. . . . . . . . . .

s

a( , )s a

. . . . . . . . . .

s

a

( , )s a

. . . . . .s . . . . . .sa

ssP assP1

1

( )

(( , ) ( ,

( )

)

))

( )

(

s

s

ii

i

n

in

i

i

i

p s

p

R sp s

p

Q s ss

a Q

s

a

1

1

( )

(( , ) ( ,

( )

)

))

( )

(

s

s

ii

i

n

in

i

i

i

p s

p

R sp s

p

Q s ss

a Q

s

a

1

11

( )

( )

( )

)

( )

(

,

( ,)

i

iT

T si t

k kk t

s

k ki t k t

p s

s

s

as

a

p

1

11

( )

( )

( )

)

( )

(

,

( ,)

i

iT

T si t

k kk t

s

k ki t k t

p s

s

s

as

a

p

To obtain Q(s, a),set (s, a) = 1

To obtain Q(s, a),set (s, a) = 1

Page 59: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

How to approximate Q(s, a)?

. . . . . . . . . .

s

a( , )s a

. . . . . . . . . .

s

a

( , )s a

. . . . . .s . . . . . .sa

ssP assP1

1

( )

(( , ) ( ,

( )

)

))

( )

(

s

s

ii

i

n

in

i

i

i

p s

p

R sp s

p

Q s ss

a Q

s

a

1

1

( )

(( , ) ( ,

( )

)

))

( )

(

s

s

ii

i

n

in

i

i

i

p s

p

R sp s

p

Q s ss

a Q

s

a

1

11

( )

( )

( )

)

( )

(

,

( ,)

i

iT

T si t

k kk t

s

k ki t k t

p s

s

s

as

a

p

1

11

( )

( )

( )

)

( )

(

,

( ,)

i

iT

T si t

k kk t

s

k ki t k t

p s

s

s

as

a

p

To obtain Q(s, a),set (s, a) = 1

To obtain Q(s, a),set (s, a) = 1

( )V s ( )V s

( )ass

s

V s

P ( )ass

s

V s

P

( )

( , ) ( , )ass

s a A s

s a Q s a

P( )

( , ) ( , )ass

s a A s

s a Q s a

P

Page 60: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Evaluating One Policy While Following Another

How to approximate Q(s, a)?

. . . . . . . . . .

s

a( , )s a

. . . . . . . . . .

s

a

( , )s a

. . . . . .s . . . . . .sa

ssP assP1

1

( , ) ( ,

( )( )

( )

)

1

1

s

s

ii

ni

i i

n

Q s a Q s

R sp s

p s

a

1

1

( , ) ( ,

( )( )

( )

)

1

1

s

s

ii

ni

i i

n

Q s a Q s

R sp s

p s

a

( )V s ( )V s

( ) 1

( ) ( , )iT s

i t k kk t

p s s a

( ) 1

( ) ( , )iT s

i t k kk t

p s s a

Page 61: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Off-Policy Monte Carlo Control

Require two policies.– estimation policy (deterministic)E.g., greedy

– behaviour policy (stochastic)E.g., -soft

Page 62: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Off-Policy Monte Carlo Control

Policy Evaluation

Policy Improvement

Page 63: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Incremental Implementation

MC can be implemented incrementally– saves memory

Compute the weighted average of each return

Page 64: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Incremental Implementation1

1

(( )

( )( ) )

( )

(

)( )

i

i

i

n

in

i

i

i

R sp s

p sV s

p s

p

V

s

s

1

1

(( )

( )( ) )

( )

(

)( )

i

i

i

n

in

i

i

i

R sp s

p sV s

p s

p

V

s

s

1

1

( )( )

n

ii

n

i

n

i

i R sV s

w

w

1

1

1 11

0 0

1

( ) ( ) ( )

0

n

n

nn n n

n

n

n

W

W

wV s V R s

w

V s

V W

W

incrementalequivalentnon-incremental

( ) ( ) ( )ttt t tV s V s R V s

Page 65: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Incremental Implementation

1

1

( )( )

n

ii

n

i

n

i

i R sV s

w

w

1

1

1 11

0 0

1

( ) ( ) ( )

0

n

n

nn n n

n

n

n

W

W

wV s V R s

w

V s

V W

W

incrementalequivalentnon-incremental

( ) ( ) ( )ttt t tV s V s R V s

If t is held constant, it is called the constant- MC.

If t is held constant, it is called the constant- MC.

Page 66: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Summary MC has several advantages over DP:

– Can learn directly from interaction with environment– No need for full models– No need to learn about ALL states– Less harm by Markovian violations

MC methods provide an alternate policy evaluation process

One issue to watch for: maintaining sufficient exploration– exploring starts, soft policies

No bootstrapping (as opposed to DP)

Page 67: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Reinforcement LearningElementary Solution Methods

Temporal Difference Learning

大同大學資工所智慧型多媒體研究室

Page 68: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Temporal Difference Learning

Combine the ideas of Monte Carlo and dynamic programming (DP).

Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics.

Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

Page 69: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Monte Carlo Methods

( ) ( ) ( )ttt t tV s V s R V s

T T T TT

T T T T T

ts

T T

T T

TT T

T TT

Page 70: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Dynamic Programming

1 1( ) ( )t t tV s E r V s

1tr 1ts

T

T T T

ts

T

TT

T

TT

T

T

T

Page 71: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Basic Concept of TD(0)

Dynamic Programming

Monte Carlo Methods

1 1( ) ( ) ( ) ( )t t t t tV s V s r V s V s

1 1( ) ( )t t tV s E r V s

( ) ( ) ( )t t t tV s V s R V s

TD(0):True

returnPredicted value

on time t + 1

Temporal Difference

Page 72: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Basic Concept of TD(0)

1 1( ) ( ) ( ) ( )t t t t tV s V s r V s V s

T T T TT

T T T T TTTTTT

T T T T T

1ts 1tr

ts

Page 73: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

TD(0) Algorithm

Initialize V (s) arbitrarily for the policy to be evaluated

Repeat (for each episode): – Initialize s – Repeat (for each step of episode):

a action given by for sTake action a; observe reward, r, and next state, s’V(s) V(s) + [r + V(s’) V(s) ] s s’

– until s is a terminal

Page 74: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Driving Home)

StateElapsed Time

(minutes)Predicted

Time to GoPredictedTotal Time

Leaving office

Reach car, Raining

Exit highway

Behind truck

Home street

Arrive home

0 30 30

5 35 40

20 15 35

30 10 40

40 3 43

43 0 43

Page 75: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Driving Home)

StateElapsed Time

(minutes)Predicted

Time to GoPredictedTotal Time

Leaving office

Reach car, Raining

Exit highway

Behind truck

Home street

Arrive home

0 30 30

5 35 40

20 15 35

30 10 40

40 3 43

43 0 43

Page 76: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

TD Bootstraps and Samples

Bootstrapping: update involves an estimate– MC does not bootstrap– DP bootstraps– TD bootstraps

Sampling: update does not involve an expected value– MC samples– DP does not sample– TD samples

Page 77: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Random Walk)

A B C D E

start

0 0 0 0 0 1

V(s) 1/6 2/6 3/6 4/6 5/6

Page 78: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Random Walk)

A B C D E

start

0 0 0 0 0 1

Values learned by TD(0) after various numbers of episodes

Page 79: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Random Walk)

A B C D E

start

0 0 0 0 0 1

Data averaged over100 sequences of episodes

Page 80: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Optimality of TD(0) Batch Updating: train completely on a finite

amount of data, e.g., train repeatedly on 10 episodes until convergence. – Compute updates according to TD(0), but only update

estimates after each complete pass through the data

For any finite Markov prediction task, under batch updating– TD(0) converges for sufficiently small a.– Constant- MC also converges under these conditions,

but to a difference answer!

Page 81: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example: Random Walk under Batch Updating

After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.

Page 82: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Why is TD better at generalizing in the batch update?

MC susceptible to poor state sampling and weird episodes

TD less affected by weird episodes & sampling because estimates linked to other states that may be better sampled– i.e., estimates smoothed across states.

TD converges to correct value function for max likelihood model of the environment (certainty-equivalence estimate)

Page 83: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example: You are the predictor

Suppose you observe the following 8 episodes from an MDP:

A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0

A B0

1

0100%

75%

25%

( ) ?V A

( ) ?V B

What for TD(0)?

What for constant- MC?

What by you?

Page 84: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Learning An Action-Value Function

st st+1 st+2st, at st+1, at+1 st+2, at+2

rt+1 rt+2

( , ) ?Q s a

1 1 1( , ) ( , ) ( , ) ( , )t t t t t t t t tQ s a Q s a r Q s a Q s a

After every transition from a nonterminal state st, do:

If st+1 is a terminal, then 1 1( , ) 0t tQ s a

Page 85: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Sarsa: On-Policy TD Control

Initialize Q (s, a) arbitrarily Repeat (for each episode):

– Initialize s – Repeat (for each step of episode):

Choose a from s using policy derived from Q (e.g., -greedy) Take action a; observe reward, r, and next state, s’ Choose a’ from s’ using policy derived from Q (e.g., -gre

edy) Q(s, a) Q(s, a) + [r + Q(s’, a’) Q(s, a)] s s’, a a’

– until s is a terminal

Page 86: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Windy World)

undiscounted, episodic, reward = –1 until goal

Standardmoves

King’smoves

Page 87: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Windy World)

Applying -greedy Sarsa to this task, with = 0.1, = 0.1, and the initial values Q(s, a) = 0 for all s, a.

Applying -greedy Sarsa to this task, with = 0.1, = 0.1, and the initial values Q(s, a) = 0 for all s, a.

Page 88: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Q-Learning: Off-Policy TD Control

1 1( , ) ( , ) max ( , ) ( , )t t t t t t t ta

Q s a Q s a r Q s a Q s a

One-step Q-Learning:

Deterministicpolicy

Stochasticpolicy

Page 89: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Q-Learning: Off-Policy TD Control

Initialize Q (s, a) arbitrarily

Repeat (for each episode): – Initialize s

– Repeat (for each step of episode): Choose a from s using policy derived from Q (e.g., -greedy)

Take action a; observe reward, r, and next state, s’

Choose a’ from s’ using policy derived from Q (e.g., -greedy)

s s’

– until s is a terminal

1 1( , ) ( , ) max ( , ) ( , )t t t t t t t ta

Q s a Q s a r Q s a Q s a

Page 90: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Example (Cliff Walking)

Page 91: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Actor-Critic Methods

Environment

ValueFunction

Policy

Critic

Actor

Action

state reward

TDerror

Explicit representation of policy as well as value function

Minimal computation to select actions

Can learn an explicit stochastic policy

Can put constraints on policies

Appealing as psychological and neural models

Page 92: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Actor-Critic Methods

Environment

ValueFunction

Policy

Critic

Actor

Action

state reward

TDerror

Policy Parameters: ( , )p s apreference

Policy: ( , ) Pr( | )t t ts a a a s s

( , )

( , )

p s a

p s b

b

e

e

TD Error:

1 1( ) ( )t t t tr V s V s

Page 93: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Actor-Critic Methods

Environment

ValueFunction

Policy

Critic

Actor

Action

state reward

TDerror

Policy Parameters: ( , )p s apreference

Policy: ( , ) Pr( | )t t ts a a a s s

TD Error:

1 1( ) ( )t t t tr V s V s

Update state-value

function using TD(0)

How to update policy

parameters?

( , )

( , )

p s a

p s b

b

e

e

Page 94: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.

Actor-Critic Methods

Environment

ValueFunction

Policy

Critic

Actor

Action

state reward

TDerror

TD Error:

1 1( ) ( )t t t tr V s V s

t> =< 0 We are tend to maximize the value.

( , ) ( , )t t t t tp s a p s a

How to update policy

parameters? Method 1:

( , ) ( , ) [1 ( , )]t t tt t t ttp s a a ap s s Method 2: