Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所...
-
Upload
wilfrid-hill -
Category
Documents
-
view
271 -
download
0
Transcript of Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所...
Reinforcement LearningElementary Solution Methods
主講人:虞台文
大同大學資工所智慧型多媒體研究室
ContentIntroductionDynamic ProgrammingMonte Carlo MethodsTemporal Difference Learning
Reinforcement LearningElementary Solution Methods
Introduction
大同大學資工所智慧型多媒體研究室
Basic Methods Dynamic programming
– well developed but require a complete and accurate model of the environment
Monte Carlo methods– don't require a model and are very simple
conceptually, but are not suited for step-by-step incremental computation
Temporal-difference learning– temporal-difference methods require no model and
are fully incremental, but are more complex to analyze
Q-Learning
Reinforcement LearningElementary Solution Methods
Dynamic Programming
大同大學資工所智慧型多媒體研究室
Dynamic Programming
A collection of algorithms that can be used to compute optimal policies given a perfect model of the environment.– e.g., a Markov decision process (MDP).
Theoretically important– An essential foundation for the understanding of
other methods.– Other methods attempt to achieve much the same
effect as DP, only with less computation and without assuming a perfect model of the environment.
Finite MDP Environments
An MDP consists of: – A set of finite states S or S+, – A set of finite actions A, – A transition distribution
– Expected immediate rewards1( | , )t
ass t tP s s s s a a P ,s s +S
aA
1 1[ | , , ]t ts tas tE r s s a a s s R
Review
21 2 3 1
0
kt t t t t k
k
R r r r r
State-Value function for policy :
10
( ) [ | ] kt t k
kt tV s R rs s sE sE
Bellman equation for V:
,( ) ( )( )a s
as
asssV s V ss a
RP
Bellman Optimality Equation:* *
( )( ) max ( )a
ssa s
s
assV s V s
ARP
Methods of Dynamic Programming
Policy Evaluation Policy Improvement Policy Iteration Value IterationAsynchronous DP
Policy Evaluation
Bellman equation for V:
,( ) ( )( )a s
as
asssV s V ss a
RP
A system of |S| linear equations.
Given policy , compute the state-value function.
It can be solved straightforward, but may be tedious.
We’ll use iterative method.
Iterative Policy Evaluation
0 1 1k kV V V V V
a “sweep”
A sweep consists of applying a backup operation to each state.
,( ) ( )( )a s
as
asssV s V ss a
RP
1( ( )( ,) ) as
assk k
a ss ss aV s V
P R
full backup:
The Algorithm Policy Evaluation
Input the policy to be evaluatedInitialize V(s) = 0 for all sS+
Repeat 0For each sS
v V(s)
max(, |v V(s)|)
Until < (a small positive number)Output V V
,( ) () )(a s
ass
assV s V ss a
RP
Example (Grid World) Possible actions from any state s: A = {up, down, left, right} Terminal state in top-left & bottom right (same state) Reward is 1 on all transitions until terminal state is reached All values initialized to 0 Out of bounds results in staying in same state
Example (Grid World)
We start with an equiprobable random policy, finally we obtain the optimal policy.
Policy Improvement
Consider V for a deterministic policy .
In what condition, would it be better to do an action a (s) when we are in state s?
11 ( , ) ,( )
( )
t t t
a
t
ass ss
s
Q s a E r s
V s
s as aV
P R
( , ) ( )Q s a V s ?
The action-value of doing a in state s is:
Is it better to switch to action a if
Policy Improvement
( , ) ( )Q s a V s
Let ’ be a policy the same as except in state s.
Suppose that ’(s) = a and
( ) ( )V s V s Given a policy and its value function, we can easily evaluate a change in the policy at a single state to a particular action.
Greedy Policy ’
Selecting at each state the action that appears best according to Q(s, a).
arg max ( , )
arg max
( )
( )
a
a ass ss
as
Q s as
V s
P R
( ) ( )V s V s
Greedy Policy ’arg max( ) ( )a a
ss ssa
s
s V s
P R
V V What if ?
m )a( x) (a ass ss
as
V s V s
P R
m( ) (ax )a ass ss
as
V s V s
P R
Bellman Optimality Equation:
* *
( )( ) (max )a a
ss ssa s
s
V s V s
A
P R
What can you say about this?
Policy Iteration
0 1 * *1
*0 V V V
policy evaluation policy improvement“greedification”
Policy Evaluation
Policy Improvement
Policy Iteration
Policy Iteration
Policy Evaluation
Policy Improvement
Optimal Policy
Policy Iteration
Policy Evaluation
Policy Improvement
Optimal Policy
arg max( ) ( )a ass ss
as
s V s
P R
1( ) ( ), ) (a ass sk ks
a s
V V ss as
P R
Value Iteration
Policy Evaluation
Policy Improvement
Optimal Policy
arg max( ) ( )a ass ss
as
s V s
P R
1( ) ( ), ) (a ass sk ks
a s
V V ss as
P R
Combine these two.
Combine these two.
Policy Evaluation
Policy Improvementarg max( ) ( )a a
ss ssa
s
s V s
P R
1( ) ( ), ) (a ass sk ks
a s
V V ss as
P R
Combine these two.
Combine these two.
Value Iteration
Optimal Policy
1 arg max( ) ( )ak s s
as
ka
s sV s V s
P R
Value Iteration
Asynchronous DP
All the DP methods described so far require exhaustive sweeps of the entire state set.
Asynchronous DP does not use sweeps. Instead it works like this:– Repeat until convergence criterion is met:– Pick a state at random and apply the appropriate
backup Still need lots of computation, but does not
get locked into hopelessly long sweeps Can you select states to backup intelligently?
– YES: an agent’s experience can act as a guide.
Generalized Policy Iteration (GPI)
V
Evaluation
V V
Improvement
( )greedy V
Optimal Policy
Efficiency of DP To find an optimal policy is polynomial in the
number of states… BUT, the number of states is often astronomical
– e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).
In practice, classical DP can be applied to problems with a few millions of states.
Asynchronous DP can be applied to larger problems, and appropriate for parallel computation.
It is surprisingly easy to come up with MDPs for which DP methods are not practical.
Reinforcement LearningElementary Solution Methods
Monte Carlo Methods
大同大學資工所智慧型多媒體研究室
What is Monte Carlo methods?
Monte Carlo methods Random Search Method
It does not assume complete knowledge of the environment
Learning from actual experience– sample sequences of states, actions, and rewards from a
ctual or simulated interaction with an environment
Monte Carlo methods vs. Reinforcement Learning
Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns.
To ensure that well-defined returns are available, we define Monte Carlo methods only for episodic tasks.
Incremental in an episode-by-episode sense, but not in a step-by-step sense.
Monte Carlo methods for Policy Evaluation V(s)
V
Evaluation
V V
Improvement
( )greedy V
Optimal Policy
Monte Carlo methods
Monte Carlo methods for Policy Evaluation V(s)
Goal: learn V(s) Given: some number of episodes under which contain s Idea: Average returns observed after visits to s
s s s An episode:
A visit to s
The first visit to s
Return(s) Return(s) Return(s)
Monte Carlo methods for Policy Evaluation V(s)
Every-Visit MC: – average returns for every time s is visited in
an episode
First-visit MC: – average returns only for first time s is
visited in an episode
Both converge asymptotically
First-Visit MC Algorithm
Initialize Policy to be evaluated– V An arbitrary state value function– Returns(s) An empty list for all s S
Repeat forever Generate episode using the policy For each state, s, occurring in the episode
Get the return, R, following the first occurrence of s
Append R to Returns(s) Set V(s) with the average of Returns(s)
Example: Blackjack
Object: – Have your card sum be greater than the dealers
without exceeding 21. States (200 of them):
– current sum (12-21)– dealer’s showing card (ace-10)– do I have a useable ace?
Reward: – +1 for winning, 0 for a draw, 1 for losing
Actions: – stick (stop receiving cards), hit (receive another card)
Policy: – Stick if my sum is 20 or 21, else hit
Example: Blackjack
Monte Carlo Estimation forAction Values Q(s,a)
If a model is not available, then it is particularly useful to estimate action values rather than state values.
By action values, we mean the expected return when starting in state s, taking action a, and thereafter following policy .
The every-visit MC method estimates the value of a state-action pair as the average of the returns that have followed visits to the state in which the action was selected.
The first-visit MC method is similar, but only records the first-visit (like before).
Maintaining Exploration
Many relevant state-action pairs may never be visited.
exploring starts– The first step of each episode starts at a state-action
pair– Every such pair has a nonzero probability of being
selected as the start.
But not a great idea to do in practice. – It's better to just choose a policy which has a
nonzero probability of selecting all actions.
Monte Carlo Control to Approximate Optimal Policy
Q
Evaluation
Q Q
Improvement
( )greedy Q
Optimal Policy
Monte Carlo Control to Approximate Optimal Policy
0 10 1 2 * *E I E I E I EQ Q Q
: Complete Policy Evaluation
: Policy Improvement
E
I
1( ) arg max ( , )kk
as Q s a
Monte Carlo Control to Approximate Optimal Policy
1( ) arg max ( , )kk
as Q s a
1 arg max ( , )( ) ,k k k
as Q sV Qs a
max ,k
aQ s a , ( )k
kQ s s
kV s
1k kV V What if ? 1 *k kV V V A ns.
Monte Carlo Control to Approximate Optimal Policy
1k kV V What if ? 1 *k kV V V A ns.
This, however, requires that– Exploration starts with each state-action
pair having nonzero probability to be selected as the start.
– Infinite number of episodes.
A Monte Carlo Control Algorithm Assuming Exploring Starts
Initialize– Q(s, a) arbitrary (s) arbitrary– Returns(s, a) empty list
Repeat forever Generate an episode using For each pair (s, a) appearing in the episode
R return following the first occurrence of (s, a) Append R to Returns(s, a) Q(s, a) average of Returns(s, a)
For each s in the episode (s) arg maxa Q(s, a)
Example: Blackjack
Exploring starts Initial policy as described before
On-Policy Monte Carlo Control
On-Policy– Learning from the current executing policy
What if we don't have exploring starts?We must adopt some method of
exploring states which would not have been explored otherwise.
We will introduce the –greedy method.
-Soft and -Greedy
-soft policy: ( , ) , and ( )( )
s a s a ss
S AA
-greedy policy:
non-gready action( )
( , )
1 gready action( )
ss a
s
A
A
-Greedy Algorithm Initialize for all states, s, and actions, a.
– Q(s, a) arbitrary. – Returns(s, a) empty list. an arbitrary -soft policy
Repeat Forever:– Generate an episode using .– For each (s, a) appearing in the episode.
R return following the first occurrence of (s, a)Append R to Returns(s, a) Q(s, a) average of Returns(s, a)
– For each state, s, in the episode: * arg max ( , )aa Q s a
For all a A(s)*
( )( , )
1 *( )
a as
s a
a as
A
A
Evaluating One Policy While Following Another
( ) ?V s Goal:
Episodes:Generated using ’
How to evaluate V(s) using the episodes generated by ’?
Assumption: ( , ) 0 ( , ) 0s a s a
Evaluating One Policy While Following Another
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
( , ) 0 ( , ) 0s a s a
Evaluating One Policy While Following Another
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
( , ) 0 ( , ) 0s a s a
1
( ) ( )sm
i ii
V s p s E R
1
( )( )
( )
smi
i ii i
p sp s E R
p s
1
1
( )
( )
( )
( )
s
s
mi
ii i
mi
i i
p sE R
p s
p s
p s
Evaluating One Policy While Following Another
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
1
1
( )
( )( )
( )
( )
s
s
mi
ii i
mi
i i
p sE R
p sV s
p s
p s
1
1
( )
( )( )
( )
( )
s
s
mi
ii i
mi
i i
p sE R
p sV s
p s
p s
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
Suppose ns samples are taken using ’
Suppose ns samples are taken using ’
1
1
( )( )
( )( )
( )
( )
s
s
ni
ij j
ni
j i
p sR s
p sV s
p s
p s
1
1
( )( )
( )( )
( )
( )
s
s
ni
ij j
ni
j i
p sR s
p sV s
p s
p s
Evaluating One Policy While Following Another
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
:
:
S
S
ith first visit to state s
( )ip s
( )ip s
( )
(?
)
i
i
p
p
s
s
Evaluating One Policy While Following Another
:
:
S
S
ith first visit to state s
( )ip s
( )ip s
1
( ) 1
( ) ( , )i
k
k k
T sa
i t k k s sk t
p s s a
P
1
( ) 1
( ) ( , )i
k
k k
T sa
i t k k s sk t
p s s a
P
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
)
1
(
( )
1
( )(
,
, )
( )( )
i
iT si t
k
T s
k ki t k t
kk t
p ss
p s
a
s a
Summary
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
)
1
(
( )
1
( )(
,
, )
( )( )
i
iT si t
k
T s
k ki t k t
kk t
p ss
p s
a
s a
How to approximate Q(s, a)?
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
)
1
(
( )
1
( )(
,
, )
( )( )
i
iT si t
k
T s
k ki t k t
kk t
p ss
p s
a
s a
)
1
(
( )
1
( )(
,
, )
( )( )
i
iT si t
k
T s
k ki t k t
kk t
p ss
p s
a
s a
To obtain Q(s, a),set (s, a) = 1
To obtain Q(s, a),set (s, a) = 1
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP1
1
( )
(( , ) ( ,
( )
)
))
( )
(
s
s
ii
i
n
in
i
i
i
p s
p
R sp s
p
Q s ss
a Q
s
a
1
1
( )
(( , ) ( ,
( )
)
))
( )
(
s
s
ii
i
n
in
i
i
i
p s
p
R sp s
p
Q s ss
a Q
s
a
1
11
( )
( )
( )
)
( )
(
,
( ,)
i
iT
T si t
k kk t
s
k ki t k t
p s
s
s
as
a
p
1
11
( )
( )
( )
)
( )
(
,
( ,)
i
iT
T si t
k kk t
s
k ki t k t
p s
s
s
as
a
p
To obtain Q(s, a),set (s, a) = 1
To obtain Q(s, a),set (s, a) = 1
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP1
1
( )
(( , ) ( ,
( )
)
))
( )
(
s
s
ii
i
n
in
i
i
i
p s
p
R sp s
p
Q s ss
a Q
s
a
1
1
( )
(( , ) ( ,
( )
)
))
( )
(
s
s
ii
i
n
in
i
i
i
p s
p
R sp s
p
Q s ss
a Q
s
a
1
11
( )
( )
( )
)
( )
(
,
( ,)
i
iT
T si t
k kk t
s
k ki t k t
p s
s
s
as
a
p
1
11
( )
( )
( )
)
( )
(
,
( ,)
i
iT
T si t
k kk t
s
k ki t k t
p s
s
s
as
a
p
To obtain Q(s, a),set (s, a) = 1
To obtain Q(s, a),set (s, a) = 1
( )V s ( )V s
( )ass
s
V s
P ( )ass
s
V s
P
( )
( , ) ( , )ass
s a A s
s a Q s a
P( )
( , ) ( , )ass
s a A s
s a Q s a
P
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP1
1
( , ) ( ,
( )( )
( )
)
1
1
s
s
ii
ni
i i
n
Q s a Q s
R sp s
p s
a
1
1
( , ) ( ,
( )( )
( )
)
1
1
s
s
ii
ni
i i
n
Q s a Q s
R sp s
p s
a
( )V s ( )V s
( ) 1
( ) ( , )iT s
i t k kk t
p s s a
( ) 1
( ) ( , )iT s
i t k kk t
p s s a
Off-Policy Monte Carlo Control
Require two policies.– estimation policy (deterministic)E.g., greedy
– behaviour policy (stochastic)E.g., -soft
Off-Policy Monte Carlo Control
Policy Evaluation
Policy Improvement
Incremental Implementation
MC can be implemented incrementally– saves memory
Compute the weighted average of each return
Incremental Implementation1
1
(( )
( )( ) )
( )
(
)( )
i
i
i
n
in
i
i
i
R sp s
p sV s
p s
p
V
s
s
1
1
(( )
( )( ) )
( )
(
)( )
i
i
i
n
in
i
i
i
R sp s
p sV s
p s
p
V
s
s
1
1
( )( )
n
ii
n
i
n
i
i R sV s
w
w
1
1
1 11
0 0
1
( ) ( ) ( )
0
n
n
nn n n
n
n
n
W
W
wV s V R s
w
V s
V W
W
incrementalequivalentnon-incremental
( ) ( ) ( )ttt t tV s V s R V s
Incremental Implementation
1
1
( )( )
n
ii
n
i
n
i
i R sV s
w
w
1
1
1 11
0 0
1
( ) ( ) ( )
0
n
n
nn n n
n
n
n
W
W
wV s V R s
w
V s
V W
W
incrementalequivalentnon-incremental
( ) ( ) ( )ttt t tV s V s R V s
If t is held constant, it is called the constant- MC.
If t is held constant, it is called the constant- MC.
Summary MC has several advantages over DP:
– Can learn directly from interaction with environment– No need for full models– No need to learn about ALL states– Less harm by Markovian violations
MC methods provide an alternate policy evaluation process
One issue to watch for: maintaining sufficient exploration– exploring starts, soft policies
No bootstrapping (as opposed to DP)
Reinforcement LearningElementary Solution Methods
Temporal Difference Learning
大同大學資工所智慧型多媒體研究室
Temporal Difference Learning
Combine the ideas of Monte Carlo and dynamic programming (DP).
Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics.
Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
Monte Carlo Methods
( ) ( ) ( )ttt t tV s V s R V s
T T T TT
T T T T T
ts
T T
T T
TT T
T TT
Dynamic Programming
1 1( ) ( )t t tV s E r V s
1tr 1ts
T
T T T
ts
T
TT
T
TT
T
T
T
Basic Concept of TD(0)
Dynamic Programming
Monte Carlo Methods
1 1( ) ( ) ( ) ( )t t t t tV s V s r V s V s
1 1( ) ( )t t tV s E r V s
( ) ( ) ( )t t t tV s V s R V s
TD(0):True
returnPredicted value
on time t + 1
Temporal Difference
Basic Concept of TD(0)
1 1( ) ( ) ( ) ( )t t t t tV s V s r V s V s
T T T TT
T T T T TTTTTT
T T T T T
1ts 1tr
ts
TD(0) Algorithm
Initialize V (s) arbitrarily for the policy to be evaluated
Repeat (for each episode): – Initialize s – Repeat (for each step of episode):
a action given by for sTake action a; observe reward, r, and next state, s’V(s) V(s) + [r + V(s’) V(s) ] s s’
– until s is a terminal
Example (Driving Home)
StateElapsed Time
(minutes)Predicted
Time to GoPredictedTotal Time
Leaving office
Reach car, Raining
Exit highway
Behind truck
Home street
Arrive home
0 30 30
5 35 40
20 15 35
30 10 40
40 3 43
43 0 43
Example (Driving Home)
StateElapsed Time
(minutes)Predicted
Time to GoPredictedTotal Time
Leaving office
Reach car, Raining
Exit highway
Behind truck
Home street
Arrive home
0 30 30
5 35 40
20 15 35
30 10 40
40 3 43
43 0 43
TD Bootstraps and Samples
Bootstrapping: update involves an estimate– MC does not bootstrap– DP bootstraps– TD bootstraps
Sampling: update does not involve an expected value– MC samples– DP does not sample– TD samples
Example (Random Walk)
A B C D E
start
0 0 0 0 0 1
V(s) 1/6 2/6 3/6 4/6 5/6
Example (Random Walk)
A B C D E
start
0 0 0 0 0 1
Values learned by TD(0) after various numbers of episodes
Example (Random Walk)
A B C D E
start
0 0 0 0 0 1
Data averaged over100 sequences of episodes
Optimality of TD(0) Batch Updating: train completely on a finite
amount of data, e.g., train repeatedly on 10 episodes until convergence. – Compute updates according to TD(0), but only update
estimates after each complete pass through the data
For any finite Markov prediction task, under batch updating– TD(0) converges for sufficiently small a.– Constant- MC also converges under these conditions,
but to a difference answer!
Example: Random Walk under Batch Updating
After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.
Why is TD better at generalizing in the batch update?
MC susceptible to poor state sampling and weird episodes
TD less affected by weird episodes & sampling because estimates linked to other states that may be better sampled– i.e., estimates smoothed across states.
TD converges to correct value function for max likelihood model of the environment (certainty-equivalence estimate)
Example: You are the predictor
Suppose you observe the following 8 episodes from an MDP:
A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0
A B0
1
0100%
75%
25%
( ) ?V A
( ) ?V B
What for TD(0)?
What for constant- MC?
What by you?
Learning An Action-Value Function
st st+1 st+2st, at st+1, at+1 st+2, at+2
rt+1 rt+2
( , ) ?Q s a
1 1 1( , ) ( , ) ( , ) ( , )t t t t t t t t tQ s a Q s a r Q s a Q s a
After every transition from a nonterminal state st, do:
If st+1 is a terminal, then 1 1( , ) 0t tQ s a
Sarsa: On-Policy TD Control
Initialize Q (s, a) arbitrarily Repeat (for each episode):
– Initialize s – Repeat (for each step of episode):
Choose a from s using policy derived from Q (e.g., -greedy) Take action a; observe reward, r, and next state, s’ Choose a’ from s’ using policy derived from Q (e.g., -gre
edy) Q(s, a) Q(s, a) + [r + Q(s’, a’) Q(s, a)] s s’, a a’
– until s is a terminal
Example (Windy World)
undiscounted, episodic, reward = –1 until goal
Standardmoves
King’smoves
Example (Windy World)
Applying -greedy Sarsa to this task, with = 0.1, = 0.1, and the initial values Q(s, a) = 0 for all s, a.
Applying -greedy Sarsa to this task, with = 0.1, = 0.1, and the initial values Q(s, a) = 0 for all s, a.
Q-Learning: Off-Policy TD Control
1 1( , ) ( , ) max ( , ) ( , )t t t t t t t ta
Q s a Q s a r Q s a Q s a
One-step Q-Learning:
Deterministicpolicy
Stochasticpolicy
Q-Learning: Off-Policy TD Control
Initialize Q (s, a) arbitrarily
Repeat (for each episode): – Initialize s
– Repeat (for each step of episode): Choose a from s using policy derived from Q (e.g., -greedy)
Take action a; observe reward, r, and next state, s’
Choose a’ from s’ using policy derived from Q (e.g., -greedy)
s s’
– until s is a terminal
1 1( , ) ( , ) max ( , ) ( , )t t t t t t t ta
Q s a Q s a r Q s a Q s a
Example (Cliff Walking)
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Explicit representation of policy as well as value function
Minimal computation to select actions
Can learn an explicit stochastic policy
Can put constraints on policies
Appealing as psychological and neural models
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Policy Parameters: ( , )p s apreference
Policy: ( , ) Pr( | )t t ts a a a s s
( , )
( , )
p s a
p s b
b
e
e
TD Error:
1 1( ) ( )t t t tr V s V s
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Policy Parameters: ( , )p s apreference
Policy: ( , ) Pr( | )t t ts a a a s s
TD Error:
1 1( ) ( )t t t tr V s V s
Update state-value
function using TD(0)
How to update policy
parameters?
( , )
( , )
p s a
p s b
b
e
e
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
TD Error:
1 1( ) ( )t t t tr V s V s
t> =< 0 We are tend to maximize the value.
( , ) ( , )t t t t tp s a p s a
How to update policy
parameters? Method 1:
( , ) ( , ) [1 ( , )]t t tt t t ttp s a a ap s s Method 2: