Download - Policy Gradient with Baselines - cse.buffalo.edu

Transcript

Policy Gradient with Baselines

Alina Vereshchaka

CSE4/510 Reinforcement LearningFall 2019

October 29, 2019

*Slides are adopted from Deep Reinforcement Learning and Control by Katerina Fragkiadaki (CarnegieMellon) & Policy Gradients by David Silver

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 1 / 26

Page 2: Policy Gradient with Baselines - cse.buffalo.edu

Policy Objective Functions

Goal: given policy πθ(s, a) with parameters θ, find best θ

In episodic environments we can use the start value

J1(θ) = Vπθ(s1) = Eπθ

[v1]

In continuing environments we can use the average value

JavV (θ) =∑s

dπθ(s)Vπθ

(s)

Or the average reward per time-step

JavR(θ) =∑s

dπθ(s)∑a

πθ(s, a)Ras

where dπθ(s) is a stationary distribution of Markov chain for πθAlina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 2 / 26

Page 3: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 3 / 26

Page 4: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 4 / 26

Page 5: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 5 / 26

Page 6: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 6 / 26

Page 7: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 7 / 26

Page 8: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 8 / 26

Page 9: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 9 / 26

Page 10: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 10 / 26

Page 11: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 11 / 26

Page 12: Policy Gradient with Baselines - cse.buffalo.edu

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 12 / 26

Page 13: Policy Gradient with Baselines - cse.buffalo.edu

REINFORCE with Baseline

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 13 / 26

Page 14: Policy Gradient with Baselines - cse.buffalo.edu

REINFORCE (Monte-Carlo Policy Gradient)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 14 / 26

Policy Gradient

This vanilla policy gradient update has no bias but high variance. Many following algorithmswere proposed to reduce the variance while keeping the bias unchanged.

∇θJ(θ) = Eπ[Qπ(s, a)∇θ lnπθ(a|s)]

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 15 / 26

Page 16: Policy Gradient with Baselines - cse.buffalo.edu

Policy Gradient Theorem

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 16 / 26

Page 17: Policy Gradient with Baselines - cse.buffalo.edu

Policy Gradient: Problems & Solutions

Vanilla Policy Gradient

Unbiased but very nosy

Requires lots of samples to make it work

Solutions:

Baseline

Temporal Structure

Other (e.g. KL trust region)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 17 / 26

Page 18: Policy Gradient with Baselines - cse.buffalo.edu

Policy Gradient: Problems & Solutions

Vanilla Policy Gradient

Unbiased but very nosy

Requires lots of samples to make it work

Solutions:

Baseline

Temporal Structure

Other (e.g. KL trust region)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 17 / 26

Page 19: Policy Gradient with Baselines - cse.buffalo.edu

Policy Gradient: Problems & Solutions

Vanilla Policy Gradient

Unbiased but very nosy

Requires lots of samples to make it work

Solutions:

Baseline

Temporal Structure

Other (e.g. KL trust region)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 17 / 26

Page 20: Policy Gradient with Baselines - cse.buffalo.edu

Temporal structure

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 18 / 26

Page 21: Policy Gradient with Baselines - cse.buffalo.edu

Temporal structure

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 19 / 26

Page 22: Policy Gradient with Baselines - cse.buffalo.edu

Temporal structure

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 20 / 26

Page 23: Policy Gradient with Baselines - cse.buffalo.edu

Likelihood ration gradient estimator

Let’s analyze the update:

∆θt = αGt∇θ log πθ(st , at)

It can be further rewritten as follows:

θt+1 = θt + αGt∇θπ(At |St , θ)

π(At |St , θ

Update is proportional to:

the product of a return Gt

the gradient of the probability of taking the action actually taken

divided by the probability of taking that action.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 21 / 26

Page 24: Policy Gradient with Baselines - cse.buffalo.edu

Likelihood ration gradient estimator

Let’s analyze the update:

∆θt = αGt∇θ log πθ(st , at)

It can be further rewritten as follows:

θt+1 = θt + αGt∇θπ(At |St , θ) 1©π(At |St , θ) 2©

1© Move most in the directions that favor actions that yield the highest return

2© Update is inversely proportional to the action probability to fight the fact that actionsthat are selected frequently are at an advantage (the updates will be more often in theirdirection)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 22 / 26

Page 25: Policy Gradient with Baselines - cse.buffalo.edu

Likelihood ration gradient estimator

Let’s analyze the update:

∆θt = αGt∇θ log πθ(st , at)

It can be further rewritten as follows:

θt+1 = θt + αGt∇θπ(At |St , θ) 1©π(At |St , θ) 2©

1© Move most in the directions that favor actions that yield the highest return

2© Update is inversely proportional to the action probability to fight the fact that actionsthat are selected frequently are at an advantage (the updates will be more often in theirdirection)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 22 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

What is we substruct b from the rewards?

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]

A good baseline is Vπ(St)

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]

= E [∇θ log π(At |St)(Aπ(St ,At)]

Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 24 / 26

Policy Gradient with Baseline

What is we substruct b from the rewards?

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]

A good baseline is Vπ(St)

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]

= E [∇θ log π(At |St)(Aπ(St ,At)]

Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 24 / 26

Policy Gradient with Baseline

What is we substruct b from the rewards?

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]

A good baseline is Vπ(St)

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]

= E [∇θ log π(At |St)(Aπ(St ,At)]

Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 24 / 26

Page 35: Policy Gradient with Baselines - cse.buffalo.edu

REINFORCE with Baseline

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 25 / 26

Page 36: Policy Gradient with Baselines - cse.buffalo.edu

*Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation."Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 26 / 26

Top Related