Policy Gradient with Baselines
Alina Vereshchaka
CSE4/510 Reinforcement LearningFall 2019
October 29, 2019
*Slides are adopted from Deep Reinforcement Learning and Control by Katerina Fragkiadaki (CarnegieMellon) & Policy Gradients by David Silver
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 1 / 26
Policy Objective Functions
Goal: given policy πθ(s, a) with parameters θ, find best θ
In episodic environments we can use the start value
J1(θ) = Vπθ(s1) = Eπθ
[v1]
In continuing environments we can use the average value
JavV (θ) =∑s
dπθ(s)Vπθ
(s)
Or the average reward per time-step
JavR(θ) =∑s
dπθ(s)∑a
πθ(s, a)Ras
where dπθ(s) is a stationary distribution of Markov chain for πθAlina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 2 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 3 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 4 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 5 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 6 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 7 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 8 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 9 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 10 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 11 / 26
Example: Pong from Pixels
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 12 / 26
REINFORCE with Baseline
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 13 / 26
REINFORCE (Monte-Carlo Policy Gradient)
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 14 / 26
Policy Gradient
This vanilla policy gradient update has no bias but high variance. Many following algorithmswere proposed to reduce the variance while keeping the bias unchanged.
∇θJ(θ) = Eπ[Qπ(s, a)∇θ lnπθ(a|s)]
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 15 / 26
Policy Gradient Theorem
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 16 / 26
Policy Gradient: Problems & Solutions
Vanilla Policy Gradient
Unbiased but very nosy
Requires lots of samples to make it work
Solutions:
Baseline
Temporal Structure
Other (e.g. KL trust region)
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 17 / 26
Policy Gradient: Problems & Solutions
Vanilla Policy Gradient
Unbiased but very nosy
Requires lots of samples to make it work
Solutions:
Baseline
Temporal Structure
Other (e.g. KL trust region)
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 17 / 26
Policy Gradient: Problems & Solutions
Vanilla Policy Gradient
Unbiased but very nosy
Requires lots of samples to make it work
Solutions:
Baseline
Temporal Structure
Other (e.g. KL trust region)
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 17 / 26
Temporal structure
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 18 / 26
Temporal structure
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 19 / 26
Temporal structure
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 20 / 26
Likelihood ration gradient estimator
Let’s analyze the update:
∆θt = αGt∇θ log πθ(st , at)
It can be further rewritten as follows:
θt+1 = θt + αGt∇θπ(At |St , θ)
π(At |St , θ
Update is proportional to:
the product of a return Gt
the gradient of the probability of taking the action actually taken
divided by the probability of taking that action.
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 21 / 26
Likelihood ration gradient estimator
Let’s analyze the update:
∆θt = αGt∇θ log πθ(st , at)
It can be further rewritten as follows:
θt+1 = θt + αGt∇θπ(At |St , θ) 1©π(At |St , θ) 2©
1© Move most in the directions that favor actions that yield the highest return
2© Update is inversely proportional to the action probability to fight the fact that actionsthat are selected frequently are at an advantage (the updates will be more often in theirdirection)
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 22 / 26
Likelihood ration gradient estimator
Let’s analyze the update:
∆θt = αGt∇θ log πθ(st , at)
It can be further rewritten as follows:
θt+1 = θt + αGt∇θπ(At |St , θ) 1©π(At |St , θ) 2©
1© Move most in the directions that favor actions that yield the highest return
2© Update is inversely proportional to the action probability to fight the fact that actionsthat are selected frequently are at an advantage (the updates will be more often in theirdirection)
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 22 / 26
Policy Gradient with Baseline
Note that in general:
E [b∇θ log π(At |St)] = E
[∑a
π(a|St)b∇θ log π(a|St)]
= E
[b∇θ
∑a
π(a|St)]
= E [b∇θ1]
= 0
Thus gradient remains unchanged with the additional term
This holds only if b does not depend on the action (though it can depend on the state)
Implies, we can substract a baseline to reduce variance
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26
Policy Gradient with Baseline
Note that in general:
E [b∇θ log π(At |St)] = E
[∑a
π(a|St)b∇θ log π(a|St)]
= E
[b∇θ
∑a
π(a|St)]
= E [b∇θ1]
= 0
Thus gradient remains unchanged with the additional term
This holds only if b does not depend on the action (though it can depend on the state)
Implies, we can substract a baseline to reduce variance
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26
Policy Gradient with Baseline
Note that in general:
E [b∇θ log π(At |St)] = E
[∑a
π(a|St)b∇θ log π(a|St)]
= E
[b∇θ
∑a
π(a|St)]
= E [b∇θ1]
= 0
Thus gradient remains unchanged with the additional term
This holds only if b does not depend on the action (though it can depend on the state)
Implies, we can substract a baseline to reduce variance
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26
Policy Gradient with Baseline
Note that in general:
E [b∇θ log π(At |St)] = E
[∑a
π(a|St)b∇θ log π(a|St)]
= E
[b∇θ
∑a
π(a|St)]
= E [b∇θ1]
= 0
Thus gradient remains unchanged with the additional term
This holds only if b does not depend on the action (though it can depend on the state)
Implies, we can substract a baseline to reduce variance
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26
Policy Gradient with Baseline
Note that in general:
E [b∇θ log π(At |St)] = E
[∑a
π(a|St)b∇θ log π(a|St)]
= E
[b∇θ
∑a
π(a|St)]
= E [b∇θ1]
= 0
Thus gradient remains unchanged with the additional term
This holds only if b does not depend on the action (though it can depend on the state)
Implies, we can substract a baseline to reduce variance
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26
Policy Gradient with Baseline
Note that in general:
E [b∇θ log π(At |St)] = E
[∑a
π(a|St)b∇θ log π(a|St)]
= E
[b∇θ
∑a
π(a|St)]
= E [b∇θ1]
= 0
Thus gradient remains unchanged with the additional term
This holds only if b does not depend on the action (though it can depend on the state)
Implies, we can substract a baseline to reduce variance
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26
Policy Gradient with Baseline
What is we substruct b from the rewards?
∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]
A good baseline is Vπ(St)
∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]
= E [∇θ log π(At |St)(Aπ(St ,At)]
Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 24 / 26
Policy Gradient with Baseline
What is we substruct b from the rewards?
∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]
A good baseline is Vπ(St)
∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]
= E [∇θ log π(At |St)(Aπ(St ,At)]
Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 24 / 26
Policy Gradient with Baseline
What is we substruct b from the rewards?
∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]
A good baseline is Vπ(St)
∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]
= E [∇θ log π(At |St)(Aπ(St ,At)]
Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 24 / 26
REINFORCE with Baseline
Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 25 / 26
*Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation."Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 26 / 26
Top Related