Policy Gradient Methods: Pathwise Derivative Methods
and Wrap-up
March 15, 2017
Pathwise Derivative Policy Gradient Methods
Policy Gradient Estimators: Review
Deriving the Policy Gradient, Reparameterized
I Episodic MDP:
θ
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute ∇θE [RT ]. We’ll use ∇θ log π(at | st ; θ)
I Reparameterize: at = π(st , zt ; θ). zt is noise from fixed distribution.
I Only works if P(s2 | s1, a1) is known _
Deriving the Policy Gradient, ReparameterizedI Episodic MDP:
θ
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute ∇θE [RT ]. We’ll use ∇θ log π(at | st ; θ)I Reparameterize: at = π(st , zt ; θ). zt is noise from fixed distribution.
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
I Only works if P(s2 | s1, a1) is known _
Deriving the Policy Gradient, ReparameterizedI Episodic MDP:
θ
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute ∇θE [RT ]. We’ll use ∇θ log π(at | st ; θ)I Reparameterize: at = π(st , zt ; θ). zt is noise from fixed distribution.
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
I Only works if P(s2 | s1, a1) is known _
Using a Q-function
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
d
dθE [RT ] = E
[T∑t=1
dRT
dat
datdθ
]= E
[T∑t=1
d
datE [RT | at ]
datdθ
]
= E
[T∑t=1
dQ(st , at)
dat
datdθ
]= E
[T∑t=1
d
dθQ(st , π(st , zt ; θ))
]
SVG(0) Algorithm
I Learn Qφ to approximate Qπ,γ, and use it to compute gradient estimates.
I Pseudocode:
for iteration=1, 2, . . . doExecute policy πθ to collect T timesteps of dataUpdate πθ using g ∝ ∇θ
∑Tt=1Q(st , π(st , zt ; θ))
Update Qφ using g ∝ ∇φ
∑Tt=1(Qφ(st , at)− Qt)
2, e.g. with TD(λ)end for
N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. arXivpreprint arXiv:1510.09142 (2015)
SVG(0) Algorithm
I Learn Qφ to approximate Qπ,γ, and use it to compute gradient estimates.
I Pseudocode:
for iteration=1, 2, . . . doExecute policy πθ to collect T timesteps of dataUpdate πθ using g ∝ ∇θ
∑Tt=1Q(st , π(st , zt ; θ))
Update Qφ using g ∝ ∇φ
∑Tt=1(Qφ(st , at)− Qt)
2, e.g. with TD(λ)end for
N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. arXivpreprint arXiv:1510.09142 (2015)
SVG(1) Algorithm
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
I Instead of learning Q, we learn
I State-value function V ≈ V π,γ
I Dynamics model f , approximating st+1 = f (st , at) + ζt
I Given transition (st , at , st+1), infer ζt = st+1 − f (st , at)
I Q(st , at) = E [rt + γV (st+1)] = E [rt + γV (f (st , at) + ζt)], and at = π(st , θ, ζt)
SVG(∞) Algorithm
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
I Just learn dynamics model f
I Given whole trajectory, infer all noise variables
I Freeze all policy and dynamics noise, differentiate through entire deterministiccomputation graph
SVG Results
I Applied to 2D robotics tasks
I Overall: different gradient estimators behave similarly
N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. arXivpreprint arXiv:1510.09142 (2015)
Deterministic Policy Gradient
I For Gaussian actions, variance of score function policy gradient estimator goes toinfinity as variance goes to zero
I Intuition: finite difference gradient estimators
I But SVG(0) gradient is fine when σ → 0
∇θ∑t
Q(st , π(st , θ, ζt))
I Problem: there’s no exploration.
I Solution: add noise to the policy, but estimate Q with TD(0), so it’s validoff-policy
I Policy gradient is a little biased (even with Q = Qπ), but only because statedistribution is off—it gets the right gradient at every state
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. “Deterministic Policy Gradient Algorithms”. ICML. 2014
Deep Deterministic Policy GradientI Incorporate replay buffer and target network ideas from DQN for increased
stability
I Use lagged (Polyak-averaging) version of Qφ and πθ for fitting Qφ (towardsQπ,γ) with TD(0)
Qt = rt + γQφ′(st+1, π(st+1; θ′))
I Pseudocode:
for iteration=1, 2, . . . doAct for several timesteps, add data to replay bufferSample minibatchUpdate πθ using g ∝ ∇θ
∑Tt=1 Q(st , π(st , zt ; θ))
Update Qφ using g ∝ ∇φ
∑Tt=1(Qφ(st , at)− Qt)
2,end for
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. “Continuous control with deep reinforcement learning”. arXiv preprintarXiv:1509.02971 (2015)
DDPG Results
Applied to 2D and 3D robotics tasks and driving with pixel input
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. “Continuous control with deep reinforcement learning”. arXiv preprintarXiv:1509.02971 (2015)
Policy Gradient Methods: Comparison
I Two kinds of policy gradient estimatorI REINFORCE / score function estimator: ∇ log π(a | s)A.
I Learn Q or V for variance reduction, to estimate AI Pathwise derivative estimators (differentiate wrt action)
I SVG(0) / DPG: ddaQ(s, a) (learn Q)
I SVG(1): dda (r + γV (s ′)) (learn f ,V )
I SVG(∞): ddat
(rt + γrt+1 + γ2rt+2 + . . . ) (learn f )
I Pathwise derivative methods more sample-efficient when they work (maybe),but work less generally due to high bias
Policy Gradient Methods vs Q-Function Regression Methods
I Q-function regression methods are more sample-efficient when they work,but don’t work as generally
I Policy gradients are easier to debug and understandI Don’t have to deal with “burn-in” periodI When it’s working, performance should be monotonically increasingI Diagnostics like KL, entropy, baseline’s explained variance
I Q-function regression methods are more compatible with exploration andoff-policy learning
I Policy-gradient methods are more compatible with recurrent policies
I Q-function regression methods CAN be used with continuous action spaces(e.g., S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. “Continuous deepQ-learning with model-based acceleration”. (2016)) but final performanceis worse (so far)
Recent Papers on Connecting Policy Gradients and Q-function
RegressionI B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih. “PGQ: Combining policy
gradient and Q-learning”. (2016)
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic
with Experience Replay”. (2016)
I Uses adjusted returns: A. Harutyunyan, M. G. Bellemare, T. Stepleton, andR. Munos. “Q(λ) with Off-Policy Corrections”. 2016, N. Jiang and L. Li.“Doubly robust off-policy value evaluation for reinforcement learning”. 2016
I O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. “Bridging the Gap Between Valueand Policy Based Reinforcement Learning”. (2017)
I T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. “Reinforcement Learning with DeepEnergy-Based Policies”. (2017)
I O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. “Bridging the Gap Between Valueand Policy Based Reinforcement Learning”. (2017)
Top Related