[DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
-
Upload
deeplearningjp2016 -
Category
Technology
-
view
42 -
download
0
Transcript of [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Qπ (s,a) = r(s,a)+ γ Eπ [Qπ (s ',a ')]
Qπ
L = (r(s,a)+ γQθπ (s ',a ')−Qθ
π (s,a))2
Qo(s,a) = r(s,a)+ γ maxa 'Qo(s ',a ')
Qo
L = (r(s,a)+ γ maxa 'Qθ
o(s ',a ')−Qθo(s,a))2
๏
๏
๏
๏
Qπ (s,a) = r(s,a)+ γ Eπ [Qπ (s ',a ')]
L = ( γ ir(si ,ai )i=0
n−1
∑ + γ nQθπ (sn ,an )−Qθ
π (s0,a0 ))2
๏
๏
L = (r(s,a)+ γ maxa 'Qθ
o(s ',a ')−Qθo(s,a))2
Q∗(s,a) = r(s,a)+ γτ log exp(Q∗(s ',a ') /τ )a '∑
Q∗
τ log exp(Q∗(s ',a ') /τ )a '∑
= τ log(exp(Q∗(sM ,aM ) /τ ) exp((Q∗(s ',a ')−Q∗(sM ,aM )) /τ )a '∑ )
= maxa 'Q∗(s ',a ')+τ log( exp((Q∗(s ',a ')−Q∗(sM ,aM )) /τ )
a '∑ )
V ∗(s) = −τ logπ ∗(a | s)+ r(s,a)+ γV ∗(s ')
s0,v0{a1,...,an}{v1,...,vn}{s1,..., sn}
OMR(π ) = π (ai )(ri + γ vio )
i=1
n
∑
v0o =OMR(π
o ) = maxi(ri + γ vi
o )
OENT (π ) = π (ai )(ri + γ vi∗ −τ logπ (ai ))
i=1
n
∑
OENT (π ) = −τ π (ai )logπ (ai )
exp((ri + γ vi∗) /τ ) / Si=1
n
∑ +τS
π ∗(ai ) =exp((ri + γ vi
∗) /τ )
exp((ri ' + γ vi '∗ ) /τ )
i '=1
n
∑
v0∗ =OENT (π
∗) = τ log exp((ri + γ vi∗) /τ )
i=1
n
∑
π ∗(ai ) =exp((ri + γ vi
∗) /τ )
exp((ri ' + γ vi '∗ ) /τ )
i '=1
n
∑
v0∗ = −τ logπ ∗(ai )+ r(si ,ai )+ γ vi
∗
V ∗(s) = −τ logπ ∗(a | s)+ r(s,a)+ γV ∗(s ')
−V ∗(s1)+ γt−1V ∗(st )+ R(s1:t )−τG(s1:t ,π
∗) = 0
R(sm:n ) = γ ir(sm+i ,am+i )i=0
n−m−1
∑ G(sm:n ,π ) = γ i logπ (am+i | sm+i )i=0
n−m−1
∑
Cθ ,φ (s1:t ) = −Vφ (s1)+ γt−1Vφ (st )+ R(s1:t )−τG(s1:t ,πθ )
Δθ ∝Cθ ,φ (s1:t )∇θG(s1:t ,πθ )
Δφ ∝Cθ ,φ (s1:t )(∇φVφ (s1)−∇φγt−1Vφ (st ))
Aθ ,φ (s1:d+1) = −Vφ (s1)+ γdVφ (sd+1)+ R(s1:d+1)
Δθ ∝Es0:T[ Aθ ,φ (si:i+d )∇θ logπθ (ai | si )i=0
T −1
∑ ]
Δφ ∝Es0:T[ Aθ ,φ (si:i+d )∇φVφ (si )i=0
T −1
∑ ]
Cθ ,φ (s1:t ) = −Vφ (s1)+ γt−1Vφ (st )+ R(s1:t )−τG(s1:t ,πθ )