Tobias Jung, University of Mainz, Germany Daniel Polani ...tjung/Neuer Ordner/ecai06-slides.pdf ·...
Transcript of Tobias Jung, University of Mainz, Germany Daniel Polani ...tjung/Neuer Ordner/ecai06-slides.pdf ·...
��� �� � ��� � � � �
� �
� � �� � � � � � � � � � � � ��
Tobias Jung, University of Mainz, Germany
Daniel Polani, University of Hertfordshire, U.K.
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.1/19
� � �� � ��
� �� �� ��� � ��� � � �� � � � � � ��� �� �� ��� � �� ��� � � � �! �� �" � � � � � � "$# � � �� �� ��� � �� � � % �! � � ��� � � � � �
&('� )� ��* + � �, � ,- . �� � , � � % �! � �/ � % . � � � � � 0 + 132 4 5 6
7 �� � � % 2 8 �, � � % � �� � � � " 0 + 132 19 : 6
1 . 8, � � � ;� � " � � , , �� , � � �� �� ��� � � � � �
<>= ?' � �A@ ' �� * 4 �� � � � � � % � , B � 8 � " " �� C� � � % 2 D �� % � C �� � 8 % � �
& �� ��E �� �� � �� 8 � � � � F � 8 � , � � ; � D � D �� % �A,G
+ � �, � ,- . �� � , 4 5 % � �� � � � " 0� . F ; �, ��� � �/ �� " � � � � F � � � % � � � 4 5 0
λ
6 6
+ � �, � ,- . �� � , 19 : 0, . � �� � �� � � H � � � � �� � � � ��� �>I � � ; . � � � � � � � �� �� ��� � � �� , 6
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.2/19
� � � � ��� � � � � � � � � � �
� : �� � �/ � � �, � � � �� � � , , 8 �, �! � % %� � �, �, �, � ;
�� �� ' * S = {s1, . . . , sN}
��� �� �* A = {a1, . . . , aM}
' � � ��*
R(s′|s, a)
�� � �* �� �� � ?� � � � � �� �� �' *
P (s′|s, a)
0 : �� � �/ 6
(Agent, Robot, ...)
Environment
Decision−maker
Control cycle t=0,1,2...
at st
st+1
rt+1
� � ' �� � E ' F � �, � � � � � �, � � � � � ��� �>I � � �� ; �� � � � �
� �� � ) ., . � % %� ��� ���� � ��� �� � � �
=⇒
F � �, � � � � � �, � � � � � ��� �>I � � � � ��� ' � @ � ' � � �
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.3/19
� � � � ��� � � � � � � � � � � �
� � �� ' � �� � � � � � �� � � � H � � �� 2 F �� �>I � � � �, � . � �� � , . � � ;� � D �� �A,
� � �� ' '� � )' � ' �
�� � �� �
π : S −→ A
0 � � ��� � � � �, � �! # , � � � � � � �� � 6
� �� ' � ��� �� � 0
0 < γ < 1
� �, � . � � ; � � �� 6
V π(s) = Eπ
∑
t≥0
γtR(st+1|st, at) | s0 = s, at = π(st)
∀s
�' � � @ � � * � �* V π � 8 � � , H � � � 2 � � � � �� � % � � � � � V π = TπV π# D F �� �
(
TπV)
(s) =∑
s′
P (s′|s, π(s))[
R(s′|s, π(s)) + γV (s′)]
� � �� , � � � � � � ��� � % � � % �! � π∗ = argmaxπ V π# ���
��
� � � % �! � π∗ , � � �, ;� � � " V π∗
(s) ≥ V π(s)# ∀s, ∀π
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.4/19
� � � � ��� � � � � � � � � � � � �
� � � � �� D �� , � � � 8 � � � � π∗# ��
"�
� � � F � �, 8 �, � � � � ?� � �� � �� ' � �� �� �
�� � �� � �� ' � �� �� � F � �, � � � � � � � % � � % �! � π0�
� ��� � �� ; �� k = 0, 1, . . .
�� � �� � ' E �� �� �� � �� � . ��
V πk
�� � �� � �@ ?� � E ' @ ' �� � �� �/ � "� � � ��� � � % �! � πk+1
;� �� V πk
Gπk+1(s) = argmax
a
{
∑
s′
P (s′|s, a)[
R(s′|s, a) + γV πk (s′)]
}
,∀s
� � � �� ' @ * : � � � %
P (s′|s, a), R(s′|s, a) � ., � 8 � � � � D � −→ , ��� . % � � � � �
� . � 8 �� � ; , � � �� , % �� " � �� , � � �� , ∈ Rd −→
; . � � � � � � � �� �� ��� � � � � �
=⇒
��� �� �� ��� � �� � � % �! � � ��� � � � � �# � � �� �� ��� � �� � � % �! � �/ � % . � � � � � 0 D � � F % � �, � ,- . �� � , 6
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.5/19
� � � �� � � �� � � � � � ��� � � � � � � � ��
�* * @ ' / � % . � ; . � � � � � �, % � � � �� %� � �� � � � ��� �>I � �
V (s) = [φm(s)]T w
D F �� �
φm(s) = [φ1(s), . . . , φs]
� m × 1
;� � � .� � / � � ��
w
�
m × 1 D� � " F �/ � � ��
φi(s) : S → R
8 �, �, ; . � � � � �
� ? ?� � = �A@ �� ' ?� � �� � ' E �� �� �� � F � �, � � � � � � � % � � % �! � π0�
� ��� � �� ; �� k = 0, 1, 2, . . .
� 8, �� / � � % � � " �� � � � � �� � . � � �� H � � �
πk0 ��
"�
� " � � � � � ��� � �, D � � F � �/ �� � � � � � � 6
.....s0 s1 s2 st−1 sta0 a1
a2
at−1
at
r1 r2 rt
D F �� � si ∼ P (si|si−1, ai−1), ri = R(si|si−1, ai−1), ai = π(si)
�, � ��� � �� V πk
., � � " � F � �� � � � � �� � 0 � ? ?� � = �A@ �� ' ?� � �� � ' E �� �� �� � 6
5 �� �/ � πk+1
�, " � � � �� � � % �! � ;� �� πk
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.6/19
� � � �� � � �� � � � ��� � � � � � � � � � � � � � � � � �� � � � � � �
� � �� � � � � ��� � � � �
w
� � V πk (s) = [φm(s)]T w
;� �� �� � � � � �� � s0, s1, . . . , st
� � �� � D �� �A,r1, . . . , rt�
�' � � @ � � � ' * � � �� @ � � �A@ ��� �� �� � � ? ?� � �� ) � � ��� � � � � D� � " F �,
w
8�w = argmin
w
{
t−1∑
i=0
[φm(si)]T w −
∑
s′
P (s′|si, πk(si))[
R(s′|si, πk(si)) + γ[φm(si+1)]T w
]
}2
� �� �'� ' � @ � � �*� �� � � � �* �� �� �* D� � � ., � � F � � 8, �� / � � �� � � � � �� � G
w = argminw
{
t−1∑
i=0
[φm(si)]T w −
[
ri + γ[φm(si+1)]T w
]
}2
.
��� � �� ��� � �� � �� � � � �� � � ���� ��� �� � � � ���� −→� � � ! " � �� ! �# $� ��� � %
& � = ' � � ?� � �� � ? ?� � = �@ �� �� � � ? ?� � �� ) '( � � ) * � � ��� � � � � D� � " F �,
w
8� , � %/ � � "
w = argminw
{
t−1∑
i=0
[φm(si)]T w −
[
ri + γ[φm(si+1)]T w
]
}2
.
��� � + + �+ � � � �-,� � � . /0 1 �2 %43 �" � � � �� �5 �� �+ � 6# � �� 87 %
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.7/19
�� � � �� � � � � �
( ' �*� *� � � ' * � � � ? ?� � = �A@ �� ' ?� � �� � ' E �� �� �� �
� F � � D� �� � % � � � � � " ; �� G
approximatevalue functions
V πtrue
Tπ
V πoptimal
� � % % � � �� � , � � . � % � � � ��� �>I � � � � �G
approximatevalue functions
V π
Tπ
TπV π
� � � � � � � � � � � � �� �� ��� � � � � �G
approximatevalue functions
V π
Tπ
Tπ V π
�� � � � �� ' �' � � @ � � � ' * � � �� @ � � �A@ ��� �� �� � � ? ?� � �� ) ∥∥
∥V − TπV
∥
∥
∥
2
→ minV
& � = ' � ?� � �� � ? ?� � = �@ �� �� � � ? ?� � �� ) '( � � ) *
V = argminV
∥
∥
∥V − TπV
∥
∥
∥
2
��� �� 5 � � � �� . /�� /� � � � �� � � �� �� � � $# � � � � � " # � $ �� � $ �� ���� $ 7 7 7
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.8/19
� � � � � � � � � � ���
� � � � �� � �
� � � � � � � � � � � �� � �
' � �� � F � D + 132 19 : 0� � " . % �� �>I � � � � � � � � D �� �,# ��� � � %� � � " � � � " � � , , � � �# � � ., , � � � �� � � , , � � "� � , , � � � 6 �� �
� � � % � � � � � ; . � � � � � � � �� �� ��� � � � � ��
� )' * )� �� *� � � � 0 ., � � " * �* '� � � � ' � ' * * � � * � � �� �� ��� � � � � � 6
� �/ � �G , �� �
t
�� � � � � � " � � � � {xi, yi}ti=1
� F � �, � G � ��� � � % ; . � � � � � k � F � � " � � �� � �� , 7 1
Hk# � F � ; . � � � � � , � � � � ; � �, , � 8 % �
, � % . � � � �, 0 ��
"�
� � %� � �� � � %,# � � ., , � � � � �,# � � �6
1 � % � �G � * �* '� {xi}mi=1
� ; � F � �� � � � � � " � � � �# D F �� � m � t
� �� � , � � �G � F � , � % . � � � � 8� f(·) =∑m
i=1 k(xi, ·)wi1 � %/ � G � F � - . � �� � � �! �2 8� 2 � �� � 8 % � � � � � 8 � � � � � F � D� � " F �,
w
minw∈Rm
‖Ktmw − y‖2 + λwT Kmmw
D F �� � [Ktm]ij = k(xi, xj)
�
t × m � � �� � �
[Kmm]ij = k(xi, xj)
� m × m � � �� � �
λ
�� � " . % �� �>I � � � � � � �� � � � ���
�+ � # � � � �5 # �� � � �� � � $# � � � � � " # �� 7 7 7
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.9/19
� � � � � � � � � � � ��
�
� ? ?� �� �� �� �� � �' � � @ � � � ' * � � �� *
4� � � � � � " � � � �G � 8, �� / � � �� � � � � �� � s0, s1, s2, . . . , st
� % � � " D � � F� � D �� �,
r1, r2, . . . , rt
� �� � , � � �/ � % . � ; . � � � � � 8� V π(·) =∑m
i=1 k(si, ·)wi
D F �� � {si}mi=1
�, � , . 8, � �
1 � %/ � � F � - . � �� � � �! �2 8� 2 � �� � 8 % � � �� � � , � � � � � � " � � � � % % � � �min
w∈Rm‖Htmw − r‖2 + λwT Kmmw
D F �� �
km(·) =
k(s1, ·)
���
k(sm, ·)
, Htm =
[km(s0) − γkm(s1)]T
���
[km(st−1) − γkm(st)]T
, r =
r1
���
rt
� 8 � � � � " � � �� � % �>I � � � �� � � % � - . � � � � �,G w =(
HTtmHtm + λKmm
)−1HT
tmr
� $ �� , " � � � � � � � + � � � ���5 � � ��� � ��� # � � �" �� � �
{si}m
i=1
7 7 7
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.10/19
�� � � � � � � � � � � � � � � � � � � �
� � � � � � �
�� � � � ��� ��# � � �� � �� �� 2 2 � %3 �� , � � � �# � 7 �� 2 2 � % �
� �� � �' * ' � ' �� �� � �, , . � � � � � � 8 � �� � , �/ � � % � 8 % � * ' � ' �� � �� � � � �
t = 1, 2, . . .1 � �� � D � � F � � � � � �� , . 8, � � 0 C � �! � � � � �� � C � ; 8 �, �, ; . � � � � �, 6
� � � ��� �
t
�� � � � � � �� �� ��� � �� � F � � � D � � � �
st
;� �� � F � .� � � � � � �! � � � � �� � G
�� � ��� � � �G � ;
k(st, st) − [km(st)]T K−1mmkm(st) > TOL
� F � � st
�, � � � � � � � , . 8, � �
�/ �� � % % �, �,G O(m2)# D F �� �m
�, � F � .� � � � � , �>I � � ; , . 8, � �
�+ � �" � ��� , � �� � � ��� , � , � � ��� 7 7 7
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.11/19
� � � � �� � � � � � � � � � � � �
� � ' � � * � E ' �@ ?� ' @ ' �� �� �� � � � � ��� � � � 8, �� / �
st−1 → st
. � � �� � � D �� �
rt
� � �� . � � � ��
wtm =
Ht−1,m
htT
T
Ht−1,m
htT
+ λKmm
−1
Ht−1,m
htT
T
rt−1
rt
)' � � �� �� �
Pt−1,m = (HTt−1,mHt−1,m + λKmm)−1# st−1,m = HT
t−1,mrt−1 � �� � � ? � �� ' * / � �G
� . �� �G � �� � . % � � ; 1 F �� � � � 2 : �� � �, � � 2 � � � � 8 .� � 0 C� � .� , �/ � % � �, � ,- . �� � , C 6
� . �� �G � � � � �� � ; � F � % � , �� ; � � �� �>I � � � � �G Pt−1,m = Φ1/2
t−1,mΦT/2
t−1,m
� �* ' � � .� � � � � � � � � � % � �, � ' ?� ' * ' �� ' � ' � � 8� .� � � � � � �! � � � � �� �
� � � � �� {Pt−1,m, st−1,m,wt−1,m} −→ {Ptm, stm,wtm}
� �, �, O(m2)
� �* ' � � .� � � � � � � � � � % � �, ��� � ' ?� ' * ' �� ' � ' � � 8� .� � � � � � �! � � � � �� �
� � �
st
�, � � D 8 �, �, ; . � � � � � � � � F � � �! � � � � �� �
� � � � �� {Pt−1,m, st−1,m,wt−1,m} −→ {Pt,m+1, st,m+1,wt,m+1}
� �, �, O(m2)
� �� # � � � � � $� � � � � # $� ��� � 7 7 7Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.12/19
� � � � �� � � � � � ��� � � � � � � � � � �
� �* � � � % �! � 2 �/ � % . � � � � � ; �� � � � � � , � � � ;� �# � � � � 8, �� / � � �� � �, � � � � �, . � � �� � � � � � ��� � % � � % �! �
�� @ ? � � � � 4 � % � � � � � " � � : � � 0 � � � � � � � � 6/ ,�
� .� � � �� � � F 0 � �2 ��� � � % 6/ ,�
H � � � � �2 � � � 0 � � � � � 6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Goal
A optimal Trajectory
Puddleworld(101x101 cells)
0 100 200 300 400 500
0
5
10
15
20
25
30
Trials
Mea
n a
pp
roxi
mat
ion
err
or
per
tri
al PuddleworldLSTD PolicyEvaluation
CMAC 10x10x10
RBFnet 12x12
LSSVM (nu=0.1, sigma=1/50)
LSSVM (nu=0.01, sigma=1/50)
' � � � ' � � ' * � � � ' * � : � � 0 � � � � D� � " F �, 6# H � � � � �2 � � � 0 �� � D� � " F �, 6# � .� � � �� � � F 0 � � � � � � �
D� � " F �, 6
� �# �# �" � � �� $# �5 � �� � �
−→�� �� $��� ��5 � � �5 � � ��� # �� � 7 7 7
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.13/19
� � � � �� � � � � � � �� � �� � � � ��� � � �� � � � � �
< ? �* � � �� � �* � * ? � �� ' � � � � � � � ? � � � � � � ) �� �
�� @ ? � � � � �� � � 8 � � � , �� , � 0
λ
6 B � � % � � � � � " / ,�
� .� � � �� � � F
0 500 1000 1500 2000−160
−140
−120
−100
−80
−60
−40
−20
0
Trials
To
tal r
ewar
d p
er t
rial
(s
mo
oth
ed a
vera
ge
of
100
run
s)
Puddleworld OPI
online sarsa(lambda) CMAC 20x20x10
LSSVM−LSTD (nu=0.01,sigma=1/50)
0 200 400 600 800 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Trials
To
tal r
ewar
d p
er t
rial
(s
mo
oth
ed a
vera
ge
of
100
run
s) LSSVM−LSTD (nu=0.01,sigma=1/50)
online sarsa(lambda) CMAC 10x10x10
online sarsa(lambda) CMAC 7x7x7
Puck−on−hill OPI
�+ � � � $�� �5 �� �� � �" � ��+ � � �# !� � # � �+ � � ! �� ���� $ �� � �� � # � �� %
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.14/19
� � � � � � � � �
� � � ' �� � � � � � � �A@ ?� � E ' @ ' �� *
� % % � D *� � � ) �*� �>� � � � �* �� �� �* G H � � � 2 � � � � � � � �� �� ��� � � � � � + 1 4 5 0
λ
6 � �, �� � � � ; � � % % � � �
� � , � � . � %,� % % � D @ � �' � � �� ' ' � ' � � � � � G � �, � � �� � . " � � � �� � , � � �� 2 � � � � �/ � % . � , 0 �2 ; . � � � � � � �, �� � � � ;
9 2 ; . � � � � � 6
� � �, � � �� � * ?' � E �>* ' � � � ��� � � � � .� � � " 8 �, �, , � % � � � � � =⇒� � � . � , , �>I � � ; � �! � � � � �� � 8�
� � ��� � � �
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.15/19
� � � � �� � � � � � � � � � � � � � � � � ��
�� � � �
� � �� � ' � � � F � D � � � � � ��� �>I � � F � � ��� � � F � �� � � �� , � � �� � % � F � 8 � % % 0� � � � ; �� � � � � � % � �� � � � " 6
Taker #2Taker #1
Keeper #2
Keeper #1
Center
−pass to keeper#1 or #2−hold the ball
Acting keeper with ball; may
Boundary (20m x 20m)
� ) �� � ' � ' * � �@ ' �* �� � �� �� � � ; � F � , � � �� , � � � 0 �� � ��� � �, � � �, 6
*� � � ) �*� �>� � � � �* �� �� �* 0 � � �, � � �� � � � � � �, � � � � � � � �,# � . % � � � % � ; . % %� � . � � � �� � ., � " � � �,
� � � � � � � � � �� � �� 6
� ' �� �� �A@ ' � ' � � � � � 0 ., � , C � � � � % C , � �� , �� / �� 6
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.16/19
� � � � �� � � � � � � � � � � � � � ��
�� @ ? � � ' �� � � 8 � � � � � �� � � F , �� , � 0
λ
6 B � � % � � � � � " / ,�
� .� � � �� � � F
0 5 10 15 20 25 30 35 404
6
8
10
12
14
16
18
20
223vs2 keepaway (field size 20m x 20m)
Training time (hours)
Epi
sode
dur
atio
n (s
ecs)
Our approach
Stone, Sutton & Kuhlman (2005)
Random behavior
Optimized handcoded behavior
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.17/19
)' @ � �*� � �� �� � �
/ � + �# � � � � � �� # � $ �� �# � � �� � �5 � # �� # , � % �� �" � ��� # � � � �� �# � � � � 2 � �5 � # �� # , � % � �
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.18/19
� � � � �
� ��� ' � � ��� � �� ? �� � � �� �� ��� � �� ��� � � � �! �� �" � � � � � � "
� �' � �� 8 � � � + 132 19 : 0, . � �� � �� " � � �� � % �>I � � � � � 6 D � � F + 132 4 5 0/ �� � ; �, � � �/ �� " � � � 6
&('� )� ��* � � �� �� ��� � �� � � % �! � �/ � % . � � � � � D � � F % � �, � ,- . �� � , � � � F � �,
� � ��� � � � �, � �! B , � � F �, � �!
� � � � % 2 8 �, � � B � � � � % 2 ;� � �
+ 132 19 : B , . 8, � � � ;� � " � � , , �� ,
� � % � � � , � % � � � � � � ; , . 8, � � 0 . �, . � �� / �, � � B , . � �� / �, � � 6
' * �� * 8 � � �, , � � � � �� � � � % � � � � � " � � + 1 4 5 � � � � � � 0 ; �� � �� �� � 8 % � � , 6
�� � � � � �! � % ��� �� �/ � � � � �, � � � / , � �� � � � D ��
Least Squares SVM for Least Squares TD Learning — ECAI 2006 – p.19/19