1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
-
Upload
britton-townsend -
Category
Documents
-
view
226 -
download
8
Transcript of 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
![Page 1: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/1.jpg)
1
Tópicos Especiais em Aprendizagem
Prof. Reinaldo Bianchi
Centro Universitário da FEI
2012
![Page 2: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/2.jpg)
2
Objetivo desta Aula
Aprendizado por Reforço:– Traços de Elegibilidade.– Generalização e Aproximações de funções.
Aula de hoje: capítulos 7 e 8 do Sutton & Barto.
![Page 3: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/3.jpg)
3
Generalization and Function Approximation
Capítulo 8 do Sutton e Barto.
![Page 4: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/4.jpg)
4
Objetivos
Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part.
Overview of function approximation (FA) methods and how they can be adapted to RL
Não tão profundamente como no livro (comentário do Bianchi)
![Page 5: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/5.jpg)
5
Value Prediction with Function Approximation
As usual: Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function
In earlier chapters, value functions were stored in lookup tables.
![Page 6: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/6.jpg)
6
Adapt Supervised Learning Algorithms
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
Training example = {input, target output}
![Page 7: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/7.jpg)
7
Backups as Training Examples
As a training example:
input target output
![Page 8: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/8.jpg)
8
Any FA Method?
In principle, yes:– artificial neural networks– decision trees– multivariate regression methods– etc.
But RL has some special requirements:– usually want to learn while interacting– ability to handle nonstationarity– other?
![Page 9: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/9.jpg)
9
Gradient Descent Methodstranspose
![Page 10: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/10.jpg)
10
Performance Measures
A common and simple one is the mean-squared error (MSE) over a distribution P :
Let us assume that P is always the distribution of states at which backups are done.
The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.
![Page 11: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/11.jpg)
11
Gradient Descent
Iteratively move down the gradient:
![Page 12: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/12.jpg)
12
Gradient Descent Cont.For the MSE given above and using the chain rule:
![Page 13: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/13.jpg)
13
Gradient Descent Cont.
Use just the sample gradient instead:
Since each sample gradient is an unbiased estimate ofthe true gradient, this converges to a local minimum of the MSE if a decreases appropriately with t.
![Page 14: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/14.jpg)
14
But We Don’t have these Targets
![Page 15: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/15.jpg)
15
What about TD(l) Targets?
![Page 16: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/16.jpg)
16
On-Line Gradient-Descent TD(l)
![Page 17: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/17.jpg)
17
Linear Methods
![Page 18: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/18.jpg)
18
Nice Properties of Linear FA Methods The gradient is very simple: For MSE, the error surface is simple:
quadratic surface with a single minumum. Linear gradient descent TD(l) converges:
– Step size decreases appropriately– On-line sampling (states sampled from the on-
policy distribution)– Converges to parameter vector with property:
![Page 19: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/19.jpg)
19
Linear methods mais usados
Coarse Coding Tile Coding (CMAC) Radial Basis Functions Kanerva Coding
![Page 20: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/20.jpg)
20
Coarse Coding
Generalization from state X to state Y depends on the number of their features whose receptive fields
![Page 21: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/21.jpg)
21
Coarse Coding
Generalization in linear function approximation methods is determined by the sizes and shapes of the features' receptive fields. All three of these cases have roughly the same number
and density of features.
![Page 22: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/22.jpg)
22
Coarse Coding
![Page 23: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/23.jpg)
23
Learning and Coarse Coding
Example of feature width's strong effect on initial generalization (first row) and weak effect on accuracy
![Page 24: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/24.jpg)
24
Tile Coding
Binary feature for each tile
Number of features present at any one time is constant
Binary features means weighted sum easy to compute
Easy to compute indices of the freatures present
![Page 25: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/25.jpg)
25
Tile Coding
![Page 26: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/26.jpg)
26
Exemplo: Simulated Soccer
How does agent decide what to do with the ball?
Complexities– Continuous inputs– High dimensionality
Do Artigo: Reinforcement Learning in Simulated Soccer with Kohonen Networks , de Chris White and David Brogan (University of Virginia)
![Page 27: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/27.jpg)
27
Problems
State space explodes exponentially in terms of dimensionality
Current methods of managing state space explosion lack automation
RL does not scale well to problems with complexities of simulated soccer…
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
![Page 28: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/28.jpg)
28
Quantization
Divide State Space into regions of interest– Tile Coding (Sutton & Barto, 1998)
No automated method for regions– granularity– Heterogeneity– location
Prefer a learned abstraction of state space
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
![Page 29: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/29.jpg)
29
Kohonen Networks
Clustering algorithm
Data driven
Agent nearopponent goal
Teammate nearopponent goal
No nearbyopponents
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
![Page 30: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/30.jpg)
30
State Space Reduction
90 continuous valued inputs describe state of a soccer game– Naïve discretization 290 states– Filter out unnecessary inputs still 218
states– Clustering algorithm only 5000 states
• Big Win!!!
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
![Page 31: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/31.jpg)
31
Two Pass Algorithm
Pass 1:– Use Kohonen Network and large training
set to learn state space Pass 2:
– Use Reinforcement Learning to learn utilities for states (SARSA)
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
![Page 32: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/32.jpg)
32
Fragility of Learned Actions
What happens to attacker’s utility if goalie crosses dotted line?
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
![Page 33: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/33.jpg)
33
Results
Evaluate three systems– Control – Random action selection– SARSA– Forcing Function
Evaluation criteria– Goals scored– Time of possession
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
![Page 34: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/34.jpg)
34
Cumulative Score
SARSA vs. Random Policy
0
100
200
300
400
500
600
700
800
900
1 55 109
163
217
271
325
379
433
487
541
595
649
703
757
811
865
919
Games Played
Cu
mu
lati
ve G
oal
s S
core
d
Learning Team
Random Team
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
![Page 35: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/35.jpg)
35
Team with Forcing Functions
SARSA with Forcing Function vs. Random Policy
0
200
400
600
800
1000
12001 65 129
193
257
321
385
449
513
577
641
705
769
833
897
Games Played
Cu
mu
lati
ve S
core
Learning Team with ForcingFunctions
Random Team Against Teamwith Forcing Functions
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
![Page 36: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/36.jpg)
36
Can you beat the “curse of dimensionality”? Can you keep the number of features
from going up exponentially with the dimension?
“Lazy learning” schemes:– Remember all the data– To get new value, find nearest neighbors
and interpolate– e.g., locally-weighted regression
![Page 37: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/37.jpg)
37
Can you beat the “curse of dimensionality”? Function complexity, not dimensionality,
is the problem. Kanerva coding:
– Select a bunch of binary prototypes– Use hamming distance as distance
measure– Dimensionality is no longer a problem, only
complexity
![Page 38: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/38.jpg)
38
Algorithms using Function Approximators We now extend value prediction methods
using function approximation to control methods, following the pattern of GPI.
First we extend the state-value prediction methods to action-value prediction methods, then we combine them with policy improvement and action selection techniques.
As usual, the problem of ensuring exploration is solved by pursuing either an on-policy or an off-policy approach.
![Page 39: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/39.jpg)
39
Control with FA
Learning state-action values:
The general gradient-descent rule:
Training examples of the form:
![Page 40: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/40.jpg)
40
Control with FA
Gradient-descent Sarsa(l) (backward view):
![Page 41: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/41.jpg)
41
Linear Gradient Descent Sarsa(l)
![Page 42: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/42.jpg)
42
Linear Gradient Descent Q()
![Page 43: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/43.jpg)
43
Mountain-Car Task
![Page 44: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/44.jpg)
44
Mountain-Car Results
The effect of alpha, lambda and the kind of traces on early performance on the mountain-car task. This study used
five 9 x 9 tilings.
![Page 45: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/45.jpg)
45
Summary
Generalization Adapting supervised-learning function
approximation methods Gradient-descent methods Linear gradient-descent methods
– Radial basis functions– Tile coding– Kanerva coding
![Page 46: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/46.jpg)
46
Summary
Nonlinear gradient-descent methods? Backpropation?
Subleties involving function approximation, bootstrapping and the on-policy/off-policy distinction
![Page 47: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/47.jpg)
47
Conclusion
![Page 48: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/48.jpg)
48
Conclusão
Vimos dois métodos importantes na aula de hoje:– Traços de elegibilidade, que faz uma
generalização temporal do aprendizado.– Aproximadores de função, que
generalizam a função valor aprendida. Generalizam o aprendizado.
![Page 49: 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.](https://reader031.fdocument.pub/reader031/viewer/2022020117/56649cd85503460f949a1c02/html5/thumbnails/49.jpg)
49
Fim.