Machine Learning I 80-629A Apprentissage Automatique I...
Transcript of Machine Learning I 80-629A Apprentissage Automatique I...
![Page 1: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/1.jpg)
Sequential Decision Making II— Week #12
Machine Learning I 80-629A
Apprentissage Automatique I 80-629
![Page 2: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/2.jpg)
Laurent Charlin — 80-629
Brief recap• Markov Decision Processes (MDP)
• Offer a framework for sequential decision making
• Goal: find the optimal policy
• Dynamic programming and several algorithms (e.g., VI,PI)
2
!&,8,5,7, !"
Figure 3.1, RL: An introduction
![Page 3: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/3.jpg)
Laurent Charlin — 80-629
From MDPs to RL• In MDPs we assume that we know
1. Transition probabilities: P(s’ | s, a)
2. Reward function: R(s)
• RL is more general
• In RL both are typically unknown
• RL agents navigate the world to gather this information
3
![Page 4: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/4.jpg)
Laurent Charlin — 80-629
Experience (I)
A. Supervised Learning:
• Given fixed dataset
• Goal: maximize objective on test set (population)
B. Reinforcement Learning
• Collect data as agent interacts with the world
• Goal: maximize sum of rewards
4
![Page 5: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/5.jpg)
Laurent Charlin — 80-629
Experience (II)• Any supervised learning problem can be mapped to a
reinforcement learning problem
• Loss function can be used as a reward function
• Dependent variables (Y) become the actions
• States encode information from the independent variables (x)
• Such mapping isn’t very useful in practice
• SL is easier than RL
5
![Page 6: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/6.jpg)
Example
https://cdn.dribbble.com/users/3907/screenshots/318354/tounge-stuck-on-pole-shot.jpg
Slide adapted from Pascal Poupart
Reinforcement LearningSupervised Learning
Don’t touch your tongue will get
stuck.
![Page 7: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/7.jpg)
Laurent Charlin — 80-629
RL applications
7
![Page 8: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/8.jpg)
Laurent Charlin — 80-629
RL applications
• Key: decision making over time, uncertain environments
7
![Page 9: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/9.jpg)
Laurent Charlin — 80-629
RL applications
• Key: decision making over time, uncertain environments
• Robot navigation: Self-driving cars, helicopter control
7
![Page 10: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/10.jpg)
Laurent Charlin — 80-629
RL applications
• Key: decision making over time, uncertain environments
• Robot navigation: Self-driving cars, helicopter control
• Interactive systems: recommender systems, chatbots
7
![Page 11: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/11.jpg)
Laurent Charlin — 80-629
RL applications
• Key: decision making over time, uncertain environments
• Robot navigation: Self-driving cars, helicopter control
• Interactive systems: recommender systems, chatbots
• Game playing: Backgammon, go
7
![Page 12: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/12.jpg)
Laurent Charlin — 80-629
RL applications
• Key: decision making over time, uncertain environments
• Robot navigation: Self-driving cars, helicopter control
• Interactive systems: recommender systems, chatbots
• Game playing: Backgammon, go
• Healthcare: monitoring systems
7
![Page 13: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/13.jpg)
Laurent Charlin — 80-629
Reinforcement learning and recommender systems• Most users have multiple interactions with the system of time
• Making recommendations over time can be advantageous (e.g., you could better explore one’s preferences)
• States: Some representation of user preferences (e.g., previous items they consumed)
• Actions: what to recommend
• Reward:
• + user consumes the recommendation
• - user does not consume the recommendation
8
![Page 14: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/14.jpg)
Laurent Charlin — 80-629
Challenges of reinforcement learning
9
![Page 15: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/15.jpg)
Laurent Charlin — 80-629
Challenges of reinforcement learning
• Credit assignment problem: which action(s) should be credited for obtaining a reward
• A series of actions (getting coffee from cafeteria)
• A small number of actions several time steps ago may be important (test taking: study before, getting grade long after)
9
![Page 16: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/16.jpg)
Laurent Charlin — 80-629
Challenges of reinforcement learning
• Credit assignment problem: which action(s) should be credited for obtaining a reward
• A series of actions (getting coffee from cafeteria)
• A small number of actions several time steps ago may be important (test taking: study before, getting grade long after)
• Exploration/Exploitation tradeoff: As agent interacts should it exploit its current knowledge (exploitation) or seek out additional information (exploration)
9
R=10
R=1
![Page 17: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/17.jpg)
Laurent Charlin — 80-629
Algorithms for RL
• Two main classes of approach
10
![Page 18: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/18.jpg)
Laurent Charlin — 80-629
Algorithms for RL
• Two main classes of approach
1. Model-based
• Learns a model of the transition and uses it to optimize a policy given the model
10
![Page 19: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/19.jpg)
Laurent Charlin — 80-629
Algorithms for RL
• Two main classes of approach
1. Model-based
• Learns a model of the transition and uses it to optimize a policy given the model
2. Model-free
• Learns an optimal policy without explicitly learning transitions
10
![Page 20: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/20.jpg)
Laurent Charlin — 80-629
Monte Carlo Methods
11
![Page 21: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/21.jpg)
Laurent Charlin — 80-629
Monte Carlo Methods• Model-free
11
![Page 22: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/22.jpg)
Laurent Charlin — 80-629
Monte Carlo Methods• Model-free
• Assume the environment is episodic
• Think of playing poker. An episode is a poker hand
• Updates the policy after each episode
11
![Page 23: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/23.jpg)
Laurent Charlin — 80-629
Monte Carlo Methods• Model-free
• Assume the environment is episodic
• Think of playing poker. An episode is a poker hand
• Updates the policy after each episode
• Intuition
• Experience many episodes
• Play many hands of poker
• Average the rewards received at each state
• What is the proportion of wins of each hand
11
![Page 24: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/24.jpg)
Laurent Charlin — 80-629
Prediction vs. control
1. Prediction: evaluate a given policy
2. Control: Learn a policy
• Sometimes also called
• passive (prediction)
• active (control)
12
![Page 25: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/25.jpg)
Laurent Charlin — 80-629
First-visit Monte Carlo• Given a fixed policy (prediction)
• Calculate the value function V(s) for each state
• Converges to V as the number of visits to each states goes to infinity
13
[ä(X)
[Sutton & Barto, RL Book, Ch 5]
![Page 26: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/26.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
Episode:
ä
![Page 27: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/27.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
Episode:
ä
![Page 28: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/28.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
Episode:
ä
![Page 29: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/29.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode:
ä
![Page 30: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/30.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode:
ä
![Page 31: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/31.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (3, )�
ä
![Page 32: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/32.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (3, )�
ä
![Page 33: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/33.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )�
ä
![Page 34: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/34.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )�
ä
![Page 35: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/35.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )�
ä
![Page 36: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/36.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )�
ä
![Page 37: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/37.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )�
ä
![Page 38: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/38.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )�
ä
![Page 39: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/39.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )�
ä
![Page 40: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/40.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )�
ä
![Page 41: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/41.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )�
ä
![Page 42: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/42.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )�
ä
![Page 43: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/43.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )
ä
![Page 44: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/44.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )
ä
![Page 45: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/45.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )
ä
![Page 46: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/46.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )
ä
![Page 47: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/47.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )
ä
For state 7:
![Page 48: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/48.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )
ä
For state 7: WJYZWS(�) = �7(�) + ��7(�) + ��7(��) + ��7(��) + ��7(��) + ��7(��)
= ����
![Page 49: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/49.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )
ä
For state 7: WJYZWS(�) = �7(�) + ��7(�) + ��7(��) + ��7(��) + ��7(��) + ��7(��)
= ����
![Page 50: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/50.jpg)
Laurent Charlin — 80-629
Example: grid world• Start state is top-left (start of episode)
• Bottom right is absorbing (end of episode)
• Policy is given (gray arrows)
14
R=10
(1, )
1 2 3 4
5 6
9
7 8
10 11
12 13 14
15 16 17 18
� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )
ä
For state 7: WJYZWS(�) = �7(�) + ��7(�) + ��7(��) + ��7(��) + ��7(��) + ��7(��)
= ����
;(�) = �� � ��
![Page 51: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/51.jpg)
Laurent Charlin — 80-629
Example: Black Jack
• Episode: one hand
• States: Sum of player’s cards, dealer’s card, usable ace
• Actions: {Stay, Hit}
• Rewards: {Win +1, Tie 0, Loose -1}
• A few other assumptions: infinite deck
15
![Page 52: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/52.jpg)
Laurent Charlin — 80-629
• Evaluates the policy that hits except when the sum of the cards is 20 or 21
R=1
R=-1
[Figure 5.1, Sutton & Barto]
![Page 53: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/53.jpg)
Laurent Charlin — 80-629
• Evaluates the policy that hits except when the sum of the cards is 20 or 21
R=1
R=-1
[Figure 5.1, Sutton & Barto]
![Page 54: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/54.jpg)
Laurent Charlin — 80-629
• Evaluates the policy that hits except when the sum of the cards is 20 or 21
R=1
R=-1
[Figure 5.1, Sutton & Barto]
![Page 55: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/55.jpg)
Laurent Charlin — 80-629
• Evaluates the policy that hits except when the sum of the cards is 20 or 21
R=1
R=-1
[Figure 5.1, Sutton & Barto]
![Page 56: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/56.jpg)
Laurent Charlin — 80-629
Q-value for control
• We know about state-value functions V(s)
17
![Page 57: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/57.jpg)
Laurent Charlin — 80-629
Q-value for control
• We know about state-value functions V(s)
• If state transitions are known then they can be used to derive an optimal policy [recall value iteration]:
17
ä!(X) = argmaxF
!7(X) + !
"
X!5(X" | X,F);N(X
")
#
!X
![Page 58: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/58.jpg)
Laurent Charlin — 80-629
Q-value for control
• We know about state-value functions V(s)
• If state transitions are known then they can be used to derive an optimal policy [recall value iteration]:
• When state transitions are unknown what can we do?
• Q(s,a) the value function of a (state,action) pair
17
ä!(X) = argmaxF
!7(X) + !
"
X!5(X" | X,F);N(X
")
#
!X
ä�(X) = arg maxF
{6(X,F)}
![Page 59: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/59.jpg)
Laurent Charlin — 80-629
Monte Carlo ES (control)
18
[Sutton & Barto, RL Book, Ch.5]
![Page 60: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/60.jpg)
Laurent Charlin — 80-629
Monte Carlo ES (control)
• Strong reasons to believe that it converges to the optimal policy
• “Exploring starts” requirement may be unrealistic
18
[Sutton & Barto, RL Book, Ch.5]
![Page 61: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/61.jpg)
Laurent Charlin — 80-629
Learning without “exploring starts”
• “Exploring starts” insures that all states can be visited regardless of the policy
• Specific policy may not visit all states
19
![Page 62: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/62.jpg)
Laurent Charlin — 80-629
Learning without “exploring starts”
• “Exploring starts” insures that all states can be visited regardless of the policy
• Specific policy may not visit all states
• Solution: inject some uncertainty in the policy
19
![Page 63: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/63.jpg)
Laurent Charlin — 80-629
Monte Carlo without exploring states (on policy)
20
[Sutton & Barto, RL Book, Ch.5]
![Page 64: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/64.jpg)
Laurent Charlin — 80-629
Monte Carlo without exploring states (on policy)
• Policy value cannot decrease
20
[ä�(X) � [ä(X), �X � 8ä�ä : policy at current step
: policy at next step
[Sutton & Barto, RL Book, Ch.5]
![Page 65: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/65.jpg)
Laurent Charlin — 80-629
Monte-Carlo methods summary
• Allow a policy to be learned through interactions
• States are effectively treated as being independent
• Focus on a subset of states (e.g., states for which playing optimally is of particular importance)
• Episodic (with or without exploring stats)
21
![Page 66: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/66.jpg)
Laurent Charlin — 80-629
Temporal Difference (TD) Learning
22
![Page 67: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/67.jpg)
Laurent Charlin — 80-629
Temporal Difference (TD) Learning
• One of the “central ideas of RL” [Sutton & Barto, RL book]
22
![Page 68: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/68.jpg)
Laurent Charlin — 80-629
Temporal Difference (TD) Learning
• One of the “central ideas of RL” [Sutton & Barto, RL book]
• Monte Carlo methods
22
;(XY) = ;(XY) + � [,Y � ;(XY)]
Step size
Observed returned : ,Y =9�
Y
�Y7(XY)
![Page 69: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/69.jpg)
Laurent Charlin — 80-629
Temporal Difference (TD) Learning
• One of the “central ideas of RL” [Sutton & Barto, RL book]
• Monte Carlo methods
22
;(XY) = ;(XY) + � [,Y � ;(XY)]
Step size
Observed returned : ,Y =9�
Y
�Y7(XY)
![Page 70: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/70.jpg)
Laurent Charlin — 80-629
Temporal Difference (TD) Learning
• One of the “central ideas of RL” [Sutton & Barto, RL book]
• Monte Carlo methods
• TD(0)
• updates “instantly”
22
;(XY) = ;(XY) + � [,Y � ;(XY)]
Step size
Observed returned :
;(XY) = ;(XY) + � [7Y+� + �;(XY+�) � ;(XY)]
,Y =9�
Y
�Y7(XY)
![Page 71: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/71.jpg)
Laurent Charlin — 80-629
TD(0) for prediction
23
[Sutton & Barto, RL Book, Ch.6]
![Page 72: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/72.jpg)
Laurent Charlin — 80-629
TD for control
![Page 73: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/73.jpg)
Laurent Charlin — 80-629
• Demo: http://www.cs.toronto.edu/~lcharlin/courses/80-629/reinforcejs/gridworld_td.html
(from Andrej Karpathy)
25
![Page 74: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/74.jpg)
Laurent Charlin — 80-629
Comparing TD and MC• MC requires going through full
episodes before updating the value function. Episodic.
• Converges to the optimal solution
26
• TD updates each V(s) after each transition. Online.
• Converges to the optimal solution (some conditions on )
• Empirically TD methods tend to converge faster
�
![Page 75: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/75.jpg)
Laurent Charlin — 80-629
Q-learning for control
27
[Sutton & Barto, RL Book, Ch.6]
![Page 76: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/76.jpg)
Laurent Charlin — 80-629
Q-learning for control
27
F =
!argmaxF 6(F, X) \NYM UWTGFGNQNY^ �! !,
WFSITR F \NYM UWTGFGNQNY^ !.-greedy policy!
[Sutton & Barto, RL Book, Ch.6]
![Page 77: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/77.jpg)
Laurent Charlin — 80-629
Q-learning for control
27
F =
!argmaxF 6(F, X) \NYM UWTGFGNQNY^ �! !,
WFSITR F \NYM UWTGFGNQNY^ !.-greedy policy!
[Sutton & Barto, RL Book, Ch.6]
![Page 78: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/78.jpg)
Laurent Charlin — 80-629
Q-learning for control
27
F =
!argmaxF 6(F, X) \NYM UWTGFGNQNY^ �! !,
WFSITR F \NYM UWTGFGNQNY^ !.-greedy policy!
[Sutton & Barto, RL Book, Ch.6]
![Page 79: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/79.jpg)
Laurent Charlin — 80-629
Q-learning for control
27
F =
!argmaxF 6(F, X) \NYM UWTGFGNQNY^ �! !,
WFSITR F \NYM UWTGFGNQNY^ !.-greedy policy!
[Sutton & Barto, RL Book, Ch.6]
![Page 80: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/80.jpg)
Laurent Charlin — 80-629
Q-learning for control
• Converges to Q* as long as all (s,a) pairs continue to be updated and with minor constraints on learning rate
27
[Sutton & Barto, RL Book, Ch.6]
![Page 81: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/81.jpg)
Laurent Charlin — 80-629
Approximation techniques
• Methods we studied are “tabular”
• State value functions (and Q) can be approximated
• Linear approximation:
• Coupling between states through x(s)
• Adapt the algorithms for this case.
• Updates to the value function now imply updating the weights w using a gradient
28
;(X) = \!](X)
![Page 82: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/82.jpg)
Laurent Charlin — 80-629
Approximation techniques (prediction)
• Linear approximation:
29
;(X) = \!](X)
![Page 83: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/83.jpg)
Laurent Charlin — 80-629
Approximation techniques (prediction)
• Linear approximation:
• Objective:
29
;(X) = \!](X)
�
X�8
�[ä(X) �\�](X)
��
![Page 84: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/84.jpg)
Laurent Charlin — 80-629
Approximation techniques (prediction)
• Linear approximation:
• Objective:
• Gradient update:
29
;(X) = \!](X)
�
X�8
�[ä(X) �\�](X)
��
\Y+� = \Y � ���
X�8
�[ä(X) �\�](X)
�](X)
![Page 85: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/85.jpg)
Laurent Charlin — 80-629
Approximation techniques (prediction)
• Linear approximation:
• Objective:
• Gradient update:
29
;(X) = \!](X)
�
X�8
�[ä(X) �\�](X)
��
\Y+� = \Y � ���
X�8
�[ä(X) �\�](X)
�](X)
[Sutton & Barto, RL Book, Ch.9]
![Page 86: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/86.jpg)
Laurent Charlin — 80-629
Approximation techniques (prediction)
• Linear approximation:
• Objective:
• Gradient update:
29
;(X) = \!](X)
�
X�8
�[ä(X) �\�](X)
��
\Y+� = \Y � ���
X�8
�[ä(X) �\�](X)
�](X)
[Sutton & Barto, RL Book, Ch.9]
,Y NX FS ZSGNFXJI JXYNRFYTW TK [ä(XY)
![Page 87: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/87.jpg)
Laurent Charlin — 80-629
Approximation techniques
• Works both for prediction and control
30
![Page 88: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/88.jpg)
Laurent Charlin — 80-629
Approximation techniques
• Works both for prediction and control
• Any model can be used to approximate
30
![Page 89: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/89.jpg)
Laurent Charlin — 80-629
Approximation techniques
• Works both for prediction and control
• Any model can be used to approximate
• Recent work using deep neural networks yield impressive performance on computer (Atari) games
30
![Page 90: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/90.jpg)
Laurent Charlin — 80-629
Summary• Today we have defined RL studied several algorithms for solving RL problems (mostly
for for tabular case)
• Main challenges
• Credit assignment
• Exploration/Exploitation tradeoff
• Algorithms
• Prediction
• Monte Carlo and TD(0)
• Control
• Q-learning
• Approximation algorithms can help scale reinforcement learning
31
![Page 91: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/91.jpg)
Laurent Charlin — 80-629
Practical difficulties
• Compared to supervised learning setting up an RL problem is often harder
• Need an environment (or at least a simulator)
• Rewards
• In some domains it’s clear (e.g., in games)
• In others it’s much more subtle (e.g., you want to please a human)
32
![Page 92: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes](https://reader034.fdocument.pub/reader034/viewer/2022042811/5fa6d7ad2db58144521a3269/html5/thumbnails/92.jpg)
Laurent Charlin — 80-629
• The algorithms are from “Reinforcement Learning: An Introduction” by Richard Sutton and Andrew Barto
• The definitive RL reference
• Some of these slides were adapted from Pascal Poupart’s slides (CS686 U.Waterloo)
• The TD demo is from Andrej Karpathy
33
Acknowledgements