Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... ·...
Transcript of Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... ·...
Playing Atari with Deep Reinforcement Learning
Jonathan Chung
Objectives
• Provide some basic understanding of RL
• Apply this understanding to the paper
• Discuss possible future directions of the paper
Reinforcement Learning
“Finding suitable actions in order to maximise the
reward”
Problem definition• States (st)
Where you are and where you have been
• Action (a)
What can you do
• Reward (r)
What can you get
st=1a0 a1
r=-1r=1
r=1
a0 a1
r=1
r=1Markov decision process
r=-1
Problem definition
st=1
Problem definition• States (st)
Where you are (and where you have been)
• Action (a)
What can you do
• Reward (r)
What can you get
st=1a0 a1
r=-1r=1
r=1
Problem definition• States (st)
Where you are (and where you have been)
• Action (a)
What can you do
• Reward (r)
What can you get
st=1a0 a1
r=-1r=1
r=1
Problem definition• States (st)
Where you are (and where you have been)
• Action (a)
What can you do
• Reward (r)
What can you get
st=1a0 a1
r=-1r=1
r=1
Problem definition• States (st)
Where you are (and where you have been)
• Action (a)
What can you do
• Reward (r)
What can you get
st=1a0 a1
r=-1r=1
r=1
• Q is defined as the maximum expected reward (Rt) after sequence s and taking action a
Aim
Playing Atari with Deep Reinforcement Learning
Inputs
• Images ~ s# • Actions = a^ • Score ~ Reward*
^Dependent on the game#A sequence of images are used to represent sequence s.
*All future rewards are considered but discounted based on the time
r
t=t’
Inputs
Rt
• Images ~ s# • Actions = a^ • Score ~ Reward*
Learning• Bellman equation
Learning• Bellman equation
r=1
st0 st1 st2
ai ai
r=3
st3 st4 st5
ai ai
st6 st7
aiai ai
r=4 r=5 r=6 r=7
Learning• Bellman equation
r=1
st0 st1 st2
ai ai
r=3
st3 st4 st5
ai ai
st6 st7
aiai ai
r=4 r=5 r=6 r=7
• s - sequence, a series of states
• a - action
Learning• Bellman equation
r=1
ai ai
r=3
ai ai aiai ai
r=4 r=5 r=6 r=7
st0 st1 st2 st3 st4 st5 st6 st7
Learning• Bellman equation
r=1
ai ai
r=3
ai ai aiai ai
r=4 r=5 r=6 r=7
Maximum expected reward
Discount factor
st0 st1 st2 st3 st4 st5 st6 st7
Learning• Bellman equation
r=1
ai ai
r=3
ai ai aiai ai
r=4 r=5 r=6 r=7
Maximum expected reward
Discount factor
st0 st1 st2 st3 st4 st5 st6 st7
r
t=t’
Rt
Learning• Bellman equation
r=1
ai ai
r=3
ai ai aiai ai
r=4 r=5 r=6 r=7
st0 st1 st2 st3 st4 st5 st6 st7
Learning• Bellman equation
• Qi → Q* as i → ∞ • Q(s, a; θ) ≅ Q*(s, a) • where Q(s, a; θ) is modelled with a deep neural network
called a “Q-network”
r=1
ai ai
r=3
ai ai aiai ai
r=4 r=5 r=6 r=7
Down sample & crop
CNN(16 8x8 filters, stride = 2)
CNN(32 4x4 filters, stride = 2)
Fully connected (256 units)
Fully connected (#actions units)
Loss func. MSEQ(s, a, θ)
r + ɣ Q(s`, a`, θ)
st0 st1 st2 st3 st4 st5 st6 st7
r=1
ai ai
r=3
ai ai aiai ai
r=4 r=5 r=6 r=7
Down sample & crop
CNN(16 8x8 filters, stride = 2)
CNN(32 4x4 filters, stride = 2)
Fully connected (256 units)
Fully connected (#actions units)
Q(s, a, θ)
st0 st1 st2 st3 st4 st5 st6 st7
• Initialise data and Q weights • For each episode:
• Init. and preprocess sequence ϕ(st) • For t in T
• Select the best action at according to Q(ϕ(st), a, θ)
• Execute action to get reward rt and image xt+1
• Store ϕ(st), at, rt, ϕ(st+1) → D • Sample a mini batch ^ from D
then perform gradient decent to update weights
Deep Q-learning
Experiments
• 7 ATARI game (Beam rider, Breakout, Enduro, Pong, Q*bert, Seaquest, Space Invaders)
• Each trained on the same network (except actions, and scaled rewards according)
• Sequences contained the actions and states from the last 4 frames
Experiments
• Randomly sampled s, a and s’, a’
• RMSProp algorithm with mini batch of size 32
• Total of 10 million frames while training on every 4 (or 3) frames
Experiments• Outperformed previous state of the art Sarsa and
Contingency
• Preformed better that humans in Breakout, Enduro, Pong
Future direction
Future direction
units killed = 30
ai ai ai ai aiai ai
territory = 30%units killed = 50territory = 33% WIN
st0 st1 st2 st3 st4 st5 st6 st7