Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... ·...

31
Playing Atari with Deep Reinforcement Learning Jonathan Chung

Transcript of Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... ·...

Page 1: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Playing Atari with Deep Reinforcement Learning

Jonathan Chung

Page 2: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Objectives

• Provide some basic understanding of RL

• Apply this understanding to the paper

• Discuss possible future directions of the paper

Page 3: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Reinforcement Learning

“Finding suitable actions in order to maximise the

reward”

Page 4: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Problem definition• States (st)

Where you are and where you have been

• Action (a)

What can you do

• Reward (r)

What can you get

st=1a0 a1

r=-1r=1

r=1

Page 5: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

a0 a1

r=1

r=1Markov decision process

r=-1

Problem definition

st=1

Page 6: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Problem definition• States (st)

Where you are (and where you have been)

• Action (a)

What can you do

• Reward (r)

What can you get

st=1a0 a1

r=-1r=1

r=1

Page 7: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network
Page 8: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Problem definition• States (st)

Where you are (and where you have been)

• Action (a)

What can you do

• Reward (r)

What can you get

st=1a0 a1

r=-1r=1

r=1

Page 9: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network
Page 10: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Problem definition• States (st)

Where you are (and where you have been)

• Action (a)

What can you do

• Reward (r)

What can you get

st=1a0 a1

r=-1r=1

r=1

Page 11: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network
Page 12: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Problem definition• States (st)

Where you are (and where you have been)

• Action (a)

What can you do

• Reward (r)

What can you get

st=1a0 a1

r=-1r=1

r=1

Page 13: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

• Q is defined as the maximum expected reward (Rt) after sequence s and taking action a

Aim

Page 14: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Playing Atari with Deep Reinforcement Learning

Page 15: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Inputs

• Images ~ s# • Actions = a^ • Score ~ Reward*

^Dependent on the game#A sequence of images are used to represent sequence s.

Page 16: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

*All future rewards are considered but discounted based on the time

r

t=t’

Inputs

Rt

• Images ~ s# • Actions = a^ • Score ~ Reward*

Page 17: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Learning• Bellman equation

Page 18: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Learning• Bellman equation

r=1

st0 st1 st2

ai ai

r=3

st3 st4 st5

ai ai

st6 st7

aiai ai

r=4 r=5 r=6 r=7

Page 19: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Learning• Bellman equation

r=1

st0 st1 st2

ai ai

r=3

st3 st4 st5

ai ai

st6 st7

aiai ai

r=4 r=5 r=6 r=7

• s - sequence, a series of states

• a - action

Page 20: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Learning• Bellman equation

r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

st0 st1 st2 st3 st4 st5 st6 st7

Page 21: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Learning• Bellman equation

r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

Maximum expected reward

Discount factor

st0 st1 st2 st3 st4 st5 st6 st7

Page 22: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Learning• Bellman equation

r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

Maximum expected reward

Discount factor

st0 st1 st2 st3 st4 st5 st6 st7

r

t=t’

Rt

Page 23: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Learning• Bellman equation

r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

st0 st1 st2 st3 st4 st5 st6 st7

Page 24: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Learning• Bellman equation

• Qi → Q* as i → ∞ • Q(s, a; θ) ≅ Q*(s, a) • where Q(s, a; θ) is modelled with a deep neural network

called a “Q-network”

Page 25: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

Down sample & crop

CNN(16 8x8 filters, stride = 2)

CNN(32 4x4 filters, stride = 2)

Fully connected (256 units)

Fully connected (#actions units)

Loss func. MSEQ(s, a, θ)

r + ɣ Q(s`, a`, θ)

st0 st1 st2 st3 st4 st5 st6 st7

Page 26: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

Down sample & crop

CNN(16 8x8 filters, stride = 2)

CNN(32 4x4 filters, stride = 2)

Fully connected (256 units)

Fully connected (#actions units)

Q(s, a, θ)

st0 st1 st2 st3 st4 st5 st6 st7

• Initialise data and Q weights • For each episode:

• Init. and preprocess sequence ϕ(st) • For t in T

• Select the best action at according to Q(ϕ(st), a, θ)

• Execute action to get reward rt and image xt+1

• Store ϕ(st), at, rt, ϕ(st+1) → D • Sample a mini batch ^ from D

then perform gradient decent to update weights

Deep Q-learning

Page 27: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Experiments

• 7 ATARI game (Beam rider, Breakout, Enduro, Pong, Q*bert, Seaquest, Space Invaders)

• Each trained on the same network (except actions, and scaled rewards according)

• Sequences contained the actions and states from the last 4 frames

Page 28: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Experiments

• Randomly sampled s, a and s’, a’

• RMSProp algorithm with mini batch of size 32

• Total of 10 million frames while training on every 4 (or 3) frames

Page 29: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Experiments• Outperformed previous state of the art Sarsa and

Contingency

• Preformed better that humans in Breakout, Enduro, Pong

Page 30: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Future direction

Page 31: Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... · • where Q(s, a; θ) is modelled with a deep neural network called a “Q-network

Future direction

units killed = 30

ai ai ai ai aiai ai

territory = 30%units killed = 50territory = 33% WIN

st0 st1 st2 st3 st4 st5 st6 st7