Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... ·...

Playing Atari with Deep Reinforcement Learning

Jonathan Chung

Objectives

• Provide some basic understanding of RL

• Apply this understanding to the paper

• Discuss possible future directions of the paper

Reinforcement Learning

“Finding suitable actions in order to maximise the

reward”

Problem definition• States (st)

Where you are and where you have been

• Action (a)

What can you do

• Reward (r)

What can you get

st=1a0 a1

r=-1r=1

r=1

a0 a1

r=1

r=1Markov decision process

r=-1

Problem definition

st=1

Problem definition• States (st)

Where you are (and where you have been)

• Action (a)

What can you do

• Reward (r)

What can you get

st=1a0 a1

r=-1r=1

r=1

• Q is defined as the maximum expected reward (Rt) after sequence s and taking action a

Aim

Playing Atari with Deep Reinforcement Learning

Inputs

• Images ~ s# • Actions = a^ • Score ~ Reward*

^Dependent on the game#A sequence of images are used to represent sequence s.

*All future rewards are considered but discounted based on the time

r

t=t’

Inputs

Rt

• Images ~ s# • Actions = a^ • Score ~ Reward*

Learning• Bellman equation


r=1

st0 st1 st2

ai ai

r=3

st3 st4 st5

ai ai

st6 st7

aiai ai

r=4 r=5 r=6 r=7


r=1

st0 st1 st2

ai ai

r=3

st3 st4 st5

ai ai

st6 st7

aiai ai

r=4 r=5 r=6 r=7

• s - sequence, a series of states

• a - action


r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

st0 st1 st2 st3 st4 st5 st6 st7


r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

Maximum expected reward

Discount factor



r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

Maximum expected reward

Discount factor


r

t=t’

Rt


r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7



• Qi → Q* as i → ∞ • Q(s, a; θ) ≅ Q*(s, a) • where Q(s, a; θ) is modelled with a deep neural network

called a “Q-network”

r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

Down sample & crop

CNN(16 8x8 filters, stride = 2)


Fully connected (256 units)

Fully connected (#actions units)

Loss func. MSEQ(s, a, θ)

r + ɣ Q(s`, a`, θ)


r=1

ai ai

r=3

ai ai aiai ai

r=4 r=5 r=6 r=7

Down sample & crop



Fully connected (256 units)

Fully connected (#actions units)

Q(s, a, θ)


• Initialise data and Q weights • For each episode:

• Init. and preprocess sequence ϕ(st) • For t in T

• Select the best action at according to Q(ϕ(st), a, θ)

• Execute action to get reward rt and image xt+1

• Store ϕ(st), at, rt, ϕ(st+1) → D • Sample a mini batch ^ from D

then perform gradient decent to update weights

Deep Q-learning

Experiments

• 7 ATARI game (Beam rider, Breakout, Enduro, Pong, Q*bert, Seaquest, Space Invaders)

• Each trained on the same network (except actions, and scaled rewards according)

• Sequences contained the actions and states from the last 4 frames

Experiments

• Randomly sampled s, a and s’, a’

• RMSProp algorithm with mini batch of size 32

• Total of 10 million frames while training on every 4 (or 3) frames

Experiments• Outperformed previous state of the art Sarsa and

Contingency

• Preformed better that humans in Breakout, Enduro, Pong

Future direction

Future direction

units killed = 30

ai ai ai ai aiai ai

territory = 30%units killed = 50territory = 33% WIN


Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... ·...

Documents

Transcript of Playing Atari with Deep Reinforcement Learningfidler/teaching/2015/slides/CSC2523/jonathan... ·...