Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Reinforcement Learning

主講人：虞台文

大同大學資工所智慧型多媒體研究室

ContentIntroductionMain ElementsMarkov Decision Process (MDP)Value Functions


Introduction



Learning from interaction (with environment) Goal-directed learning Learning what to do and its effect Trial-and-error search and delayed reward

– The two most important distinguishing features of reinforcement learning

Exploration and Exploitation

The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future.

Dilemma neither exploitation nor exploration can be pursued exclusively without failing at the task.

Supervised Learning

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)


RLSystem

Inputs Outputs(“actions”)

Training Info = evaluations (“rewards” / “penalties”)

Objective: get as much reward as possible


Main Elements


Main Elements

EnvironmentEnvironment

action

reward

state

Agent

agent

To maximizevalue

Example (Bioreactor)

state– current temperature and other sensory readings,

composition, target chemical actions

– how much heating, stirring, what ingredients to add

reward– moment-by-moment production of desired

chemical

Example (Pick-and-Place Robot)

state– current positions and velocities of joints

actions– voltages to apply to motors

reward:– reach end-position successfully, speed,

smoothness of trajectory

Example (Recycling Robot)

State– charge level of battery

Actions– look for cans, wait for can, go recharge

Reward– positive for finding cans, negative for running out

of battery

Main Elements Environment

– Its state is perceivable Reinforcement Function

– To generate reward– A function of states (or state/action pairs)

Value Function– The potential to reach the goal (with maximum

total reward)– To determine the policy– A function of state

The Agent-Environment Interface

EnvironmentEnvironment

AgentAgent

actionat

statest

rewardrt

st+1

rt+1

st st+1 st+2 st+3

rt rt+1 rt+2 rt+3at at+1 at+2 at+3… …

Frequently, we model the environment as a Markov Decision Process (MDP).

Reward Function

A reward function defines the goal in a reinforcement learning problem. – Roughly speaking, it maps perceived states

(or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state.

:r S R or :r S A R

S: a set of statesA: a set of actions

Goals and Rewards

The agent's goal is to maximize the total amount of reward it receives.

This means maximizing not just immediate reward, but cumulative reward in the long run.

Goals and Rewards

Reward = 0

Reward = 1

Can you design another reward function?

Goals and Rewards

Win

Loss

Draw orNon-terminal

state reward

+1

1

0

Goals and Rewards

The reward signal is the way of communicating to the agent what we want it to achieve, not how we want it achieved.

The reward signal is the way of communicating to the agent what we want it to achieve, not how we want it achieved.

0

11

11


Markov Decision Processes


DefinitionAn MDP consists of:

– A set of states S, and actions A, – A transition distribution

– Expected next rewards

1( | , )tass t tP s s s s a a P

,s sSa A

1 1[ | , , ]t ts tas tE r s s a a s s R

,s sSa A

Decision Making

Many stochastic processes can be modeled within the MDP framework.

The process is controlled by choosing actions in each state trying to attain the maximum long-term reward.

How to find the optimal policy?

:* S A

Example (Recycling Robot)

High Low

wait

search

search

wait

recharge

1, waitR

, search R ,1 search R

, search R1 3,

1, waitR

1,0

Example (Recycling Robot)High Low

wait

search

search

wait

recharge

1, waitR

, search R ,1 search R

, search R1 3,

1, waitR

1,0

: expected # cans while

: expected

searching

wait# cans while

ing

search

wait

search wait

R

R

R R

{ , }S high low( ) { , }A wait seigh archh

( ) { , , }A wait search rechaow rgel


Value Functions


Value Functions

:r S R or :r S A R To estimate how good it is for the agent to be in a given

state (or how good it is to perform a given action in a given state).

The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return.

Value functions are defined with respect to particular policies.

ReturnsEpisodic Tasks

– finite-horizon tasks– indefinite-horizon tasks

Continual Tasks– infinite-horizon tasks

Finite Horizontal Tasks

1 2t t t TR r r r Returnat time t

[ ]tE RExpected returnat time t

k-armed bandit problem

Indefinite Horizontal Tasks

1 2t t tR r r Returnat time t


Play chess

Infinite Horizontal Tasks

21 2 3t t t tR r r r

Returnat time t


Control

10

kt k

k

r

Unified Notation

Reformulation of episodic tasks

s0 s1 s2r1 r2 r3

r4=0r5=0

.

.

.

10

t t kk

kR r

Discounted

returnat time t

: discounting factor

= 0= 1< 1

PoliciesA policy, , is a mapping from states, s

S, and actions, aA(s), to the probability (s, a) of taking action a when in state s.

Value Functions under a Policy

( ) [ | ]t tV s R sE s

State-Value Function

Action-Value Function

( [ ],, ) |t t tsQ s a E aR s a

10

|kt k t

k

r s sE

10

| ,k

ktt k ts s aE ar

Bellman Equation for a Policy State-Value Function

2 31 2 3 4t t t t tR r r r r

( ) [ | ]t tV s E R s s

,( ) ( )( )a s

as

asssV s V ss a

RP

21 2 3 4( )t t t tr r r r

1 1t tr R

1 1[ ( ) | ]t t tE r V s s s

Backup Diagram State-Value Function

,( ) ( )( )a s

as

asssV s V ss a

RP

s

ar

Bellman Equation for a Policy Action-Value Function

( , ) [ | , ]t t tQ s a E R s s a a

( , ) ( )ass

as

ssQ s a V s

RP

1 1t t tR r R

1 1[ | , ]t t t tE r R s s a a

1 1[ | , ] [ | , ]t t t t t tE r s s a a E R s s a a

1 1[ | ]a a ass ss ss t t

s s

E R s s

P R P

( )a a ass ss ss

s s

V s

P R P

Backup Diagram Action-Value Function

( , ) ( )ass

as

ssQ s a V s

RP

s, a

s’

a’

Bellman Equation for a Policy

,( ) ( )( )a s

as

asssV s V ss a

RP

This is a set of equations (in fact, linear), one for each state.

The value function for is its unique solution. It can be regarded as consistency condition

between values of states and successor states, and rewards.

( , ) ( )ass

as

ssQ s a V s

RP

Example (Grid World)

,( ) ( )( )a s

as

asssV s V ss a

RP

( , ) ( )ass

as

ssQ s a V s

RP

State: position Actions: north, south, east, west; deterministic. Reward: If would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move

agent out of special states A and B as shown.

State-value function for equiprobable random policy;= 0.9

Optimal Policy (*)*( ) ( ), *V s V s *( ) ( ), *V s V s

,( ) ( )( )a s

as

asssV s V ss a

RP

( , ) ( )ass

as

ssQ s a V s

RP

* *( ) ( ) max ( )V s V s V s

Optimal State-Value Function

Optimal Action-Value Function*( , ) max ( , )Q s a Q s a

What is the relation btw. them.

Optimal Value Functions

,( ) ( )( )a s

as

asssV s V ss a

RP

( , ) ( )ass

as

ssQ s a V s

RP

* *

( )( ) max ( , )

a sV s Q s a

A

*

( )max ( )aa

sss

sa s

s V s

A

P R

Bellman Optimality Equations:

* *

( ) ( )( , ) max max ( , )aa

sss

sa

ss a s

Q s a Q s a

A A

RP

Optimal Value Functions

* *

( )( ) max ( , )

a sV s Q s a

A

*

( )max ( )aa

sss

sa s

s V s

A

P R

Bellman Optimality Equations:

* *

( ) ( )( , ) max max ( , )aa

sss

sa

ss a s

Q s a Q s a

A A

RP

How to apply the value function to determine the action to be taken on each state?

How to compute? How to store?

Example (Grid World)

V* *

RandomPolicy

OptimalPolicy

Finding Optimal Solution via Bellman

Finding an optimal policy by solving the Bellman Optimality Equation requires the following:– accurate knowledge of environment

dynamics;– we have enough space and time to do the

computation;– the Markov Property.

Optimality and Approximation

How much space and time do we need?– polynomial in number of states (via dynamic

programming methods)– BUT, number of states is often huge (e.g.,

backgammon has about 1020 states).

We usually have to settle for approximations. Many RL methods can be understood as

approximately solving the Bellman Optimality Equation.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Documents

Transcript of Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Reinforcement Learning 主講人：虞台文 大同大學資工所 智慧型多媒體研究室.

Documents

Transcript of Reinforcement Learning 主講人：虞台文 大同大學資工所 智慧型多媒體研究室.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Transcript of Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.