Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reinforcement Learning

Partially ObservableMarkov Decision Processes

(POMDP)

主講人：虞台文

大同大學資工所智慧型多媒體研究室

Content IntroductionValue iteration for MDPBelief States & Infinite-State MDPValue Function of POMDPThe PWLC Property of Value

Function

(POMDP)

Introduction

Definition MDP A Markov decision process is a tuple

S a finite set of states of the world A a finite set of actions T: SA (S) state-transition function

R: SA R the reward function

, , ,S A T R

1( , , ) ( | , )t t tT s a s P s s s s a a

Complete Observability

Solution procedures for MDPs give values or policies for each state.

Use of these solutions requires that the agent is able to detect the state it is currently in with complete reliability.

Therefore, it is called CO-MDP (completely observable)

Partial Observability Instead of directly measuring the current

state, the agent makes an observation to get a hint about what state it is in.

How to get hint (guess the state)?– To do an action and take an observation.– The observation can be probabilistic, i.e., it

provides hint only.– The ‘state’ will be defined in probability

sense.

Observation Model

: ( )O S A

a finite set of observations the agent can experience of its world.

1 1( , , ) ( | , )t t tO s a o P o o s s a a

The probability of getting observation o given that the agent took action a and landed in state s’.

Definition POMDP

, , , , ,S A T R O

, , ,S A T R describes an MDP.

: ( )O S A is the observation function.

A POMDP is a tuple

How to find optimal policy in such an environment?

(POMDP)

Value Iteration for MDP

Acting Optimality

Finite-Horizon Model

Infinite-Horizon Discounted Model

maximize k

Maximize the expected

total reward of the next k steps.

maximize tt

Maximize the expected

discounted total reward.

Are there any difference on the nature of their optimal policies?

Stationary vs. Non-Stationary Policies

The optimal policy is dependent on the number of time steps remained.

The optimal policy is independent on the number of time steps remained.

Use non-stationary policy

Use stationary policy : S A

:t S A

Stationary vs. Non-Stationary Policies

The optimal policy is dependent on the number of time steps remained.

The optimal policy is independent on the number of time steps remained.

Use non-stationary policy

Use stationary policy : S A

:t S A The remained time steps.

Value Functions

Non-stationary policy

Stationary policy

, , 1( ,( ( )) ( , ( ), )) ( )tt ts S

tR s s T s s sV s V s

( , ( )) ( , ( ) ),) ()(s S

V s V sR s s T s s s

Optimal PoliciesFinite-Horizon Model

Stationary policy

, , 1( ,( ( )) ( , ( ), )) ( )tt ts S

tR s s T s s sV s V s

( , ( )) ( , ( ) ),) ()(s S

V s V sR s s T s s s

* *1arg max ( , ) ( , ,( ( )))t t

R s a T s as V ss

* *arg max (( ) (, ( , ) )) ,a

R s a T s a ss V s

*1 arg m () )( ax ,

aRs s a *

1 arg m () )( ax ,a

Rs s a

Optimal PoliciesFinite-Horizon Model

Stationary policy

* *1max ( , ) ( , , )( ) ( )t t

V s V sR s a T s a s

* *max (( , ) ( , , )) ( )s S

* *1arg max ( , ) ( , ,( ( )))t t

R s a T s as V ss

* *arg max (( ) (, ( , ) )) ,a

R s a T s a ss V s

*1 arg m () )( ax ,

aRs s a *

1 arg m () )( ax ,a

Rs s a

Optimal PoliciesFinite-Horizon Model Non-stationary policy

* *1max ( , ) ( , , )( ) ( )t t

1arg max ( , ) ( , ,( ( )))t ta

R s a T s as V ss

*1 arg m () )( ax ,

aRs s a *

1 arg m () )( ax ,a

Rs s a

How about t ?

How about Vt(s) Vt1(s) s?

How about t if Vt(s) Vt1(s) s?

To find an optimal policy, do we need to pay infinite time?

Value IterationThe MDP has finite number of states.

(POMDP)

Belief States & Infinite-State MDP

Agent Agent

POMDP Framework

World (MDP)

observationaction

bbelief state

SE: state estimator

Belief States

1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b

( ) 1s S

There are uncountably infinite number of belief states.

State Space

1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b

( ) 1s S

2-state POMDP

1( )b s 0 1

13-state POMDP

State Estimation

1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b

( ) 1s S

State estimation:

Given bt, at and ot+1, bt+1=?

State Estimation

1( ) ( | , , )t tb s P s o a b

1 2( ), ( )( ),tt tTb s b sb

1 1 21 1( ),( )( ), Ttt tb s b s b

( | , , (

( | , , )

( | , )

( | ( | ,, ) )t ts S

P s s a P

P o s a s

( | , )( | , )

a bo sP s

( , , )( , , )

O T s a s bs a o

b Normalization Factor

State Estimation

1( ) ( | , , )t tb s P s o a b

1 2( ), ( )( ),tt tTb s b sb

1 1 21 1( ),( )( ), Ttt tb s b s b

( | , , (

( | , , )

( | , )

( | ( | ,, ) )t ts S

P s s a P

P o s a s

( | , )( | , )

a bo sP s

( , , )( , , )

O T s a s bs a o

1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b

,1 ( | , )

a o tt

tP o a T b

,1 ( | , )

a o tt

tP o a T b

Remember these.Remember these.

State Estimation

1( ) ( | , , )t tb s P s o a b

1 2( ), ( )( ),tt tTb s b sb

1 1 21 1( ),( )( ), Ttt tb s b s b

( | , , (

( | , , )

( | , )

( | ( | ,, ) )t ts S

P s s a P

P o s a s

( | , )( | , )

a bo sP s

( , , )( , , )

O T s a s bs a o

1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b

( | , ) ( , , ) ( , , ) ( )t ts S s SP o a O s a o T s a s b s

b( | , ) ( , , ) ( , , ) ( )t ts S s S

P o a O s a o T s a s b s

It is linearw.r.t bt

State Transition Function1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b

( , , ) ( | , )a P a b b b b

( , , )SE a ob b( , , )SE a ob b

( | , , ) ( | , )o

P a o P o a

( , , )

( | , )o

SE a o

State Transition Function

( , , ) ( | , )a P a b b b b

( , , )SE a ob b( , , )SE a ob b

( | , , ) ( | , )o

P a o P o a

( , , )

( | , )o

SE a o

Suppose that ( , , ) ( , , ) i jSE a o SE a o i j b b

( | , ) ( , , )( , , )

P o a SE a oa

otherwise

b b bb b

POMDP = Infinite-State MDP

A POMDP is an MDP with tuple B a set of Belief states A the finite set of actions (the same as

the original MDP) : BA (B) state-transition function

: BA R the reward function1( , , ) ( | , )t t ta P a a b b b b b b

What is the reward function?

, , , B A

Reward Function

( , ) ( ) ( , )s S

R sa b s a

The reward function ofthe original MDP

Good news: It is Linear.

(POMDP)

Value Function of POMDP

Consider a 2-state POMDP:

Value Function over Belief Space

V(b) How to obtain the value function in belief space?

Can we use the table-based method?

Finding Optimal Policy

POMDP = Infinite-State MDPThe general method of MDP:

– To determine the value function and, then followed by policy improvement.

Value functions– State value function– Action value function

Review Value Iteration

Based on finite-horizon value function.

It finds on each iteration.*t

What is*1 ?

The and *1 *

( , ) ( ) ( , )s S

R sa b s a

bImmediateReward

*1( ) (( ) , )

a R sbQ as

1 1( ) arg m (ax )

aQV bb

The and *1 *

( ) (( ) , )s S

a R sbQ as

1 1( ) arg m (ax )

aQV bb

Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).

Horizon-1 Policy Trees

It is piecewise linear and convex.(PWLC)

(0,0)(1,0)

The and *1 *

( ) (( ) , )s S

a R sbQ as

1 1( ) arg m (ax )

aQV bb

How about 3-state POMDP and more?

It is PWLC.

What is the policy?

The and *1 *

( ) (( ) , )s S

a R sbQ as

1 1( ) arg m (ax )

aQV bb

How about 3-state POMDP and more?

What is the policy?

The PWLC A Piecewise Linear function consists of linear,

or hyperplane segments

– Linear function:

– kth linear segment:

– the -vector:

– each segment could be represented as

0 0 11i Ni

i Nx x x x

0 1[ , ,..., ]k k k kN α

( )k tα

The PWLC A Piecewise Linear function consists of linear,

or hyperplane segments

– Linear function:

– kth linear segment:

– the -vector:

– each segment could be represented as

0 0 11i Ni

i Nx x x x

0 1[ , ,..., ]k k k kN α

( )k tα

( ) max is PWLC.Tk

kf x α x( ) max is PWLC.T

f x α x

The and *t *

1( ) ( , ) ( , , ) ( )att

Q a a V

b b b b b

( | , ) ( , , )( , , )

P o a SE a oa

otherwise

b b bb b

( | , ) ( , , )( , , )

P o a SE a oa

otherwise

b b bb b

Immediatereward

*1 ( ,( , ) ( | , ) ( )),t

a P o a oa SEV

Value of observation o for doing action a

on the current stat b. Prob. of observation o

for doing action aon the current stat b.

*, ,11 ( )a o

The and *t *

1( ) ( , ) ( , , ) ( )att

Q a a V

b b b b b

( | , ) ( , , )( , , )

P o a SE a oa

otherwise

b b bb b

( | , ) ( , , )( , , )

P o a SE a oa

otherwise

b b bb b

Immediatereward

*1 ( ,( , ) ( | , ) ( )),t

a P o a oa SEV

Value of observation o for doing action a

on the current stat b. Prob. of observation o

for doing action aon the current stat b.

*, ,11 ( )a o

Yes, it is.But, I will defer the proof.Yes, it is.But, I will defer the proof.

The and *2 *

12( ) ( , ) ( , , ) ( )aQ a a V

b bb b b

( | , ) ( , , )( , , )

P o a SE a oa

otherwise

b b bb b

( | , ) ( , , )( , , )

P o a SE a oa

otherwise

b b bb b

*, ,1( , ) ( )a o

*1( , ) ( ( ,| , , ))) (

a P o a V SE a o

** *, ,

2 12( ) arg max ( ) arg max ( , ) ( )a a o

V Q a V

b b b b

The and *2 *

12( ) ( , ) ( | , ) ( )a

Q a P o a V

b bb b

Compute 1*2

1( , )a bb

b’o1

( , , )SE a o bb ( , , )SE a o bb

12( ) ( , ) ( | , ) ( )a

Q a P o a V

b bb b

The and *2 *

Compute 1*2

1( , )a bb

b’o1

What action will you take if the observation is oi after a1 is taken? What action will you take if the observation is oi after a1 is taken?

The and *2 *

12( ) ( , ) ( | , ) ( )a

Q a P o a V

b bb b

Consider individual observation (o) after action (a) is taken.

*, , *1 1( ) ( | , ) ( )a oV P o a V b b b

*1( | , ,( )() , )SE a oP o a V bb

Define

12( ) ( , ) ( )a a o

The and *2 *

a1( , )a b

a1a2*, ,

1 ( )a oV b

Transformed value function

a1a2 *

1 ( )V b

12( ) ( , ) ( )a a o

The and *2 *

*, ,12

( ) ( , ) ( )a a o

( , )a b

1 1,*,1 ( )a oV b

1 3,*,1 ( )a oV b

1 2,*,1 ( )a oV b

The and *2 *

*, ,12

( ) ( , ) ( )a a o

( , )a b

1 1,*,1 ( )a oV b

1 3,*,1 ( )a oV b

1 2,*,1 ( )a oV b

o1o2o3

Horizon-2 Tree for Action 1

12 2( , , )aa a 1 1 2( , , )a a a 2 21( , ),aa a

o1o2o3

o1 o2o3

P1 P1 P1

Horizon-2 Tree for Action 1

12 2( , , )aa a 1 1 2( , , )a a a 2 21( , ),aa a

o1o2o3

12 2( , , )aa a 1 1 2( , ),a a a

o1o2o3

The and *2 *

a1 a2 a1 a2

Horizon-2 Policy Tree

a1 a2 a1 a2

o1 o2 o3

P1 P1 P1

Can you figure out How to determine the value function for horizon 3 from the above discussion?

The and *3

1 1*, ,2 ( )a oV b

a11 2*, ,

2 ( )a oV b1 3*, ,

2 ( )a oV b

2 1*, ,2 ( )a oV b

2 2*, ,2 ( )a oV b

2 3*, ,2 ( )a oV b

1*3( )aQ b

( )aQ b

*3 ( )V b

a1 a2 a1 a2

The and *3 *

1*3( )aQ b 2

*3( )aQ b

The and *3 *

1*3( )aQ b 2

*3( )aQ b

How for *t *

tVand ?

Horizon-3 Policy Tree

o1 o2 o3

P1 P1 P1

o1 o2 o3

(POMDP)

The PWLC Property of Value Function

Value Function for POMDP

1( ) ( , ) ( , , ) ( )att

Q a a V

b b b b b

* *1( ) max ( , ) ( | , ) ( ,( , ))t t

SE a oV a P o a V

b bb b

*1 ( ,( , ) ( | , ) ( )),t

a P o a oa SEV

*1 ( ) max ( , ) max ( ) ( , )i iia a

V a b s R s a b b

*1 ( ) max ( , ) max ( ) ( , )i iia a

V a b s R s a b b

Let 1 2( , ), ( , ),T

a R s a R s ar

*1 ( ) max T

V b r b

Let 1 2( , ), ( , ),T

a R s a R s ar

*1 ( ) max T

V b r b

Let ,1 kk aα r *1 ,1( ) max T

V b α b

Theorem

*( )tV b is PWLC.

Proof*( )tV b is PWLC.

By induction

We already know*

1 ( )V b is true.

Assume*

1( )tV b is also true.

We then show *( )tV b must be true.

* *1( ) max ( , ) ( | , ) ( ,( , ))t t

SE a oV a P o a V

b bb b

From the assumption, we have

*1 , 1( ) m( , , x) a T

t k tk

SE aV o b α b

, 1 ( , , )max Tk t

kSE a o bα

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

V a SEo aP oa

b b α bb

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

V a SEo aP oa

b b α bb

, 1arg ma ( , ,( , , )x) Tk t

ko aa SE o b bαLet

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

V a SEo aP oa

b b α bb

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

V a SEo aP oa

b b α bb

*, 1( , , ) ( , , )( ) max ( , ) ( | , ) a o

SV o a oa EP a

b bb b b α

,( ,| , )

) a oSE a oP o a

,( ,| , )

) a oSE a oP o a

, 1( , ,, )max ( , ) Tt a o

b bα Tb

, 1arg ma ( , ,( , , )x) Tk t

ko aa SE o b bαLet

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

V a SEo aP oa

b b α bb

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

V a SEo aP oa

b b α bb

*, 1( , , ) ( , , )( ) max ( , ) ( | , ) a o

SV o a oa EP a

b bb b b α

, 1( , ,, )max ( , ) Tt a o

bb bα T

, 1, ) ,( ,max T Ta t aa

b T br α

( , , ,)*

, 1( ) max T Tt a t

aa ooV

b bb r Tα

Let ( , , ), , 1 ,ii ia o a oT T

k t a to

b Tα r α

*,( ) max T

t k tk

V b α b

*( )tV b is PWLC.

Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Documents

Transcript of Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

El Universo observable - USP · El Universo observable Sao Carlos, 20/05/2013 Hector Vucetich´ Observatorio Astronomico´ Universidad Nacional de La Plata El Universo observable–

Quasi injectivity of partially ordered acts · 2017. 7. 10. · Partially Ordered Sets with actions of a pomonoid S on them (Pos-S) 2/50. Quasi injectivity of partially ordered acts

Inverse Reinforcement On POMDP

LA PSICOLOGIA de lo observable a las palabras

Optimal Stopping and Supermartingales over Partially ... · Optimal Stopping and Supermartingales over Partially Ordered Sets 255 stopping time with respect to ~t is called a policy.Corresponding

DISC - El Lenguaje Universal del Comportamiento Observable | 1

"Observable и Computed на пример KnockoutJS", Ольга Кобец, MoscowJS 29

第63号 目 次 - esj-k.jp · PDF fileessentially the simplest stochastic model that can have ... 明らかではない場合には，Partially Observable Markov Decision ... とりBioeconomic

Xavier: Navegação Baseado em POMDP Sven Koenig, Reid G. Simmons Apresentador: Pedro Mitsuo Shiroma.

Crash Course: Building PC: Draft (Partially English)

I. Observable Erwartuugswerte BCH€¦ · I. 4 Observable & Erwartuugswerte Def. Sei SER cine runendlich Menge inoghileu Messergebwissen 17:25 → BCH) ein POVM und SEBCH) ein Dichte

(partially) Vision-Based Pose Synchronization … · Conclusion / Future Works ... • Cooperative control is a distributed control strategy that achieves ... (partially vision-based

are experimentally observable

El Universo observable - Portal IFSC · El Universo observable Sao Carlos, 20/05/2013 Hector Vucetich´ Observatorio Astronomico´ Universidad Nacional de La Plata El Universo observable–

Observable как атом приложения

Lecture Observable Patterns OfIinheritance 2010

αPD-1-mesoCAR-T cells partially inhibit the growth of ...

Differential Diptera succession patterns onto partially ... · Differential Diptera succession patterns onto partially burned and unburned pig carrion in southeastern Brazil Oliveira-Costa,

5차시 : Adversarial Search · 2020-05-21 · 5.3 Alpha-Beta Pruning 5.4 Imperfect Real-Time Games 5.5 Stochastic Games 5.6 Partially-Observable Games Summary Homework 4. 7. 11.

대화력전 및 기계화 보병 시나리오를 통한 대규모 가상군의 POMDP ...ailab.kaist.ac.kr/papers/pdfs/nobody2017.pdf · 2017-06-21 · 344 정보과학회 컴퓨팅의

第63号目次 - esj-k.jp · PDF fileessentially the simplest stochastic model that can have ... 明らかではない場合には，Partially Observable Markov Decision ... とりBioeconomic