DESIGNING STATES, ACTIONS, AND REWARDS FOR USING POMDP IN SESSION SEARCH
Jiyun Luo, Sicong Zhang, Xuchu Dong, Grace Hui Yang
InfoSenseDepartment of Computer Science
Georgetown University
{jl1749,sz303,xd47}@georgetown.edu
1
2
E.g. Find what city and state Dulles airport is in, what shuttles ride-sharing vans and taxi cabs connect the airport to other cities, what hotels are close to the airport, what are some cheap off-airport parking, and what are the metro stops close to the Dulles airport.
DYNAMIC IR- A NEW PERSPECTIVE TO LOOK AT SEARCH
Informationneed
User
Search Engine
3
Trial-and-error
CHARACTERISTICS OF DYNAMIC IR
3 q1 – "dulles hotels"
q2 – "dulles airport"
q3 – "dulles airport location”
q4 – "dulles metrostop"
4
Rich interactions Query formulation Document clicks Document examination eye movement mouse movements etc.
4
CHARACTERISTICS OF DYNAMIC IR
5
Temporal dependency
5
CHARACTERISTICS OF DYNAMIC IR
clicked documentsquery
D1
ranked documents
q1 C1
D2
q2 C2……
…… Dn
qn Cn
I information need
iteration 1 iteration 2 iteration n
6
Fits well in this trial-and-error setting
It is to learn from repeated, varied attempts which are continued until success.
The learner (also known as agent) learns from its dynamic interactions with the world rather than from a labeled dataset as in supervised learning.
The stochastic model assumes that the system's current state depend on the previous state and action in a non-deterministic manner
REINFORCEMENT LEARNING (RL)
6
7
PARTIALLY OBSERVABLE MARKOV DECISION PROCESS (POMDP)
……s0 s1
r0
a0
s2
r1
a1
s3
r2
a2
Hidden states
ActionsRewards
1R. D. Smallwood et. al., ‘73
o1 o2 o3
7
Markov Long Term OptimizationObservations, Beliefs
8
8
Study designs of states, actions, reward functions of RL algorithms in Session Search
GOAL OF THIS PAPER
10
Partially Observable Markov Decision Process
Two agentsCooperative game Joint Optimization
WIN-WIN SEARCH: DUAL-AGENT STOCHASTIC GAME
Hidden states Actions Rewards Markov
[Luo, Zhang, and Yang SIGIR 2014]
11
A tuple (S, M, A, R, γ, O, Θ, B) S : state space M: state transition function A: actions R: reward function γ: discount factor, 0< γ ≤1 O: observations a symbol emitted according to a hidden state. Θ: observation function Θ(s,a,o) is the probability that o is observed when the system transitions into state s after taking action a, i.e. P(o|s,a). B: belief space Belief is a probability distribution over hidden states.
PARTIALLY OBSERVABLE MARKOV DECISION PROCESS (POMDP)
1R. D. Smallwood et. al., ‘73
12
SRT
Relevant & Exploitation
SRR
Relevant & Exploration
SNRT
Non-Relevant &
Exploitation
SNRR
Non-Relevant & Exploration
scooter price ⟶ scooter stores
collecting old US coins⟶ selling old US coins
Philadelphia NYC travel ⟶ Philadelphia NYC train Boston tourism ⟶ NYC
tourism
q0
HIDDEN DECISION MAKING STATES
[Luo, Zhang, and Yang SIGIR 2014]
ACTIONS User Action (Au)
add query terms (+Δq) remove query terms (-Δq) keep query terms (qtheme)
Search Engine Action(Ase) Increase/ decrease/ keep term weights Switch on or off a search technique,
e.g. to use or not to use query expansion
adjust parameters in search techniques e.g., select the best k for the top k
docs used in PRF Message from the user(Σu)
clicked documents SAT clicked documents
Message from search engine(Σse) top k returned documents
Messages are essentially documents that an agent thinks are relevant.
[Luo, Zhang, and Yang SIGIR 2014]
13
Based on Markov Decision Process (MDP) States: Queries
Observable Actions:
User actions: Add/remove/ unchange the query terms
Nicely correspond to our definition of query change Search Engine actions:
Increase/ decrease /remain term weights
Rewards: nDCG
14
[Guan, Zhang, and Yang SIGIR 2013]2ND MODEL: QUERY CHANGE MODEL
SEARCH ENGINE AGENT’S ACTIONS∈ Di−1 action Example
qtheme
Y increase “pocono mountain” in s6
N increase
“france world cup 98 reaction” in s28, france world cup 98 reaction stock market→ france world cup 98 reaction
+∆q Y decrease ‘policy’ in s37, Merck lobbyists →
Merck lobbyists US policy
N increase ‘US’ in s37, Merck lobbyists → Merck lobbyists US policy
−∆q Y decrease
‘reaction’ in s28, france world cup 98 reaction → france world cup 98
N No change
‘legislation’ in s32, bollywood legislation →bollywood law
15[Guan, Zhang, and Yang SIGIR 2013]
QUERY CHANGE RETRIEVAL MODEL (QCM)
Bellman Equation gives the optimal value for an MDP:
The reward function is used as the document relevance score function and is tweaked backwards from Bellman equation:
16
Document relevant
scoreQuery
Transition model
Maximum past
relevanceCurrent reward/releva
nce score
[Guan, Zhang, and Yang SIGIR 2013]
CALCULATING THE TRANSITION MODEL
)|(log)|(
)|(log)()|(log)|(
)|(log)]|(1[+ d)|P(q log = d) ,Score(q
*1
*1
*1ii
*1
*1
dtPdtP
dtPtidfdtPdtP
dtPdtP
qti
dtqt
dtqt
i
qthemeti
ii
17
• According to Query Change and Search Engine Actions
Current reward/ relevance score
Increase weights for
theme terms
Decrease weights for
removed terms
Increase weights for novel added
terms
Decrease weights for old added terms
[Guan, Zhang, and Yang SIGIR 2013]
18
RELATED WORK Katja Hofmann, Shimon Whiteson, and Maarten
de Rijke. Balancing exploration and exploitation in learning to rank online. In ECIR'11.
Xiaoran Jin and Marc Sloan, and Jun Wang. Interactive exploratory search for multi page search results. In WWW '13
Xuehua Shen, Bin Tan, and Chengxiang Zhai. Implicit user modeling for personalized search. In CIKM '05
Norbert Fuhr. A Probability Ranking Principle for Interactive Information Retrieval. In IRJ, 11, 3, 2008 18
STATE DESIGN OPTIONS (S1) Fixed number of states
use two binary relevance states “Relevant” or “Irrelevant”
use four states whether the previously retrieved documents are
relevant whether the user desires to explore
(S2) Varying number of states model queries as states, n queries n states infinity states
document relevance score distribution as states. one document corresponds to one state
19
ACTION DESIGN OPTIONS
(A1) Technology Selection a meta-level modeling of actions
implement multiple search methods, and select the best methods for each query
Select the best parameters for each method
(A2) Term Weight Adjustment adjusted term weights
(A3) Ranked List One possible ranking of a list of documents is
one single action If the corpus size is N and the retrieved document
number is n, then the size of the action space is: 20
REWARD FUNCTION DESIGN OPTIONS
(R1) Explicit Feedback Rewards generated from user’s relevance
assessments. nDCG, MAP, etc
(R2) Implicit Feedback Use implicit feedback obtained from user behavior
Clicks, SAT clicks
21
SYSTEMS UNDER COMPARISON
Luo, et al. Win-Win Search: Dual-Agent Stochastic Game in Session Search. SIGIR’14
Zhang, et al. A POMDP Model for Content-Free Document Re-ranking. SIGIR’14
Guan, et al. Utilizing Query Change for Session Search. SIGIR’13
Shen, et al. Implicit user modeling for personalized search. CIKM '05
Jin, et al. Interactive exploratory search for multi page search results. WWW '13
22
S1A1R1(win-win)
S1A3R2
S2A2R1(QCM)
S2A1R1(UCAIR)
S2A3R1(IES)
S1A1R2
S1A2R1
S2A1R1
EXPERIMENTS Evaluate on TREC 2012 and 2013 Session Tracks
The session logs contain session topic user queries previously retrieved URLs, snippets user clicks, and dwell time etc.
Task: retrieve 2,000 documents for the last query in each session
The evaluation is based on the whole session. Metrics include: nDCG@10, nDCG, nERR@10 and MAP Wall Clock Time, CPU cycles and the Big O notation
23
Datasets ClueWeb09 CatB ClueWeb12 CatB spam documents are
removed duplicated documents
are removed
EFFICIENCY VS. # OF ACTIONS ON TREC 2012
24 When number of actions increases, efficiency tends
to drop dramatically
S1A3R2, S1A2R1,
S2A1R1(UCAIR), S2A2R1(QCM) and S2A1R1 are efficient
S1A1R1(win-win) and S1A1R2 are moderately efficient
S2A3R1(IES) is the slowest system
ACCURACY VS. EFFICIENCY
25
TREC 2012 TREC 2013
Accuracy tends to increase when efficiency decreases S2A1R1(UCAIR) strikes a good balance between accuracy
and efficiency S1A1R1(win-win) gives impressive accuracy with a fair
degree of efficiency
OUR RECOMMENDATION
26
If focus on accuracy
If time limit is within one hour
If want the balance of accuracy and efficiency
Note: number of actions heavily effect efficiency which need to be carefully designed
CONCLUSIONS
POMDPs are good for session search modeling Information seeking behaviors
Design questions States: What changes with each time step? Actions: How does our system change the state? Rewards: How can we measure feedback or
effectiveness? It is something between an Art and Empirical
Experiments Balance between efficiency and accuracy
27
RESOURCES
Infosense http://infosense.cs.georgetown.edu/
Dynamic IR Website Tutorials :
http://www.dynamic-ir-modeling.org/ Live Online Search Engine – Dumpling
http://dumplingproject.org Upcoming Book
Dynamic Information Retrieval Modeling TREC 2015 Dynamic Domain Track
http://trec-dd.org/ Please participate, if you are interested in
interactive, and dynamic search
28
Top Related