1
Counterfactual Model for Online Systems
CS 7792 - Fall 2016
Thorsten Joachims
Department of Computer Science & Department of Information Science
Cornell University
Imbens, Rubin, Causal Inference for Statistical Social Science, 2015. Chapters 1,3,12.
Interactive System Schematic
Action y for x
System ฯ0
Utility: ๐ ๐0
News Recommender
โข Context ๐ฅ:
โ User
โข Action ๐ฆ:
โ Portfolio of newsarticles
โข Feedback ๐ฟ ๐ฅ, ๐ฆ :
โ Reading time in minutes
Ad Placement
โข Context ๐ฅ:
โ User and page
โข Action ๐ฆ:
โ Ad that is placed
โข Feedback ๐ฟ ๐ฅ, ๐ฆ :
โ Click / no-click
Search Engine
โข Context ๐ฅ:
โ Query
โข Action ๐ฆ:
โ Ranking
โข Feedback ๐ฟ ๐ฅ, ๐ฆ :
โ win/loss against baseline in interleaving
Log Data from Interactive Systems โข Data
๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐, ๐ฆ๐, ๐ฟ๐
Partial Information (aka โContextual Banditโ) Feedback
โข Properties โ Contexts ๐ฅ๐ drawn i.i.d. from unknown ๐(๐) โ Actions ๐ฆ๐ selected by existing system ๐0: ๐ โ ๐ โ Feedback ๐ฟ๐ from unknown function ๐ฟ: ๐ ร ๐ โ โ
context ฯ0 action reward / loss
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
2
Goal
โข Use interaction log data
๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐ , ๐ฆ๐ , ๐ฟ๐
for evaluation of system ๐: โข Estimate online measures of some system ๐ offline.
โข System ๐ can be different from ๐0 that generated log.
Evaluation: Outline โข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward prediction
โข Approach 2: โModel the biasโ โ Counterfactual Model
โ Inverse propensity scoring (IPS) estimator
Online Performance Metrics Example metrics
โ CTR โ Revenue โ Time-to-success โ Interleaving โ Etc.
Correct choice depends on application and is not the focus of this lecture.
This lecture: Metric encoded as ฮด(๐ฅ, ๐ฆ) [click/payoff/time for (x,y) pair]
System โข Definition [Deterministic Policy]:
Function ๐ฆ = ๐(๐ฅ)
that picks action ๐ฆ for context ๐ฅ. โข Definition [Stochastic Policy]:
Distribution ๐ ๐ฆ ๐ฅ
that samples action ๐ฆ given context ๐ฅ ๐1(๐|๐ฅ) ๐2(๐|๐ฅ)
๐|๐ฅ
ฯ1 ๐ฅ ฯ2 ๐ฅ
๐|๐ฅ
System Performance
Definition [Utility of Policy]:
The expected reward / utility U(๐) of policy ๐ is
U ๐ = ๐ฟ ๐ฅ, ๐ฆ ๐ ๐ฆ ๐ฅ ๐ ๐ฅ ๐๐ฅ ๐๐ฆ
๐(๐|๐ฅ๐)
๐|๐ฅ๐
๐(๐|๐ฅ๐)
๐|๐ฅ๐
โฆ e.g. reading
time of user x for portfolio y
Online Evaluation: A/B Testing Given ๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐ , ๐ฆ๐ , ๐ฟ๐ collected under ฯ0, A/B Testing
Deploy ฯ1: Draw ๐ฅ โผ ๐ ๐ , predict ๐ฆ โผ ๐1 ๐ ๐ฅ , get ๐ฟ(๐ฅ, ๐ฆ) Deploy ฯ2: Draw ๐ฅ โผ ๐ ๐ , predict ๐ฆ โผ ๐2 ๐ ๐ฅ , get ๐ฟ(๐ฅ, ๐ฆ) โฎ Deploy ฯ|๐ป|: Draw ๐ฅ โผ ๐ ๐ , predict ๐ฆ โผ ๐|๐ป| ๐ ๐ฅ , get ๐ฟ(๐ฅ, ๐ฆ)
๐ ฯ0 =1
๐ ๐ฟ๐
๐
๐=1
3
Pros and Cons of A/B Testing โข Pro
โ User centric measure โ No need for manual ratings โ No user/expert mismatch
โข Cons โ Requires interactive experimental control โ Risk of fielding a bad or buggy ๐๐ โ Number of A/B Tests limited โ Long turnaround time
Evaluating Online Metrics Offline
โข Online: On-policy A/B Test
โข Offline: Off-policy Counterfactual Estimates
Draw ๐1 from ๐1 ๐ ๐1
Draw ๐2 from ๐2 ๐ ๐2
Draw ๐3 from ๐3 ๐ ๐3
Draw ๐4 from ๐4 ๐ ๐4
Draw ๐5 from ๐5 ๐ ๐5
Draw ๐6 from ๐6 ๐ ๐6
Draw ๐ from ๐0
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ ๐6
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ ๐12
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ ๐18
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ ๐24
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ โ1
๐ ๐30
Draw ๐7 from ๐7 ๐ ๐7
Evaluation: Outline โข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward prediction
โข Approach 2: โModel the biasโ โ Counterfactual Model
โ Inverse propensity scoring (IPS) estimator
Approach 1: Reward Predictor โข Idea:
โ Use ๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐, ๐ฆ๐, ๐ฟ๐ from ๐0 to estimate reward predictor ๐ฟ ๐ฅ, ๐ฆ
โข Deterministic ๐: Simulated A/B Testing with predicted ๐ฟ ๐ฅ, ๐ฆ โ For actions ๐ฆ๐
โฒ = ๐(๐ฅ๐) from new policy ๐, generate predicted log ๐โฒ = ๐ฅ1, ๐ฆ1
โฒ , ๐ฟ ๐ฅ1, ๐ฆ1โฒ , โฆ , ๐ฅ๐, ๐ฆ๐
โฒ , ๐ฟ ๐ฅ๐, ๐ฆ๐โฒ
โ Estimate performace of ๐ via ๐ ๐๐ ๐ =1
๐ ๐ฟ ๐ฅ๐ , ๐ฆ๐
โฒ๐๐=1
โข Stochastic ๐: ๐ ๐๐ ๐ =1
๐ ๐ฟ ๐ฅ๐ , ๐ฆ๐ฆ ๐(๐ฆ|๐ฅ๐)
๐๐=1
๐ฟ ๐ฅ, ๐ฆ1 ๐ฟ ๐ฅ, ๐ฆ2
๐|๐ฅ ๐ฟ ๐ฅ, ๐ฆ
๐ฟ ๐ฅ, ๐ฆโฒ
Regression for Reward Prediction
Learn ๐ฟ : ๐ฅ ร ๐ฆ โ โ
1. Represent via features ฮจ ๐ฅ, ๐ฆ 2. Learn regression based on ฮจ ๐ฅ, ๐ฆ
from ๐ collected under ๐0
3. Predict ๐ฟ ๐ฅ, ๐ฆโฒ for ๐ฆโฒ = ๐(๐ฅ) of new policy ๐
๐ฟ (๐ฅ, ๐ฆ)
๐ฟ ๐ฅ, ๐ฆโฒ
ฮจ1
ฮจ2
News Recommender: Exp Setup โข Context x: User profile
โข Action y: Ranking โ Pick from 7 candidates
to place into 3 slots
โข Reward ๐ฟ: โRevenueโ โ Complicated hidden
function
โข Logging policy ๐0: Non-uniform randomized logging system โ Placket-Luce โexplore around current production rankerโ
4
News Recommender: Results
RP is inaccurate even with more training and logged data
Problems of Reward Predictor
โข Modeling bias
โ choice of features and model
โข Selection bias
โ ฯ0โs actions are over-represented
๐ ๐๐ ๐ =1
๐ ๐ฟ ๐ฅ๐ , ๐ ๐ฅ๐
๐
๐ฟ (๐ฅ, ๐ฆ)
๐ฟ ๐ฅ, ๐ ๐ฅ
ฮจ1
ฮจ2 Can be unreliable and biased
Evaluation: Outline โข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward prediction
โข Approach 2: โModel the biasโ โ Counterfactual Model
โ Inverse propensity scoring (IPS) estimator
Approach โModel the Biasโ
โข Idea:
Fix the mismatch between the distribution ๐0 ๐ ๐ฅ that generated the data and the distribution ๐ ๐ ๐ฅ we aim to evaluate.
U ๐0 = ๐ฟ ๐ฅ, ๐ฆ ๐0 ๐ฆ ๐ฅ ๐ ๐ฅ ๐๐ฅ ๐๐ฆ
๐ ๐ฆ ๐ฅ ๐
Counterfactual Model โข Example: Treating Heart Attacks
โ Treatments: ๐ โข Bypass / Stent / Drugs
โ Chosen treatment for patient x๐: y๐ โ Outcomes: ฮด๐
โข 5-year survival: 0 / 1
โ Which treatment is best?
01
1
1
01
0
1
1
10
Pati
ents
x๐โ
1,.
..,๐
Counterfactual Model โข Example: Treating Heart Attacks
โ Treatments: ๐ โข Bypass / Stent / Drugs
โ Chosen treatment for patient x๐: y๐ โ Outcomes: ฮด๐
โข 5-year survival: 0 / 1
โ Which treatment is best?
01
1
1
01
0
1
1
10
Pati
ents
๐โ
1,.
..,๐
Placing Vertical
Click / no Click on SERP
Pos 1 / Pos 2/ Pos 3
Pos 1
Pos 2
Pos 3
5
Counterfactual Model โข Example: Treating Heart Attacks
โ Treatments: ๐ โข Bypass / Stent / Drugs
โ Chosen treatment for patient x๐: y๐ โ Outcomes: ฮด๐
โข 5-year survival: 0 / 1
โ Which treatment is best? โข Everybody Drugs โข Everybody Stent โข Everybody Bypass Drugs 3/4, Stent 2/3, Bypass 2/4 โ really?
01
1
1
01
0
1
1
10
Pati
ents
xi,
๐โ
1,.
..,๐
Treatment Effects
โข Average Treatment Effect of Treatment ๐ฆ
โ U ๐ฆ =1
๐ ๐ฟ(๐ฅ๐ , ๐ฆ)๐
โข Example
โ U ๐๐ฆ๐๐๐ ๐ =4
11
โ U ๐ ๐ก๐๐๐ก =6
11
โ U ๐๐๐ข๐๐ =3
11
Pati
ents
0 1 0
1 1 0
0 0 1
001
010
010
10011
01111
10000
Factual Outcome
Counterfactual Outcomes
Assignment Mechanism โข Probabilistic Treatment Assignment
โ For patient i: ๐0 ๐๐ = ๐ฆ|๐ฅ๐ โ Selection Bias
โข Inverse Propensity Score Estimator
โ
โ Propensity: pi = ๐0 ๐๐ = ๐ฆ๐|๐ฅ๐
โ Unbiased: ๐ธ ๐ ๐ฆ =๐ ๐ฆ , if ๐0 ๐๐ = ๐ฆ|๐ฅ๐ > 0 for all ๐
โข Example
โ ๐ ๐๐๐ข๐๐ =1
11
1
0.8+
1
0.7+
1
0.8+
0
0.1
= 0.36 < 0.75
Pati
ents
0 1 0
1 1 0
0 0 1
001
010
010
10011
01111
10000
0.30.50.1
0.60.40.1
0.1 0.1 0.8
0.60.20.7
0.30.50.2
0.10.70.1
0.10.10.30.30.4
0.10.80.30.60.4
0.80.10.40.10.2
๐0 ๐๐ = ๐ฆ|๐ฅ๐
๐ ๐๐๐ ๐ฆ =1
๐
๐{๐ฆ๐= ๐ฆ}
๐๐๐ฟ(๐ฅ๐ , ๐ฆ๐)
๐
Experimental vs Observational โข Controlled Experiment
โ Assignment Mechanism under our control โ Propensities ๐๐ = ๐0 ๐๐ = ๐ฆ๐|๐ฅ๐ are known by design โ Requirement: โ๐ฆ: ๐0 ๐๐ = ๐ฆ|๐ฅ๐ > 0 (probabilistic)
โข Observational Study โ Assignment Mechanism not under our control โ Propensities ๐๐ need to be estimated โ Estimate ๐ 0 ๐๐|๐ง๐ = ๐0 ๐๐ ๐ฅ๐) based on features ๐ง๐ โ Requirement: ๐ 0 ๐๐ ๐ง๐) = ๐ 0 ๐๐ ๐ฟ๐ , ๐ง๐ (unconfounded)
Conditional Treatment Policies โข Policy (deterministic)
โ Context ๐ฅ๐ describing patient โ Pick treatment ๐ฆ๐ based on ๐ฅ๐: yi = ๐(๐ฅ๐) โ Example policy:
โข ๐ ๐ด = ๐๐๐ข๐๐ , ๐ ๐ต = ๐ ๐ก๐๐๐ก, ๐ ๐ถ = ๐๐ฆ๐๐๐ ๐
โข Average Treatment Effect
โ ๐ ๐ =1
๐ ๐ฟ(๐ฅ๐ , ๐ ๐ฅ๐ )๐
โข IPS Estimator
โ
Pati
ents
0 1 0
1 1 0
0 0 1
001
010
010
10011
01111
10000
๐ต๐ถ๐ด๐ต๐ด๐ต๐ด๐ถ๐ด๐ถ๐ต
๐ ๐๐๐ ๐ =1
๐
๐{๐ฆ๐= ๐(๐ฅ๐)}
๐๐๐
๐ฟ(๐ฅ๐ , ๐ฆ๐)
Stochastic Treatment Policies โข Policy (stochastic)
โ Context ๐ฅ๐ describing patient โ Pick treatment ๐ฆ based on ๐ฅ๐: ๐(๐|๐ฅ๐)
โข Note โ Assignment Mechanism is a stochastic policy as well!
โข Average Treatment Effect
โ ๐ ๐ =1
๐ ๐ฟ(๐ฅ๐ , ๐ฆ)๐ ๐ฆ|๐ฅ๐๐ฆ๐
โข IPS Estimator
โ ๐ ๐ =1
๐
๐ ๐ฆ๐ ๐ฅ๐
๐๐๐ ๐ฟ(๐ฅ๐ , ๐ฆ๐)
Pati
ents
0 1 0
1 1 0
0 0 1
001
010
010
10011
01111
10000
๐ต๐ถ๐ด๐ต๐ด๐ต๐ด๐ถ๐ด๐ถ๐ต
6
Counterfactual Model = Logs Medical Search Engine Ad Placement Recommender
Context ๐ฅ๐ Diagnostics Query User + Page User + Movie
Treatment ๐ฆ๐ BP/Stent/Drugs Ranking Placed Ad Watched Movie
Outcome ๐ฟ๐ Survival Click metric Click / no Click Star rating
Propensities ๐๐ controlled (*) controlled controlled observational
New Policy ๐ FDA Guidelines Ranker Ad Placer Recommender
T-effect U(๐) Average quality of new policy.
Rec
ord
ed in
Lo
g
Evaluation: Outline โข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward prediction
โข Approach 2: โModel the biasโ โ Counterfactual Model
โ Inverse propensity scoring (IPS) estimator
System Evaluation via Inverse Propensity Scoring
Definition [IPS Utility Estimator]: Given ๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐ , ๐ฆ๐, ๐ฟ๐ collected under ๐0,
Unbiased estimate of utility for any ๐, if propensity nonzero whenever ๐ ๐ฆ๐ ๐ฅ๐ > 0. Note:
If ๐ = ๐0, then online A/B Test with
Off-policy vs. On-policy estimation.
๐ ๐๐๐ ๐ =1
๐ ๐ฟ๐
๐
๐=1
๐ ๐ฆ๐ ๐ฅ๐
๐0 ๐ฆ๐ ๐ฅ๐
[Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Li et al., 2011]
Propensity ๐๐
๐ ๐๐๐ ๐0 =1
๐ ๐ฟ๐
๐
Illustration of IPS
IPS Estimator:
๐0 ๐ ๐ฅ ๐(๐|๐ฅ)
๐ ๐ผ๐๐ ๐ =1
๐
๐ ๐ฆ๐ ๐ฅ๐
๐0 ๐ฆ๐|๐ฅ๐๐
๐ฟ๐
IPS Estimator is Unbiased
=1
๐ ๐0 ๐ฆ1 ๐ฅ1 ๐(๐ฅ1) โฆ ๐0 ๐ฆ๐ ๐ฅ๐ ๐(๐ฅ๐)
๐ ๐ฆ๐ ๐ฅ๐
๐0 ๐ฆ๐ ๐ฅ๐)๐ฟ ๐ฅ๐ , ๐ฆ๐
๐ฅ๐,๐ฆ๐๐ฅ1,๐ฆ1
๐
=1
๐ ๐0 ๐ฆ๐ ๐ฅ๐ ๐(๐ฅ๐)
๐ ๐ฆ๐ ๐ฅ๐
๐0 ๐ฆ๐ ๐ฅ๐)๐ฟ ๐ฅ๐ , ๐ฆ๐
๐ฅ๐,๐ฆ๐
๐
=1
๐ ๐ ๐ฆ๐ ๐ฅ๐ ๐(๐ฅ๐)๐ฟ ๐ฅ๐ , ๐ฆ๐
๐ฅ๐,๐ฆ๐
๐
=1
๐ U(ฯ)
๐
= ๐ ๐
๐ธ ๐ ๐ =1
๐ โฆ
๐ ๐ฆ๐ ๐ฅ๐
๐0 ๐ฆ๐ ๐ฅ๐)๐ฟ ๐ฅ๐ , ๐ฆ๐
๐๐ฅ๐,๐ฆ๐
๐0 ๐ฆ1 ๐ฅ1 โฆ ๐0 ๐ฆ๐ ๐ฅ๐ ๐ ๐ฅ1 โฆ ๐(๐ฅ๐)
๐ฅ1,๐ฆ1
=1
๐ ๐0 ๐ฆ1 ๐ฅ1 ๐(๐ฅ1)โฆ ๐0 ๐ฆ๐ ๐ฅ๐ ๐(๐ฅ๐)
๐ ๐ฆ๐ ๐ฅ๐
๐0 ๐ฆ๐ ๐ฅ๐)๐ฟ ๐ฅ๐ , ๐ฆ๐
๐๐ฅ๐,๐ฆ๐๐ฅ1,๐ฆ1
Probabilistic Assignment
News Recommender: Results
IPS eventually beats RP; variance decays as ๐1
๐
7
Evaluation: Outline โข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward prediction
โข Approach 2: โModel the biasโ โ Counterfactual Model
โ Inverse propensity scoring (IPS) estimator
Top Related