A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
-
Upload
alan-said -
Category
Technology
-
view
970 -
download
0
description
Transcript of A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
A Top-‐N Recommender System Evalua8on Protocol Inspired by Deployed Systems
Alan Said, Alejandro Bellogín, Arjen De Vries CWI
@alansaid, @abellogin, @arjenpdevries
Outline • Evalua8on
– Real world – Offline
• Protocol • Experiments & Results • Conclusions
2013-‐10-‐13 LSRS'13 2
• Not algorithmic comparison! • Comparison of evalua8on
EVALUATION
2013-‐10-‐13 LSRS'13 3
Evalua8on • Does p@10 in [Smith,2010a] measure the same quality as p@10 in [Smith,
2012b]? – Even if it does
• is the underlying data the same? • was cross-‐valida8on performed similarly? • etc.
2013-‐10-‐13 LSRS'13 4
Evalua8on • What metrics should we use? • How should we evaluate?
– Relevance criteria for test items – Cross valida8on (n-‐fold, random)
• Should all users and items be treated the same way? – Do certain users and items reflect different evalua8on quali8es?
2013-‐10-‐13 LSRS'13 5
Offline Evalua8on Recommender System accuracy evalua8on is currently based on methods from IR/ML
– One training set – One test set – (One valida8on set) – Algorithms are trained on the training set – Evaluate using metric@N (e.g. p@N – a page size)
• Even when N is larger than the number of test items • p@N = 1.0 is (almost) impossible
2013-‐10-‐13 LSRS'13 6
Evalua8on in produc8on • One dynamic training set
– All of the available data at a certain point in 8me – Con8nuously updated
• No test set – Only live user interac8ons
• Clicked/purchased items are good recommenda8ons
Can we simulate this offline?
2013-‐10-‐13 LSRS'13 7
Evalua8on Protocol • Based on “real world” concepts • Uses as much available data as possible • Trains algorithms once per user and evalua8on selng (e.g. N) • Evaluates p@N when there are exactly N correct items in the test set
– possible p@N = 1 (gold standard)
2013-‐10-‐13 LSRS'13 8
Evalua8on Protocol Three concepts: 1. Personalized training & test sets
– Use all available informa8on about the system for the candidate user – Different test/training sets for different levels of N
2. Candidate item selec8on (items in test sets) – Only “good” items go in test sets (no random 80%-‐20% splits) – How “good” an item is is based on each user’s personal preference
3. Candidate user selec8on (users in test sets) – Candidate users must have items in the training set – When evalua8ng p@N, each user in test set should have N items in test set
• Effec8vely precision becomes R-‐precision
Train each algorithm once for each user in the test set and once for each N.
2013-‐10-‐13 LSRS'13 9
Evalua8on Protocol
2013-‐10-‐13 LSRS'13 10
EXPERIMENTS
2013-‐10-‐13 LSRS'13 11
Experiments • Datasets:
– Movielens 100k • Minimum 20 ra8ngs per user • 943 users • 6.43% density • Not realis8c
– Movielens 1M sample • 100k ra8ngs • 1000 users • 3.0% density
• Algorithms – SVD – User-‐based CF (kNN) – Item-‐based CF
1
10
100
10 100 1000
numbe
r of u
sers
number of raAngs
1
10
100
10 100 1000 nu
mbe
r of u
sers
number of raAngs 2013-‐10-‐13 LSRS'13 12
According to proposed protocol: • Evaluate R-‐precision for
N=[1,5,10,20,50,100] • Users evaluated at N must have at
least N items rated above the relevance threshold (RT)
• RT depends on the users mean ra8ng and standard devia8on
• Number of runs: |N|*|users|
Experimental Selngs
2013-‐10-‐13 LSRS'13 13
Baseline • Evaluate p@N for
N=[1,5,10,20,50,100] • 80%-‐20% training-‐test split
– Items in test set rated at least 3
• Number of runs: 1
Results
14
User-‐based CF ML1M sample
2013-‐10-‐13 LSRS'13
Results
15
User-‐based CF ML1M sample User-‐based CF ML100k
2013-‐10-‐13
SVD ML1M sample
LSRS'13
SVD ML1M sample
Results What about 8me?
– |N|*|users| vs. 1? – Trade-‐off between a realis8c
evalua8on and complexity?
2013-‐10-‐13 LSRS'13 16
Conclusions • We can emulate a realis8c produc8on scenario by crea8ng personalized
training/test sets and evalua8ng them for each candidate user separately • We can see how well a recommender performs at different levels of recall
(page size) • We can compare towards a gold standard • We can reduce evalua8on 8me
2013-‐10-‐13 LSRS'13 17
Ques8ons? • Thanks!
• Also: check out
– ACM TIST Special Issue on RecSys Benchmarking – bit.ly/RecSysBe – The ACM RecSys Wiki – www.recsyswiki.com
2013-‐10-‐13 LSRS'13 18