#iCanHazRobot?: improved robot detection for IR usage statistics
-
Upload
ucd-library -
Category
Education
-
view
637 -
download
0
Transcript of #iCanHazRobot?: improved robot detection for IR usage statistics
![Page 1: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/1.jpg)
Leabharlann UCD
An Coláiste Ollscoile, Baile Átha Cliath,Belfield, Baile Átha Cliath 4, Eire
UCD Library
University College Dublin,Belfield, Dublin 4, Ireland
Joseph GreeneResearch Repository LibrarianUniversity College [email protected]://researchrepository.ucd.ie
#iCanHazRobot?Improved robot detection for IR usage statistics
Open Repositories 2016Dublin, 14 June
![Page 2: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/2.jpg)
Overview and take-home points
• Usage stats are important– (go to the Usage Stats panel on Thursday,
16/Jun/2016: 11:00am - 12:30pm)• Robot filtration is a problem, especially in
repositories• Robot detection has an exponential effect on
usage stats’ accuracy in repositories• 2-3 ways to improve DSpace and EPrints’ usage
stats by 20% or more will be demonstrated
![Page 3: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/3.jpg)
Experimental study
• Simple random sample of 2 years of UCD repository’s download data– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human• Applied DSpace, EPrints robot detection
algorithms to the dataset– This is an EXPERIMENT, simulating algorithms on a
DSpace repository’s usage data and Apache logs– The data is real, live data, and the algorithms were
very easy to simulate
![Page 4: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/4.jpg)
First finding
85% of unfiltered repository downloads come from robots• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
![Page 5: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/5.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (robots)
Accu
racy
of d
ownl
oad
stat
s (in
vers
e pr
eciti
on)
Catching more robots improves stats(But how much depends on the number of robots)
Get b
ette
r sta
ts
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
![Page 6: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/6.jpg)
Robot detection techniques used
DSpace EPrints Minho DSpace
Statistics Add-on Rate of requests ✓ 3 User agent string ✓ ✓ ✓ robots.txt access ✓
Volume of requests ✓ 2 ✓ 3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓ 1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓ 3 1Only implemented nominally or experimentally 2Via the repeat download or ‘double-click’ filter 3Data available as a configurable report for manual decision making
![Page 7: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/7.jpg)
Measurements used in robot detection
• All measurements are a number between 0 and 1• Recall: proportion of robots detected
– I can haz robot?• Precision: true positives in robot detection
– Proportion of discounted downloads that are actually made by robots (sometimes humans are counted as robots)
• Accuracy of download stats measured as inverse precision: – Proportion of stats that are actually made by
humans
![Page 8: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/8.jpg)
How they perform, out-of-the-box
DSpace
EPrin
ts
Minho
Minho with
monthly
manual
check
ing
No robot d
etecti
on0
0.20.40.60.8
1
Robot detection in OA IR systems
RecallPrecisionNegative precision (accuracy of download stats)
![Page 9: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/9.jpg)
Room for improvement?
![Page 10: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/10.jpg)
1. Ability to manually check for outliers
• At UCD, once a month, we check:– Daily downloads for the last 2-4 months– Top 10 most downloaded items– Top 20 downloading IP addresses for the last 2-4
months
![Page 11: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/11.jpg)
![Page 12: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/12.jpg)
![Page 13: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/13.jpg)
![Page 14: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/14.jpg)
![Page 15: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/15.jpg)
DSpace Eprints Minho0
0.20.40.60.8
1
Robots caught (Recall)
DSpace Eprints Minho Wihtout robot detection
00.10.20.30.40.50.60.70.80.9
1
Accuracy of reported download stats (Inverse precision)
Out-of-the-boxWith manual checking (outlier exclusion)
![Page 16: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/16.jpg)
2. Recalibrate the EPrints repeat-download (double-click) filter
0
0.2
0.4
0.6
0.8
1Effect of double-click filter on EPrints’ robot detection and stats
Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter*
𝑻𝒑+𝑻𝒏𝒏
![Page 17: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/17.jpg)
3. Port Minho’s robot detection code (a log parser) onto DSpace or EPrints
• 1 Java class• Input is Apache Combined Log Format• Output is a database update (robot = true field)
– Similar to EPrints' $is_robot variable in Robots.pm, – Could be modified to update the DSpace 'isBot'
field in the SOLR usage events document• Requires 2 database tables to store learned
agents and IPs
![Page 18: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/18.jpg)
DSpace Eprints Minho0
0.2
0.4
0.6
0.8
1
Robots caught (Recall)
DSpace Eprints Minho Wihtout robot detection
00.10.20.30.40.50.60.70.80.9
1
Accuracy of reported download stats(Inverse precision)
Out-of-the-box With Minho log parser
![Page 19: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/19.jpg)
4. Combine two or more techniques
DSpace Eprints Minho0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Robots caught(Recall) Out-of-the-box
With manual checking (outlier exclusion)
With recalibrated double click filter*
With Minho log parser
With Minho and out-liers
Minho, outliers, and recalibrated double-click*
![Page 20: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/20.jpg)
4. Combine two or more techniques
DSpace Eprints Minho Wihtout robot detection
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy of reported download stats (Inverse precision)
Out-of-the-box
With manual checking (outlier exclusion)
With recalibrated double click filter*
With Minho log parser
With Minho and out-liers
Minho, outliers, and recalibrated double-click*
![Page 21: #iCanHazRobot?: improved robot detection for IR usage statistics](https://reader034.fdocument.pub/reader034/viewer/2022051707/58ed09a51a28ab7e748b45c1/html5/thumbnails/21.jpg)
Thank you!