Measuring CDN performance and why you're doing it wrong
Transcript of Measuring CDN performance and why you're doing it wrong
Why this matters
• Performance is one of the main reasons we use a CDN
• Seems easy to measure, but isn’t • Performance is an easy way to comparison shop • Nuanced • Metric overload
Common mistakes
• Getting lost in data • Focusing on one thing and one thing alone – We like numbers!
• Forgetting about our applications • Letting vendors influence us – (I’m fully aware of the irony!)
Goals
• Share measurement experiences
• What does the measurement landscape look like
• Help guide towards good testing plan – Avoid pitfalls
CDNs: delivery platforms
• We won’t talk about – Extra stuff: security, GSLB, TLS offload, etc – We also won’t talk about page-‐level optimizations
• We will focus on the delivery side – Delivering HTTP objects
What we’ll be focusing on
• How we measure • Metrics to measure • What to measure • Some gotchas, misconceptions, and common mistakes
Measurement techniques • Pretend Users – Synthetic tests – Not actual users
• Real Users – In the browser – Actual users
Synthetic testing
• Usually a large network of test nodes all over the globe
• Highly scalable, can do lots of tests at once • Many vendors that have this model – Examples: Catchpoint, Dynatrace(Gomez), Keynote, Pingdom, etc
Synthetic testing • Built to do full performance and availability testing
– Lots of “monitors” – emulating what real users do – DNS, Traceroute, Ping, Streaming, Mobile – HTTP
• Object • Browser • Transactions/Flows
• Tests set up with some frequency to repeatedly test things
– Aggregates reported
Backbone nodes • Test machines sitting in datacenters all around the globe • Terrible indicators of raw performance
– No latency – Infinite bandwidth
• But really good at: – Availability – Scale – Backend problems – Global reach
Backbone nodes • Test machines sitting in datacenters all around the globe • Terrible indicators of raw performance
– No latency – Infinite bandwidth
• But really good at: – Availability – Scale – Backend problems – Global reach
Last mile nodes • Test machines sitting behind a real home-‐like internet connection
• Much better at reporting what you can expect from users, but sometimes unreliable
• Also not as dense in deployment
Synthetic testing • Pros – Geographic distribution – Lots of options for testing – Really good for on-‐the-‐spot troubleshooting – Last-‐mile nodes can be pretty good proxies for performance
• Cons – Not real users! – Backbone nodes can be misleading
RUM
• Use javascript to collect timing metrics
• Can collect lots of things through browser APIs – Page metrics, asset metrics, user-‐defined metrics
Use test assets
• Use this model to initiate tests in the browser • Some vendors: – Cedexis, TurboBytes, CloudHarmony, more… – Usually, this isn’t their business, but the data drives their main business objectives
• You can build this yourself too
Use real assets in the page • Collect timings from actual objects – Resource timing
• Vendors – SOASTA, New Relic, most synthetic vendors – Boomerang (open source) – Google Analytics User Timings
DATA, DATA, DATA • For either RUM technique, we need A LOT of data
• Too many variances
• Most vendors don’t use averages – Medians and percentiles
Real user measurements • Pros – Real users, real browsers, real world conditions – If you use your own content, could be close to what your users experience
– With enough data, great for granular analysis • Cons – We need a lot of data – If you do it yourself, data infrastructures aren’t trivial
DNS TCP (TLS) TTFB Download (TTLB-‐TTFB)
Time
DNS RTT to DNS server, DNS iterations, DNS caching and TTLs
DNS TCP (TLS) TTFB Download (TTLB-‐TTFB)
Time
DNS
TCP
RTT to DNS server, DNS iterations, DNS caching and TTLs
RTT to cache server (CDN footprint & routing algorithms)
DNS TCP (TLS) TTFB Download (TTLB-‐TTFB)
Time
DNS
TCP
(TLS)
RTT to DNS server, DNS iterations, DNS caching and TTLs
RTT to cache server (CDN footprint & routing algorithms)
RTT to cache server (or RTTs depending on TLS False Start), efficiency of TLS engine
DNS TCP (TLS) TTFB Download (TTLB-‐TTFB)
Time
DNS
TCP
(TLS)
TTFB
RTT to DNS server, DNS iterations, DNS caching and TTLs
RTT to cache server (CDN footprint & routing algorithms)
RTT to cache server (or RTTs depending on TLS False Start), efficiency of TLS engine
RTT to where the object is stored + storage efficiency (different for requests to origin); lower bound = network RTT
DNS TCP (TLS) TTFB Download (TTLB-‐TTFB)
Time
DNS
TCP
(TLS)
TTFB
TTLB-‐TTFB
RTT to DNS server, DNS iterations, DNS caching and TTLs
RTT to cache server (CDN footprint & routing algorithms)
RTT to cache server (or RTTs depending on TLS False Start), efficiency of TLS engine
RTT to where the object is stored + storage efficiency (different for requests to origin); lower bound = network RTT
Bandwidth, congestion avoidance algorithms (and RTT!)
Core object metrics • Not every request experiences every metric: – DNS: once per domain – TCP/TLS once per connection – HTTP/Download for every object (not already in browser cache)
• All techniques/tools measure and report these metrics
What the…??? • We always assume “all things equal”
• Too many factors affect page load time – 3rd parties (sometimes varying), content form origin, layout, JS execution, etc
• Too much variance
To be clear… • Always use webpagetest (or something like it) to understand your
application’s performance profile
• Continue to monitor application performance, and always spot check
• Be extremely careful when using it to gage/compare CDN performance, it can mislead you – If using RUM to measure page metrics, with lots of data, things
become a little more meaningful (data volume handles variance)
Most commonly… • Pick a “normal object”
– e.g. some object on the home page
• Set up testing from multiple places (usually with a synthetic vendor) – And hopefully not backbone!
• Compare either overall load time, or some object metrics
Web application: objects
• Ratio of objects coming from CDN cache vs those coming from origin (through CDN) should determine objects to test
• If HTML is from origin, we must measure it – Essential to critical page metrics
Web application: metrics • On any page
– DNS queries only happen a small number of times – 6 TCP connections per domain – Many many many HTTP fetches
• Core metrics – TTFB – Download (TTLB-‐TTFB) if important large objects – Should have a good idea of DNS/TCP/TLS, but less critical
Web application • If CDN only for static/cacheable objects: – One or two representative assets – TTFB and maybe download most important
Client CDN Node
Web application • If CDN also for whole site – Sample of key HTML pages, delivered from origin – TTFB will show efficiency of routing (and connection management) to origin
– TTLB will show efficiency of delivery
Web Server Client CDN Node
Web application • If CDN also for whole site – Sample of key HTML pages, delivered from origin – TTFB will show efficiency of routing (and connection management) to origin
– TTLB will show efficiency of delivery
Web Server Client CDN Node CDN Node
Software download: objects • Pick a standard file that users will be downloading – Representative file size
• Also pick something you expect to be on the CDN but not fetched all that often – More on this later….
Software download: metrics • Ratio of TCP-‐to-‐HTTP is closer to 1-‐1
– Especially if you have a dedicated download domain – Could mean the same for DNS
• For large files, we care about download time
• Core metrics: – TTFB (+ TCP and maybe DNS, if applicable) – TTLB – TTLB-‐TTFB will usually be a larger component
Download time • Careful about where you expect this download to happen in the
lifetime of a TCP connection
• In the beginning of the connection – Function of init_cwnd and TCP slow start (and RTT)
• Later in the connection – Function of congestion avoidance and bandwidth
• Large files will experience both
Download time • Careful about where you expect this download to happen in the
lifetime of a TCP connection
• In the beginning of the connection – Function of init_cwnd and TCP slow start (and RTT)
• Later in the connection – Function of congestion avoidance and bandwidth
• Large files will experience both
Cache hit ratio
vs. 1 -‐ Requests to Origin
Total Requests
Hits
Hits + Misses @edge
Offload Performance
Popular Medium Tail (1hr) Long tail (6hr)
Connect (median)
Popular 14msec
1hr Tail 15msec
6hr Tail 16msec
Popular Medium Tail (1hr) Long tail (6hr)
Connect (median)
Popular 14msec
1hr Tail 15msec
6hr Tail 16msec 6,400+ measurements
77,000+ measurements
38,000+ measurements
Popular Medium Tail (1hr) Long tail (6hr)
Connect (median)
Popular 14msec
1hr Tail 15msec
6hr Tail 16msec 6,400+ measurements
77,000+ measurements
38,000+ measurements
Popular Medium Tail (1hr) Long tail (6hr)
Connect (median) Wait (median)
Popular 14msec 19msec
1hr Tail 15msec 26msec
6hr Tail 16msec 32msec 6,400+ measurements
77,000+ measurements
38,000+ measurements
The bigger picture
• It’s really easy to lock in on a metric
• Performance absolutely matters
• True performance isn’t always as easy to measure
Key takeaways • Everything is application-‐dependent
– Evaluate how your application works and what impacts performance the most
• Don’t get locked into a single number
• Always know your application performance and bottlenecks
• Be mindful of the bigger picture!