Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry
-
Upload
dyn -
Category
Technology
-
view
785 -
download
1
description
Transcript of Dyn: Active/Active Failover with Cory von Wallenstein & Eric Rosenberry
Going Ac)ve/Ac)ve
Cory von Wallenstein Chief Technology Officer,
Dyn Inc. @cvonwallenstein
Eric Rosenberry Principal Infrastructure Architect,
iova)on Inc. eric.rosenberry@iova)on.com
@eprosenx
Introduc)ons
Cory von Wallenstein Chief Technology Officer,
Dyn Inc. [email protected]
@cvonwallenstein
Eric Rosenberry Principal Infrastructure Architect,
iova)on Inc. eric.rosenberry@iova)on.com
@eprosenx
What Do We Mean By Ac)ve/Ac)ve?
• Ac)ve • Passive • Ac)ve/Passive • Ac)ve/Ac)ve
What Are We Looking to Gain?
• High(er) availability • Flexibility to change infrastructure without down)me
• Flexibility to expand infrastructure without four walled limita)ons
• Disaster resilience
Ac)ve/Ac)ve FUD
• “It’s impossible!” – CAP theorem – WAN latency
• “It’s built in to my database!” – NoSQL and WAN replica)on
• Reality is it’s somewhere in the middle, depending on what problem you’re trying to solve
hZp://www.flickr.com/photos/notaperfectpilot/8119088205/
“Wired people should know something about wires” -‐ Neal Stephenson, quoted in Andrew Blum’s TED Talk What is the Internet, Really?
hZp://www.ted.com/talks/andrew_blum_what_is_the_internet_really.html
Paradigm Shif
• All system maintenance is done during business hours without impact
• All sofware upgrades are done during business hours
• Sofware upgrades do not require down)me, so code can be pushed to produc)on more rapidly (more frequent smaller itera)ons)
• Enable commodity hardware usage
The Four Ques)ons You Need To Ask Before Embarking
1. What problem(s) am I aZemp)ng to solve? 2. How will I segment? 3. Where will I deploy? 4. How will this affect each part of my app?
Step One: Scope the Problem
• What are we replica)ng and why? • How close to real)me is it needed to be?
– Synchronous vs. Asynchronous • Think about this for each applica)on )er, and set availability/distribu)on goals
Step One: Scope the Problem
• Example: • iova)on end-‐user facing content services must be served using the closest GSLB
selected node and each node must have N capacity (where N = our full overall global load) -‐ so overall we have more than 4N total capacity with all nodes online
• iova)on real-‐)me API services require N+1 redundancy in each of our two Ac)ve/Ac)ve facili)es -‐ i.e. 2 * (N+1) -‐ Allows us to lose any server, plus a datacenter and con)nue to func)on
• Non real-‐)me API services (i.e. Admin Console) require 2N+ resiliancy (i.e. one instance in each of our two Ac)ve/Ac)ve datacenters, with that instance running on a N+1 Virtual cluster)
• Some internal processes (i.e. Research Analy)cs) only require placement in one datacenter
Step Two: How Will You Segment?
• Global Server Load Balancing with DNS – Round robin – Advanced load balancing – Ac)ve failover – Geographic
• Other strategies (out of scope for today): – Anycast – Challenges with TCP – HTTP Redirec)on – Challenges with performance – BGP Netblock based failover
Step Three: Where Will You Deploy?
• Going from 1 to N • Where are you thinking?
– What are your current datacenter assets and how can they be leveraged?
• And for what reasons? – Disaster resilience – Get closer to users – Room to grow
Disaster Resilience
hZp://maps.google.com
hZp://www.cogentco.com/files/images/network/network_map/networkmap_global_large.png
Speed of light 299,792.458 km/second
(in a vacuum)
Theore)cal RTT ~40ms
Real RTT ~90ms
Speed of Light
• Things don’t work as well at 90ms RTT latency as they do at 9ms RTT latency
• Where can you go to get out of the way of a disaster but not create latency headaches?
hZp://www.globaldatavault.com/natural-‐disaster-‐threat-‐maps.htm
Implica)ons on Selec)on
hZp://soladrive.com/images/level3-‐map-‐large.png
Where The Fiber Actually Goes
Disaster Resilience: Local Failures
hZp://www.datacenterknowledge.com/archives/2012/07/09/outages-‐surviving-‐electric-‐squirrels-‐ups-‐failures/
“A frying squirrel took out half of our Santa Clara data center two years back,” -‐ Mike Chris)an, Yahoo
Local Failures
hZp://blog.level3.com/level-‐3-‐network/the-‐10-‐most-‐bizarre-‐and-‐annoying-‐causes-‐of-‐fiber-‐cuts/
“Squirrel chews account for a whopping 17% of our damages so far this year! But let me add that it is down from 28% just last year and it con)nues to decrease since we added cable guards to our plant.”, Fred Lawler, Level(3)
Get closer to users
hZp://www.akamai.com/html/technology/dataviz1.html
Get closer to users
hZp://www.akamai.com/html/technology/dataviz1.html
“Sorry, we’re full”
hZp://www.theregister.co.uk/2010/10/12/capgemini_merlin_data_center/
Step Three: Where Will You Deploy?
• Don’t just assume vastly different geographic areas
• How far do you need to go to get out of same disaster zone? – What kind of disasters happen in your area? – What geographic barriers are there? – Can you drive it in an emergency?
hZp://www.zayo.com/sites/default/files/images/Zayo-‐US-‐Network-‐EXTERNAL-‐11-‐1-‐2012.kmz
Portland to SeaZle
Step Four: Think Through Your Apps
• How will these different pieces of the architecture behave with increased latency between them?
• Can you avoid real-‐)me calls across the WAN?
Step Four: Think Through Your Apps
• Examples from Iova)on: – Web Device Print code is served from four global nodes using GSLB
• via Dyn Traffic Management • Was our first Ac)ve/Ac)ve applica)on
– Real )me API responses are served Ac)ve/Ac)ve between Portland and SeaZle
• 50% of the )me our API URL returns PDX, and 50% it returns SEA IP
• Real )me queries are handled locally within single DC
Summary
• Top takeaways – Ac)ve/Ac)ve is a Paradigm Shif – It is achievable – Choose your loca)ons carefully
• Network is a primary selec)on criteria • How far do you really need to go?
– Analyze each applica)on )ers constraints carefully – Start with low hanging fruit
What iovation Does
Iden)fy and re-‐recognize devices connec)ng to your business sites
Associate groups of devices that would otherwise appear unrelated
Assess real-‐)me risk through business rules including velocity, anomaly, proxy use, etc.
Ques)ons?
Cory von Wallenstein Chief Technology Officer,
Dyn Inc. [email protected]
@cvonwallenstein
Eric Rosenberry Principal Infrastructure Architect,
iova)on Inc. eric.rosenberry@iova)on.com
@eprosenx