#Surgeconf Scaling Twitter to go After the Fail Whale
-
Upload
jonathan-reichhold -
Category
Technology
-
view
196 -
download
0
description
Transcript of #Surgeconf Scaling Twitter to go After the Fail Whale
![Page 1: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/1.jpg)
Scaling Twitter To Go After the Fail Whale
Jonathan Reichhold - Twitter Engineering
![Page 2: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/2.jpg)
Early Twitter....
![Page 3: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/3.jpg)
2010 World Cup Challenge
•Tweet and user requests growing exponentially (good problem)
![Page 4: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/4.jpg)
Load....
![Page 5: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/5.jpg)
Monolithic Architecture
•Ruby on Rails
•Temporally-sharded MySQL
•Memcached
•~60 engineers
![Page 6: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/6.jpg)
Stabilize & Understand
•Learn & make improvements
•Don’t just survive
![Page 7: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/7.jpg)
Be Realistic & Ambitious
•Prioritize what can be fixed and timeframes for doing it
•Sometimes need the duct tape
•Find patterns and improvements for the long term
![Page 8: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/8.jpg)
A Bad Approach
•Flip switches/branches/other until fixed
http://www.flickr.com/photos/chrism70/1144424032
![Page 9: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/9.jpg)
Science
![Page 10: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/10.jpg)
Step 1: Trustworty Data
• https://blog.twitter.com/2013/observability-at-twitter
![Page 11: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/11.jpg)
Step 2: Set Expectations
•Being on-call is a job and during high stress will burn folks out
•Maintain calm and order
![Page 12: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/12.jpg)
Post Mortems
•Improvement becomes part of process
•Stress makes system stronger not weaker
![Page 13: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/13.jpg)
Teamwork
•All of this made possible by amazing team and management
•Culture
![Page 14: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/14.jpg)
Capacity Planning & Forecast
•Just in time but realistic
•Figure out real buffers
![Page 15: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/15.jpg)
Longer Term Changes
•Architecture changes take time and changes in organization
![Page 16: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/16.jpg)
Improve Efficiency•Rails/Ruby -> Scala & JVM
•200-300 RPS -> 10,000-20,000
•Single process per request -> Finagle
![Page 17: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/17.jpg)
Service Orientation•Make changes
at interface boundary, not in single monolith
•Team interactions simplified
•Core nouns and verbs
![Page 18: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/18.jpg)
Move out of public cloud
•Flexibility and latency demand at some point
•Hard problem
•Datacenter as failure domain
•Mesos
![Page 19: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/19.jpg)
Dynamic Configuration
•Update routes and compare live vs dark/new
•Quickly adjust to issues
•Faster and less fragile deploys
![Page 20: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/20.jpg)
Improve storage
•Gizzard for MySQL
•Improve Memcached
•Storage as a service
•Snowflake IDs
![Page 21: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/21.jpg)
Development Speed
•Startups live and die by development speed
•Make easier to ship but contain damage
![Page 22: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/22.jpg)
Conclusion
•Fail whale is now an endangered species
•Went from event driven spikes to pushing continuous reliability improvements where events became trivial
![Page 23: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/23.jpg)
Tweet Spikes Today• New Tweets per second (TPS) record: 143,199
TPS. Typical day: more than 500 million Tweets sent; average 5,700 TPS. (August 2 at 7:21:50 PDT; August 3 at 11:21:50 JST)
• https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
![Page 24: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/24.jpg)
Final Thoughts
•Marathon not a sprint. Maintain systems and yourself
•We are hiring to make system even better
![Page 25: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/25.jpg)
Endangered: Fail Whale Jonathan
Reichhold@jreichhold
![Page 26: #Surgeconf Scaling Twitter to go After the Fail Whale](https://reader035.fdocument.pub/reader035/viewer/2022070303/54b74b404a7959ef448b4601/html5/thumbnails/26.jpg)
Questions?
•https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
•https://blog.twitter.com/2013/observability-at-twitter