Datadog + VictorOps Webinar
-
Upload
datadog -
Category
Technology
-
view
108 -
download
0
Transcript of Datadog + VictorOps Webinar
![Page 1: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/1.jpg)
Do’s &
of post-incident analysis
Don’ts
![Page 2: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/2.jpg)
Jason HandDevOps EvangelistDevOps, Dogs, Horses, and Mountain LivingTwitter: @jasonhand
VictorOpsIncident management & notificationsMakes on-call suck less!Twitter: @victorops
![Page 3: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/3.jpg)
Jason YeeTechnical writer/evangelistTravel hacker & ChefTwitter: @gitbisect
DatadogSaaS-based full stack monitoringOver a trillion data points per dayTwitter: @datadoghq
![Page 4: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/4.jpg)
AgendaService Disruptions
Detection
Diagnosis
Post-incident analysis
Framework
![Page 5: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/5.jpg)
Follow & Share on Twitter
#VOWebinar
@gitbisect@jasonhand
@datadogHQ@VictorOps
![Page 6: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/6.jpg)
Service Disruptions
There is no such thing as being soooo good, you’ll never fail
Are a reality in ALL complex systems
![Page 7: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/7.jpg)
Complex Systems
● Diversity● Interdependent● Adaptive● Connectedness
(i.e. we can be connected but not dependent on each other)
![Page 8: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/8.jpg)
Cynefin Framework
● Obvious - cause & effect is obvious to all
● Complicated - cause & effect requires analysis or expert knowledge
● Complex - cause & effect can only be perceived in retrospect
● Chaotic - no relationship between cause & effect
Cynefin diagram by Dave Snowden CC BY-SA 3.0
![Page 9: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/9.jpg)
Contributing Factors
Systems Thinking: an understanding of a
system by examining the linkages and
interactions between the components that
comprise the entirety of that defined system
![Page 10: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/10.jpg)
MTTR vs MTBF
Mean Time To Repair
vs
Mean Time Between Failure
![Page 11: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/11.jpg)
Detection
![Page 12: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/12.jpg)
Collecting data is cheapNot having it when you need it can be expensive
![Page 13: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/13.jpg)
![Page 14: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/14.jpg)
![Page 15: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/15.jpg)
![Page 16: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/16.jpg)
![Page 17: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/17.jpg)
4 qualities of good metricsNot all metrics are created equal
![Page 18: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/18.jpg)
1. Well understood
![Page 19: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/19.jpg)
2. Granular
![Page 20: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/20.jpg)
3. Tagged & filterable
![Page 21: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/21.jpg)
4. Long-lived
![Page 22: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/22.jpg)
![Page 23: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/23.jpg)
Diagnosis
![Page 24: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/24.jpg)
Real-time Notification
![Page 25: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/25.jpg)
Getting “the right” Humans Involved
Paging has evolved to: Smart & Actionable alerts ...
Routed to the right teams and people …
With valuable context
![Page 26: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/26.jpg)
Graphs, Logs, Runbooks
![Page 27: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/27.jpg)
Automation
![Page 28: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/28.jpg)
ChatOps
jhand.co/chatopsbook
![Page 29: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/29.jpg)
The Full Incident Lifecycle
![Page 30: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/30.jpg)
What we are really here to learn about...
![Page 31: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/31.jpg)
Post-incident Analysis(a.k.a. learning review, postmortem)
![Page 32: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/32.jpg)
Do: Establish that we are here to learn
The primary objective of these exercises is to learn
![Page 33: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/33.jpg)
Do: Establish timeline of events
Identify when anomaly was first detected, first responders, SMEs pulled in to assist, conversations, commands, etc.
![Page 34: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/34.jpg)
Don’t: Hijack the Discussion
Having an objective moderator run the exercise can help prevent one person (or small group) from steamrolling the conversation and avoids
“Group Think”
![Page 35: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/35.jpg)
Do: Describe What Happened
Gather a detailed account of what happened from team members. What services, components, etc. were affected? Include how
customers were impacted
i.e. Accountability
![Page 36: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/36.jpg)
Don’t: Explain What Happened
Explaining often leads to a less than objective understanding of what took place as well as finger pointing and blame
![Page 37: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/37.jpg)
Do: Ask “How” Things Happened
Understand in great detail “how” things happened including multiple contributing factors
![Page 38: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/38.jpg)
Don’t: Ask “Why” Things Happened
Asking “why” often contains bias and leads to blame
“Why” .. brings us to the very mysterious incentives we have in the workplace.
“How" brings us to the conditions that allowed the event to take place to begin with. - John Allspaw (CTO Etsy)
![Page 39: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/39.jpg)
Do: Understand Contributing Factors
Use Systems Thinking to see more holistically
![Page 40: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/40.jpg)
“Cause is not something found in the rubble. Cause is created
in the minds of the investigators” - Sydney Dekker
![Page 41: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/41.jpg)
Don’t: Focus on a ‘Root Cause’
Rather than focusing on the ‘Root Cause’ of service disruption, understand all of the contributing factors.
![Page 42: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/42.jpg)
Newtonian thinking … Why some still seek a root cause
We’ve created the idea that a single cause has an
equal and opposite effect
● Humans adapt to the work they have● Root Cause analysis ONLY works in SIMPLE systems● Root Cause Analysis = Retrospective Cover of Ass
In complex systems .. it doesn’t
![Page 43: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/43.jpg)
Do: Watch For Bias
We are easily susceptible to cognitive bias such as: confirmation, hindsight, anchoring, outcome, availability
![Page 44: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/44.jpg)
Don’t: Blame Humans
Humans are only a part of the problem and response, never a contributing factor is issues
![Page 45: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/45.jpg)
Do: Include What Went Well
Much can be learned from what worked during the response to a service disruption. Capture and discuss what efforts actually went well.
![Page 46: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/46.jpg)
Don’t: Hide What Happened
Customers and end-users are savvy. Being transparent about what took place and what was done will help build trust
![Page 47: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/47.jpg)
Do: Conduct Analysis Soon
Gather the team and conduct the post-incident analysis as soon as everyone is rested
![Page 48: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/48.jpg)
Don’t: Wait longer than 48 hours
The longer time passes, the less accurate accounts of what took place will be
![Page 49: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/49.jpg)
Do: Assign Action Items
Look for small incremental improvements to take action on.Each improvement item should be assigned an owner and tracked for
follow up
![Page 50: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/50.jpg)
Don’t: Debate Without Action
Don’t allow for extended debate on action items. Place ideas into a “parking lot” for later action but come up with at least one action item to
be implemented immediately
![Page 51: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/51.jpg)
Do: Hear from everyone
To fully understand the disruption and response you want to hear from all parties involved. Everyone’s experience was different. The more
voices you hear from, the more accurate the story and timeline become.
![Page 52: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/52.jpg)
Do: Encourage Many Possible Improvements
We are looking for many possible areas for incremental improvements to our systems, processes, tools, incident response, and team members. Encourage people to build on top of existing ideas in
addition to posing alternatives.
![Page 53: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/53.jpg)
Don’t: Overpromise or Overcommit
We are looking for ideas not binding commitments. This helps to make sure you get suggestions from a wide group
![Page 54: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/54.jpg)
Do: Archive Your Postmortem
Save and store your postmortem where it is available to everyone internally for future review or as assistance during future similar
incidents
![Page 55: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/55.jpg)
Do: Rinse & Repeat
Be disciplined in your post-incident analysis exercises and perform them for all incidents regardless of the severity. Practice makes
perfect and these will become more efficient and useful over time
![Page 56: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/56.jpg)
Framework
![Page 57: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/57.jpg)
Post-incident analysis framework
1. Summary: what happened?2. How was the incident detected?3. How did we respond?4. How did it happen?5. How can we improve?
![Page 58: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/58.jpg)
Summary: what happened?
● Impact on customers● Severity of the incident● Components affected● What ultimately resolved the incident?● Externally shared information
![Page 59: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/59.jpg)
How was the incident detected?
● Did we have a metric that showed the incident?● Was there a monitor/alerting on that metric?● How long did it take to declare an incident?
![Page 60: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/60.jpg)
How did we respond?
● Who was involved?● ChatOps archive links● Timeline of events● What went well?● What didn’t go so well?
![Page 61: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/61.jpg)
How did it happen?
● Technical deep-dive● Include context● Identify contributing factors● Ask “How,” not “Why”
![Page 62: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/62.jpg)
How can we improve?
● Now (immediate actions)● Next (in current or following sprint)● Later (after the next sprint)● Follow up notes● Ensure all items are actionable and tracked
![Page 63: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/63.jpg)
Summary:
![Page 64: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/64.jpg)
Resources● Post-incident analysis framework/template
○ http://bit.ly/2dxDIT3
● Blameless postmortems & a just culture - John Allspaw○ https://codeascraft.com/2012/05/22/blameless-postmortems/
● The infinite hows - John Allspaw○ http://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-dangers-of-the-five-whys/
● The human side of postmortems - Dave Zwieback○ http://www.oreilly.com/webops-perf/free/the-human-side-of-postmortems.csp
● Writing your first postmortem - Mathias Lafeldt○ https://medium.com/production-ready/writing-your-first-postmortem-8053c678b90f
![Page 65: Datadog + VictorOps Webinar](https://reader031.fdocument.pub/reader031/viewer/2022021918/58a4e12d1a28ab34318b6b05/html5/thumbnails/65.jpg)
Q&A
Do: Start a free trialhttps://app.datadoghq.com/signuphttps://victorops.com/start-free-trial