Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~

Post on 08-Apr-2017

341 views 3 download

Transcript of Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~

StackStorm

If-This-Than-That for DevOpsautomation

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

@StackStorm@Stack_Storm

Agenda

What is StackStorm

Why we made it

Use cases

Why is it better than …

StackStorm: What is it?

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

StackStorm is like …

7

for DevOps

IFTTT,for DevOps

8

cat /opt/stackstorm/packs/st2-demos/rules/demo.yaml---

name: "sensu_crit_to_slack"pack: "st2-demos"description: "Post all critical alerts to the demoenabled: truetrigger:

type: "sensu.event_handler"criteria:

trigger.check.status:pattern: 2type: "equals"

trigger.check.name:pattern: "demo_.*"type: "matchregex"

action:ref: "slack.post_message"parameters:

message: >[ALERT]{{trigger.client.name}}{{trigger.check.output}}

channel: "#demos"

9

10

Trigger Action

Rule

Ingredients

11

IT Domains

Config mgmtStorageNetworking ContainersCloud InfraMonitoring

ActionsSensors

WorkflowsRules

Ops Support

Trigger Call

Automation Example

12

Automation

EngineerService

Monitoring IncidentManagement

Event: “low disk on web301”

Web301 is “low disk”

Resolve known cases, fast. Is it

/var/log? Clean up!

Unknown problem, need

a human

Wake up, buddy. Something real

is going on…

StackStorm: Why we made it?

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Opalis, now Microsoft SC Orchestrator2004-2008

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Event-driven automation with workflows,for Enterprise IT

VS

VMware, 2008 - 2013

VS

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 16

OpenStack VMWare

DevOps Tools

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 17

VS

Enterprise Suites

VS

18

Event-driven automation with workflows,for Cloud and DevOps

DevOps Tools Enterprise Suites

19

What can be it automated?

20

Got StackStorm !

Every automation looks like a nail

What SHOULD be automated?From: Practice of Cloud System Administration, by Thomas Limoncelli

What HAS BEEN automated with StackStorm

• Security checks

– On malware detection in a VM, isolate network port on a switch

• App blue-green deployment

– On Jenkins tests passed, bring new vmclaster, deploy and configure app, set loadbalancer to send % of traffic to new app, monitor, roll forward, or back out

• Networking

– On BGP peer goes down: collect troubleshooting data, post on slack & create JIRA ticket

– On Link aggregation member error, check load, if capacity of rest of LAG bundle enough, disable link with error

• OpenStack– orphan VM clean-up: On orphans

detected, shut down, email owner, keep for few days, delete

– VM evacuation on HW failures: On host RAID failure, get list of impacted VMs, email VM owners, evacuate VMs, create JIRA ticket for hardware replacement.

• Service remediation:– Cassandra “node down” recovery: On ring

node dying, deploy new node, configure, add to the ring.

– Remediating RabbitMQ, Galera cluster, MySQL, and more…

22

23

Auto-Remediation

FB auto-remediates 98% alarms,can you?

“Auto-Remediation & Automation at Facebook” @ Auto-Remediation Meetup SFhttps://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/events/236704012/

Facebook FBAR:43 % Problem Fixed51 % False positives94 % Automated

“Auto-Remediation & Automation at Facebook” @ Auto-Remediation Meetup SFhttps://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/events/236704012/

FBAR & StackStorm are friendsStackStorm is inspired by FBARStackStorm and FB FBAR collaborating since 2014

“Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona

Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm

“Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona

Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm

User: Symantec (Mirantis)Use case: OpenStack cluster remediationPresented by Mirantis at OpenStack Barcelona

StackStorm at Symantec

Engineer Wakes up

Logs in and ACK

Checksrunbook

Studiesthe alert

Fixes theproblem

Runs diagnostics

PagerDuty

Alert

2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM

On-call, Without Automation

Source: “Winston: Helping Netflix Engineers Sleep at Night” @ Qcon ‘16 SFhttps://goo.gl/lHzq4r

FalsePositive

Winston

2:00 AM

2:05 AM

2:05 AM

2:15 AM

AssistedDiagnostics

Fixed theproblem

On-call With Winston

Source: “Winston: Helping Netflix Engineers Sleep at Night” @ Qcon ‘16 SFhttps://goo.gl/lHzq4r

Benefits

• Reduce MTR (Mean Time to Resolution)

• Avoid failures (fixing on computer time, not human time)

• Reduce risk of human error (no fat fingers)

• Positive team impact

– Avoid pager fatigue and team burn-out

– Turn from reactive to proactive (break reactive vicious cycle)

– Capture operational knowledge – as code

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 30

31

Network Automation

Now supporting Multiple Vendors

Device supportProven approach from cloud compute space

33

NAPALM

AUTOMATION

Packs

PYTHON

BINDINGS

INTEGRATION

PACKS

DEVICE / OS

INTERFACES

Some workflows [1] leverage unique device functions, so they call actions of the device’s integration pack. Other workflows [2] need to be abstracted and treat devices alike (e.g. IXP provision on mixture of SLX and MLX). So they use “unified” Napalm pack.

device drivers

Integration packs expose device configurations and operations as st2 actions. VDX and SLX packs will expose most/all of device functionality. MLX is “best effort”. Napalm integration pack provide “unified” device actions, just like libcloud does in “compute” domain.

While integration packs can call device interfaces directly, python bindings provide reuse, and abstract device OS versions. PyNOS binds VDX via netconf. SLX and MLX don’t exist yet.Napalm is Open source project that exposes a unified python API to interact with different network devices via device drivers.

Devices expose various interfaces: RESTCONF, NETCONF, CLI/TELNET, SNMP.

VDX MLX SLX

[1] [2]

Other vendors

34

Including legacy, business apps …

Integration

“Innovation at Dimension Data: Optimizing Operations with Event Driven Automation” https://stackstorm.com/2016/12/15/dimension-data-devops-beyond-deployment/

Dimension Data (SP, part of NTT)

“Innovation at Dimension Data: Optimizing Operations with Event Driven Automation” https://stackstorm.com/2016/12/15/dimension-data-devops-beyond-deployment/

• Integrate IT systems & tools• Security automation• Legacy Run-book replacement• Automation-aaS to end-users• Top st2 contributors

Dimension Data (SP, part of NTT)

38

IoTFun stuff

40

Serverless

Grab StackStorm & DIY

StackStorm is like …

41

AWS Lambda AWS Step Functions

OpenSource, for DIY Serverless

StackStorm is like …

42

ActionsSensors

WorkflowsRules

IT Domains

Config mgmtStorageNetworking ContainersCloud InfraMonitoring Ops Support

Step Functions

AWS Lambda

Serverless with Swarmfor Genomic Annotation ComputingDmitri Ziminehttp://github.com/dzimine/serverless@dzimine

Image by Miki Yoshihito, Creative Commons license

Many Use Cases – One Platform

44

StackStorm automation platform

Ne

two

rk

Au

tom

atio

n

Ass

iste

dT

rou

ble

sho

otin

g

Au

toR

em

ed

iatio

n

IT p

roce

ss

inte

gra

tio

n

IoT

Inte

rne

t o

f T

hin

gs

Se

rve

rle

ss

CI/

CD

Co

ntin

uo

us

De

plo

ym

en

t

NF

V

Se

curi

tyO

rch

est

ratio

n

Ch

atO

ps

Why not…

45

Why not Scripts?

46

Why not Scripts?

47

• Simple to define, reason, visualize

• Transparent

– state is clear, execution is trackable: running, complete, failed steps

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.48

Workflows Better in Operations

• Simple to define, reason, visualize

• Transparent

– state is clear, execution is trackable: running, complete, failed steps

• Reliable

– Workflows are long-running

– Crash tolerance

– “Restart from point of failure”

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 49

50

Why not legacy Runbook Automation?

• Microsoft System Center Orchestrator

• HP Operation Orchestrator

• Cisco Process Orchestrator (CPO)

• VMWare vCO / vRealize

They do not DevOps

Infrastructure As Code

Leverage social coding and collaboration

OpenSource

Designed for Devops

Infrastructure As Code

Leverage social coding and collaboration

OpenSource

Designed for Devops

53

Infrastructure as code

Case Study

• Service Catalog backed up by workflow

• Automate provisioning on VMW/OpenStack, 4 Data centers

• Before: CPO, operator updates via GUI, click and pray, x4

• After: StackStorm, dev -> code review -> staging -> QA-> prod

Infrastructure As Code

Leverage social coding and collaboration

OpenSource

Designed for Devops

265 ContributorsSource: https://openhub.net/p/st2

256 = 100,000,0002

Contributors

StackStorm & BWC Usage

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 57

0

500

1000

1500

2000

2500

3000

Sep Oct Nov Dec Jan Feb

Installations/month- StackStorm- BWC

StackStorm Exchange

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 58

StackStorm Exchange

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 59

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 60

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 61

Take away:* Try it* Use it* Contribute to it

62

63

• Use StackStorm.

Try it, find automation, nail POC. Let us know, good & bad.

curl -sSL https://stackstorm.com/packages/install.sh | bash -s

docs.stackstorm.com/install

• Commit code. Become a “community maintainer”It is not hard (2 days?). We help & support.

• Spread the word

Blog. Tweet. Talk. Mention. Bug. Github Star!

64

Contribute! Everything counts

Thank You!

@Stack_Stormhttp://github.com/StackStorm/st2 Star 1,869

Dmitri Zimine@dzimine

OpenSource Apache 2.0

• Github: github.com/StackStorm/st2

• Twitter: Stack_Storm

• IRC: #stackstorm on FreeNode

• stackstorm.slack.com on Slack

• www.stackstorm.com

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 66

StackStorm Brocade Workflow Composer

Commercial Edition

• Enterprise features

• Priority support

• brocade.com/bwc

• docs: bwc-docs.brocade.com

• Network lifecycle automation suite