Deployment Preparedness

45
Production Preparedness { name: ‘Bryan Reinero’, title: ‘Developer Advocate’, twitter: ‘@blimpyacht’, code: ‘github.com/breinero’ email: [email protected] ’ }

Transcript of Deployment Preparedness

Page 1: Deployment Preparedness

Production Preparedness

{ name: ‘Bryan Reinero’,

title: ‘Developer Advocate’,

twitter: ‘@blimpyacht’,

code: ‘github.com/breinero’

email: ‘[email protected]’ }

Page 2: Deployment Preparedness

2

Deploy

withJoy!

Page 3: Deployment Preparedness

3

Page 4: Deployment Preparedness

4

Page 5: Deployment Preparedness

5

Production Checklist

Proper InfrastructureProper Configuration

Proper MonitoringEmergency Procedures

Page 6: Deployment Preparedness

6

Infrastructure Sizing

• RAM • CPU • Disk Size• I/O Bandwidth• Availability

Page 7: Deployment Preparedness

7

Sizing

• Indexes need to be in RAM• Working set needs to be in RAM• I/O Bandwidth

- write load - Index updates - Working set migration

{ _id: ObjectId(),

tour: UUID,

user: UUID,

name: "Doug's Dogs",

desc: "The best hot-dog",

clues: [

"Hungry for a Coney Island?",

"Ask for Dr. Frankenfurter",

"Look for the hot dog stand"

]

"geometry": {

"type": "Point",

"coordinates": [125.6, 10.1]

}

}

Page 8: Deployment Preparedness

11

Load Testing

Page 9: Deployment Preparedness

12

Load Testing

• Test it like you use it, benchmarks don’t count

Page 10: Deployment Preparedness

13

Load Testing

• Test it like you use it, benchmarks don’t count• Test to failure

Page 11: Deployment Preparedness

14

Load Testing

• Test it like you use it, benchmarks don’t count• Test to failure• Instrument your code!

Page 12: Deployment Preparedness

15

Load Testing

• Test it like you use it, benchmarks don’t count• Test to failure• Instrument your code!

https://github.com/breinero/Firehosehttps://github.com/ParsePlatform/flashback

Page 13: Deployment Preparedness

16

Growth

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

Saturation

Warn

Load

1K Ops / Second

time

Page 14: Deployment Preparedness

17

Growth

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

Saturation

Warn

Load

Memory

Page 15: Deployment Preparedness

18

Growth

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

Saturation

Warn

Load

Input Output

Page 16: Deployment Preparedness

19

Page 17: Deployment Preparedness

20

Monitoring

Baseline • MongoDB Cloud Manager• MongoDB Ops Manager• Nagios, Zenoss, …

Detailed Query Specific• mongotop• db.currentOp()• Query Profiler• mtools

Page 18: Deployment Preparedness

21

Forensics

2014-08-08T21:15:25.181-0500 [conn1026] getmore myDB.myCollection cursorid:100012502307 ntoreturn:0 keyUpdates:0 numYields:1406953 locks(micros) r:11887558422 nreturned:289 reslen:4208149 28795759ms2014-08-07T15:31:51.714-0500 [conn7] command myDB.$cmd command: createIndexes { createIndexes: ”myColletion", indexes: [ { key: { Claims.ICN: 1.0 }, name: ”test.a_1" } ] } keyUpdates:0 numYields:0 locks(micros) r:14476 w:25176930351 reslen:113 25176955ms

Page 19: Deployment Preparedness

22

Logging

Page 20: Deployment Preparedness

23

Logging

• Save and Rotate• Don’t use --quiet• --logpath != --dbpath• Use component verbosity

for debugging

Page 21: Deployment Preparedness

24

Security

Page 22: Deployment Preparedness

25

Security

• Firewall• Bind IP• Encrypt Networks • Enable Access Control• Don’t enable REST interface• Auditing

Limit Exposure and use

Principal of Least Privileges

Page 23: Deployment Preparedness

26

Tuning

Best Practices• Disable Transparent hugepages• NTP to synchronize time• Set ulimits• Use XFS or Ext4• Don’t use NFS• Disable NUMA• Have swap

Read Production Notes

Tunables• Set IO Scheduler NOOP• Adjust readaheads ( MMapV1 )• Avoid cgroups• SE Linux (?)• RAID

Page 24: Deployment Preparedness

27

Availability

http://avstop.com/ac/flighttrainghandbook/imagel4b.jpg

Page 25: Deployment Preparedness

28

Availability

S S

DC1 DC2

P

Avoid Critical Data Centers

Page 26: Deployment Preparedness

29

Availability

P S

DC1 DC2

S

DC3

Page 27: Deployment Preparedness

30

Availability

P S

DC1 DC2

S

AWS

Page 28: Deployment Preparedness

31

Availability

P S

DC1 DC2

Arbiter

DC3

Page 29: Deployment Preparedness

32

Availability

P

DC1

Arbiter

AWS

S

DC2

Down for maintenance

Page 30: Deployment Preparedness

33

Emergency Procedures

https://spinoff.nasa.gov/spinoff2002/images/070.jpg

Page 31: Deployment Preparedness

34

Emergency Procedures

https://spinoff.nasa.gov/spinoff2002/images/070.jpg

Backup and Recovery• File System Snapshot• MMS Cloud• Ops Manager• Mongodump

Page 32: Deployment Preparedness

35

Backups and Recovery

https://spinoff.nasa.gov/spinoff2002/images/070.jpg

PERFORM DRILLS OFTEN AND ROUTINELY

Page 33: Deployment Preparedness

36

Emergency Procedures

https://spinoff.nasa.gov/spinoff2002/images/070.jpg

Document your Procedures• Include ETAs• Follow procedures in

docs.mongodb.org

Page 34: Deployment Preparedness

37

Production Ready Architecture

L.B.

Page 35: Deployment Preparedness

38

Production Ready Architecture

L.B.

Unindexed queries

Page 36: Deployment Preparedness

39

Production Ready Architecture

L.B.

Unindexed queries Leads to collection scans

Page 37: Deployment Preparedness

40

Production Ready Architecture

L.B.

Unindexed queries Leads to collection scans

Results in high latencies

Page 38: Deployment Preparedness

41

Classic Failure Scenario

L.B.

Unindexed queries Leads to collection scans

Results in high latencies Causes memory exhaustion

Page 39: Deployment Preparedness

42

Production Ready Architecture

L.B.

Unindexed queries Leads to collection scans

Results in high latencies Causes memory exhaustion

CASCADING FAILURE

Page 40: Deployment Preparedness

43

Circuit Breaker

Trigger Conditions• Latency stats.getMean() >= max• OpsPerSecond stats.getN() >= max• ConcurrentOperations stats.getN()*stats.getMean() >= max

Page 41: Deployment Preparedness

44

Circuit Breaker

Trigger Conditions• Latency stats.getMean() >= max• OpsPerSecond stats.getN() >= max• ConcurrentOperations stats.getN()*stats.getMean() >= max

https://github.com/breinero/Firehose

Page 42: Deployment Preparedness

45

Client Side

• Don’t use ensureIndex() in application• Look out for connection bombs

--maxConnect• DO use operation timeouts• DON’T cause socket timeouts

Lower keepalives• Avoid retry bombs

Page 43: Deployment Preparedness

46

Requirements & Specs

Make a DevOps Contract• Database Access Requirements• Database Access Fulfillment Specification• Cluster Configuration• Monitoring and Alerting Specification

Page 44: Deployment Preparedness

47

Monitoring

• Opcounters• Memory• Page Faults• Queues• Replication Lag• Oplog Window• Background Flush Average• Disk space

Page 45: Deployment Preparedness

Thanks!

{ name: ‘Bryan Reinero’,

title: ‘Developer Advocate’,

twitter: ‘@blimpyacht’,

code: ‘github.com/breinero’

email: ‘[email protected]’ }