Deployment Preparedness

Post on 17-Aug-2015

288 views 0 download

Transcript of Deployment Preparedness

Production Preparedness

{ name: ‘Bryan Reinero’,

title: ‘Developer Advocate’,

twitter: ‘@blimpyacht’,

code: ‘github.com/breinero’

email: ‘bryan@mongdb.com’ }

2

Deploy

withJoy!

3

4

5

Production Checklist

Proper InfrastructureProper Configuration

Proper MonitoringEmergency Procedures

6

Infrastructure Sizing

• RAM • CPU • Disk Size• I/O Bandwidth• Availability

7

Sizing

• Indexes need to be in RAM• Working set needs to be in RAM• I/O Bandwidth

- write load - Index updates - Working set migration

{ _id: ObjectId(),

tour: UUID,

user: UUID,

name: "Doug's Dogs",

desc: "The best hot-dog",

clues: [

"Hungry for a Coney Island?",

"Ask for Dr. Frankenfurter",

"Look for the hot dog stand"

]

"geometry": {

"type": "Point",

"coordinates": [125.6, 10.1]

}

}

11

Load Testing

12

Load Testing

• Test it like you use it, benchmarks don’t count

13

Load Testing

• Test it like you use it, benchmarks don’t count• Test to failure

14

Load Testing

• Test it like you use it, benchmarks don’t count• Test to failure• Instrument your code!

15

Load Testing

• Test it like you use it, benchmarks don’t count• Test to failure• Instrument your code!

https://github.com/breinero/Firehosehttps://github.com/ParsePlatform/flashback

16

Growth

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

Saturation

Warn

Load

1K Ops / Second

time

17

Growth

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

Saturation

Warn

Load

Memory

18

Growth

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

Saturation

Warn

Load

Input Output

19

20

Monitoring

Baseline • MongoDB Cloud Manager• MongoDB Ops Manager• Nagios, Zenoss, …

Detailed Query Specific• mongotop• db.currentOp()• Query Profiler• mtools

21

Forensics

2014-08-08T21:15:25.181-0500 [conn1026] getmore myDB.myCollection cursorid:100012502307 ntoreturn:0 keyUpdates:0 numYields:1406953 locks(micros) r:11887558422 nreturned:289 reslen:4208149 28795759ms2014-08-07T15:31:51.714-0500 [conn7] command myDB.$cmd command: createIndexes { createIndexes: ”myColletion", indexes: [ { key: { Claims.ICN: 1.0 }, name: ”test.a_1" } ] } keyUpdates:0 numYields:0 locks(micros) r:14476 w:25176930351 reslen:113 25176955ms

22

Logging

23

Logging

• Save and Rotate• Don’t use --quiet• --logpath != --dbpath• Use component verbosity

for debugging

24

Security

25

Security

• Firewall• Bind IP• Encrypt Networks • Enable Access Control• Don’t enable REST interface• Auditing

Limit Exposure and use

Principal of Least Privileges

26

Tuning

Best Practices• Disable Transparent hugepages• NTP to synchronize time• Set ulimits• Use XFS or Ext4• Don’t use NFS• Disable NUMA• Have swap

Read Production Notes

Tunables• Set IO Scheduler NOOP• Adjust readaheads ( MMapV1 )• Avoid cgroups• SE Linux (?)• RAID

27

Availability

http://avstop.com/ac/flighttrainghandbook/imagel4b.jpg

28

Availability

S S

DC1 DC2

P

Avoid Critical Data Centers

29

Availability

P S

DC1 DC2

S

DC3

30

Availability

P S

DC1 DC2

S

AWS

31

Availability

P S

DC1 DC2

Arbiter

DC3

32

Availability

P

DC1

Arbiter

AWS

S

DC2

Down for maintenance

33

Emergency Procedures

https://spinoff.nasa.gov/spinoff2002/images/070.jpg

34

Emergency Procedures

https://spinoff.nasa.gov/spinoff2002/images/070.jpg

Backup and Recovery• File System Snapshot• MMS Cloud• Ops Manager• Mongodump

35

Backups and Recovery

https://spinoff.nasa.gov/spinoff2002/images/070.jpg

PERFORM DRILLS OFTEN AND ROUTINELY

36

Emergency Procedures

https://spinoff.nasa.gov/spinoff2002/images/070.jpg

Document your Procedures• Include ETAs• Follow procedures in

docs.mongodb.org

37

Production Ready Architecture

L.B.

38

Production Ready Architecture

L.B.

Unindexed queries

39

Production Ready Architecture

L.B.

Unindexed queries Leads to collection scans

40

Production Ready Architecture

L.B.

Unindexed queries Leads to collection scans

Results in high latencies

41

Classic Failure Scenario

L.B.

Unindexed queries Leads to collection scans

Results in high latencies Causes memory exhaustion

42

Production Ready Architecture

L.B.

Unindexed queries Leads to collection scans

Results in high latencies Causes memory exhaustion

CASCADING FAILURE

43

Circuit Breaker

Trigger Conditions• Latency stats.getMean() >= max• OpsPerSecond stats.getN() >= max• ConcurrentOperations stats.getN()*stats.getMean() >= max

44

Circuit Breaker

Trigger Conditions• Latency stats.getMean() >= max• OpsPerSecond stats.getN() >= max• ConcurrentOperations stats.getN()*stats.getMean() >= max

https://github.com/breinero/Firehose

45

Client Side

• Don’t use ensureIndex() in application• Look out for connection bombs

--maxConnect• DO use operation timeouts• DON’T cause socket timeouts

Lower keepalives• Avoid retry bombs

46

Requirements & Specs

Make a DevOps Contract• Database Access Requirements• Database Access Fulfillment Specification• Cluster Configuration• Monitoring and Alerting Specification

47

Monitoring

• Opcounters• Memory• Page Faults• Queues• Replication Lag• Oplog Window• Background Flush Average• Disk space

Thanks!

{ name: ‘Bryan Reinero’,

title: ‘Developer Advocate’,

twitter: ‘@blimpyacht’,

code: ‘github.com/breinero’

email: ‘bryan@mongdb.com’ }