Deployment Preparedness
-
Upload
mongodb -
Category
Technology
-
view
288 -
download
0
Transcript of Deployment Preparedness
Production Preparedness
{ name: ‘Bryan Reinero’,
title: ‘Developer Advocate’,
twitter: ‘@blimpyacht’,
code: ‘github.com/breinero’
email: ‘[email protected]’ }
2
Deploy
withJoy!
3
4
5
Production Checklist
Proper InfrastructureProper Configuration
Proper MonitoringEmergency Procedures
6
Infrastructure Sizing
• RAM • CPU • Disk Size• I/O Bandwidth• Availability
7
Sizing
• Indexes need to be in RAM• Working set needs to be in RAM• I/O Bandwidth
- write load - Index updates - Working set migration
{ _id: ObjectId(),
tour: UUID,
user: UUID,
name: "Doug's Dogs",
desc: "The best hot-dog",
clues: [
"Hungry for a Coney Island?",
"Ask for Dr. Frankenfurter",
"Look for the hot dog stand"
]
"geometry": {
"type": "Point",
"coordinates": [125.6, 10.1]
}
}
11
Load Testing
12
Load Testing
• Test it like you use it, benchmarks don’t count
13
Load Testing
• Test it like you use it, benchmarks don’t count• Test to failure
14
Load Testing
• Test it like you use it, benchmarks don’t count• Test to failure• Instrument your code!
15
Load Testing
• Test it like you use it, benchmarks don’t count• Test to failure• Instrument your code!
https://github.com/breinero/Firehosehttps://github.com/ParsePlatform/flashback
16
Growth
1 2 3 4 5 6 7 8 9 100
2
4
6
8
10
12
Saturation
Warn
Load
1K Ops / Second
time
17
Growth
1 2 3 4 5 6 7 8 9 100
2
4
6
8
10
12
Saturation
Warn
Load
Memory
18
Growth
1 2 3 4 5 6 7 8 9 100
2
4
6
8
10
12
Saturation
Warn
Load
Input Output
19
20
Monitoring
Baseline • MongoDB Cloud Manager• MongoDB Ops Manager• Nagios, Zenoss, …
Detailed Query Specific• mongotop• db.currentOp()• Query Profiler• mtools
21
Forensics
2014-08-08T21:15:25.181-0500 [conn1026] getmore myDB.myCollection cursorid:100012502307 ntoreturn:0 keyUpdates:0 numYields:1406953 locks(micros) r:11887558422 nreturned:289 reslen:4208149 28795759ms2014-08-07T15:31:51.714-0500 [conn7] command myDB.$cmd command: createIndexes { createIndexes: ”myColletion", indexes: [ { key: { Claims.ICN: 1.0 }, name: ”test.a_1" } ] } keyUpdates:0 numYields:0 locks(micros) r:14476 w:25176930351 reslen:113 25176955ms
22
Logging
23
Logging
• Save and Rotate• Don’t use --quiet• --logpath != --dbpath• Use component verbosity
for debugging
24
Security
25
Security
• Firewall• Bind IP• Encrypt Networks • Enable Access Control• Don’t enable REST interface• Auditing
Limit Exposure and use
Principal of Least Privileges
26
Tuning
Best Practices• Disable Transparent hugepages• NTP to synchronize time• Set ulimits• Use XFS or Ext4• Don’t use NFS• Disable NUMA• Have swap
Read Production Notes
Tunables• Set IO Scheduler NOOP• Adjust readaheads ( MMapV1 )• Avoid cgroups• SE Linux (?)• RAID
27
Availability
http://avstop.com/ac/flighttrainghandbook/imagel4b.jpg
28
Availability
S S
DC1 DC2
P
Avoid Critical Data Centers
29
Availability
P S
DC1 DC2
S
DC3
30
Availability
P S
DC1 DC2
S
AWS
31
Availability
P S
DC1 DC2
Arbiter
DC3
32
Availability
P
DC1
Arbiter
AWS
S
DC2
Down for maintenance
33
Emergency Procedures
https://spinoff.nasa.gov/spinoff2002/images/070.jpg
34
Emergency Procedures
https://spinoff.nasa.gov/spinoff2002/images/070.jpg
Backup and Recovery• File System Snapshot• MMS Cloud• Ops Manager• Mongodump
35
Backups and Recovery
https://spinoff.nasa.gov/spinoff2002/images/070.jpg
PERFORM DRILLS OFTEN AND ROUTINELY
36
Emergency Procedures
https://spinoff.nasa.gov/spinoff2002/images/070.jpg
Document your Procedures• Include ETAs• Follow procedures in
docs.mongodb.org
37
Production Ready Architecture
L.B.
38
Production Ready Architecture
L.B.
Unindexed queries
39
Production Ready Architecture
L.B.
Unindexed queries Leads to collection scans
40
Production Ready Architecture
L.B.
Unindexed queries Leads to collection scans
Results in high latencies
41
Classic Failure Scenario
L.B.
Unindexed queries Leads to collection scans
Results in high latencies Causes memory exhaustion
42
Production Ready Architecture
L.B.
Unindexed queries Leads to collection scans
Results in high latencies Causes memory exhaustion
CASCADING FAILURE
43
Circuit Breaker
Trigger Conditions• Latency stats.getMean() >= max• OpsPerSecond stats.getN() >= max• ConcurrentOperations stats.getN()*stats.getMean() >= max
44
Circuit Breaker
Trigger Conditions• Latency stats.getMean() >= max• OpsPerSecond stats.getN() >= max• ConcurrentOperations stats.getN()*stats.getMean() >= max
https://github.com/breinero/Firehose
45
Client Side
• Don’t use ensureIndex() in application• Look out for connection bombs
--maxConnect• DO use operation timeouts• DON’T cause socket timeouts
Lower keepalives• Avoid retry bombs
46
Requirements & Specs
Make a DevOps Contract• Database Access Requirements• Database Access Fulfillment Specification• Cluster Configuration• Monitoring and Alerting Specification
47
Monitoring
• Opcounters• Memory• Page Faults• Queues• Replication Lag• Oplog Window• Background Flush Average• Disk space
Thanks!
{ name: ‘Bryan Reinero’,
title: ‘Developer Advocate’,
twitter: ‘@blimpyacht’,
code: ‘github.com/breinero’
email: ‘[email protected]’ }