An SNMP based failure detection service...• Consensus, Atomic broadcast, Atomic commitment •...
Transcript of An SNMP based failure detection service...• Consensus, Atomic broadcast, Atomic commitment •...
An SNMP based failure detection service
Matthias Wiesmann, Péter Urbán, Xavier Défago
FN NFSSwiss National Science Foundation
北陸先端科學技術大學院大學
1
An SNMP based failure detection service
Overview
• Failure Detection
• SNMP
• SNMP-FD Service
• Conclusion & Future Work
2
2
An SNMP based failure detection service
Failure detection
• Basic abstraction for distributed systems
• Needed for solving fundamental problems:
• Consensus, Atomic broadcast, Atomic commitment
• Formal definitions: P, ♢S, Ω, φ, etc.
• Also outside distributed system community
• System, Network, Cluster management, Middleware, Grid systems
• Common problem→ common solution
3
3
Failure detection service
An SNMP based failure detection service
•Factorize-out failure detection
• Shared, separate implementation
• Complex, adaptive policies
• Proposed in the literature
• Implemented by middleware systems, grid systems
• Problem: no standard
•Every system has its own service and interfaces
4
4
An SNMP based failure detection service
Standard Needed
• External standard
• Connect clients of the service to the service• Application, Middleware, Monitoring and administration tools
• Suspicion notifications, configuration
• Internal standard
• Use standard for heartbeat messages, internal queries
• We need some form of Middleware
5
5
An SNMP based failure detection service
Overview
• Failure Detection
• SNMP
• SNMP-FD Service
• Conclusion & Future Work
6
6
An SNMP based failure detection service
Why SNMP ?
• Simple Network Management Protocol
• Present in many network attached devices now
• Routers, Smart Switches, UPS, Printers…
• Simple Middleware
• Offers basic features
• Relatively lightweight
• Many applicable standard defined
• Monitoring, fault reporting.
7
7
Network
EquipmentNetwork
Management
StationSNMP
Agent
Monitored
ObjectMIB
An SNMP based failure detection service
SNMP Architecture
• Network Management Station
• Manages equipment
• Equipment’s agent
• Exposes Management Information Base (MIB)
• Sends asynchronous messages (traps) to NMS.
8
8
An SNMP based failure detection service
Overview
• Failure Detection
• SNMP
• SNMP-FD Service
• Conclusion & Future Work
9
9
!"#$%&'
!"#$%&'S
()*$'B!"#$%"&'()&"*'++,
p1B
-,."+%,&'+"/&*'+,!01
-,2"%$!*3%$"#,4,53&6'%,!01
-,.'3&%7'3%+,!01
!"#$%&'&($%)*$+,-&.%*,+/01,
!"#$%"&'()&"*'++,
p2B
!"#$%"&'()&"*'++,
p3B
-,!01,
23(4&56,%*
()*$'A
7,*
8+09:
2,*&;&7,*
<,0+*=,0*&
8+09:!"#$%&'&()%&$*"+",)-./%"012
!"34)-5"012
-,.'3&%7'3%,8#"9:'(6',!01
-,)&"*'++,8#"9:'(6',!01
-,;3$:/&',<'%'*%$"#,!01,
!"#$%&'&($%)*$+)%6&.%*,+/01,&
!"#$%"&$#6)&"*'++
pA
3$*)!10*)$%
2,*&;&7,*
6%)*7)-7"012
*/+%"=,!01
()
(*
)>?+$*3:,@$#A
!'++36'+
B/'&$'+,C,D&$%'+
An SNMP based failure detection service
SNMP-FD service
10
10
An SNMP based failure detection service
Service Architecture
• Distributed architecture
• Dæmon runs on each host
• Monitor local processes
• Exchange heartbeats
• Export MIB information
• Can interact with equipment
• Similar to other failure detection service architectures
11
!"#$%&'
!"#$%&'S
()*$'B!"#$%"&'()&"*'++,
p1B
-,."+%,&'+"/&*'+,!01
-,2"%$!*3%$"#,4,53&6'%,!01
-,.'3&%7'3%+,!01
!"#$%&'&($%)*$+,-&.%*,+/01,
!"#$%"&'()&"*'++,
p2B
!"#$%"&'()&"*'++,
p3B
-,!01,
23(4&56,%*
()*$'A
7,*
8+09:
2,*&;&7,*
<,0+*=,0*&
8+09:!"#$%&'&()%&$*"+",)-./%"012
!"34)-5"012
-,.'3&%7'3%,8#"9:'(6',!01
-,)&"*'++,8#"9:'(6',!01
-,;3$:/&',<'%'*%$"#,!01,
!"#$%&'&($%)*$+)%6&.%*,+/01,&
!"#$%"&$#6)&"*'++
pA
3$*)!10*)$%
2,*&;&7,*
6%)*7)-7"012
*/+%"=,!01
()
(*
)>?+$*3:,@$#A
!'++36'+
B/'&$'+,C,D&$%'+
11
An SNMP based failure detection service
Features
• Uses many standard MIBs
• Host Resource MIB → Process state
• Target & Notification MIB → Subscriber description
• Alarm MIB → Current suspicion description
• Lightweight messaging:
• Heartbeat are UDP traps.
• Leverages SNMP
• Configuration, Security, Interoperability with devices12
❏✔
12
An SNMP based failure detection service
New MIBs
• Heartbeat MIB
• Describes heartbeats
• Failure detection MIB
• Describes failure detector configuration
• Described using standard SNA syntax file
• Accessible using standard tools
13
SNMP table: SnmpFD::kbhbTable
index interval lastbeat lasttime count lostcount mean stddev min max kt-dhcp23.jaist.ac.jp 100 2678 0:0:04:38.49 2671 0 99.9955 28.3417 0 356
13
An SNMP based failure detection service
Implementation
• Validate the architecture
• Implemented in Java
• Open source libraries
• SNMP4J, SNMP4J Agent, Apache Libraries
• Available for download
• http://ddsg.jaist.ac.jp//en/projects/snmp-fd/
14
14
An SNMP based failure detection service
Implementation
• Validate the architecture
• Implemented in Java
• Open source libraries
• SNMP4J, SNMP4J Agent, Apache Libraries
• Available for download
• http://ddsg.jaist.ac.jp//en/projects/snmp-fd/
14
14
• Simple failure detector
• Stable performance
• Consistent with expectations
• Validates approach
• Issues
• High latency
• High CPU load on receiver
An SNMP based failure detection service
Performance
15
0.001
0.01
0.1
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Timeout (ms)
Mis
takes
per
sec
ond
99.85%
99.9%
99.95%
100.0%
Quer
y
Acc
ura
cyMistake Rate
Query Accuracy
15
• Simple failure detector
• Stable performance
• Consistent with expectations
• Validates approach
• Issues
• High latency
• High CPU load on receiver
An SNMP based failure detection service
Performance
15
0.001
0.01
0.1
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Timeout (ms)
Mis
takes
per
sec
ond
99.85%
99.9%
99.95%
100.0%
Quer
y
Acc
ura
cyMistake Rate
Query Accuracy
}SNMP4JIssue, code spends 95% of time there.
15
An SNMP based failure detection service
Overview
• Failure Detection
• SNMP
• SNMP-FD Service
• Conclusion & Future Work
16
16
An SNMP based failure detection service
Conclusion
• Failure detection service
• Standard based approach desirable
• No need for new standard → use existing SNMP
• Approach validated by prototype
• Can be extended with more advance FD techniques
• Hierarchical FD, adaptive FD, etc.
• Can interact with network devices
• For instance link-down traps
17
17
An SNMP based failure detection service
Future work
• Look into performance issues
• Optimizations needed in the SNMP4J library.
•Port the framework to the latest version of SNMP4J
• Support for AgentX
• Implement framework as sub-agent
• Implement more advanced failure detectors
• Handle network equipment data
18
18