© 2009 IBM Corporation IBM Research SOAR 2010 keynoteJune 7, 2010 Engineering Decentralized...
-
Upload
madlyn-poole -
Category
Documents
-
view
217 -
download
2
Transcript of © 2009 IBM Corporation IBM Research SOAR 2010 keynoteJune 7, 2010 Engineering Decentralized...
© 2009 IBM Corporation SOAR 2010 keynote June 7, 2010
IBM Research
Engineering Decentralized Autonomic Computing Systems
Jeff Kephart ([email protected])IBM Thomas J Watson Research CenterHawthorne, NY
2
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Building Effective Multivendor Autonomic Computing SystemsICAC 2006 panel
Moderator: Omer Rana (Cardiff University) Panelists
– Steve White (IBM Research)
– Julie McCann (Imperial College, London)
– Mazin Yousif (Intel Research)
– Jose Fortes (University of Florida)
– Kumar Goswami (HP Labs)
Discussion– Multi-vendor systems are an unavoidable reality – we must cope
– Standards are key to interoperability: web services (WS-*), policies, resource information (CIM); probably others too, for monitoring probes, types of events, semantic stack (OWL) …
– Virtualization is key – layer of abstraction providing uniform access to resources
– “We must identify key application scenarios, such as data centers and Grid-computing applications, that would enable wider adoption of multivendor autonomic systems.”
Rana and Kephart, IEEE Distributed Systems Online, 2006
3
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Autonomic Computing and Agents AC definition
– “Computing systems that manage themselves in accordance with high-level objectives from humans.” Kephart & Chess, IEEE Computer 2003
Agents definition– “An encapsulated computer system, situated in some
environment, and capable of flexible, autonomous action in that environment in order to meet its design objectives.” Jennings, et al, A Roadmap of Agent Research and Development, JAAMAS 1998
Autonomic elements ~ agents Autonomic systems ~ multi-agent systems
Mutual benefit for AC & Agents communities– AC is killer app for agents
– Agents community has much to contribute to AC
Service-oriented computing may evolve into agent-oriented computing over time
– Agents more proactive
– Agent standards should grow from SOC standards
Brazier, Kephart, Parunak, and Huhns, Internet Computing, June 2009
Article resulted from brainstorming session at Agents for Autonomic Computing workshop, ICAC 2008
4
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
The focus of this talk
I start from two premises:– Autonomic systems are “Computing systems that manage themselves in
accordance with high-level objectives from humans.”
– Autonomic systems ~ multi-agent systems
Which leads to…
How do we get a (decentralized) Multi-Agent System to act in accordance with high-level objectives?
My thesis– Objectives should be expressed in terms of value
– Value is an essential piece of information that must be processed, transformed, and communicated by agents
5
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Outline
Autonomic Computing and Multi-Agent Systems
Utility Functions– As means for expressing high-level objectives
– As means for managing to high-level objectives
Examples– Unity, and its commercialization
– Power and performance objectives and tradeoffs
– Applying utility concepts at the data center level
Challenges and conclusions
6
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
How to represent high-level policies?
Utility functions map any possible state of a system to a scalar value
They can be derived from– a service level agreement
– preference elicitation techniques
– simple templates, e.g. specify response time thresholds and “importance” levels
They are a very useful representation for high-level objectives
– Value can be propagated and/or converted from service level parameters to resource control parameters
PossibleState
1
PossibleState
2
PossibleState
3
a1
a2
a3
CurrentState
S
Value(RT) =
Kephart and Walsh, Policy04
7
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
How to manage with high-level policies?
Elicit utility function U(S) expressed in terms of service attributes S
Obtain models of how each attribute Si depends on set of controllable parameters C and observable parameters O
– Can express set of models as S(C, O)
– E.g., RTgold(routing weights, request rate)
– Models can be obtained, by experiment/learning, theory (queuing), …
Transform from service utility U(S) to resource utility U’ by substituting models
– U(S) = U(S(C,O)) = U’(C,O)
Optimize resource utility. As observable O changes, set C to values that maximize U’(C,O)
U(RT, RPO)
Recovery Point
ObjectiveResponse Time
U
cpu
Backup rate b
U’
U’(cpu, b; )
Transform
8 © 2005 IBM Corporation
Finding the optimal control parameters
NetUtil(cpu, b; )
Util(RT, RPO)
RT
RPO
b
cpu
b*=0.874575cpu*=2.49134U*=152.661RT*=99.5775
b
cpu
b*=1.19931cpu*=3.65144U*=137.414RT*=95.4449
b
cpu
b*=2.05265cpu*=8.58375U*=75.8644RT*=88.6853
The mapping from service level to low-level parameters is affected by observables (like in this case), which in turn affects (cpu*, b*).
9
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Outline
Autonomic Computing and Multi-Agent Systems
Utility Functions– As means for expressing high-level objectives
– As means for managing to high-level objectives
Examples– Unity, and its commercialization
– Power and performance objectives and tradeoffs
– Applying utility concepts at the data center level
Challenges and conclusions
10
IBM Research
SOAR 2010 Keynote © 2007 IBM CorporationJune 7, 2010
Unity Data Center Prototype: Experimental setup
Value(#srvrs)
Batch
AppManager
Trade3
Server Server Server Server Server Server Server Server
Value(#srvrs)
Value(#srvrs)
Demand(HTTP req/sec)
Trade3
AppManager
Value(RT)
WebSphere 5.1
DB2
AppManager
WebSphere 5.1
DB2
Value(#srvrs)
Maximize Total SLA Revenue
5 sec
Value(RT)
Demand(HTTP req/sec)
Trade3
Trade3
ResourceArbiter
Chess, Segal, Whalley and White, Unity: Experiences with a Prototype Autonomic Computing System, ICAC 2004
11
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
ResourceArbiter
Application Environment 1
ServersServersServers
U1(S1)ApplicationManager
Application Environment 2
ServersServersServers
U2(S2)ApplicationManager
R*1 R*
2
Transformation ofservice-level utility toresource-level utility:
U(R) = maxCU(S(C,R))
U2(R2)U1(R1)
Number of servers
Ma
x U
tili
ty
U(Ri)
The Unity Datacenter Prototype
Decision problem:Allocate resources R* = argmaxRUi(Ri) Effectively maximizes Ui(Si)
12
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Commercializing Unity
Barriers are not just technical in nature
Strong legacy of successful product lines must be respected; otherwise– Difficult for the vendor
– Risk alienating existing customer base
Solution: Infuse agency/autonomicity gradually into existing products– Demonstrate value incrementally at each step
We worked with colleagues at IBM Research and IBM Software Group to implement the Unity ideas in two commercial products:– Application Manager: IBM WebSphere Extended Deployment (WXD)
– Resource Arbiter: IBM Tivoli Intelligent Orchestrator (TIO)
Das et al., ICAC 2006
13
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Commercializing The Unity Datacenter Prototype
Tivoli Intelligent
Orchestrator
Application Environment 1
ServersServersServers
U1(S1)WebSphere Extended
Deployment
Application Environment 2
ServersServersServers
U2(S2)
Decision problem:Allocate resources R* = argmaxRUi(Ri) Effectively maximizes Ui(Si)
R*1 R*
2
Transformation ofservice-level utility toresource-level utility:
U(R) = maxCU(S(C,R))
U2(R2)U1(R1)
Number of servers
Ma
x U
tili
ty
U(Ri)
WebSphere Extended
Deployment
14
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Coordinating two commercial products
WXD uses utility functions internally, but
– Doesn’t communicate them externally
– Doesn’t compute them for n other than current value
– Reports “needed speed” to Objective Analyzer, which converts it to PoB
Tivoli Intelligent Orchestrator
U(S)WebSphere Extended
Deployment
V(n)
TIO doesn’t use utility functions
– Uses Probability of Breach
– PoB: probability that SLA will be violated if n servers are given to an applicationIf I give you n
servers, how valuable will that be?
If I give you n servers, how often will you exceed the response time goal?
I need 300M CPU cycles/sec
DesiredActual
15
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Original
WXD 5.1TIO 3.1
Utility-based Interactions between WXD and TIO: Step 1
WebSphere XD
Objective Analyzer
TIO Transform
Resource Allocations: n
ResourceNeeds(speed)
ResourceNeeds in PoB(n)
Policy Engine
TIO 3.1
ResourceNeeds in Fitness(n)
Utility(current n)
TIO cannot make well-founded resource allocation decisions WS XD can’t articulate its needs to TIO PoB not commensurate with utility
16
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Intermediate
WXD 6.0.2
TIO 3.1
Utility-based Interactions between WXD and TIO: Step 2
WebSphere XD
Objective Analyzer
TIO Transform
Resource Allocations: n
PoB(n)
Policy Engine
TIO 3.1
Fitness(n)
Utility(current n)
WS XD research team added ResourceUtil interface of WXD We developed a good heuristic for converting ResourceUtil to PoB in Objective Analyzer
Interpolate discrete set of ResourceUtil points and map to PoB This PoB better reflects WS XD’s needs
ResourceUtility(n)
17
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
New
Modified WXD 6.0.2
Modified TIO 3.1
Utility-based Interactions between WXD and TIO: Step 3
WebSphere XD
Objective Analyzer
1
Resource Allocations: n
Policy Engine
TIO 3.1
Utility(current n)
ResourceUtility(n)
We modified TIO to use ResourceUtil(n) directly instead of PoB(n) Most mathematically principled basis for TIO allocation decisions It enables TIO to be in perfect synch with the goals defined by WS XD Basic scheme can work, not just for XD, but for any other entity that may be requesting resource, provided that it can estimate its own utilities
ResourceUtility(n)
ResourceUtility(n)
18
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Outline
Autonomic Computing and Multi-Agent Systems
Utility Functions– As means for expressing high-level objectives
– As means for managing to high-level objectives
Examples– Unity, and its commercialization
– Power and performance objectives and tradeoffs
– Applying utility concepts at the data center level
Challenges and conclusions
19
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Multi-agent management of performance and power
To manage performance and power objectives and tradeoffs, we have explored multiple variants on a theme
Two separate agents: one for performance, the other for power
Various control parameters, various coordination and communication mechanisms
– Power controls: clock frequency & voltage, sleep modes, …
– Performance controls: routing weights, number of servers, VM placement …
– Coordination: unilateral with/without FYI, bilateral interaction, mediated, …
I show several examples ….
20
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Utility functions for performance-power tradeoffs
Performance agent– IBM WebSphere
middleware adjusts load balancing, CPU application placement parameters ,etc to maximize performance
Power agent– Autonomic Management of
Energy throttles CPU clock to slow down processor and save power
An inherent conflict!
Utility function formulation: Maximize U(perf, pwr) = U(perf) – Pwr; Pwr < PwrMax
Kephart et al., ICAC 2007
21
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Coordinating Multiple Autonomic Managers(Actually, only 2 or 3)
No frequency feedback With frequency feedback
Perf data (subset)
Pwr data Clock frequency
Perf Pwr
Perf data
Perf data (subset)
Pwr data
Perf Pwr
Perf data
Kephart et al., Coordinating Multiple Autonomic Managers to Achieve Specified Power-Performance Tradeoffs, ICAC 2007
22
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Interaction between power and performance agents How might semi-autonomous
power & performance agents interact? – Mediated through coordinator
agent, or
– Direct bi/multi-lateral interactions Scenario (with mediation)
– Performance manager observes subset sperf of system state, and controls application placement and load balancing weights
– Power manager observes subset of spwr of system state, and controls on/off state of servers
– Coordinator knows overall power-performance tradeoffs as expressed in joint utility function; queries performance, power agents for likely impact of n servers on to derive optimum n*
I relinquish control of server i to
you
Performance Mgr
sperf spwr
System
Coordinator
UPP(RT, Pwr)
Power Mgr
On/Offi
System
Placement PRouting weights wi
What-if(non; Pwr)?What-if(non; RT)?
RTest(non) Pwrest(non)
n*onn*on
Kephart, Chan, Das, Levine, Tesauro, Rawson, Lefurgy. Coordinating Multiple Autonomic Managers to Achieve Specified Power-Performance Tradeoffs. ICAC 2007.(Emergent phenomena can occur when autonomic managers don’t communicate effectively.)
23
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Power-aware Application Placement Controller for WebSphere
WebSphere XD Node Group
BNode
5
BANode
2
CNode
3
BCNode
4
BANode
1
A
B
C
HighImportance
360ms
MediumImportance
600ms
LowImportance
840ms
PlacementController
PlacementController
13 machines, single-CPU, 3.2GHz Intel Xeon, hyperthreading disabled, 4GB RAM
Flow Controller(Pacifici et al. JSAC 2005)
(Karve et al., WWW 2006,
Tang et al., WWW 2007)
Steinder, Whalley, Hanson, Kephart, Coordinated Management of Power Usage and Runtime Performance, NOMS 2008Time (minutes)
Nu
mb
er o
f cl
ien
ts
Idea. Enhance WebSphere’s Placement Controller to place apps on minimal number of nodes required to satisfy SLA. Uses utility functions, models and optimization.
24
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Models, Optimization and Utility
4 nodes on 3 nodes on
2 nodes on
20% savings40% savings
Performance of sample Web application There is a tradeoff between good response time and low power consumption
To resolve the tradeoff, we provide a utility function U(RT, Pwr)– We experimented with three spanning wide spectrum
in power-saving preference
Performance manager (APC) computes– Best number of number of servers n*– Best application placement Placement*(n*)
For a given placement P on n nodes, APC uses offline, online models to compute– Estimated RT(P, n)– Estimated Pwr(P, n)– Estimated U(RT(P,n), Pwr(P,n)) = U’(P,n)
APC uses constraint optimization to determine (P*, n*) that maximizes U’(P,n)
25
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Experimental results(3 different utility functions)
1. Always meet SLAs
2. Always maximize performance
3. Permit 10% performance degradation for 10% power savings
Conclusion. Substantial power savings can be attained without violating the SLA.
26
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Coordinated Load Balancing and Powering Off Servers
DB2DB2
WebSphere
App. Server
WebSphere
App. Server
WebSphere
App. Server
WebSphere
App. Server
HTTP
Server
HTTP
Server
WebSphere
App. Server
WebSphere
App. ServerLoad
Generator
Load
Generator
Das, Kephart, Lefurgy, Tesauro, Levine, Chan Autonomic Multi-agent Management of Power and Performance in Data Centers, AAMAS 2008
Customers typically purchase enough servers to handle peak workloads
The average workload is often much less than peak (e.g. daily and weekly cycles)
Can we intelligently power down servers when they’re not needed?
27
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Baseline experiment
Thursday Friday
Time (minutes)
Wo
rklo
ad
(#
cli
en
ts)
Re
sp
on
se
ti
me
(m
se
c)
# s
erv
ers
on
P
ow
er
(wa
tts
)
Trade6 workload; NASA traffic
SLA
Response time
Power
# servers
28
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Models, optimization and utility
Thursday Friday
Time (minutes)
Wo
rklo
ad
(#
cli
en
ts)
Re
sp
on
se
ti
me
(m
se
c)
# s
erv
ers
on
P
ow
er
(wa
tts
)
Trade6 workload; NASA traffic
SLA
Response time
Power
# servers
Avg Power Savings = 27% No SLA violation
Here we use simple utility function– U(RT) = 1 if SLA satisfied– U(RT) = 0 if SLA not satisfied– U(RT, Pwr) = U(RT) – Pwr
Offline, we measure & compute models– RT (n, )– Pwr (n,
We substitute models into U to obtain– U’ (n, ) = U (RT (n, ), Pwr (n, ))
We pre-compute policy– n*() = argmaxn U’ (n, )
During the experiment, we plug a forecast of into n*()– We then modify n* a bit– We add some extra because latencies entailed
in turning servers back on can be several minutes
– We apply some heuristics to ensure that we don’t turn servers on and off too often
– We then turn on n* servers in the app tier
29
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Power-Performance Commercialization
Managed elements (worker machines)
Workload Mgr Agent
Hardware Mgr Agent
Hardware-related sensor data
Hardware-mgmt commands to
machines under HMgr’s control
•Cost functionsCost functions for machines•Grant/refusal of request for control
•Request/release of controlRequest/release of control over machines
Workload-related sensor data
Workload-mgmt commands to
machines under WMgr’s control
Hanson et al. Autonomic Manager for Power, NOMS 2010
30
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Outline
Autonomic Computing and Multi-Agent Systems
Utility Functions– As means for expressing high-level objectives
– As means for managing to high-level objectives
Examples– Unity, and its commercialization
– Power and performance objectives and tradeoffs
– Applying utility concepts at the data center level
Challenges and conclusions
31
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Data Center as a Multi-Agent System
Power Manager
Performance Mgr
Availability Manager
I want to turn that server off to save power
Reliability Manager
I want to process workload on that server
I want to use that server to achieve 3-fold redundancy
I want to leave that server in its present power state to reduce thermal stress
I want to perform valuable computational services
I want to increase the fan speed on that CRAC unit
32
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Multi-agent Data Centers
H2OIce
Pumps
ChillerComputer Room Air Conditioners
Utility
Central UPS
Power Distribution Units
Facility Network
DC Power
Low Voltage600VAC Eqpt
GeneratorParallel or Transfer Eqpt
Utility,Rates, Incentives SubstationCommunicating Revenue Meter
CHP Fuel Cell, MicroTurbine or Turbine Power
CoolingMonitoring and Control
CoolingTower
Medium Voltage>600VAC Eqpt
$
Utility
AlternativePower
Utility
Compute• Main Frames• Volume Servers• Blade Servers
Storage• SATA Disk• Tape• Blended
Network• Corporate Networks• VoIP• Integrated Blade/Switch
In-Row Power• Modular UPS• Rack Mount PDUs
In Row Cooling• Rear Door Heat Exchanger• Liquid Cooling Racks• Overhead Cooling
IT and Networks
Security…UPSBattery
IT
Raised Floor
Compute
Storage, Netw
ork
In-row Power
and Cooling
Coordinator
33
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
MAS for Data Center ManagementResearch Challenges Marketplace realities dictate de-centralized MAS solutions to
energy management– Interaction among agents responsible for different dimensions of
management– Interaction across layers of the stack
Architectural questions– What is the scope/boundary of the agents? Where are they situated? – What and how do they communicate?– Can a multi-agent approach work, using negotiating agents and
mediators to manage performance, power, availability, reliability …?– Are markets and auctions effective coordination mechanisms when
there are numerous agents and “goods”?• What are the goods in this case (e.g. one core in a multi-core system running
in turbo mode)• We may need hierarchical markets that extend across multiple data centers• What happens when data center markets are coupled to the global economy?
Algorithmic (and other) challenges– Building/tuning deterministic and statistical what-if models on the fly– Avoiding undesirable emergent phenomena (IBM and HP Research
have observed this!)– Eliciting preferences (tradeoffs between power, performance, …)
Beyond IT– There is much to be gained by coordinating workload placement, load
balancing, etc. with facilities management, e.g. co-managing cooling and workload migration
– Agents will represent PDUs, CRACs, chillers, etc., vastly increasing the size and variety of the MAS
Lubin, Kephart, Das and Parkes. Expressive Power-Based Resource Allocation for Data Centers. IJCAI 2009.(Exploring market-based resource allocation for data centers.)
34
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Conclusions Multi-agent systems offer a powerful paradigm for engineering
decentralized autonomic computing systems– But impractical to build from scratch using standard agent tools
Humans should express objectives in terms of value (utility functions)
Value can then be propagated, processed, and transformed by the agents, and used to guide their internal decisions and their communication with other agents
Key technologies required to support this scheme include– Utility function elicitation
– Learning
– Modeling / what-if modeling
– Optimization
– Agent communication, mediation
There is lots of architectural work to do !
35
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Backup
36
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Research questions
What are the agent boundaries?
What interfaces and protocols do they use to communicate?
What information do the agents share?– When negotiating services
– When provide/receiving services
How to compose agents into systems?– How do agents advertise their service?
– Can we use planning algorithms to compose them to achieve a desired end goal?
How do we motivate agents to provide services to one another?
We need to build these systems and subject them to realistic workloads– How do we encourage the community to build AC systems?
– Where do we get the realistic workloads?
37
IBM Research
SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010
Multi-agent systems and autonomic data centers I envision data centers of the future as a complex ecosystem of interacting semi-autonomous
entities – an autonomic, multi-agent system
Autonomic computing definition– “Computing systems that manage themselves in accordance with high-level objectives from humans.” Kephart and
Chess, A Vision of Autonomic Computing, IEEE Computer, Jan. 2003
Software agent definition– “An encapsulated computer system, situated in some environment, and capable of flexible, autonomous action in
that environment in order to meet its design objectives.” Jennings, Sycara and Wooldridge, A Road Map of Agent Research and Development, JAAMAS 1998
– Multi-agent systems: collections of agents that interact with one another to achieve individual and/or system goals
Agents will– represent, or be embedded in, different products from different vendors– reside at many levels of the management stack– manipulate control knobs at all levels of the stack (from hardware/firmware up through middleware and facilities)– collectively manage the data center to specified objectives and constraints (some relating to power)– interact in intended and unintended ways with one another, and with other types of automated management
processes directed towards maintaining high levels of performance, availability, reliability, security, etc.
This vision is a natural extrapolation of present-day facts and trends– Industry and academia are developing a multitude of control knobs and automated techniques to save energy– These will be incorporated into a multitude of management products from different vendors– They will operate simultaneously within and across multiple levels of the stack– Somehow, these products will need to work together effectively, requiring some cooperative interactions
Data centers are a “killer app” for multi-agent systems– Conversely, MAS architectural and algorithmic concepts are essential to energy-efficient, autonomic data centers