© 2009 IBM Corporation IBM Research SOAR 2010 keynoteJune 7, 2010 Engineering Decentralized...

© 2009 IBM Corporation SOAR 2010 keynote June 7, 2010

IBM Research

Engineering Decentralized Autonomic Computing Systems

Jeff Kephart ([email protected])IBM Thomas J Watson Research CenterHawthorne, NY

2

IBM Research

SOAR 2010 Keynote © 2009 IBM CorporationJune 7, 2010

Building Effective Multivendor Autonomic Computing SystemsICAC 2006 panel

Moderator: Omer Rana (Cardiff University) Panelists

– Steve White (IBM Research)

– Julie McCann (Imperial College, London)

– Mazin Yousif (Intel Research)

– Jose Fortes (University of Florida)

– Kumar Goswami (HP Labs)

Discussion– Multi-vendor systems are an unavoidable reality – we must cope

– Standards are key to interoperability: web services (WS-*), policies, resource information (CIM); probably others too, for monitoring probes, types of events, semantic stack (OWL) …

– Virtualization is key – layer of abstraction providing uniform access to resources

– “We must identify key application scenarios, such as data centers and Grid-computing applications, that would enable wider adoption of multivendor autonomic systems.”

Rana and Kephart, IEEE Distributed Systems Online, 2006

3

IBM Research


Autonomic Computing and Agents AC definition

– “Computing systems that manage themselves in accordance with high-level objectives from humans.” Kephart & Chess, IEEE Computer 2003

Agents definition– “An encapsulated computer system, situated in some

environment, and capable of flexible, autonomous action in that environment in order to meet its design objectives.” Jennings, et al, A Roadmap of Agent Research and Development, JAAMAS 1998

Autonomic elements ~ agents Autonomic systems ~ multi-agent systems

Mutual benefit for AC & Agents communities– AC is killer app for agents

– Agents community has much to contribute to AC

Service-oriented computing may evolve into agent-oriented computing over time

– Agents more proactive

– Agent standards should grow from SOC standards

Brazier, Kephart, Parunak, and Huhns, Internet Computing, June 2009

Article resulted from brainstorming session at Agents for Autonomic Computing workshop, ICAC 2008

4

IBM Research


The focus of this talk

I start from two premises:– Autonomic systems are “Computing systems that manage themselves in

accordance with high-level objectives from humans.”

– Autonomic systems ~ multi-agent systems

Which leads to…

How do we get a (decentralized) Multi-Agent System to act in accordance with high-level objectives?

My thesis– Objectives should be expressed in terms of value

– Value is an essential piece of information that must be processed, transformed, and communicated by agents

5

IBM Research


Outline

Autonomic Computing and Multi-Agent Systems

Utility Functions– As means for expressing high-level objectives

– As means for managing to high-level objectives

Examples– Unity, and its commercialization

– Power and performance objectives and tradeoffs

– Applying utility concepts at the data center level

Challenges and conclusions

6

IBM Research


How to represent high-level policies?

Utility functions map any possible state of a system to a scalar value

They can be derived from– a service level agreement

– preference elicitation techniques

– simple templates, e.g. specify response time thresholds and “importance” levels

They are a very useful representation for high-level objectives

– Value can be propagated and/or converted from service level parameters to resource control parameters

PossibleState

1

PossibleState

2

PossibleState

3

a1

a2

a3

CurrentState

S

Value(RT) =

Kephart and Walsh, Policy04

7

IBM Research


How to manage with high-level policies?

Elicit utility function U(S) expressed in terms of service attributes S

Obtain models of how each attribute Si depends on set of controllable parameters C and observable parameters O

– Can express set of models as S(C, O)

– E.g., RTgold(routing weights, request rate)

– Models can be obtained, by experiment/learning, theory (queuing), …

Transform from service utility U(S) to resource utility U’ by substituting models

– U(S) = U(S(C,O)) = U’(C,O)

Optimize resource utility. As observable O changes, set C to values that maximize U’(C,O)

U(RT, RPO)

Recovery Point

ObjectiveResponse Time

U

cpu

Backup rate b

U’

U’(cpu, b; )

Transform

8 © 2005 IBM Corporation

Finding the optimal control parameters

NetUtil(cpu, b; )

Util(RT, RPO)

RT

RPO

b

cpu

b*=0.874575cpu*=2.49134U*=152.661RT*=99.5775

b

cpu

b*=1.19931cpu*=3.65144U*=137.414RT*=95.4449

b

cpu

b*=2.05265cpu*=8.58375U*=75.8644RT*=88.6853

The mapping from service level to low-level parameters is affected by observables (like in this case), which in turn affects (cpu*, b*).

9

IBM Research


Outline








10

IBM Research


Unity Data Center Prototype: Experimental setup

Value(#srvrs)

Batch

AppManager

Trade3

Server Server Server Server Server Server Server Server

Value(#srvrs)

Value(#srvrs)

Demand(HTTP req/sec)

Trade3

AppManager

Value(RT)

WebSphere 5.1

DB2

AppManager

WebSphere 5.1

DB2

Value(#srvrs)

Maximize Total SLA Revenue

5 sec

Value(RT)

Demand(HTTP req/sec)

Trade3

Trade3

ResourceArbiter

Chess, Segal, Whalley and White, Unity: Experiences with a Prototype Autonomic Computing System, ICAC 2004

11

IBM Research


ResourceArbiter

Application Environment 1

ServersServersServers

U1(S1)ApplicationManager



U2(S2)ApplicationManager

R*1 R*

2

Transformation ofservice-level utility toresource-level utility:

U(R) = maxCU(S(C,R))

U2(R2)U1(R1)

Number of servers

Ma

x U

tili

ty

U(Ri)

The Unity Datacenter Prototype

Decision problem:Allocate resources R* = argmaxRUi(Ri) Effectively maximizes Ui(Si)

12

IBM Research


Commercializing Unity

Barriers are not just technical in nature

Strong legacy of successful product lines must be respected; otherwise– Difficult for the vendor

– Risk alienating existing customer base

Solution: Infuse agency/autonomicity gradually into existing products– Demonstrate value incrementally at each step

We worked with colleagues at IBM Research and IBM Software Group to implement the Unity ideas in two commercial products:– Application Manager: IBM WebSphere Extended Deployment (WXD)

– Resource Arbiter: IBM Tivoli Intelligent Orchestrator (TIO)

Das et al., ICAC 2006

13

IBM Research


Commercializing The Unity Datacenter Prototype

Tivoli Intelligent

Orchestrator



U1(S1)WebSphere Extended

Deployment



U2(S2)

Decision problem:Allocate resources R* = argmaxRUi(Ri) Effectively maximizes Ui(Si)

R*1 R*

2

Transformation ofservice-level utility toresource-level utility:

U(R) = maxCU(S(C,R))

U2(R2)U1(R1)

Number of servers

Ma

x U

tili

ty

U(Ri)

WebSphere Extended

Deployment

14

IBM Research


Coordinating two commercial products

WXD uses utility functions internally, but

– Doesn’t communicate them externally

– Doesn’t compute them for n other than current value

– Reports “needed speed” to Objective Analyzer, which converts it to PoB

Tivoli Intelligent Orchestrator

U(S)WebSphere Extended

Deployment

V(n)

TIO doesn’t use utility functions

– Uses Probability of Breach

– PoB: probability that SLA will be violated if n servers are given to an applicationIf I give you n

servers, how valuable will that be?

If I give you n servers, how often will you exceed the response time goal?

I need 300M CPU cycles/sec

DesiredActual

15

IBM Research


Original

WXD 5.1TIO 3.1

Utility-based Interactions between WXD and TIO: Step 1

WebSphere XD

Objective Analyzer

TIO Transform

Resource Allocations: n

ResourceNeeds(speed)

ResourceNeeds in PoB(n)

Policy Engine

TIO 3.1

ResourceNeeds in Fitness(n)

Utility(current n)

TIO cannot make well-founded resource allocation decisions WS XD can’t articulate its needs to TIO PoB not commensurate with utility

16

IBM Research


Intermediate

WXD 6.0.2

TIO 3.1


WebSphere XD

Objective Analyzer

TIO Transform


PoB(n)

Policy Engine

TIO 3.1

Fitness(n)

Utility(current n)

WS XD research team added ResourceUtil interface of WXD We developed a good heuristic for converting ResourceUtil to PoB in Objective Analyzer

Interpolate discrete set of ResourceUtil points and map to PoB This PoB better reflects WS XD’s needs

ResourceUtility(n)

17

IBM Research


New

Modified WXD 6.0.2

Modified TIO 3.1


WebSphere XD

Objective Analyzer

1


Policy Engine

TIO 3.1

Utility(current n)

ResourceUtility(n)

We modified TIO to use ResourceUtil(n) directly instead of PoB(n) Most mathematically principled basis for TIO allocation decisions It enables TIO to be in perfect synch with the goals defined by WS XD Basic scheme can work, not just for XD, but for any other entity that may be requesting resource, provided that it can estimate its own utilities

ResourceUtility(n)

ResourceUtility(n)

18

IBM Research


Outline








19

IBM Research


Multi-agent management of performance and power

To manage performance and power objectives and tradeoffs, we have explored multiple variants on a theme

Two separate agents: one for performance, the other for power

Various control parameters, various coordination and communication mechanisms

– Power controls: clock frequency & voltage, sleep modes, …

– Performance controls: routing weights, number of servers, VM placement …

– Coordination: unilateral with/without FYI, bilateral interaction, mediated, …

I show several examples ….

20

IBM Research


Utility functions for performance-power tradeoffs

Performance agent– IBM WebSphere

middleware adjusts load balancing, CPU application placement parameters ,etc to maximize performance

Power agent– Autonomic Management of

Energy throttles CPU clock to slow down processor and save power

An inherent conflict!

Utility function formulation: Maximize U(perf, pwr) = U(perf) – Pwr; Pwr < PwrMax

Kephart et al., ICAC 2007

21

IBM Research


Coordinating Multiple Autonomic Managers(Actually, only 2 or 3)

No frequency feedback With frequency feedback

Perf data (subset)

Pwr data Clock frequency

Perf Pwr

Perf data

Perf data (subset)

Pwr data

Perf Pwr

Perf data

Kephart et al., Coordinating Multiple Autonomic Managers to Achieve Specified Power-Performance Tradeoffs, ICAC 2007

22

IBM Research


Interaction between power and performance agents How might semi-autonomous

power & performance agents interact? – Mediated through coordinator

agent, or

– Direct bi/multi-lateral interactions Scenario (with mediation)

– Performance manager observes subset sperf of system state, and controls application placement and load balancing weights

– Power manager observes subset of spwr of system state, and controls on/off state of servers

– Coordinator knows overall power-performance tradeoffs as expressed in joint utility function; queries performance, power agents for likely impact of n servers on to derive optimum n*

I relinquish control of server i to

you

Performance Mgr

sperf spwr

System

Coordinator

UPP(RT, Pwr)

Power Mgr

On/Offi

System

Placement PRouting weights wi

What-if(non; Pwr)?What-if(non; RT)?

RTest(non) Pwrest(non)

n*onn*on

Kephart, Chan, Das, Levine, Tesauro, Rawson, Lefurgy. Coordinating Multiple Autonomic Managers to Achieve Specified Power-Performance Tradeoffs. ICAC 2007.(Emergent phenomena can occur when autonomic managers don’t communicate effectively.)

23

IBM Research


Power-aware Application Placement Controller for WebSphere

WebSphere XD Node Group

BNode

5

BANode

2

CNode

3

BCNode

4

BANode

1

A

B

C

HighImportance

360ms

MediumImportance

600ms

LowImportance

840ms

PlacementController

PlacementController

13 machines, single-CPU, 3.2GHz Intel Xeon, hyperthreading disabled, 4GB RAM

Flow Controller(Pacifici et al. JSAC 2005)

(Karve et al., WWW 2006,

Tang et al., WWW 2007)

Steinder, Whalley, Hanson, Kephart, Coordinated Management of Power Usage and Runtime Performance, NOMS 2008Time (minutes)

Nu

mb

er o

f cl

ien

ts

Idea. Enhance WebSphere’s Placement Controller to place apps on minimal number of nodes required to satisfy SLA. Uses utility functions, models and optimization.

24

IBM Research


Models, Optimization and Utility

4 nodes on 3 nodes on

2 nodes on

20% savings40% savings

Performance of sample Web application There is a tradeoff between good response time and low power consumption

To resolve the tradeoff, we provide a utility function U(RT, Pwr)– We experimented with three spanning wide spectrum

in power-saving preference

Performance manager (APC) computes– Best number of number of servers n*– Best application placement Placement*(n*)

For a given placement P on n nodes, APC uses offline, online models to compute– Estimated RT(P, n)– Estimated Pwr(P, n)– Estimated U(RT(P,n), Pwr(P,n)) = U’(P,n)

APC uses constraint optimization to determine (P*, n*) that maximizes U’(P,n)

25

IBM Research


Experimental results(3 different utility functions)

1. Always meet SLAs

2. Always maximize performance

3. Permit 10% performance degradation for 10% power savings

Conclusion. Substantial power savings can be attained without violating the SLA.

26

IBM Research


Coordinated Load Balancing and Powering Off Servers

DB2DB2

WebSphere

App. Server

WebSphere

App. Server

WebSphere

App. Server

WebSphere

App. Server

HTTP

Server

HTTP

Server

WebSphere

App. Server

WebSphere

App. ServerLoad

Generator

Load

Generator

Das, Kephart, Lefurgy, Tesauro, Levine, Chan Autonomic Multi-agent Management of Power and Performance in Data Centers, AAMAS 2008

Customers typically purchase enough servers to handle peak workloads

The average workload is often much less than peak (e.g. daily and weekly cycles)

Can we intelligently power down servers when they’re not needed?

27

IBM Research


Baseline experiment

Thursday Friday

Time (minutes)

Wo

rklo

ad

(#

cli

en

ts)

Re

sp

on

se

ti

me

(m

se

c)

# s

erv

ers

on

P

ow

er

(wa

tts

)

Trade6 workload; NASA traffic

SLA

Response time

Power

# servers

28

IBM Research


Models, optimization and utility

Thursday Friday

Time (minutes)

Wo

rklo

ad

(#

cli

en

ts)

Re

sp

on

se

ti

me

(m

se

c)

# s

erv

ers

on

P

ow

er

(wa

tts

)

Trade6 workload; NASA traffic

SLA

Response time

Power

# servers

Avg Power Savings = 27% No SLA violation

Here we use simple utility function– U(RT) = 1 if SLA satisfied– U(RT) = 0 if SLA not satisfied– U(RT, Pwr) = U(RT) – Pwr

Offline, we measure & compute models– RT (n, )– Pwr (n,

We substitute models into U to obtain– U’ (n, ) = U (RT (n, ), Pwr (n, ))

We pre-compute policy– n*() = argmaxn U’ (n, )

During the experiment, we plug a forecast of into n*()– We then modify n* a bit– We add some extra because latencies entailed

in turning servers back on can be several minutes

– We apply some heuristics to ensure that we don’t turn servers on and off too often

– We then turn on n* servers in the app tier

29

IBM Research


Power-Performance Commercialization

Managed elements (worker machines)

Workload Mgr Agent

Hardware Mgr Agent

Hardware-related sensor data

Hardware-mgmt commands to

machines under HMgr’s control

•Cost functionsCost functions for machines•Grant/refusal of request for control

•Request/release of controlRequest/release of control over machines

Workload-related sensor data

Workload-mgmt commands to

machines under WMgr’s control

Hanson et al. Autonomic Manager for Power, NOMS 2010

30

IBM Research


Outline








31

IBM Research


Data Center as a Multi-Agent System

Power Manager

Performance Mgr

Availability Manager

I want to turn that server off to save power

Reliability Manager

I want to process workload on that server

I want to use that server to achieve 3-fold redundancy

I want to leave that server in its present power state to reduce thermal stress

I want to perform valuable computational services

I want to increase the fan speed on that CRAC unit

32

IBM Research


Multi-agent Data Centers

H2OIce

Pumps

ChillerComputer Room Air Conditioners

Utility

Central UPS

Power Distribution Units

Facility Network

DC Power

Low Voltage600VAC Eqpt

GeneratorParallel or Transfer Eqpt

Utility,Rates, Incentives SubstationCommunicating Revenue Meter

CHP Fuel Cell, MicroTurbine or Turbine Power

CoolingMonitoring and Control

CoolingTower

Medium Voltage>600VAC Eqpt

$

Utility

AlternativePower

Utility

Compute• Main Frames• Volume Servers• Blade Servers

Storage• SATA Disk• Tape• Blended

Network• Corporate Networks• VoIP• Integrated Blade/Switch

In-Row Power• Modular UPS• Rack Mount PDUs

In Row Cooling• Rear Door Heat Exchanger• Liquid Cooling Racks• Overhead Cooling

IT and Networks

Security…UPSBattery

IT

Raised Floor

Compute

Storage, Netw

ork

In-row Power

and Cooling

Coordinator

33

IBM Research


MAS for Data Center ManagementResearch Challenges Marketplace realities dictate de-centralized MAS solutions to

energy management– Interaction among agents responsible for different dimensions of

management– Interaction across layers of the stack

Architectural questions– What is the scope/boundary of the agents? Where are they situated? – What and how do they communicate?– Can a multi-agent approach work, using negotiating agents and

mediators to manage performance, power, availability, reliability …?– Are markets and auctions effective coordination mechanisms when

there are numerous agents and “goods”?• What are the goods in this case (e.g. one core in a multi-core system running

in turbo mode)• We may need hierarchical markets that extend across multiple data centers• What happens when data center markets are coupled to the global economy?

Algorithmic (and other) challenges– Building/tuning deterministic and statistical what-if models on the fly– Avoiding undesirable emergent phenomena (IBM and HP Research

have observed this!)– Eliciting preferences (tradeoffs between power, performance, …)

Beyond IT– There is much to be gained by coordinating workload placement, load

balancing, etc. with facilities management, e.g. co-managing cooling and workload migration

– Agents will represent PDUs, CRACs, chillers, etc., vastly increasing the size and variety of the MAS

Lubin, Kephart, Das and Parkes. Expressive Power-Based Resource Allocation for Data Centers. IJCAI 2009.(Exploring market-based resource allocation for data centers.)

34

IBM Research


Conclusions Multi-agent systems offer a powerful paradigm for engineering

decentralized autonomic computing systems– But impractical to build from scratch using standard agent tools

Humans should express objectives in terms of value (utility functions)

Value can then be propagated, processed, and transformed by the agents, and used to guide their internal decisions and their communication with other agents

Key technologies required to support this scheme include– Utility function elicitation

– Learning

– Modeling / what-if modeling

– Optimization

– Agent communication, mediation

There is lots of architectural work to do !

35

IBM Research


Backup

36

IBM Research


Research questions

What are the agent boundaries?

What interfaces and protocols do they use to communicate?

What information do the agents share?– When negotiating services

– When provide/receiving services

How to compose agents into systems?– How do agents advertise their service?

– Can we use planning algorithms to compose them to achieve a desired end goal?

How do we motivate agents to provide services to one another?

We need to build these systems and subject them to realistic workloads– How do we encourage the community to build AC systems?

– Where do we get the realistic workloads?

37

IBM Research


Multi-agent systems and autonomic data centers I envision data centers of the future as a complex ecosystem of interacting semi-autonomous

entities – an autonomic, multi-agent system

Autonomic computing definition– “Computing systems that manage themselves in accordance with high-level objectives from humans.” Kephart and

Chess, A Vision of Autonomic Computing, IEEE Computer, Jan. 2003

Software agent definition– “An encapsulated computer system, situated in some environment, and capable of flexible, autonomous action in

that environment in order to meet its design objectives.” Jennings, Sycara and Wooldridge, A Road Map of Agent Research and Development, JAAMAS 1998

– Multi-agent systems: collections of agents that interact with one another to achieve individual and/or system goals

Agents will– represent, or be embedded in, different products from different vendors– reside at many levels of the management stack– manipulate control knobs at all levels of the stack (from hardware/firmware up through middleware and facilities)– collectively manage the data center to specified objectives and constraints (some relating to power)– interact in intended and unintended ways with one another, and with other types of automated management

processes directed towards maintaining high levels of performance, availability, reliability, security, etc.

This vision is a natural extrapolation of present-day facts and trends– Industry and academia are developing a multitude of control knobs and automated techniques to save energy– These will be incorporated into a multitude of management products from different vendors– They will operate simultaneously within and across multiple levels of the stack– Somehow, these products will need to work together effectively, requiring some cooperative interactions

Data centers are a “killer app” for multi-agent systems– Conversely, MAS architectural and algorithmic concepts are essential to energy-efficient, autonomic data centers

© 2009 IBM Corporation IBM Research SOAR 2010 keynoteJune 7, 2010 Engineering Decentralized...

Documents

Transcript of © 2009 IBM Corporation IBM Research SOAR 2010 keynoteJune 7, 2010 Engineering Decentralized...