NOW 1 Berkeley NOW Project David E. Culler [email protected] Sun Visit May 1, 1998.

47
NOW 1 Berkeley NOW Project David E. Culler [email protected] u http:// now.cs.berkeley.edu / Sun Visit
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of NOW 1 Berkeley NOW Project David E. Culler [email protected] Sun Visit May 1, 1998.

NOW 1

Berkeley NOW Project

David E. Culler

[email protected]

http://now.cs.berkeley.edu/

Sun Visit

May 1, 1998

NOW 2

Project Goals

• Make a fundamental change in how we design and construct large-scale systems

– market reality:

» 50%/year performance growth => cannot allow 1-2 year engineering lag

– technological opportunity:

» single-chip “Killer Switch” => fast, scalable communication

• Highly integrated building-wide system

• Explore novel system design concepts in this new “cluster” paradigm

NOW 3

Remember the “Killer Micro”

• Technology change in all markets

• At many levels: Arch, Compiler, OS, Application

CM-2

CM-5

Paragon XP/S (6768)Cray T3D

ASCI red

Y-mp

C90

T90

X-mp

0.1

1

10

100

1000

1984 1986 1988 1990 1992 1994 1996

Year

GF

LO

PS

MPP

Cray vector

Linpack Peak Performance

NOW 4

Another Technological Revolution

• The “Killer Switch”– single chip building block for scalable networks

– high bandwidth

– low latency

– very reliable

» if it’s not unplugged

=> System Area Networks

NOW 5

One Example: Myrinet

• 8 bidirectional ports of 160 MB/s each way< 500 ns routing delay

• Simple - just moves the bits

• Detects connectivity and deadlock

Tomorrow: gigabit Ethernet?

NOW 6

Potential: Snap together large systems

• incremental scalability

• time / cost to market

• independent failure => availability

0

50

100

150

200

250

300

1986 1988 1990 1992 1994

Year

SpecIntSpecFP

EngineeringLag Time

NodePerformancein Large System

NOW 7

Opportunity: Rethink O.S. Design

• Remote memory and processor are closer to you than your own disks!

• Networking Stacks ?

• Virtual Memory ?

• File system design ?

NOW 8

Example: Traditional File System

Clients Server

$$$

GlobalSharedFile Cache

RAIDDisk Storage

Fast Channel (HPPI)

• Expensive

• Complex

• Non-Scalable

• Single point of failure

$

LocalPrivate

File Cache

$

$

° ° ° Bottleneck

• Server resources at a premium

• Client resources poorly utilized

NOW 9

Truly Distributed File System

• VM: page to remote memory

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

FileCache

P

Scalable Low-Latency Communication Network

Network RAID striping

G = Node Comm BW / Disk BW

LocalCache

Cluster Caching

NOW 10

Fast Communication Challenge

• Fast processors and fast networks

• The time is spent in crossing between them

Killer Switch

° ° °

NetworkInterface Hardware

Comm..Software

NetworkInterface Hardware

Comm.Software

NetworkInterface Hardware

Comm.Software

NetworkInterface Hardware

Comm.Software

Killer Platform

ns

µs

ms

NOW 11

Opening: Intelligent Network Interfaces

• Dedicated Processing power and storage embedded in the Network Interface

• An I/O card today

• Tomorrow on chip?

$

P

M I/O bus (S-Bus)50 MB/s

MryicomNet

P

Sun Ultra 170

MyricomNIC

160 MB/s

M

$

P

M P

$

P

M

$

P$

P

M

NOW 12

Our Attack: Active Messages

• Request / Reply small active messages (RPC)

• Bulk-Transfer (store & get)

• Highly optimized communication layer on a range of HW

Request

handler

handler

Reply

NOW 13

NOW System Architecture

Net Inter. HW

UNIXWorkstation

Comm. SW

Net Inter. HW

Comm. SW

Net Inter. HW

Comm. SW

Net Inter. HW

Comm. SW

Global Layer UNIX Resource Management Network RAM Distributed Files Process Migration

Fast Commercial Switch (Myrinet)

UNIXWorkstation

UNIXWorkstation

UNIXWorkstation

Large Seq. Apps

Parallel Apps

Sockets, Split-C, MPI, HPF, vSM

NOW 14

Outline

• Introduction to the NOW project

• Quick tour of the NOW lab

• Important new system design concepts

• Conclusions

• Future Directions

NOW 15

First HP/fddi Prototype

• FDDI on the HP/735 graphics bus.

• First fast msg layer on non-reliable network

NOW 16

SparcStation ATM NOW

• ATM was going to take over the world.

The original INKTOMI

Today: www.hotbot.com

NOW 17

100 node Ultra/Myrinet NOW

NOW 18

Massive Cheap Storage

•Basic unit:

2 PCs double-ending four SCSI chains

Currently serving Fine Art at http://www.thinker.org/imagebase/

NOW 19

Cluster of SMPs (CLUMPS)

• Four Sun E5000s– 8 processors

– 3 Myricom NICs

• Multiprocessor, Multi-NIC, Multi-Protocol

NOW 20

Information Servers

• Basic Storage Unit:– Ultra 2, 300 GB raid, 800 GB

tape stacker, ATM

– scalable backup/restore

• Dedicated Info Servers– web,

– security,

– mail, …

• VLANs project into dept.

NOW 21

What’s Different about Clusters?

• Commodity parts?

• Communications Packaging?

• Incremental Scalability?

• Independent Failure?

• Intelligent Network Interfaces?

• Complete System on every node– virtual memory

– scheduler

– files

– ...

NOW 22

Three important system design aspects

• Virtual Networks

• Implicit co-scheduling

• Scalable File Transfer

NOW 23

Communication Performance Direct Network Access

• LogP: Latency, Overhead, and Bandwidth

• Active Messages: lean layer supporting programming models

0

2

4

6

8

10

12

14

16

µs

gLOrOs

Latency 1/BW

NOW 24

0

25

50

75

100

125

0 25 50 75 100 125

Processors

Sp

ee

du

p o

n L

U-A

T3D

SP2

NOW

Example: NAS Parallel Benchmarks

• Better node performance than the Cray T3D

• Better scalability than the IBM SP-2

NOW 25

General purpose requirements

• Many timeshared processes– each with direct, protected access

• User and system

• Client/Server, Parallel clients, parallel servers– they grow, shrink, handle node failures

• Multiple packages in a process– each may have own internal communication layer

• Use communication as easily as memory

NOW 26

Virtual Networks

• Endpoint abstracts the notion of “attached to the network”

• Virtual network is a collection of endpoints that can name each other.

• Many processes on a node can each have many endpoints, each with own protection domain.

NOW 27

Process 3

How are they managed?

• How do you get direct hardware access for performance with a large space of logical resources?

• Just like virtual memory– active portion of large logical space is bound to physical

resources

Process n

Process 2Process 1

***

HostMemory

Processor

NICMem

Network Interface

P

NOW 28

Solaris System Abstractions

Segment Driver• manages portions of an address space

Device Driver• manages I/O device

Virtual Network Driver

NOW 29

Virtualization is not expensive

0

2

4

6

8

10

12

14

16

gam AM gam AM

µs

gLOrOs

NOW 30

Bursty Communication among many virtual networks

Client

Client

Client

Server

Server

Server

Msgburst work

NOW 31

Sustain high BW with many VN

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28Number of virtual networks

Ag

gre

ga

te m

sgs/

s

continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

NOW 32

Perspective on Virtual Networks

• Networking abstractions are vertical stacks– new function => new layer

– poke through for performance

• Virtual Networks provide a horizontal abstraction– basis for build new, fast services

NOW 33

Beyond the Personal Supercomputer

• Able to timeshare parallel programs – with fast, protected communication

• Mix with sequential and interactive jobs

• Use fast communication in OS subsystems– parallel file system, network virtual memory, …

• Nodes have powerful, local OS scheduler

• Problem: local schedulers do not know to run parallel jobs in parallel

NOW 34

Local Scheduling

• Local Schedulers act independently– no global control

• Program waits while trying communicate with peers that are not running

• 10 - 100x slowdowns for fine-grain programs!

=> need coordinated scheduling

A AAB

BC

A

A

AA

B C

B

C

Time

P1 P2 P3 P4

A

C

NOW 35

Traditional Solution: Gang Scheduling

• Global context switch according to precomputed schedule

• Inflexible, inefficient, fault prone

A A AA

B CB C

A A AA

B CB C

TimeP1 P2 P3 P4

Master

NOW 36

Novel Solution: Implicit Coscheduling

• Coordinate schedulers using only the communication in the program– very easy to build

– potentially very robust to component failures

– inherently “service on-demand”

– scalable

• Local service component can evolve.

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

NOW 37

Why it works

• Infer non-local state from local observations

• React to maintain coordinationobservation implication action

fast response partner scheduled spin

delayed response partner not scheduled block

WS 1 Job A Job A

WS 2 Job B Job A

WS 3 Job B Job A

WS 4 Job B Job A

sleep

spin

request response

NOW 38

Example: Synthetic Pgms

• Range of granularity and load imbalance– spin wait 10x slowdown

NOW 39

Implicit Coordination

• Surprisingly effective– real programs

– range of workloads

– simple an robust

• Opens many new research questions– fairness

• How broadly can implicit coordination be applied in the design of cluster subsystems?

NOW 40

A look at Serious File I/O

• Traditional I/O system

• NOW I/O system

• Benchmark Problem: sort large number of 100 byte records with 10 byte keys

– start on disk, end on disk

– accessible as files (use the file system)

– Datamation sort: 1 million records

– Minute sort: quantity in a minute

Proc-Mem

P-M P-M P-M P-M

NOW 41

World-Record Disk-to-Disk Sort

• Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth

Minute Sort

SGI Power Challenge

SGI Orgin

0123456789

0 10 20 30 40 50 60 70 80 90 100

Processors

Gig

abyt

es s

orted

NOW 42

Key Implementation Techniques

• Performance Isolation: highly tuned local disk-to-disk sort

– manage local memory

– manage disk striping

– memory mapped I/O with m-advise, buffering

– manage overlap with threads

• Efficient Communication– completely hidden under disk I/O

– competes for I/O bus bandwidth

• Self-tuning Software– probe available memory, disk bandwidth, trade-offs

NOW 43

Towards a Cluster File System

• Remote disk system built on a virtual network

Rat

e (M

B/s

)

LocalRemote

5.0

6.0

Read Write

CP

U U

tiliz

atio

n

Read Write0%

40%

20%

client

server

Client

RDlibRD server

Activemsgs

NOW 44

Conclusions

• Fast, simple Cluster Area Networks are a technological breakthrough

• Complete system on every node makes clusters a very powerful architecture.

• Extend the system globally– virtual memory systems,

– schedulers,

– file systems, ...

• Efficient communication enables new solutions to classic systems challenges.

NOW 45

Millennium Computational Community

Gigabit Ethernet

SIMS

C.S.

E.E.

M.E.

BMRC

N.E.

IEORC. E. MSME

NERSC

Transport

Business

Chemistry

Astro

Physics

Biology

EconomyMath

NOW 46

Millennium PC Clumps

• Inexpensive, easy to manage Cluster

• Replicated in many departments

• Prototype for very large PC cluster

NOW 47

Proactive Infrastructure

Scalable Servers

Stationarydesktops

Informationappliances