NOW 1 Berkeley NOW Project David E. Culler [email protected] Sun Visit May 1, 1998.
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of NOW 1 Berkeley NOW Project David E. Culler [email protected] Sun Visit May 1, 1998.
NOW 1
Berkeley NOW Project
David E. Culler
http://now.cs.berkeley.edu/
Sun Visit
May 1, 1998
NOW 2
Project Goals
• Make a fundamental change in how we design and construct large-scale systems
– market reality:
» 50%/year performance growth => cannot allow 1-2 year engineering lag
– technological opportunity:
» single-chip “Killer Switch” => fast, scalable communication
• Highly integrated building-wide system
• Explore novel system design concepts in this new “cluster” paradigm
NOW 3
Remember the “Killer Micro”
• Technology change in all markets
• At many levels: Arch, Compiler, OS, Application
CM-2
CM-5
Paragon XP/S (6768)Cray T3D
ASCI red
Y-mp
C90
T90
X-mp
0.1
1
10
100
1000
1984 1986 1988 1990 1992 1994 1996
Year
GF
LO
PS
MPP
Cray vector
Linpack Peak Performance
NOW 4
Another Technological Revolution
• The “Killer Switch”– single chip building block for scalable networks
– high bandwidth
– low latency
– very reliable
» if it’s not unplugged
=> System Area Networks
NOW 5
One Example: Myrinet
• 8 bidirectional ports of 160 MB/s each way< 500 ns routing delay
• Simple - just moves the bits
• Detects connectivity and deadlock
Tomorrow: gigabit Ethernet?
NOW 6
Potential: Snap together large systems
• incremental scalability
• time / cost to market
• independent failure => availability
0
50
100
150
200
250
300
1986 1988 1990 1992 1994
Year
SpecIntSpecFP
EngineeringLag Time
NodePerformancein Large System
NOW 7
Opportunity: Rethink O.S. Design
• Remote memory and processor are closer to you than your own disks!
• Networking Stacks ?
• Virtual Memory ?
• File system design ?
NOW 8
Example: Traditional File System
Clients Server
$$$
GlobalSharedFile Cache
RAIDDisk Storage
Fast Channel (HPPI)
• Expensive
• Complex
• Non-Scalable
• Single point of failure
$
LocalPrivate
File Cache
$
$
° ° ° Bottleneck
• Server resources at a premium
• Client resources poorly utilized
NOW 9
Truly Distributed File System
• VM: page to remote memory
FileCache
P
FileCache
P
FileCache
P
FileCache
P
FileCache
P
FileCache
P
FileCache
P
FileCache
P
Scalable Low-Latency Communication Network
Network RAID striping
G = Node Comm BW / Disk BW
LocalCache
Cluster Caching
NOW 10
Fast Communication Challenge
• Fast processors and fast networks
• The time is spent in crossing between them
Killer Switch
° ° °
NetworkInterface Hardware
Comm..Software
NetworkInterface Hardware
Comm.Software
NetworkInterface Hardware
Comm.Software
NetworkInterface Hardware
Comm.Software
Killer Platform
ns
µs
ms
NOW 11
Opening: Intelligent Network Interfaces
• Dedicated Processing power and storage embedded in the Network Interface
• An I/O card today
• Tomorrow on chip?
$
P
M I/O bus (S-Bus)50 MB/s
MryicomNet
P
Sun Ultra 170
MyricomNIC
160 MB/s
M
$
P
M P
$
P
M
$
P$
P
M
NOW 12
Our Attack: Active Messages
• Request / Reply small active messages (RPC)
• Bulk-Transfer (store & get)
• Highly optimized communication layer on a range of HW
Request
handler
handler
Reply
NOW 13
NOW System Architecture
Net Inter. HW
UNIXWorkstation
Comm. SW
Net Inter. HW
Comm. SW
Net Inter. HW
Comm. SW
Net Inter. HW
Comm. SW
Global Layer UNIX Resource Management Network RAM Distributed Files Process Migration
Fast Commercial Switch (Myrinet)
UNIXWorkstation
UNIXWorkstation
UNIXWorkstation
Large Seq. Apps
Parallel Apps
Sockets, Split-C, MPI, HPF, vSM
NOW 14
Outline
• Introduction to the NOW project
• Quick tour of the NOW lab
• Important new system design concepts
• Conclusions
• Future Directions
NOW 15
First HP/fddi Prototype
• FDDI on the HP/735 graphics bus.
• First fast msg layer on non-reliable network
NOW 16
SparcStation ATM NOW
• ATM was going to take over the world.
The original INKTOMI
Today: www.hotbot.com
NOW 18
Massive Cheap Storage
•Basic unit:
2 PCs double-ending four SCSI chains
Currently serving Fine Art at http://www.thinker.org/imagebase/
NOW 19
Cluster of SMPs (CLUMPS)
• Four Sun E5000s– 8 processors
– 3 Myricom NICs
• Multiprocessor, Multi-NIC, Multi-Protocol
NOW 20
Information Servers
• Basic Storage Unit:– Ultra 2, 300 GB raid, 800 GB
tape stacker, ATM
– scalable backup/restore
• Dedicated Info Servers– web,
– security,
– mail, …
• VLANs project into dept.
NOW 21
What’s Different about Clusters?
• Commodity parts?
• Communications Packaging?
• Incremental Scalability?
• Independent Failure?
• Intelligent Network Interfaces?
• Complete System on every node– virtual memory
– scheduler
– files
– ...
NOW 22
Three important system design aspects
• Virtual Networks
• Implicit co-scheduling
• Scalable File Transfer
NOW 23
Communication Performance Direct Network Access
• LogP: Latency, Overhead, and Bandwidth
• Active Messages: lean layer supporting programming models
0
2
4
6
8
10
12
14
16
µs
gLOrOs
Latency 1/BW
NOW 24
0
25
50
75
100
125
0 25 50 75 100 125
Processors
Sp
ee
du
p o
n L
U-A
T3D
SP2
NOW
Example: NAS Parallel Benchmarks
• Better node performance than the Cray T3D
• Better scalability than the IBM SP-2
NOW 25
General purpose requirements
• Many timeshared processes– each with direct, protected access
• User and system
• Client/Server, Parallel clients, parallel servers– they grow, shrink, handle node failures
• Multiple packages in a process– each may have own internal communication layer
• Use communication as easily as memory
NOW 26
Virtual Networks
• Endpoint abstracts the notion of “attached to the network”
• Virtual network is a collection of endpoints that can name each other.
• Many processes on a node can each have many endpoints, each with own protection domain.
NOW 27
Process 3
How are they managed?
• How do you get direct hardware access for performance with a large space of logical resources?
• Just like virtual memory– active portion of large logical space is bound to physical
resources
Process n
Process 2Process 1
***
HostMemory
Processor
NICMem
Network Interface
P
NOW 28
Solaris System Abstractions
Segment Driver• manages portions of an address space
Device Driver• manages I/O device
Virtual Network Driver
NOW 30
Bursty Communication among many virtual networks
Client
Client
Client
Server
Server
Server
Msgburst work
NOW 31
Sustain high BW with many VN
0
10000
20000
30000
40000
50000
60000
70000
80000
1 4 7 10 13 16 19 22 25 28Number of virtual networks
Ag
gre
ga
te m
sgs/
s
continuous
1024 msgs
2048 msgs
4096 msgs
8192 msgs
16384 msgs
NOW 32
Perspective on Virtual Networks
• Networking abstractions are vertical stacks– new function => new layer
– poke through for performance
• Virtual Networks provide a horizontal abstraction– basis for build new, fast services
NOW 33
Beyond the Personal Supercomputer
• Able to timeshare parallel programs – with fast, protected communication
• Mix with sequential and interactive jobs
• Use fast communication in OS subsystems– parallel file system, network virtual memory, …
• Nodes have powerful, local OS scheduler
• Problem: local schedulers do not know to run parallel jobs in parallel
NOW 34
Local Scheduling
• Local Schedulers act independently– no global control
• Program waits while trying communicate with peers that are not running
• 10 - 100x slowdowns for fine-grain programs!
=> need coordinated scheduling
A AAB
BC
A
A
AA
B C
B
C
Time
P1 P2 P3 P4
A
C
NOW 35
Traditional Solution: Gang Scheduling
• Global context switch according to precomputed schedule
• Inflexible, inefficient, fault prone
A A AA
B CB C
A A AA
B CB C
TimeP1 P2 P3 P4
Master
NOW 36
Novel Solution: Implicit Coscheduling
• Coordinate schedulers using only the communication in the program– very easy to build
– potentially very robust to component failures
– inherently “service on-demand”
– scalable
• Local service component can evolve.
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
NOW 37
Why it works
• Infer non-local state from local observations
• React to maintain coordinationobservation implication action
fast response partner scheduled spin
delayed response partner not scheduled block
WS 1 Job A Job A
WS 2 Job B Job A
WS 3 Job B Job A
WS 4 Job B Job A
sleep
spin
request response
NOW 39
Implicit Coordination
• Surprisingly effective– real programs
– range of workloads
– simple an robust
• Opens many new research questions– fairness
• How broadly can implicit coordination be applied in the design of cluster subsystems?
NOW 40
A look at Serious File I/O
• Traditional I/O system
• NOW I/O system
• Benchmark Problem: sort large number of 100 byte records with 10 byte keys
– start on disk, end on disk
– accessible as files (use the file system)
– Datamation sort: 1 million records
– Minute sort: quantity in a minute
Proc-Mem
P-M P-M P-M P-M
NOW 41
World-Record Disk-to-Disk Sort
• Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth
Minute Sort
SGI Power Challenge
SGI Orgin
0123456789
0 10 20 30 40 50 60 70 80 90 100
Processors
Gig
abyt
es s
orted
NOW 42
Key Implementation Techniques
• Performance Isolation: highly tuned local disk-to-disk sort
– manage local memory
– manage disk striping
– memory mapped I/O with m-advise, buffering
– manage overlap with threads
• Efficient Communication– completely hidden under disk I/O
– competes for I/O bus bandwidth
• Self-tuning Software– probe available memory, disk bandwidth, trade-offs
NOW 43
Towards a Cluster File System
• Remote disk system built on a virtual network
Rat
e (M
B/s
)
LocalRemote
5.0
6.0
Read Write
CP
U U
tiliz
atio
n
Read Write0%
40%
20%
client
server
Client
RDlibRD server
Activemsgs
NOW 44
Conclusions
• Fast, simple Cluster Area Networks are a technological breakthrough
• Complete system on every node makes clusters a very powerful architecture.
• Extend the system globally– virtual memory systems,
– schedulers,
– file systems, ...
• Efficient communication enables new solutions to classic systems challenges.
NOW 45
Millennium Computational Community
Gigabit Ethernet
SIMS
C.S.
E.E.
M.E.
BMRC
N.E.
IEORC. E. MSME
NERSC
Transport
Business
Chemistry
Astro
Physics
Biology
EconomyMath
NOW 46
Millennium PC Clumps
• Inexpensive, easy to manage Cluster
• Replicated in many departments
• Prototype for very large PC cluster