Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * +...
-
Upload
eleonora-martin -
Category
Documents
-
view
217 -
download
1
Transcript of Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * +...
Lehrstuhl für BetriebssystemeRWTH Aachen
Lehrstuhl für Rechnerarchitektur
TU Chemnitz
* +
Efficient Asynchronous Message Passing via SCI with Zero-Copying
Joachim Worringen*, Friedrich Seifert+, Thomas Bemmerl*
SCI Europe 2001 – Trinity College Dublin
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Agenda
• What is Zero-Copying? What is it good for?Zero-Copying with SCI
• Support through SMI-LibraryShared Memory Interface
• Zero-Copy Protocols in SCI-MPICHMemory Allocation SetupsPerformance Optimizations
• Performance EvaluationPoint-to-PointApplication KernelAsynchronous Communication
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Zero-Copying
• Transfer of data between two user-level accessible memory buffers with N explicit intermediate copies:N-way–Copying
No intermediate copy: Zero-Copying• Effective Bandwidth and Efficiency:
n
iipeak
eff
BB
B
1
111
eff
peak
BB
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Efficiency Comparison
FastEthernet
GigaEthernet
SCI DMA
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Zero-Copying with SCI
SCI does zero-copy by nature.But: SCI via IO-Bus is limited:• No SMP-style shared memory• Specially allocated memory regions were required No general zero-copy possibleNew possibility:• Using user-allocated buffers for SCI communication Allows general zero-copy!
Connection setup is always required.
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
SMI LibraryShared Memory Interface
High-Level SCI support library for parallel applications or libraries• Application startup• Synchronization & basic communication• Shared-Memory setup:
- Collective regions- Point-2-point regions- Individual regions
• Dynamic memory management• Data transfer
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Data Moving (I)
Shared Memory Paradigm:• Import remote memory in local address space• Perform memcpy() or maybe DMA• SMI Support:
- region type REMOTE- Synchronous (PIO): SMI_Memcpy()
- Asynchronous (DMA if possible): SMI_Imemcpy() followed by SMI_Mem_wait()
Problems: • High Mapping Overhead• Resource Usage (ATT entries on PCI-SCI adapter)
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Mapping Overhead
Not suitable for dynamic memory setups!
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Data Moving (II)
Connection Paradigm:• Connect to remote memory location• No representation in local address space only DMA possible• SMI support:
• Region type RDMA• Synchronous / Asynchronous DMA:SMI_Put/SMI_Iput, SMI_Get/SMI_Iget, SMI_Memwait
Problems:• Alignment restrictions• Source needs to be pinned down
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Setup Acceleration
Memory buffer setup costs time ! Reduce number of operations to increase performance
Desirable: only one operation per buffer• Problem: limited ressources• Solution: caching of SCI segment states by lazy-release
- Leave buffers registered, remote segments connected or mapped- Release unneeded resources if setup of new resource fails- Different replacement strategies possible:
LRU, LFU, best-fit, random, immediate- Attention: remote segment deallocation! Callback on connection event to release local connection
• MPI persistent communication operations:• Pre-register user buffer & higher „hold“ priority
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Memory Allocation
Allocate „good“ memory:• MPI_Alloc_mem() / MPI_Free_mem()• Part of MPI-2 (mostly for single-sided operations)• SCI-MPICH defines attributes:
-type: shared, private or default Shared memory performs best.
-alignment: none, specified or default Non-shared memory should be page-aligned
• „Good“ memory should only be enforced for communication buffers!
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Zero-Copy Protocols
• Applicable for hand-shake based rendez-vous protocol• Requirements:
• registered user allocated buffersor• regular SCI segments„good“ memory via MPI_Alloc_mem()
• State of memory range must be known SMI provides query functionality
• Registering / Connection / Mapping may fail• Several different setups possible Fallback mechanism required
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Data Transfer
SenderApplicationThread
DeviceThread
ReceiverApplicationThread
DeviceThread
Asynchronous Rendez-Vous
OK to send
Control Messages
Ask to sendIsendIsend
IrecvIrecv
WaitWait
WaitWait
ContinueDone
Done
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Test Setup
Systems used for performance evaluation:• Pentium-III @ 800 MHz• 512 MB RAM @ 133 MHz• 64-bit / 66 MHz PCI (ServerWorks ServerSet III LE)• Dolphin D330 (single ring topology)• Linux 2.4.4-bigphysarea• modified SCI driver (user memory for SCI)
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Bandwidth Comparison
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Application Kernel: NPB IS
• Parallel bucket sort• Keys are integer numbers• Dominant communication:MPI_Alltoallv for distributed key array:
Class Array size [MiB]
Procs Msg size [kiB]
Alltoallv [ms]
% of execution time
A 1 4 256 16.363 34.6W 8 4 2048 123.921 36.2
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
MPI_Alltoallv Performance
• MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall
• Improved performance with asynchronous DMA operations
• Application speedup deduced
Class Procs regular [ms]
speedup user [ms]
speedup
A 4 7.578 1.22 9.617 1.16W 4 52.415 1.26 63.957 1.21
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Asynchronous Communication
Goal: Overlap Computation & Communication• How to quantify the efficiency for this? Typical overlapping effect:
totaltime
computation time
Computation
Synchronous
Asynchronous
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Saturation and Efficiency (I)
Two parameters are required:1. Saturation s
• Duration of computation period required to make total time (communication & computation) increase
2. Efficiency • Relation of overhead to message latency
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Saturation and Efficiency (II)
ttotal
tbusy
tmsg_a ttotal - tbusy
Computation
Synchronous
Asynchronous
tmsg_s
msg
busytotal
ttt
1
Saturation s
busytotalmsg ttts
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Experimental Setup: Overlap
Micro-Benchmark to quantify overlapping:
latency = MPI_Wtime()if (sender)
MPI_Isend(msg, msgsize)while (elapsed_time < spinning_duration)
spin (with multiple threads)MPI_Wait()
elseMPI_Recv()
latency = MPI_Wtime() - latency
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Experimental Setup: Spinning
Different ways of keeping CPU busy:• FIXED
Spin on single variable for a given amount of CPU time No memory stress
• DAXPYPerform a given number of DAXPY operations
on vectors (vectorsizes x, y equivalent to message size) Stress memory system
jyjxAjy
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
DAXPY – 64kiB Message
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
DAXPY – 256kiB Message
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
FIXED – 64kiB Message
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Asynchronous PerformanceSaturation and Efficiency derived from experiments:
Experiment Protocol tmsg [ms] s [ms] 64 kiBDAXPY
a-DMA-0-R 0.490 0.285 0.581a-DMA-0-U 0.735 0.473 0.643s-PIO-1 0.572 0.056 0.043
256 kiBDAXPY
a-DMA-0-R 1.300 1.099 0.845a-DMA-0-U 1.506 1.148 0.762s-PIO-1 1.895 -0.030 -0.015
64 kiBFIXED
a-DMA-0-R 0.493 0.446 0.904a-DMA-0-U 0.738 0.691 0.936s-PIO-1 0.567 0.016 0.028
SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Summary & Outlook• Efficient utilization of new SCI driver functionality for MPI
communication: Max. bandwidth of 230 MiB/s (regular)
190 MiB/s (user)• Connection overhead hidden by segment caching
Asynchronous communication pays off much earlier than before
• New (?) quantification scheme for efficiency of asynchronous communication
• Flexible MPI memory allocation supports MPI application writer• Connection-oriented DMA transfers reduce resource utilization
• DMA alignment problems• Segment callback required for improved connection caching