mTCP使ってみた

26
mTCP 使ってみた Userspace TCP/IP stack for PacketShader I/O engine Hajime Tazaki High-speed PC Router #3 2014/5/15

description

 

Transcript of mTCP使ってみた

mTCP 使ってみたUserspace TCP/IP stack for PacketShader I/O

engine

Hajime TazakiHigh-speed PC Router #3

2014/5/15

• PacketShader ファミリ (@ KAIST)

•ユーザ空間スタック(kernel bypass)• w/ DPDK, PacketShader I/O, netmap

•種々の高速化手法 (Fast Packet I/O, affinity-accept, RSS, etc)を利用

• Short Flow:Linuxより25倍高速

mTCP とは

2

[mTCP14] Jeong et al., mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, USENIX NSDI, April, 2014

`

名前。。。

3

Why Userspace Stack ?

• Problems• Lack of connection locality• Shared file descriptor space• Inefficient per-packet processing• System call overhead

• kernel stack performance saturation

4

[mTCP14] Jeong et al., mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, USENIX NSDI, April, 2014

カーネルスタックの性能

5

アプリはほとんどCPU使えていない

複数コアの利用効率悪い

[mTCP14] Jeong et al., mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, USENIX NSDI, April, 2014

mTCP Design

6

[mTCP14] Jeong et al., mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, USENIX NSDI, April, 2014

•パケットバッチング (ps I/O engine)• システムコール回数減少• コンテキストスイッチも減少

• CPUまたぎ極小化

mTCP Design

7

mTCP Design (cont’d)

8

[mTCP14] Jeong et al., mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, USENIX NSDI, April, 2014

Application Modification

9

• BSD風Socket, epoll風 mux/demux

•少量の修正で既存アプリ利用可能• lighttpd (65 / 40K LoC)• ab (531 / 68K LoC)• SSL Shader (43 / 6.6K LoC)• WebReplay (81 / 3.3K LoC)

10

mTCP 性能 (64B ping-pong transaction)

[mTCP14] Jeong et al., mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, USENIX NSDI, April, 2014

x25 Linuxx5 SO_REUSEPORTx3 MegaPipe

`

mTCPお試し

11

• https://github.com/eunyoung14/mtcp

• Q1: iperf を mTCP 化(簡単?)

• Q2: iperf 速くなるの?

さわってみた

12

Porting iperf for mTCP

• socket () => mtcp_socket ()

• LD_PRELOAD なライブラリでOKかも。

13

Performance

• PC0• CPU: Xeon E5-2420 1.90GHz (6core)• NIC: Intel X520• Linux 2.6.39.4 (ps_ixgbe.ko)

• PC1• CPU: Core i7-3770K 3.50GHz (8core)• NIC: Intel X520-2• Linux 3.8.0-19 (ixgbe.ko)

15

PC0: mTCP-ed iperf PC1: vanilla Linux

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

5

64 256 1024 2048 4096 8192

Goo

dput

(Gbp

s)

Packet Size (bytes)

SendBuffer = 102400 (bytes)

MTCPLinux

PacketSize - Goodput (Gbps)

16

mTCP: ほぼ同一

Packet Size はwrite(mtcp_write)のlength

system call batching の影響で高速化

Buffer Size - Goodput (Gbps)

17

0

0.5

1

1.5

2

2.5

2048 10240 51200 102400 204800 512000 1024000

Goo

dput

(Gbp

s)

Buffer Size (bytes)

Packet size = 64 byte

MTCPLinux

mTCP:調査中。。。

mTCP: Send Buffer configurationLinux: wmem_max configuration

Summary

18

•パケット長によらず (1.6 Gbps) • 単一コネクションなの。。

• 50行くらいの変更で利用可能 !

•簡単に悪名高き iperf 高速化可能!

Discussion

19

•複雑なものはユーザスペースへ•マイクロカーネル ?• システム入れ替え不要• かつパフォーマンスも向上

Application

BSD Socket

network stack

NIC

Application

network stack

NIC

network

channeluser

kernel

• Kernel Bypass = Shortcut• 省略された機能どうすんの ?

• only v4 ? udp ?• ルーティングテーブル ?

• middlebox にしたら Quaggaとか?

Discussion (cont’d)

20

Library OS

21

• Library Operating System (LibOS)• アプリケーションが選択可能なOS

• End-to-End principle (on OS?)• rump[1], MirageOS[2], Drawbridge[3]

•ネットワークスタックだけはもうユーザスペースでいいんじゃね?

[1] Kantee. Rump file systems: Kernel code reborn, USENIX ATC 2009.[2] Madhavapeddy et al., Unikernels: library operating systems for the cloud ASPLOS 2013[3] Porter et al., Rethinking the library OS from the top down. ASPLOS 2011

ARP

Qdisc

TCP UDP DCCP SCTP

ICMP IPv4IPv6

Netlink

BridgingNetfilter

IPSec Tunneling

Kernel layer

Heap Stack

memory

Virtualization Corelayer

ns-3 (network simulation core)

POSIX layer

Application(ip, iptables, quagga)

bottom halves/rcu/timer/interruptstruct net_device

DCE

ns-3 applicati

on

ns-3TCP/IP stack

• Network Stack in Userspace (NUSE)

• Direct Code Execution• liblinux.so (latest)• libfreebsd.so (10.0.0)

• カーネル実装されたものを全てライブラリ化

• 高機能化• 高速化 ?

An implementation of LibOS

22

NUSE: Network Stack in Userspace

Hajime Tazaki1 and Mathieu Lacage2,

1NICT, Japan 2France

Summary

Network Stack in Userspace (NUSE)I A framework for the userspace execution of network stacks

I Based on Direct Code Execution (DCE) designed for ns-3 network simulatorI Supports multiple kinds and versions of network stacks implemented for kernel-space (currently net-next

Linux, freebsd.git are supported)I Introduces the distributed validation framework in a single controller (thanks to ns-3).I Transparent to existing codes (no manual patching to network stacks)

Problem Statement on Network Stack Development

Kernel-space Network Stack Implementation

Userspace Network Stack Implementation

hard to deploy

Existing Implementation is desirable

Userspace implementation is desirable

- implementing network stack from scratch is not realistic (620K LOC in ./net-next/net/).- needs to validate the interoperatbility again!

Infinite Loop on Network Stack Development

The Architecture

(kernel-space)Network Stack

Code

SharedLibrary

UserspaceProgram

compile dynamic-link

I Kernel network stacks are compiled to shared library and linked to applications.I Unmodified application codes (userspace) and network stack (kernel-space) are usable.

(Host) kernel

process

apps (socket/syscall)

NUSE bottom-half

(Guest) kernelnetwork stack

NUSE top-half

user-space

kernel-space

library {I

Top-half provides a transparent interface to applications with system-call redirection.I

Bottom-half provides a bridge between userspace network stack and host operatingsystem.

Call trace via NUSE with Linux network stack:

(gdb) bt ---------------#0 sendto () at ../sysdeps/unix/syscall-template.S:82 Host OS#1 ns3::EmuNetDevice::SendFrom () at ../src/emu/model/emu-net-device.cc:913 Raw socket#2 ns3::EmuNetDevice::Send () at ../src/emu/model/emu-net-device.cc:835 ---------------#3 ns3::LinuxSocketFdFactory::DevXmit () at ../model/linux-socket-fd-factory.cc:297 NUSE

#4 sim_dev_xmit () at sim/sim.c:290 bottom half

#5 fake_ether_output () at sim/sim-device.c:165#6 arprequest () at freebsd.git/sys/netinet/if_ether.c:271 ---------------#7 arpresolve () at freebsd.git/sys/netinet/if_ether.c:419#8 fake_ether_output () at sim/sim-device.c:89 freebsd.git#9 ip_output () at freebsd.git/sys/netinet/ip_output.c:631 network stack#10 udp_output () at freebsd.git/sys/netinet/udp_usrreq.c:1233 layer#11 udp_send () at freebsd.git/sys/netinet/udp_usrreq.c:1580#12 sosend_dgram () at freebsd.git/sys/kern/uipc_socket.c:1115 ---------------#13 sim_sock_sendmsg () at sim/sim-socket.c:104 NUSE

#14 sim_sock_sendmsg_forwarder () at sim/sim.c:88 top half

#15 ns3::LinuxSocketFdFactory::Sendmsg () at ../model/linux-socket-fd-factory.cc:633 ---------------#16 ns3::LinuxSocketFd::Sendmsg () at ../model/linux-socket-fd.cc:76 syscall#17 ns3::LinuxSocketFd::Write () at ../model/linux-socket-fd.cc:44 emulation#18 dce_write () at ../model/dce-fd.cc:290 layer#19 write () at ../model/libc-ns3.h:187 ---------------#20 main () at ../example/udp-client.cc:34 application#21 ns3::DceManager::DoStartProcess () at ../model/dce-manager.cc:283 ---------------#22 ns3::TaskManager::Trampoline () at ../model/task-manager.cc:250 ns-3 DCE#23 ns3::UcontextFiberManager::Trampoline () at ../model/ucontext-fiber-manager.cc:1 scheduler#24 ?? () from /lib64/libc.so.6 ---------------#25 ?? ()

The kernel network stack code is transparently integrated into NUSE framework without any modifications into original one.

Experience

Linux (Host)Apps

NIC

1) Linux native stack

Linux (Host)vNIC

ns3-coreNUSE bottom-

half

linux net-next

AppsNUSE top-half

NIC

2) NUSE with Linux stack

Linux (Host)vNIC

ns3-core

Apps

NUSE bottom-half

freebsd.gitNUSE top-half

NIC

3) NUSE with FreeBSD stack

I A UDP simple socket-based traffic generator is used with three different scenarios.I The development tree version of Linux and FreeBSD kernel is encapsulated by NUSE without modifying the

application and host network stacks.

0

5000

10000

15000

20000

1 2 3 4 5UD

P p

ack

et

ge

ne

ratin

g p

erf

orm

an

ce (

pp

s)

Time (sec)

1) native2) Linux NUSE

3) Freebsd NUSE

I Linux NUSE shows a different behavior with Linux native stack: current timer implementation in NUSE issimplified and not enough to emulate native one accurately.

I Native and FreeBSD NUSE show a similarity in the UDP packet generation.

Possible Use-Cases

network stack

processNUSE

(Guest) networkstack

user-space

kernel-spaceNIC

bypassed(raw sock/tap

/netmap)

I Application embedded network stacks deployment (e.g., Firefox + NUSE Linux + mptcp.).I No required kernel stack replacement with bypassing the host network stack.

network stack

processNUSE

network stack X

user-space

kernel-space

NIC

MultipleNetwork Stack

Instances processNUSE

network stack Y

processNUSE

network stack Z

NICnetwork stack

ns-3

I Multiple network stacks debugging via ns-3 network simulator.I Validation platform across the distributed network entities (like VM multiplexing) with a simple controllable

scenario.

Related Work

IUserspace network stack porting: Network Simulation Cradle [6], Daytona [9], Alpine [4], Entrapid [5],MiNet [3]Need of manual patching, resulting the difficulties of latest version tracking.

IVirtual machines, operating systems: (KVM [8]/Linux Container [1]/User-mode Linux [2])High-level transparency introduces the broader compatibility of codes (application, network stack), butunnecessary workload for the application execution.

IIn-kernel userspace execution support: Runnable Userspace Meta Program (RUMP) [7]No manual patching (already integrated NetBSD), but lack of integrated operation.

Future Directions

I Userspace binary emulation shouldI More useful example with unavailable features of current network stacks (e.g., mptcp)I Seek the performance improvement with userspace network stack by utilizing multiple processors.I Promote to merge Linux/FreeBSD main-tree as a general framework.

References

[1] Sukadev Bhattiprolu, Eric W. Biederman, Serge Hallyn, and Daniel Lezcano. , Virtual servers and checkpoint/restart in mainstream linux. , ACM SIGOPS Operating Systems Review, 42(5):104–113, July 2008.[2] J. Dike et al. , User Mode Linux. , In Proceedings of the 5th Anual Linux Showcase and Conference, ALS’01, pages 3–14. USENIX Association, 2001.[3] P.A. Dinda. , The Minet TCP/IP stack. , Northwestern University Department of Computer Science Technical Report NWU-CS-02-08, 2002.[4] David Ely, Stefan Savage, and David Wetherall. , Alpine: a user-level infrastructure for network protocol development. , In Proceedings of the 3rd conference on USENIX Symposium on Internet Technologies and Systems - Volume 3, USITS’01, pages 1–13, Berkeley, CA, USA, 2001.[5] X.W. Huang, R. Sharma, and S. Keshav. , The ENTRAPID protocol development environment. , In INFOCOM ’99. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, volume 3, pages 1107 –1115 vol.3, mar 1999.[6] Sam Jansen and Anthony McGregor. , Simulation with real world network stacks. , In Proceedings of the 37th conference on Winter simulation, WSC ’05, pages 2454–2463. Winter Simulation Conference, 2005.[7] A. Kantee. , Rump file systems: Kernel code reborn. , In Proceedings of the 2009 conference on USENIX Annual technical conference, USENIX ATC’09, pages 1–14. USENIX Association, 2009.[8] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori. , kvm: the Linux Virtual Machine Monitor. , In Linux Symposium, pages 225–230, 2007.[9] P. Pradhan S. Kandula W. Xu A. Shaikh and E. Nahum. , Daytona : A user-level tcp stack. , Technical report, MIT Technical Report, 2002.

The 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Hollywood, USA, October, 2012

Next Steps

• net-next-sim (liblinux.so)• Linuxと同等機能な userspace stack

• make menuconfig で必要なものだけ• (work-in-progress)

23

まとめ

• mTCP 簡単・高速 !• (Packet I/O高速化だけなら)

•速ければいいのか ?

•速くて、高機能で、自由度も高く!

24

References

• Jeong et al., mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, USENIX NSDI, April, 2014

• iperf mTCP version• https://github.com/thehajime/iperf-2.0.5-

mtcp• PacketShader I/O engine mod (X520)

• https://github.com/thehajime/mtcp• Direct Code Execution (liblinux.so)

• https://github.com/direct-code-execution/net-next-sim

25

• to Linux 3.8.0 • 基本問題なし• ARP/Static Route をファイルに設定

• to macosx (10.9.2) TCPv4 (iperf)• static route /32 で設定できない..

• でもちゃんとEstablish になる !

相互接続関連

26