Windows Server 2012 R2 - Boosted by Mellanox

58
メラノックステクノロジーズジャパン株式会社 シニアシステムエンジニア 友永 和総 2013124Windows Server 2012 R2 Boosted by Mellanox

description

Presentation at the joint session with Microsoft Japan - 2013/12/4

Transcript of Windows Server 2012 R2 - Boosted by Mellanox

Page 1: Windows Server 2012 R2 - Boosted by Mellanox

メラノックステクノロジーズジャパン株式会社

シニアシステムエンジニア 友永 和総

2013年12月4日

Windows Server 2012 R2 – Boosted by Mellanox

Page 2: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 2

セッション内容

Mellanox概要

Microsoft + Mellanox

• SMB Direct

• NVGREオフロード

補足

• RoCE設定

• I/O統合に関する考慮事項

参考資料

• SMB Direct – Protocol Deep Dive

Page 3: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 3

会社概要

サーバ・ストレージエリア向け広帯域・低レイテンシーなインターコネクト市場のリーディングプロバイダー

• FDR InfiniBand (56Gbps)と10/40/56ギガビットEthernetを共通のハードウェアでサポート

• データアクセス性能の高速化により、アプリケーション性能を飛躍的に向上

• ノード数の大幅な削減や管理効率の向上によって、データセンター内IT基盤のROIを劇的に改善

本社・従業員数

• ヨークナム(イスラエル)、サニーベール(米国)

• 全世界で 約1,400人の従業員

安定した財務基盤

• 2011年度の売上: $259.3M (67.6% up)

• 2012年度の売上: $500.8M (93.2% up)

• Cash + investments @ 9/30/13 = $306.4M

Ticker: MLNX

* As of September 2013

Page 4: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 4

VPI (Virtual Protocol Interconnect) テクノロジー

• IB and EN with single chip (ConnectX-3、SwitchX-2)

• IB and EN port by port (ConnectX-3、SwitchX-2)

• IB/EN Bridging(SwitchX-2)

高スループット、低レイテンシー、超低消費電力 (Ultra Low Power)

RDMA (Remote Direct Memory Access) 対応、高速データ転送

VXLAN/NVGREオフロード(ConnectX-3Pro)

3.0 x8

17mm

45mm

InfiniBand/Ethernet

InfiniBand/Ethernet

2 x 56Gbps Ethernet mode: 1/10/40/56GbE

144組のネットワークSerDesを搭載 36x 40/56GbE 64x 10GbE 48x 10GbE+12x 40/56GbE Ethernet mode: 1/10/40/56GbE

• InfiniBand or Ethernet • InfiniBand + Ethernet • InfiniBand / Ethernet Bridging

36x 40GbE: 83W 64x 10GbE: 63W

(100% load power) 2pt 40GbE Typ power: 7.9W

3.0 x16

2 x IB FDR (56Gbps)

メラノックス社のコアテクノロジー : 高性能・高集積ASIC

Page 5: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 5

Leading Supplier of End-to-End Interconnect Solutions

Virtual Protocol Interconnect

Storage Front / Back-End

Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Virtual Protocol Interconnect

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Metro / WAN

Page 7: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 7

Top Tier OEMs, ISVs and Distribution Channels

Medical

Server

Storage

Embedded

Hardware OEMs Software Partners Selected Channel Partners

Page 8: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 8

InfiniBand Enables Lowest Application Cost in the Cloud (Examples)

Microsoft Windows Azure

90.2% Cloud Efficiency

33% Lower Cost per Application

Cloud Application Performance

Improved up to 10X

3X Increase in VMs per Physical Server

Consolidation of Network and Storage I/O

32% Lower Cost per Application

694% Higher Network Performance

Page 9: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 9

Microsoft + Mellanox

Page 10: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 10

SMB Direct (RDMA) – Windows Server 2012から投入されたI/O性能を飛躍的に高めるテクノロジー

Mellanox ConnectX-3 (InfiniBand 及び 10G/40G Ethernet NIC)搭載で実現

Hyper-V over SMB Direct

• 統合率向上、アプリケーションパフォーマンスの向上

Hyper-V RDMA Live Migration (Windows Server 2012 R2)

• ライブマイグレーション時間を短縮、運用の簡素化、効率化

Microsoft SQL Server 2012 – Always-ON

• 二重化したDBサーバー間を低レイテンシーなメラノックスネットワークで接続、DBライト性能を向上

Hyper-VベースのVDIソリューション

• 同一ハードウェア構成でのVDIクライアント数を倍増させ、コストパフォーマンス向上

Microsoft SQL 2012 Parallel Data Warehouse V2

• 高速なSMB Directを活用した高速データベースアプライアンス

NVGREオフロード (Windows Server 2012 R2)

Mellanox ConnectX-3Pro (10G/40G Ethernet NIC)搭載で実現

• オーバーレイネットワークにおけるパケット処理をネットワークアダプタでハードウェアオフロード処理

• CPUボトルネックを解消し、広帯域ネットワークの帯域を最大限活用

マイクロソフト社ソリューションにおけるメラノックスの技術

Page 11: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 11

Remote Direct Memory Access

• ゼロコピー、CPUバイパスを実現するデータ転送技術

• 標準的なインターコネクトプロトコルとしてサポート

• リモートアプリケーションのバッファ間でダイレクトにデータ転送

• 非常に低いレイテンシでのデータ転送が可能

RDMAプロトコル

• InfiniBand – 最大 56Gb/s

• RDMA-over-Converged-Ethernet (RoCE) – 最大 40Gb/s

SMB Directでは、Windowsのファイル共有プロトコル(SMB3.0)にRDMA処理を統合

RDMA技術の特長

RDMA

InfiniBand Ethernet

MellanoxのEthernet NICを使えば、EthernetでもRDMA動作可能*

SMB3.0

*RDMAを用いる場合はDCB設定が必要

Page 12: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 12

RoCE Frame

RoCE(RDMA over Converged Ethernet)

出典:http://blog.infinibandta.org/2012/02/13/roce-and-infiniband-which-should-i-choose/

出典: IBTA Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.1 - Annex A16:RDMA over Converged Ethernet (RoCE)/

Page 13: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 13

RDMAの概要

RDMA over InfiniBand or

Ethernet

KE

RN

EL

H

AR

DW

AR

E

US

ER

RACK 1

OS

NIC Buffer 1

Application

1 Application

2

OS

Buffer 1

NIC Buffer 1

TCP/IP

RACK 2

HCA HCA

Buffer 1 Buffer 1

Buffer 1

Buffer 1

Buffer 1

「バケツリレー」的な転送

「太い直結ホース」的な転送

データを水に

例えると・・・

Page 14: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 14

~88% CPU

Available for App

I/O Offload Frees Up CPU for Application Processing U

se

r S

pa

ce

S

ys

tem

Sp

ac

e

~53% CPU

Available for App

~47% CPU

Overhead

~12% CPU

Overhead

Without RDMA With RDMA and Offload

Us

er

Sp

ac

e

Sys

tem

Sp

ac

e

Page 15: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 15

SMB Direct - File Read

SMB Client

SMB Server

RDMA Write

by SMB Server

RoCE frame capture

Page 16: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 16

SMB Direct - File Write

SMB Client

SMB Server

RoCE frame capture

RDMA Read

by SMB Server

Page 17: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 17

How to watch the RDMA Traffic

• ibdump.exe

Page 18: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 18

Measuring SMB Direct Performance

Single

Server

Fusion IO Fusion IO Fusion IO Fusion-IO

IO Micro

Benchmark

SMB

Client

SMB

Server

Fusion IO Fusion IO Fusion IO Fusion-IO

IO Micro

Benchmark

10GbE

10GbE

SMB

Client

SMB

Server

Fusion IO Fusion IO Fusion IO Fusion-IO

IO Micro

Benchmark

FDR IB

FDR IB

SMB

Client

SMB

Server

Fusion IO Fusion IO Fusion IO Fusion-IO

IO Micro

Benchmark

QDR IB

QDR IB

Page 19: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 19

Microsoft Delivers Low-Cost Replacement to High-End Storage

FDR 56Gb/s InfiniBand delivers 5X higher throughput with 50% less CPU overhead vs. 10GbE

Native Throughput Performance over FDR InfiniBand

Page 20: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 20

Native

Single Server

SQLIO

Hyper-V

(SMB 3.0)

File Server

(SMB 3.0)

VM

RDMA

NIC

RDMA

NIC

RDMA

NIC

RDMA

NIC

SQLIO

Remote VM

SA

S

RAID Controller

JBOD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SA

S

RAID Controller

JBOD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SA

S

RAID Controller

JBOD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SA

S

RAID Controller

JBOD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SA

S

RAID Controller

JBOD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SA

S

RAID Controller

JBOD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SA

S

RAID Controller

JBOD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SA

S

RAID Controller

JBOD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

Hyper-V over SMB Direct - Performance

vs.

Page 21: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 21

Configuration BW MB/sec

IOPS 512KB IOs/sec

%CPU Privileged

Latency milliseconds

Native 10,090 38,492 ~2.5% ~3ms

Remote VM 10,367 39,548 ~4.6% ~3 ms

SMB 3.0 Performance in Virtualized Environment

SMB 3.0 over InfiniBand Delivers Native Performance

Page 22: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 22

File Server

(SMB 3.0)

File Client

(SMB 3.0) SQLIO

RDMA

NIC

RDMA

NIC

RDMA

NIC

RDMA

NIC

RDMA

NIC

RDMA

NIC

SA

S

SAS

HBA

JBOD SSD SSD SSD SSD SSD SSD SSD SSD

SA

S

SAS

HBA

JBOD SSD SSD SSD SSD SSD SSD SSD SSD

SA

S

SAS

HBA

JBOD SSD SSD SSD SSD SSD SSD SSD SSD

SA

S

SAS

HBA

JBOD SSD SSD SSD SSD SSD SSD SSD SSD

SA

S

SAS

HBA

JBOD SSD SSD SSD SSD SSD SSD SSD SSD

Storage Spaces

SA

S

SAS

HBA

JBOD SSD SSD SSD SSD SSD SSD SSD SSD

8KB random reads

from a mirrored space (disk)

~600,000 IOPS

8KB random reads

from cache (RAM)

~1,000,000 IOPS

32KB random reads

from a mirrored space (disk)

~500,000 IOPS

~16.5 GBytes/sec

EchoStreams: InfiniBand Enables Near Linear Scalability

Page 23: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 23

Hyper-V Live Migration over SMB

SMB as a transport for Live Migration of VMs

Delivers the power of SMB to provide: • RDMA (SMB Direct)

• Streaming over multiple NICs (SMB Multichannel)

Provides highest bandwidth and lowest latency

複数リンクを

活用可能

RDMAで広帯域ネットワークを最大限に活用

RDMAでCPU負荷最小限に抑制

0

10

20

30

40

50

60

70

Seco

nd

s

Live Migration Times New in

Windows

Server

2012 R2

Page 24: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 24

RDMAデータ転送で低レイテンシなDB二重化書き込みを実現

Microsoft SQL Server 2012 AlwaysON

Page 25: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 25

Microsoft PDW* V2 – 10X Faster & 50% Lower Capital Cost

Control Node

Mgmt. Node

LZ

Backup Node

Estimated total HW component list

price: $1M $ Estimated total HW component list

price: $500K $

Ethernet, InfiniBand & Fiber Channel

• Pure hardware costs are ~50% lower

• Price per raw TB is close to 70% lower

due to higher capacity

• 70% more disk I/O bandwidth

InfiniBand & Ethernet

• 128 cores on 8 compute nodes

• 2TB of RAM on compute

• Up to 168 TB of temp DB

• Up to 1PB of user data

• 160 cores on 10 compute nodes

• 1.28 TB of RAM on compute

• Up to 30 TB of temp DB

• Up to 150 TB of user data

PDW V1 PDW V2

*Parallel Data Warehouse

Page 26: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 26

Accelerating Microsoft SQL 2012 Parallel Data Warehouse V2

Analyze 1 Petabyte of Data in 1 Second

Up to 100X faster performance than legacy data warehouse queries

Up to 50X faster data query, up to 2X the data loading rate

Unlimited storage scalability for future proofing

Accelerated by Mellanox FDR 56Gb/s InfiniBand end-to-end solutions

Page 27: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 27

NVGRE H/W Offload

Page 28: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 28

ConnectX-3 Pro | The Next Generation Cloud Competitive Asset

World’s first Cloud offload interconnect solution

Provides hardware offloads for Overlay Networks – enables mobility, scalability, serviceability

Dramatically lowers CPU overhead, reduces cloud application cost

Highest throughput (10, 40GbE & 56GbE), SR-IOV, PCIe Gen3, low power

More users

Mobility

Scalability

Simpler Management

Lower Application Cost

Cloud 2.0

The Foundation of Cloud 2.0

Page 29: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 29

Physical Switch

World’s First HW Offload Engines for Overlay Network Protocols

Introducing L2 Virtual Tunneling solutions for virtualized data centers

• NVGRE and VXLAN

Virtual L2 Tunnels provides a method for “creating” virtual domains

on top of a scalable L3 virtualized infrastructure

• Enabling virtual domains with complete isolation

Targeting public/private cloud networks with multi-tenants

Mellanox uniqueness: HW offload = higher performance

• Checksums, LSO, FlowID calculation, VLAN Stripping / insertion

• Combined with steering mechanisms: RSS, VMQ

Three virtual domains

connected by

Layer 2 Tunneling

Server

VM VM VM VM

Server

VM VM VM VM

Server

VM VM VM VM

Domain1

Domain2

Domain3

Page 30: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 30

NVGRE

• MAC over GRE

• 24 bit tenant id

NVGRE

MAC GRE MAC IP (v4/v6) ….

Page 31: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 31

0

1

2

3

4

5

6

7

8

9

10

NVGRE with ConnectX-3 Pro Offloads NVGRE Without Offloads

Throughput (Gb/s)

Higher Is Better

65%

Higher Throughput for Less CPU Overhead

0

2

4

6

8

10

12

NVGRE with ConnectX-3 Pro Offloads NVGRE Without Offloads

CPU Overhead (CPU Cycles per Byte)

Lower Is Better

80%

NVGRE Initial Performance Results (ConnectX-3 Pro, 10GbE)

Page 32: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 32

ConnectX-3 Pro NVGRE Throughput

4.55 4.8 5 5.5

8.7 9.15 9.2

8.65

0

2

4

6

8

10

12

2 4 8 16

Ban

dw

idth

Gb

/s

VM Pairs

Throughput ConnectX3 Pro 10 GbE

NVGRE Offload Disabled NVGRE Offload Enabled

Page 33: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 33

Microsoft:

• http://smb3.info

• Blog posts from Microsoft about SMB Direct

Mellanox.com:

• http://www.mellanox.com/page/file_storage

- Recipe and how-to guides

• http://www.mellanox.com/page/edc_system

- Demo/Test RDMA on Windows Server 2012

Links

Page 34: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 34

Mellanox RDMAテクノロジーは、Windows Server 2012 及び Windows Server 2012 R2の標準技術として「SMB Direct」として採用され、I/O性能を高めると同時にCPU負荷を下げるという非常に効果的なテクノロジーです。

Windows Server 2012 及び Windows Server 2012 R2は、ファイルプロトコルの運用性・管理性と、ブロックストレージをも上回る性能を両立した画期的なテクノロジーをOS標準技術として業界に先駆けて搭載しています。

• Mellanox ConnectX-3を実装するだけで圧倒的なネットワーク性能を活用可能 - 従来に比べ、約10倍といったオーダーの性能を実現(ブロックストレージをも上回る性能)

• InfiniBandだけでなく、Ethernetでも動作

• Hyper-V環境では、ファイルストレージアクセス(Hyper-V over SMB)だけでなく、ライブマイグレーションもRDMAで動作

• Windows Server 2012 R2を基盤とした幅広いソリューションに活用可能 - Hyper-V、VDI、SQLServer

まとめ

Page 35: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 35

補足

Page 36: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 36

RoCE設定

http://www.mellanox.com/pdf/whitepapers/WP_Deploying_Windows_Server_Eth.pdf

http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf

通常のEthernetフレームのため、原理的には何も設定しなくても動くが、

性能、安定性のためにDCB設定(フローコントロール設定)を行う

• Windowsホストの設定

• Ethernetスイッチの設定

Page 37: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 37

SMB ダイレクトを使用する場合の考慮事項

http://technet.microsoft.com/ja-jp/library/jj134210.aspx (Windows仕様上の制限事項)

Hyper-V 管理オペレーティング システムで SMB ダイレクトを使用して、Hyper-V over SMB を使用できるようにしたり、Hyper-V 記憶域スタックを使用する仮想マシンに記憶域を提供したりできます。ただし、RDMA 対応ネットワーク アダプターは Hyper-V クライアントに直接公開されません。RDMA 対応ネットワーク アダプターを仮想スイッチに接続すると、そのスイッチからの仮想ネットワーク アダプターは

RDMA 対応ではなくなります。

SMB マルチチャネルを無効にすると、SMB ダイレクトも無効になります。SMB マルチチャネルによって、ネットワーク アダプターの機能が検出され、ネットワーク アダプターが RDMA 対応かどうかが確認されるため、SMB マルチチャネルが無効になっていると、クライアントが SMB ダイレクトを使用できません。

SMB ダイレクトは、ダウンレベル バージョンの Windows Server ではサポートされていません。Windows Server 2012 でのみサポートされています。

Page 38: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 38

I/O統合に関する考慮事項

前頁考慮事項より:「RDMA 対応ネットワーク アダプターを仮想スイッチに接続すると、そのスイッチからの仮想ネットワーク アダプターは RDMA 対応ではなくなります」

SMB Direct (ストレージアクセス及びライブマイグレーションパス)とVM間通信(TCP/IP)(仮想スイッチに接続する必要あり)を同一インターフェースでI/O統合することができない?

Mellanoxとしてのソリューション

• part_manコマンドによる仮想インターフェース追加で、1物理ポートを2論理ポートとしてOSへ見せる

Page 39: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 39

part_manコマンド

例)

# part_man add “イーサネット 4” <任意の名前>

現在のステータス • InfiniBand : サポート済み

• Ethernet : (会場説明)リリース予定

MLNX WinOF 4.55 User Manual

Page 40: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 40

参考資料 SMB Direct - Protocol Deep Dive

Page 42: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 42

RDMA Transports

• The SMBDirect Protocol is transport-independent.

• It requires only an RDMA lower layer for sending and receiving the messages.

• The RDMA transports most commonly used by SMBDirect include:

- iWARP

- InfiniBand Reliable Connected mode

- RDMA over Converged Ethernet (RoCE)

Protocols Transported by SMBDirect

• SMB2 Protocl [MS-SMB2]

- When SMB2 version 3.0 is negotiated

- both client and server

- RDMA-capable transport

Relationship to Other Protocols

Page 43: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 43

Must in-order delivery

Must support direct data placement via RDMA

Write and RDMA Read reqeusts

• Example : iWARP, InfiniBand, RoCE

Only 3 message types

• Negotiate request

• Negotiate response

• Data transfer

Little-endian order

• least-significant byte first

Use multiple connection

• 1st connection – negotiation

• 2nd or more connection - RDMA

SMBDirect Protocol Overview

Page 44: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 44

New additions for RDMA

• Negotiate

- Server capability advertisement

Server must advertise Multi-channel support (multiple connections per session) because

SMB Direct always starts with a TCP connection, then opens a second connection (or more)

for RDMA

- Session setup

Initial session –None (normal processing)

New Connection is created when RDMA is detected

› As part of RDMA connection setup, an SMB Direct negotiation occurs

› New RDMA connection joins previously setup session

Initialization

Page 45: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 45

Normal TCP connection is used to negotiate SMB2.2 and setup session

After session setup

• Client uses FSCTL FSCTL_QUERY_TRANSPORT_INFO to query server interface capabilities

• Interface Capability = RDMA_CAPABLE

If Server RDMA NIC is found and a local RDMA NIC is found that can connect to the server RDMA NIC

• Then additional connections using RDMA will be created and bound to session

• Original TCP connection is idled –all traffic goes over RDMA channel

RDMA NICs are always selected first over other types of NICs

Creating the RDMA Connection

Page 46: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 46

The 3 messages are:

• Negotiate request –negotiate RDMA parameters

• Negotiate response –negotiate RDMA parameters

• Data transfer -encapsulates SMB2 messages

SMBDirect Data Transfer Mode

• Send/Receive mode

- Transmit SMB3 metadata requests and small SMB3 reads/writes

• RDMA mode

- Transmit data for large SMB3 reads/writes

SMBDirect Message Types – 3 Messages

Page 47: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 47

Credits are bidirectional and asymmetric

Peers MUST avoid credit deadlock

• All sends must request at least one credit

• When consuming final credit, at least one must also be granted by the message

• These rules avoid deadlock

Peers SHOULD grant many credits

Peers can perform dynamic credit management

KEEPALIVE mechanism supports liveness probe

• Side effect to refresh credits

SMBDirect Credits

Page 48: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 48

SMB Direct Negotiate Request

CreditsRequested (2 bytes): The number of Send Credits requested of the receiver. PreferredSendSize (4 bytes): The maximum number of bytes that the sender requests to transmit in a single message. MaxReceiveSize (4 bytes): The maximum number of bytes that the sender can receive in a single message. MaxFragmentedSize (4 bytes): The maximum number of upper-layer bytes that the sender can receive as the result of a sequence of fragmented Send operations.

0

1

2

3

4

5

6

7

8

9

1 0

1

2

3

4

5

6

7

8

9

2 0

1

2

3

4

5

6

7

8

9

3 0

1

MinVersion MaxVersion

Reserved CreditsRequested

PreferredSendSize

MaxReceiveSize

MaxFragmentedSize

Page 49: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 49

0

1

2

3

4

5

6

7

8

9

1 0

1

2

3

4

5

6

7

8

9

2 0

1

2

3

4

5

6

7

8

9

3 0

1

MinVersion MaxVersion

NegotiatedVersion Reserved

CreditsRequested CreditsGranted

Status

MaxReadWriteSize

PreferredSendSize

MaxReceiveSize

MaxFragmentedSize

SMBDirect Negotiate Response

NegotiatedVersion (2 bytes): The SMBDirect Protocol version that has been selected for this connection. This value MUST be one of the values from the range specified by the SMBDirect Negotiate Request message. CreditsRequested (2 bytes): The number of Send Credits requested of the receiver. CreditsGranted (2 bytes): The number of Send Credits granted by the sender. Status (4 bytes): Indicates whether the SMBDirect Negotiate Request message succeeded. The value MUST be set to STATUS_SUCCESS (0x0000) if SMBDirect Negotiate Request message succeeds. MaxReadWriteSize (4 bytes): The maximum number of bytes that the sender will transfer via RDMA Write or RDMA Read request to satisfy a single upper-layer read or write request. PreferredSendSize (4 bytes): The maximum number of bytes that the sender will transmit in a single message. This value MUST be less than or equal to theMaxReceiveSize value of the SMBDirect Negotiate Request message. MaxReceiveSize (4 bytes): The maximum number of bytes that the sender can receive in a single message. MaxFragmentedSize (4 bytes): The maximum number of upper-layer bytes that the sender can receive as the result of a sequence of fragmented Send operations.

Page 50: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 50

0

1

2

3

4

5

6

7

8

9

1

0

1

2

3

4

5

6

7

8

9

2

0

1

2

3

4

5

6

7

8

9

3

0

1 CreditsRequested CreditsGranted

Flags Reserved

RemainingDataLength

DataOffset

DataLength

Padding (variable)

...

Buffer (variable)

...

SMBDirect Data Transfer Message

RemainingDataLength (4 bytes): The amount of data, in bytes, remaining in a sequence of fragmented messages. If this value is 0x00000000, this message

is the final message in the sequence.

DataOffset (4 bytes): The offset, in bytes, from the beginning of the SMBDirect header to the first byte of the message’s data payload. If no data payload is

associated with this message, this value MUST be 0. This offset MUST be 8-byte aligned from the beginning of the message.

DataLength (4 bytes): The length, in bytes, of the message’s data payload. If no data payload is associated with this message, this value MUST be 0.

Page 51: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 51

1

2

3

4

5

6

7

8

9

1 0

1

2

3

4

5

6

7

8

9

2 0

1

2

3

4

5

6

7

8

9

3 0

1

Offset

...

Token

Length

SMBDirect Buffer Descriptor V1 Structure

Offset (8 bytes): The RDMA provider-specific offset, in bytes, identifying the first byte of data to be transferred to or from the registered buffer. Token (4 bytes): An RDMA provider-assigned Steering Tag for accessing the registered buffer. Length (4 bytes): The size, in bytes, of the data to be transferred to or from the registered buffer.

Page 52: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 52

0

1

2

3

4

5

6

7

8

9

1 0

1

2

3

4

5

6

7

8

9

2 0

1

2

3

4

5

6

7

8

9

3

0

1

StructureSize Padding Reserved

Length

Offset

...

FileId

...

MinimumCount

Channel

RemainingBytes

ReadChannelInfoOffset ReadChannelInfoLength

Buffer (variable)

...

SMB2 READ Request

Channel (4 bytes): For SMB 2.002 and 2.1 dialects, this field MUST NOT be used and MUST be reserved. The client MUST set this field to 0, and the server MUST ignore it on receipt. For the SMB 3.0 dialect, this field MUST contain exactly one of the following values:

Value Meaning

SMB2_CHANNEL_NONE 0x00000000

No channel information is present in the request. The ReadChannelInfoOffset andReadChannelInfoLength fields MUST be set to 0 by the client and MUST be ignored by the server.

SMB2_CHANNEL_RDMA_V1 0x00000001

One or more SMB_DIRECT_BUFFER_DESCRIPTOR_V1 structures as specified in [MS-SMBD] section 2.2.3.1 are present in the channel information specified byReadChannelInfoOffset andReadChannelInfoLength fields.

Page 53: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 53

The initiator (for example, an SMB2 client) sends an SMBDirect Negotiate message, indicating that it is capable of the

1.0 version of the protocol, can send and receive up to 1 KiB of data per Send operation, and can reassemble

fragmented Sends up to 128 KiB.

• The SMBDirect Negotiate request message fields are set to the following:

- MinVersion: 0x0100 MaxVersion: 0x0100 Reserved: 0x0000 CreditsRequested: 0x000A (10)

- PreferredSendSize: 0x00000400 (1 KiB) MaxReceiveSize: 0x00000400 (1 KiB) MaxFragmentedSize: 0x00020000 (128 KiB)

The peer receives the SMBDirect Negotiate request and selects version 1.0 as the version for the connection. The

negotiate response indicates that the peer can receive up to 1 KiB of data per Send operation, and requests that the

requestor permit the same. The negotiate response also grants an initial batch of 10 Send Credits and requests 10

Send Credits to be used for future messages.

• The SMBDirect Negotiate response message fields are set to the following:

- MinVersion: 0x0100 MaxVersion: 0x0100 NegotiatedVersion: 0x0100 Reserved: 0x0000

- CreditsRequested: 0x000A (10) CreditsGranted: 0x000A (10) Status: 0x0000

- MaxReadWriteSize: 0x00100000 (1MiB) PreferredSendSize: 0x00000400 (1KiB)

- MaxReceiveSize: 0x00000400 (1KiB) MaxFragmentedSize: 0x00020000 (128KiB)

The peer sends the first data transfer, typically an upper-layer SMB2 Negotiate Request. The message grants an initial

credit limit of 10, and requests 10 credits to begin sending normal traffic.

• The SMBDirect Data Transfer message fields are set to the following:

- CreditsRequested: 0x000A (10) CreditsGranted: 0x000A (10) Flags: 0x0000

- Reserved: 0x0000 RemainingDataLength: 0x000000 (nonfragmented message) DataOffset: 0x00000018 (24)

- DataLength: 0x00000xxx (length of Buffer) Padding: 0x00000000 (4 bytes of 0x00) Buffer: (Upper layer message)

An SMBDirect Version 1.0 Protocol connection has now been established, and the initial message is processed.

Example - Establishing a Connection

Page 54: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 54

The peer uses the Send operation to transmit the data because the upper layer request did not

provide an RDMA Buffer Descriptor. An SMBDirect Data Transfer message is sent that contains the

500 bytes of data as the message’s payload. The message requests 10 Send Credits to maintain

the current credit limit and grants 1 Send Credit to replace the credit request used by the final

message.

• The SMBDirect Data Transfer message fields are set to the following:

- CreditsRequested: 0x000A (10)

- CreditsGranted: 0x0001

- Flags: 0x0000

- Reserved: 0x0000

- RemainingDataLength: 0x000000 (nonfragmented message)

- DataOffset: 0x00000018 (24)

- DataLength: 0x000001F4 (500 = size of the data payload)

- Padding: 0x00000000 (4 bytes of 0x00)

- Buffer: (Upper layer message)

Example - Peer Transmits 500 Bytes of Data

Page 55: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 55

The peer uses fragmented Send operations to transmit the data because the message exceeds the remote peer’s negotiated MaxReceiveSize, but is within the MaxFragmentedSize. A sequence of fragmented Sends of SMBDirect Data Transfer messages is prepared. The messages each request 10 Send Credits and grant a Send Credit to maintain the credits offered to the peer for expected responses. Because the fragmented sequence requires more credits (65) than are currently available (10), several pauses can occur while waiting for credit replenishment.

The SMBDirect Data Transfer message fields are set to the following: • CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000

• Reserved: 0x0000 RemainingDataLength: 0x000000xxx (63KiB remaining) DataOffset: 0x00000018 (24)

• DataLength: 0x000003F8 (1000 = MaxReceiveSize – 24) Padding: 0x00000000 (4 bytes of 0x00)

• Buffer: (1000 bytes of the upper-layer message)

The SMBDirect Data Transfer message fields are set to the following: • CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000

• Reserved: 0x0000 RemainingDataLength: 0x000000xxx (62KiB remaining) DataOffset: 0x00000018 (24)

• DataLength: 0x000003F8 (1000 = MaxReceiveSize – 24) Padding: 0x00000000 (4 bytes of 0x00)

• Buffer: (1000 bytes of the upper-layer message)

(Additional intermediate fragments, and pauses, elided…)

The SMBDirect Data Transfer message fields are set to the following: • CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000

• Reserved: 0x0000 RemainingDataLength: 0x000000000 (final message of fragmented sequence)

• DataOffset: 0x00000018 (24)

• DataLength: 0x00000218 (536 = last fragment) Padding: 0x00000000 (4 bytes of 0x00)

• Buffer: (536 final bytes of the upper-layer message)

Example - Peer Transmits 64 KiB of Data

Page 56: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 56

The upper layer performs the transfer via RDMA. The buffer containing the data to be written is registered, obtaining the following single-element SMBDirect Buffer Descriptor V1. The buffer descriptor will be embedded in the upper-layer Write request. • The SMBDirect Buffer Descriptor V1 fields are set to the following:

- Base: 0x00000000ABCDE012

- Length: 0x0000000000100000 (1 MiB)

- Token: 0x1A00BC56

The peer sends an SMBDirect Data Transfer message that contains an upper layer Write request which includes the SMBDirect Buffer Descriptor V1 describing the 1 MiB buffer. The upper layer message totals 500 bytes. • The SMBDirect Data Transfer message fields are set to the following:

- CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 (1) Flags: 0x0000 Reserved: 0x0000

- RemainingDataLength: 0x000000 (nonfragmented message) DataOffset: 0x00000018 (24)

- DataLength: 0x000001F4 (500 = size of the data payload) Padding: 0x00000000 (4 bytes of 0x00)

- Buffer: (Upper-layer message)

The message is recognized by the upper layer as a Write request via RDMA, and the supplied SMBDirect buffer descriptor is used to RDMA Read the data from the peer into a local buffer.

(the RDMA device performs an RDMA Read operation)

The write processing is completed, and the upper layer later replies to the peer.

The peer deregisters the buffer and completes the operation.

Example - Peer Transmits 1 MiB of Data Via Upper Layer

Page 57: Windows Server 2012 R2 - Boosted by Mellanox

© 2013 Mellanox Technologies 57

The upper layer performs the transfer via RDMA. The buffer containing the data to be read is registered, and the following single-element SMB Buffer Descriptor V1 is obtained. The buffer descriptor will be embedded in the upper-layer read request. • The SMBDirect Buffer Descriptor V1 fields are set to the following:

- Base: 0x00000000DCBA024 Length: 0x0000000000100000 (1 MiB) Token: 0x1A00BC57

The peer sends an SMBDirect Data Transfer message that contains an upper-layer Read request which includes the SMBDirect Buffer Descriptor describing the 1 MiB buffer. The upper-layer message totals 500 bytes. • The SMBDirect Data Transfer message fields are set to the following:

- CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000

- Reserved: 0x0000 RemainingDataLength: 0x000000 (nonfragmented message)

- DataOffset: 0x00000018 (24) DataLength: 0x000001F4 (500 = size of the data payload)

- Padding: 0x00000000 (4 bytes of 0x00)

- Buffer: (Upper-layer message)

The message is recognized by the upper layer as a Read request via RDMA, and the 1MiB of data is prepared.

The supplied SMBDirect Buffer Descriptor V1 is used by an RDMA Write request to write the data to the peer from a local buffer. • (the RDMA device performs an RDMA Write operation)

The read processing is completed, and the reply is sent.

The peer deregisters the buffer and completes the operation.

Example - Peer Receives 1 MiB of Data Via Upper Layer

Page 58: Windows Server 2012 R2 - Boosted by Mellanox

Thank You