Windows Server 2012 R2 - Boosted by Mellanox
-
Upload
kazusa-tomonaga -
Category
Technology
-
view
924 -
download
4
description
Transcript of Windows Server 2012 R2 - Boosted by Mellanox
メラノックステクノロジーズジャパン株式会社
シニアシステムエンジニア 友永 和総
2013年12月4日
Windows Server 2012 R2 – Boosted by Mellanox
© 2013 Mellanox Technologies 2
セッション内容
Mellanox概要
Microsoft + Mellanox
• SMB Direct
• NVGREオフロード
補足
• RoCE設定
• I/O統合に関する考慮事項
参考資料
• SMB Direct – Protocol Deep Dive
© 2013 Mellanox Technologies 3
会社概要
サーバ・ストレージエリア向け広帯域・低レイテンシーなインターコネクト市場のリーディングプロバイダー
• FDR InfiniBand (56Gbps)と10/40/56ギガビットEthernetを共通のハードウェアでサポート
• データアクセス性能の高速化により、アプリケーション性能を飛躍的に向上
• ノード数の大幅な削減や管理効率の向上によって、データセンター内IT基盤のROIを劇的に改善
本社・従業員数
• ヨークナム(イスラエル)、サニーベール(米国)
• 全世界で 約1,400人の従業員
安定した財務基盤
• 2011年度の売上: $259.3M (67.6% up)
• 2012年度の売上: $500.8M (93.2% up)
• Cash + investments @ 9/30/13 = $306.4M
Ticker: MLNX
* As of September 2013
© 2013 Mellanox Technologies 4
VPI (Virtual Protocol Interconnect) テクノロジー
• IB and EN with single chip (ConnectX-3、SwitchX-2)
• IB and EN port by port (ConnectX-3、SwitchX-2)
• IB/EN Bridging(SwitchX-2)
高スループット、低レイテンシー、超低消費電力 (Ultra Low Power)
RDMA (Remote Direct Memory Access) 対応、高速データ転送
VXLAN/NVGREオフロード(ConnectX-3Pro)
3.0 x8
17mm
45mm
InfiniBand/Ethernet
InfiniBand/Ethernet
2 x 56Gbps Ethernet mode: 1/10/40/56GbE
144組のネットワークSerDesを搭載 36x 40/56GbE 64x 10GbE 48x 10GbE+12x 40/56GbE Ethernet mode: 1/10/40/56GbE
• InfiniBand or Ethernet • InfiniBand + Ethernet • InfiniBand / Ethernet Bridging
36x 40GbE: 83W 64x 10GbE: 63W
(100% load power) 2pt 40GbE Typ power: 7.9W
3.0 x16
2 x IB FDR (56Gbps)
メラノックス社のコアテクノロジー : 高性能・高集積ASIC
© 2013 Mellanox Technologies 5
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
Storage Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
© 2013 Mellanox Technologies 6
The Future Depends on Fastest Interconnects
10Gb/s 40/56Gb/s 1Gb/s
© 2013 Mellanox Technologies 7
Top Tier OEMs, ISVs and Distribution Channels
Medical
Server
Storage
Embedded
Hardware OEMs Software Partners Selected Channel Partners
© 2013 Mellanox Technologies 8
InfiniBand Enables Lowest Application Cost in the Cloud (Examples)
Microsoft Windows Azure
90.2% Cloud Efficiency
33% Lower Cost per Application
Cloud Application Performance
Improved up to 10X
3X Increase in VMs per Physical Server
Consolidation of Network and Storage I/O
32% Lower Cost per Application
694% Higher Network Performance
© 2013 Mellanox Technologies 9
Microsoft + Mellanox
© 2013 Mellanox Technologies 10
SMB Direct (RDMA) – Windows Server 2012から投入されたI/O性能を飛躍的に高めるテクノロジー
Mellanox ConnectX-3 (InfiniBand 及び 10G/40G Ethernet NIC)搭載で実現
Hyper-V over SMB Direct
• 統合率向上、アプリケーションパフォーマンスの向上
Hyper-V RDMA Live Migration (Windows Server 2012 R2)
• ライブマイグレーション時間を短縮、運用の簡素化、効率化
Microsoft SQL Server 2012 – Always-ON
• 二重化したDBサーバー間を低レイテンシーなメラノックスネットワークで接続、DBライト性能を向上
Hyper-VベースのVDIソリューション
• 同一ハードウェア構成でのVDIクライアント数を倍増させ、コストパフォーマンス向上
Microsoft SQL 2012 Parallel Data Warehouse V2
• 高速なSMB Directを活用した高速データベースアプライアンス
NVGREオフロード (Windows Server 2012 R2)
Mellanox ConnectX-3Pro (10G/40G Ethernet NIC)搭載で実現
• オーバーレイネットワークにおけるパケット処理をネットワークアダプタでハードウェアオフロード処理
• CPUボトルネックを解消し、広帯域ネットワークの帯域を最大限活用
マイクロソフト社ソリューションにおけるメラノックスの技術
© 2013 Mellanox Technologies 11
Remote Direct Memory Access
• ゼロコピー、CPUバイパスを実現するデータ転送技術
• 標準的なインターコネクトプロトコルとしてサポート
• リモートアプリケーションのバッファ間でダイレクトにデータ転送
• 非常に低いレイテンシでのデータ転送が可能
RDMAプロトコル
• InfiniBand – 最大 56Gb/s
• RDMA-over-Converged-Ethernet (RoCE) – 最大 40Gb/s
SMB Directでは、Windowsのファイル共有プロトコル(SMB3.0)にRDMA処理を統合
RDMA技術の特長
RDMA
InfiniBand Ethernet
MellanoxのEthernet NICを使えば、EthernetでもRDMA動作可能*
SMB3.0
*RDMAを用いる場合はDCB設定が必要
© 2013 Mellanox Technologies 12
RoCE Frame
RoCE(RDMA over Converged Ethernet)
出典:http://blog.infinibandta.org/2012/02/13/roce-and-infiniband-which-should-i-choose/
出典: IBTA Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.1 - Annex A16:RDMA over Converged Ethernet (RoCE)/
© 2013 Mellanox Technologies 13
RDMAの概要
RDMA over InfiniBand or
Ethernet
KE
RN
EL
H
AR
DW
AR
E
US
ER
RACK 1
OS
NIC Buffer 1
Application
1 Application
2
OS
Buffer 1
NIC Buffer 1
TCP/IP
RACK 2
HCA HCA
Buffer 1 Buffer 1
Buffer 1
Buffer 1
Buffer 1
「バケツリレー」的な転送
「太い直結ホース」的な転送
データを水に
例えると・・・
© 2013 Mellanox Technologies 14
~88% CPU
Available for App
I/O Offload Frees Up CPU for Application Processing U
se
r S
pa
ce
S
ys
tem
Sp
ac
e
~53% CPU
Available for App
~47% CPU
Overhead
~12% CPU
Overhead
Without RDMA With RDMA and Offload
Us
er
Sp
ac
e
Sys
tem
Sp
ac
e
© 2013 Mellanox Technologies 15
SMB Direct - File Read
SMB Client
SMB Server
RDMA Write
by SMB Server
RoCE frame capture
© 2013 Mellanox Technologies 16
SMB Direct - File Write
SMB Client
SMB Server
RoCE frame capture
RDMA Read
by SMB Server
© 2013 Mellanox Technologies 17
How to watch the RDMA Traffic
• ibdump.exe
© 2013 Mellanox Technologies 18
Measuring SMB Direct Performance
Single
Server
Fusion IO Fusion IO Fusion IO Fusion-IO
IO Micro
Benchmark
SMB
Client
SMB
Server
Fusion IO Fusion IO Fusion IO Fusion-IO
IO Micro
Benchmark
10GbE
10GbE
SMB
Client
SMB
Server
Fusion IO Fusion IO Fusion IO Fusion-IO
IO Micro
Benchmark
FDR IB
FDR IB
SMB
Client
SMB
Server
Fusion IO Fusion IO Fusion IO Fusion-IO
IO Micro
Benchmark
QDR IB
QDR IB
© 2013 Mellanox Technologies 19
Microsoft Delivers Low-Cost Replacement to High-End Storage
FDR 56Gb/s InfiniBand delivers 5X higher throughput with 50% less CPU overhead vs. 10GbE
Native Throughput Performance over FDR InfiniBand
© 2013 Mellanox Technologies 20
Native
Single Server
SQLIO
Hyper-V
(SMB 3.0)
File Server
(SMB 3.0)
VM
RDMA
NIC
RDMA
NIC
RDMA
NIC
RDMA
NIC
SQLIO
Remote VM
SA
S
RAID Controller
JBOD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SA
S
RAID Controller
JBOD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SA
S
RAID Controller
JBOD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SA
S
RAID Controller
JBOD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SA
S
RAID Controller
JBOD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SA
S
RAID Controller
JBOD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SA
S
RAID Controller
JBOD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SA
S
RAID Controller
JBOD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
Hyper-V over SMB Direct - Performance
vs.
© 2013 Mellanox Technologies 21
Configuration BW MB/sec
IOPS 512KB IOs/sec
%CPU Privileged
Latency milliseconds
Native 10,090 38,492 ~2.5% ~3ms
Remote VM 10,367 39,548 ~4.6% ~3 ms
SMB 3.0 Performance in Virtualized Environment
SMB 3.0 over InfiniBand Delivers Native Performance
© 2013 Mellanox Technologies 22
File Server
(SMB 3.0)
File Client
(SMB 3.0) SQLIO
RDMA
NIC
RDMA
NIC
RDMA
NIC
RDMA
NIC
RDMA
NIC
RDMA
NIC
SA
S
SAS
HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
SA
S
SAS
HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
SA
S
SAS
HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
SA
S
SAS
HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
SA
S
SAS
HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
Storage Spaces
SA
S
SAS
HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
8KB random reads
from a mirrored space (disk)
~600,000 IOPS
8KB random reads
from cache (RAM)
~1,000,000 IOPS
32KB random reads
from a mirrored space (disk)
~500,000 IOPS
~16.5 GBytes/sec
EchoStreams: InfiniBand Enables Near Linear Scalability
© 2013 Mellanox Technologies 23
Hyper-V Live Migration over SMB
SMB as a transport for Live Migration of VMs
Delivers the power of SMB to provide: • RDMA (SMB Direct)
• Streaming over multiple NICs (SMB Multichannel)
Provides highest bandwidth and lowest latency
複数リンクを
活用可能
RDMAで広帯域ネットワークを最大限に活用
RDMAでCPU負荷最小限に抑制
0
10
20
30
40
50
60
70
Seco
nd
s
Live Migration Times New in
Windows
Server
2012 R2
© 2013 Mellanox Technologies 24
RDMAデータ転送で低レイテンシなDB二重化書き込みを実現
Microsoft SQL Server 2012 AlwaysON
© 2013 Mellanox Technologies 25
Microsoft PDW* V2 – 10X Faster & 50% Lower Capital Cost
Control Node
Mgmt. Node
LZ
Backup Node
Estimated total HW component list
price: $1M $ Estimated total HW component list
price: $500K $
Ethernet, InfiniBand & Fiber Channel
• Pure hardware costs are ~50% lower
• Price per raw TB is close to 70% lower
due to higher capacity
• 70% more disk I/O bandwidth
InfiniBand & Ethernet
• 128 cores on 8 compute nodes
• 2TB of RAM on compute
• Up to 168 TB of temp DB
• Up to 1PB of user data
• 160 cores on 10 compute nodes
• 1.28 TB of RAM on compute
• Up to 30 TB of temp DB
• Up to 150 TB of user data
PDW V1 PDW V2
*Parallel Data Warehouse
© 2013 Mellanox Technologies 26
Accelerating Microsoft SQL 2012 Parallel Data Warehouse V2
Analyze 1 Petabyte of Data in 1 Second
Up to 100X faster performance than legacy data warehouse queries
Up to 50X faster data query, up to 2X the data loading rate
Unlimited storage scalability for future proofing
Accelerated by Mellanox FDR 56Gb/s InfiniBand end-to-end solutions
© 2013 Mellanox Technologies 27
NVGRE H/W Offload
© 2013 Mellanox Technologies 28
ConnectX-3 Pro | The Next Generation Cloud Competitive Asset
World’s first Cloud offload interconnect solution
Provides hardware offloads for Overlay Networks – enables mobility, scalability, serviceability
Dramatically lowers CPU overhead, reduces cloud application cost
Highest throughput (10, 40GbE & 56GbE), SR-IOV, PCIe Gen3, low power
More users
Mobility
Scalability
Simpler Management
Lower Application Cost
Cloud 2.0
The Foundation of Cloud 2.0
© 2013 Mellanox Technologies 29
Physical Switch
World’s First HW Offload Engines for Overlay Network Protocols
Introducing L2 Virtual Tunneling solutions for virtualized data centers
• NVGRE and VXLAN
Virtual L2 Tunnels provides a method for “creating” virtual domains
on top of a scalable L3 virtualized infrastructure
• Enabling virtual domains with complete isolation
Targeting public/private cloud networks with multi-tenants
Mellanox uniqueness: HW offload = higher performance
• Checksums, LSO, FlowID calculation, VLAN Stripping / insertion
• Combined with steering mechanisms: RSS, VMQ
Three virtual domains
connected by
Layer 2 Tunneling
Server
VM VM VM VM
Server
VM VM VM VM
Server
VM VM VM VM
Domain1
Domain2
Domain3
© 2013 Mellanox Technologies 30
NVGRE
• MAC over GRE
• 24 bit tenant id
NVGRE
MAC GRE MAC IP (v4/v6) ….
© 2013 Mellanox Technologies 31
0
1
2
3
4
5
6
7
8
9
10
NVGRE with ConnectX-3 Pro Offloads NVGRE Without Offloads
Throughput (Gb/s)
Higher Is Better
65%
Higher Throughput for Less CPU Overhead
0
2
4
6
8
10
12
NVGRE with ConnectX-3 Pro Offloads NVGRE Without Offloads
CPU Overhead (CPU Cycles per Byte)
Lower Is Better
80%
NVGRE Initial Performance Results (ConnectX-3 Pro, 10GbE)
© 2013 Mellanox Technologies 32
ConnectX-3 Pro NVGRE Throughput
4.55 4.8 5 5.5
8.7 9.15 9.2
8.65
0
2
4
6
8
10
12
2 4 8 16
Ban
dw
idth
Gb
/s
VM Pairs
Throughput ConnectX3 Pro 10 GbE
NVGRE Offload Disabled NVGRE Offload Enabled
© 2013 Mellanox Technologies 33
Microsoft:
• http://smb3.info
• Blog posts from Microsoft about SMB Direct
Mellanox.com:
• http://www.mellanox.com/page/file_storage
- Recipe and how-to guides
• http://www.mellanox.com/page/edc_system
- Demo/Test RDMA on Windows Server 2012
Links
© 2013 Mellanox Technologies 34
Mellanox RDMAテクノロジーは、Windows Server 2012 及び Windows Server 2012 R2の標準技術として「SMB Direct」として採用され、I/O性能を高めると同時にCPU負荷を下げるという非常に効果的なテクノロジーです。
Windows Server 2012 及び Windows Server 2012 R2は、ファイルプロトコルの運用性・管理性と、ブロックストレージをも上回る性能を両立した画期的なテクノロジーをOS標準技術として業界に先駆けて搭載しています。
• Mellanox ConnectX-3を実装するだけで圧倒的なネットワーク性能を活用可能 - 従来に比べ、約10倍といったオーダーの性能を実現(ブロックストレージをも上回る性能)
• InfiniBandだけでなく、Ethernetでも動作
• Hyper-V環境では、ファイルストレージアクセス(Hyper-V over SMB)だけでなく、ライブマイグレーションもRDMAで動作
• Windows Server 2012 R2を基盤とした幅広いソリューションに活用可能 - Hyper-V、VDI、SQLServer
まとめ
© 2013 Mellanox Technologies 35
補足
© 2013 Mellanox Technologies 36
RoCE設定
http://www.mellanox.com/pdf/whitepapers/WP_Deploying_Windows_Server_Eth.pdf
http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf
通常のEthernetフレームのため、原理的には何も設定しなくても動くが、
性能、安定性のためにDCB設定(フローコントロール設定)を行う
• Windowsホストの設定
• Ethernetスイッチの設定
© 2013 Mellanox Technologies 37
SMB ダイレクトを使用する場合の考慮事項
http://technet.microsoft.com/ja-jp/library/jj134210.aspx (Windows仕様上の制限事項)
Hyper-V 管理オペレーティング システムで SMB ダイレクトを使用して、Hyper-V over SMB を使用できるようにしたり、Hyper-V 記憶域スタックを使用する仮想マシンに記憶域を提供したりできます。ただし、RDMA 対応ネットワーク アダプターは Hyper-V クライアントに直接公開されません。RDMA 対応ネットワーク アダプターを仮想スイッチに接続すると、そのスイッチからの仮想ネットワーク アダプターは
RDMA 対応ではなくなります。
SMB マルチチャネルを無効にすると、SMB ダイレクトも無効になります。SMB マルチチャネルによって、ネットワーク アダプターの機能が検出され、ネットワーク アダプターが RDMA 対応かどうかが確認されるため、SMB マルチチャネルが無効になっていると、クライアントが SMB ダイレクトを使用できません。
SMB ダイレクトは、ダウンレベル バージョンの Windows Server ではサポートされていません。Windows Server 2012 でのみサポートされています。
© 2013 Mellanox Technologies 38
I/O統合に関する考慮事項
前頁考慮事項より:「RDMA 対応ネットワーク アダプターを仮想スイッチに接続すると、そのスイッチからの仮想ネットワーク アダプターは RDMA 対応ではなくなります」
SMB Direct (ストレージアクセス及びライブマイグレーションパス)とVM間通信(TCP/IP)(仮想スイッチに接続する必要あり)を同一インターフェースでI/O統合することができない?
Mellanoxとしてのソリューション
• part_manコマンドによる仮想インターフェース追加で、1物理ポートを2論理ポートとしてOSへ見せる
© 2013 Mellanox Technologies 39
part_manコマンド
例)
# part_man add “イーサネット 4” <任意の名前>
現在のステータス • InfiniBand : サポート済み
• Ethernet : (会場説明)リリース予定
MLNX WinOF 4.55 User Manual
© 2013 Mellanox Technologies 40
参考資料 SMB Direct - Protocol Deep Dive
© 2013 Mellanox Technologies 41
[MS-SMBD]
Available below
• http://msdn.microsoft.com/en-us/library/hh536346(v=prot.20).aspx
SMB Direct Specification
© 2013 Mellanox Technologies 42
RDMA Transports
• The SMBDirect Protocol is transport-independent.
• It requires only an RDMA lower layer for sending and receiving the messages.
• The RDMA transports most commonly used by SMBDirect include:
- iWARP
- InfiniBand Reliable Connected mode
- RDMA over Converged Ethernet (RoCE)
Protocols Transported by SMBDirect
• SMB2 Protocl [MS-SMB2]
- When SMB2 version 3.0 is negotiated
- both client and server
- RDMA-capable transport
Relationship to Other Protocols
© 2013 Mellanox Technologies 43
Must in-order delivery
Must support direct data placement via RDMA
Write and RDMA Read reqeusts
• Example : iWARP, InfiniBand, RoCE
Only 3 message types
• Negotiate request
• Negotiate response
• Data transfer
Little-endian order
• least-significant byte first
Use multiple connection
• 1st connection – negotiation
• 2nd or more connection - RDMA
SMBDirect Protocol Overview
© 2013 Mellanox Technologies 44
New additions for RDMA
• Negotiate
- Server capability advertisement
Server must advertise Multi-channel support (multiple connections per session) because
SMB Direct always starts with a TCP connection, then opens a second connection (or more)
for RDMA
- Session setup
Initial session –None (normal processing)
New Connection is created when RDMA is detected
› As part of RDMA connection setup, an SMB Direct negotiation occurs
› New RDMA connection joins previously setup session
Initialization
© 2013 Mellanox Technologies 45
Normal TCP connection is used to negotiate SMB2.2 and setup session
After session setup
• Client uses FSCTL FSCTL_QUERY_TRANSPORT_INFO to query server interface capabilities
• Interface Capability = RDMA_CAPABLE
If Server RDMA NIC is found and a local RDMA NIC is found that can connect to the server RDMA NIC
• Then additional connections using RDMA will be created and bound to session
• Original TCP connection is idled –all traffic goes over RDMA channel
RDMA NICs are always selected first over other types of NICs
Creating the RDMA Connection
© 2013 Mellanox Technologies 46
The 3 messages are:
• Negotiate request –negotiate RDMA parameters
• Negotiate response –negotiate RDMA parameters
• Data transfer -encapsulates SMB2 messages
SMBDirect Data Transfer Mode
• Send/Receive mode
- Transmit SMB3 metadata requests and small SMB3 reads/writes
• RDMA mode
- Transmit data for large SMB3 reads/writes
SMBDirect Message Types – 3 Messages
© 2013 Mellanox Technologies 47
Credits are bidirectional and asymmetric
Peers MUST avoid credit deadlock
• All sends must request at least one credit
• When consuming final credit, at least one must also be granted by the message
• These rules avoid deadlock
Peers SHOULD grant many credits
Peers can perform dynamic credit management
KEEPALIVE mechanism supports liveness probe
• Side effect to refresh credits
SMBDirect Credits
© 2013 Mellanox Technologies 48
SMB Direct Negotiate Request
CreditsRequested (2 bytes): The number of Send Credits requested of the receiver. PreferredSendSize (4 bytes): The maximum number of bytes that the sender requests to transmit in a single message. MaxReceiveSize (4 bytes): The maximum number of bytes that the sender can receive in a single message. MaxFragmentedSize (4 bytes): The maximum number of upper-layer bytes that the sender can receive as the result of a sequence of fragmented Send operations.
0
1
2
3
4
5
6
7
8
9
1 0
1
2
3
4
5
6
7
8
9
2 0
1
2
3
4
5
6
7
8
9
3 0
1
MinVersion MaxVersion
Reserved CreditsRequested
PreferredSendSize
MaxReceiveSize
MaxFragmentedSize
© 2013 Mellanox Technologies 49
0
1
2
3
4
5
6
7
8
9
1 0
1
2
3
4
5
6
7
8
9
2 0
1
2
3
4
5
6
7
8
9
3 0
1
MinVersion MaxVersion
NegotiatedVersion Reserved
CreditsRequested CreditsGranted
Status
MaxReadWriteSize
PreferredSendSize
MaxReceiveSize
MaxFragmentedSize
SMBDirect Negotiate Response
NegotiatedVersion (2 bytes): The SMBDirect Protocol version that has been selected for this connection. This value MUST be one of the values from the range specified by the SMBDirect Negotiate Request message. CreditsRequested (2 bytes): The number of Send Credits requested of the receiver. CreditsGranted (2 bytes): The number of Send Credits granted by the sender. Status (4 bytes): Indicates whether the SMBDirect Negotiate Request message succeeded. The value MUST be set to STATUS_SUCCESS (0x0000) if SMBDirect Negotiate Request message succeeds. MaxReadWriteSize (4 bytes): The maximum number of bytes that the sender will transfer via RDMA Write or RDMA Read request to satisfy a single upper-layer read or write request. PreferredSendSize (4 bytes): The maximum number of bytes that the sender will transmit in a single message. This value MUST be less than or equal to theMaxReceiveSize value of the SMBDirect Negotiate Request message. MaxReceiveSize (4 bytes): The maximum number of bytes that the sender can receive in a single message. MaxFragmentedSize (4 bytes): The maximum number of upper-layer bytes that the sender can receive as the result of a sequence of fragmented Send operations.
© 2013 Mellanox Technologies 50
0
1
2
3
4
5
6
7
8
9
1
0
1
2
3
4
5
6
7
8
9
2
0
1
2
3
4
5
6
7
8
9
3
0
1 CreditsRequested CreditsGranted
Flags Reserved
RemainingDataLength
DataOffset
DataLength
Padding (variable)
...
Buffer (variable)
...
SMBDirect Data Transfer Message
RemainingDataLength (4 bytes): The amount of data, in bytes, remaining in a sequence of fragmented messages. If this value is 0x00000000, this message
is the final message in the sequence.
DataOffset (4 bytes): The offset, in bytes, from the beginning of the SMBDirect header to the first byte of the message’s data payload. If no data payload is
associated with this message, this value MUST be 0. This offset MUST be 8-byte aligned from the beginning of the message.
DataLength (4 bytes): The length, in bytes, of the message’s data payload. If no data payload is associated with this message, this value MUST be 0.
© 2013 Mellanox Technologies 51
1
2
3
4
5
6
7
8
9
1 0
1
2
3
4
5
6
7
8
9
2 0
1
2
3
4
5
6
7
8
9
3 0
1
Offset
...
Token
Length
SMBDirect Buffer Descriptor V1 Structure
Offset (8 bytes): The RDMA provider-specific offset, in bytes, identifying the first byte of data to be transferred to or from the registered buffer. Token (4 bytes): An RDMA provider-assigned Steering Tag for accessing the registered buffer. Length (4 bytes): The size, in bytes, of the data to be transferred to or from the registered buffer.
© 2013 Mellanox Technologies 52
0
1
2
3
4
5
6
7
8
9
1 0
1
2
3
4
5
6
7
8
9
2 0
1
2
3
4
5
6
7
8
9
3
0
1
StructureSize Padding Reserved
Length
Offset
...
FileId
...
MinimumCount
Channel
RemainingBytes
ReadChannelInfoOffset ReadChannelInfoLength
Buffer (variable)
...
SMB2 READ Request
Channel (4 bytes): For SMB 2.002 and 2.1 dialects, this field MUST NOT be used and MUST be reserved. The client MUST set this field to 0, and the server MUST ignore it on receipt. For the SMB 3.0 dialect, this field MUST contain exactly one of the following values:
Value Meaning
SMB2_CHANNEL_NONE 0x00000000
No channel information is present in the request. The ReadChannelInfoOffset andReadChannelInfoLength fields MUST be set to 0 by the client and MUST be ignored by the server.
SMB2_CHANNEL_RDMA_V1 0x00000001
One or more SMB_DIRECT_BUFFER_DESCRIPTOR_V1 structures as specified in [MS-SMBD] section 2.2.3.1 are present in the channel information specified byReadChannelInfoOffset andReadChannelInfoLength fields.
© 2013 Mellanox Technologies 53
The initiator (for example, an SMB2 client) sends an SMBDirect Negotiate message, indicating that it is capable of the
1.0 version of the protocol, can send and receive up to 1 KiB of data per Send operation, and can reassemble
fragmented Sends up to 128 KiB.
• The SMBDirect Negotiate request message fields are set to the following:
- MinVersion: 0x0100 MaxVersion: 0x0100 Reserved: 0x0000 CreditsRequested: 0x000A (10)
- PreferredSendSize: 0x00000400 (1 KiB) MaxReceiveSize: 0x00000400 (1 KiB) MaxFragmentedSize: 0x00020000 (128 KiB)
The peer receives the SMBDirect Negotiate request and selects version 1.0 as the version for the connection. The
negotiate response indicates that the peer can receive up to 1 KiB of data per Send operation, and requests that the
requestor permit the same. The negotiate response also grants an initial batch of 10 Send Credits and requests 10
Send Credits to be used for future messages.
• The SMBDirect Negotiate response message fields are set to the following:
- MinVersion: 0x0100 MaxVersion: 0x0100 NegotiatedVersion: 0x0100 Reserved: 0x0000
- CreditsRequested: 0x000A (10) CreditsGranted: 0x000A (10) Status: 0x0000
- MaxReadWriteSize: 0x00100000 (1MiB) PreferredSendSize: 0x00000400 (1KiB)
- MaxReceiveSize: 0x00000400 (1KiB) MaxFragmentedSize: 0x00020000 (128KiB)
The peer sends the first data transfer, typically an upper-layer SMB2 Negotiate Request. The message grants an initial
credit limit of 10, and requests 10 credits to begin sending normal traffic.
• The SMBDirect Data Transfer message fields are set to the following:
- CreditsRequested: 0x000A (10) CreditsGranted: 0x000A (10) Flags: 0x0000
- Reserved: 0x0000 RemainingDataLength: 0x000000 (nonfragmented message) DataOffset: 0x00000018 (24)
- DataLength: 0x00000xxx (length of Buffer) Padding: 0x00000000 (4 bytes of 0x00) Buffer: (Upper layer message)
An SMBDirect Version 1.0 Protocol connection has now been established, and the initial message is processed.
Example - Establishing a Connection
© 2013 Mellanox Technologies 54
The peer uses the Send operation to transmit the data because the upper layer request did not
provide an RDMA Buffer Descriptor. An SMBDirect Data Transfer message is sent that contains the
500 bytes of data as the message’s payload. The message requests 10 Send Credits to maintain
the current credit limit and grants 1 Send Credit to replace the credit request used by the final
message.
• The SMBDirect Data Transfer message fields are set to the following:
- CreditsRequested: 0x000A (10)
- CreditsGranted: 0x0001
- Flags: 0x0000
- Reserved: 0x0000
- RemainingDataLength: 0x000000 (nonfragmented message)
- DataOffset: 0x00000018 (24)
- DataLength: 0x000001F4 (500 = size of the data payload)
- Padding: 0x00000000 (4 bytes of 0x00)
- Buffer: (Upper layer message)
Example - Peer Transmits 500 Bytes of Data
© 2013 Mellanox Technologies 55
The peer uses fragmented Send operations to transmit the data because the message exceeds the remote peer’s negotiated MaxReceiveSize, but is within the MaxFragmentedSize. A sequence of fragmented Sends of SMBDirect Data Transfer messages is prepared. The messages each request 10 Send Credits and grant a Send Credit to maintain the credits offered to the peer for expected responses. Because the fragmented sequence requires more credits (65) than are currently available (10), several pauses can occur while waiting for credit replenishment.
The SMBDirect Data Transfer message fields are set to the following: • CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000
• Reserved: 0x0000 RemainingDataLength: 0x000000xxx (63KiB remaining) DataOffset: 0x00000018 (24)
• DataLength: 0x000003F8 (1000 = MaxReceiveSize – 24) Padding: 0x00000000 (4 bytes of 0x00)
• Buffer: (1000 bytes of the upper-layer message)
The SMBDirect Data Transfer message fields are set to the following: • CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000
• Reserved: 0x0000 RemainingDataLength: 0x000000xxx (62KiB remaining) DataOffset: 0x00000018 (24)
• DataLength: 0x000003F8 (1000 = MaxReceiveSize – 24) Padding: 0x00000000 (4 bytes of 0x00)
• Buffer: (1000 bytes of the upper-layer message)
(Additional intermediate fragments, and pauses, elided…)
The SMBDirect Data Transfer message fields are set to the following: • CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000
• Reserved: 0x0000 RemainingDataLength: 0x000000000 (final message of fragmented sequence)
• DataOffset: 0x00000018 (24)
• DataLength: 0x00000218 (536 = last fragment) Padding: 0x00000000 (4 bytes of 0x00)
• Buffer: (536 final bytes of the upper-layer message)
Example - Peer Transmits 64 KiB of Data
© 2013 Mellanox Technologies 56
The upper layer performs the transfer via RDMA. The buffer containing the data to be written is registered, obtaining the following single-element SMBDirect Buffer Descriptor V1. The buffer descriptor will be embedded in the upper-layer Write request. • The SMBDirect Buffer Descriptor V1 fields are set to the following:
- Base: 0x00000000ABCDE012
- Length: 0x0000000000100000 (1 MiB)
- Token: 0x1A00BC56
The peer sends an SMBDirect Data Transfer message that contains an upper layer Write request which includes the SMBDirect Buffer Descriptor V1 describing the 1 MiB buffer. The upper layer message totals 500 bytes. • The SMBDirect Data Transfer message fields are set to the following:
- CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 (1) Flags: 0x0000 Reserved: 0x0000
- RemainingDataLength: 0x000000 (nonfragmented message) DataOffset: 0x00000018 (24)
- DataLength: 0x000001F4 (500 = size of the data payload) Padding: 0x00000000 (4 bytes of 0x00)
- Buffer: (Upper-layer message)
The message is recognized by the upper layer as a Write request via RDMA, and the supplied SMBDirect buffer descriptor is used to RDMA Read the data from the peer into a local buffer.
(the RDMA device performs an RDMA Read operation)
The write processing is completed, and the upper layer later replies to the peer.
The peer deregisters the buffer and completes the operation.
Example - Peer Transmits 1 MiB of Data Via Upper Layer
© 2013 Mellanox Technologies 57
The upper layer performs the transfer via RDMA. The buffer containing the data to be read is registered, and the following single-element SMB Buffer Descriptor V1 is obtained. The buffer descriptor will be embedded in the upper-layer read request. • The SMBDirect Buffer Descriptor V1 fields are set to the following:
- Base: 0x00000000DCBA024 Length: 0x0000000000100000 (1 MiB) Token: 0x1A00BC57
The peer sends an SMBDirect Data Transfer message that contains an upper-layer Read request which includes the SMBDirect Buffer Descriptor describing the 1 MiB buffer. The upper-layer message totals 500 bytes. • The SMBDirect Data Transfer message fields are set to the following:
- CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000
- Reserved: 0x0000 RemainingDataLength: 0x000000 (nonfragmented message)
- DataOffset: 0x00000018 (24) DataLength: 0x000001F4 (500 = size of the data payload)
- Padding: 0x00000000 (4 bytes of 0x00)
- Buffer: (Upper-layer message)
The message is recognized by the upper layer as a Read request via RDMA, and the 1MiB of data is prepared.
The supplied SMBDirect Buffer Descriptor V1 is used by an RDMA Write request to write the data to the peer from a local buffer. • (the RDMA device performs an RDMA Write operation)
The read processing is completed, and the reply is sent.
The peer deregisters the buffer and completes the operation.
Example - Peer Receives 1 MiB of Data Via Upper Layer
Thank You