The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame...

25
Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited The Tofu Interconnect 2 August 26, 2014, Hot Interconnects 22 Copyright 2014 FUJITSU LIMITED

Transcript of The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame...

Page 1: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto,

Shun Ando, Masahiro Maeda, Takahide Yoshikawa,

Koji Hosoe, and Toshiyuki Shimizu

Fujitsu Limited

The Tofu Interconnect 2

August 26, 2014, Hot Interconnects 22 Copyright 2014 FUJITSU LIMITED

Page 2: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Introduction

Tofu interconnect

Highly-scalable six-dimensional network topology

Tofu interconnect 2

Link speed has been upgraded from 40 Gbps to 100 Gbps

System-on-Chip integration

New features

1 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

Tofu interconnect 2 (Tofu2)

2010 2012 2015

Tofu interconnect (Tofu1)

Page 3: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

A Next Generation Interconnect

Optical-dominant: 2/3 of network links are optical

2 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

0

0.2

0.4

0.6

0.8

1

1.2

IBM BlueGene/Q

IB-FDRFat-Tree

CrayCascade

Tofu1 Tofu2

Gro

ss r

aw

bitra

te p

er

no

de

(T

bp

s)

Optical network links

Electrical network links

6.25 Gbps

25 Gbps

25 Gbps

12.5 Gbps

14 Gbps

14 Gbps

14 Gbps

10 Gbps

5 Gbps

Page 4: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Copyright 2014 FUJITSU LIMITED

Agenda

Introduction

Implementation

Transmission Technology

New Features

Preliminary evaluations

Summary

August 26, 2014, Hot Interconnects 22 3

Page 5: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Reduction in the Chip Area Size

The process technology shrinks from 65 to 20 nm

System-on-chip integration eliminates the host bus interface

Chip area shrinks to 1/3 size

Tofu network

interface (TNI)

and

Tofu barrier

interface (TBI)

Tofu network router (TNR)

PCI

Express

root

complex

Host bus interface

August 26, 2014, Hot Interconnects 22 Copyright 2014 FUJITSU LIMITED 4

Tofu1 < 300mm2

InterConnect Controller (65nm)

Tofu2 < 100mm2 – SPARC64TM XIfx (20nm)

TNI and TBI

TNR

PCI Express

root complex

34 core processor

with 8 HMC links

Page 6: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Optical Modules

Optical modules are placed next to the processor SoC

25 Gbps high-speed signals connect the optical modules

5 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

SPARC64TM XIfx integrating Tofu2

Board-mounted optical assembly

Electrical connector (8 lanes)

25 Gbps high-speed signals

(12 lanes for each)

Page 7: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Physical Topology on a Board

3 nodes on a CPU/memory board (CMB)

10 ports for each node

20 ports for optical links

10 ports for electrical links

4 ports for connection to the neighbor nodes on the CMB

6 ports for connection to the other CMBs within the chassis

6 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

Page 8: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Six-dimensional network: {X, Y, Z, A, B, C}

The size of a chassis = {1, 1, 3, 2, 1, 2}

3 nodes on a CMB connects along the Z-axis

4 CMBs in a chassis connect by the A- and C-axes

Logical Topology in a Chassis

7 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

Page 9: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Packaging Hierarchy

A chassis is 2U sized

A 19-inch rack mounts a maximum of 18 chassis

Racks are interconnected by links to the X- and Y-axes

8 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

SPARC64TM XIfx

1 TFlops

CPU/memory board

3 nodes

2U chassis

12 nodes

19-inch rack

216 nodes

Post-FX10 System

Petaflops per 5 racks

Page 10: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Copyright 2014 FUJITSU LIMITED

Agenda

Introduction

Implementation

Transmission Technology

New Features

Preliminary evaluations

Summary

August 26, 2014, Hot Interconnects 22 9

Page 11: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Physical Layer

The physical layer is based on 100GbE (100GBASE-SR4)

The data transfer rate is 25.78125 Gbps

Increased more than fourfold from 6.25 Gbps

The encoding scheme is 64b/66b

Enhanced from 8b/10b

Each link has 4 lanes of signals

Each link provides 12.5 GB/s peak throughput

Increased by 2.5 times from 5.0 GB/s

10 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

Tofu1 Tofu2

Data transfer rate 6.25 Gbps 25.78125 Gbps

Encoding 8b/10b 64b/66b

Signals per link 8 lanes 4 lanes

Link throughput 5.0 GB/s 12.5 GB/s

Page 12: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Frame Format

Tofu2 harnesses the Ethernet frame format

Preamble, frame check sequence and inter-frame gap are the same

The frame size is a multiple of 32 bytes

The maximum frame size is 2016 bytes

Single frame encapsulates transport and data-link packets

Transport and data-link layer packets of Tofu1 had their own frame

August 26, 2014, Hot Interconnects 22 Copyright 2014 FUJITSU LIMITED 11

Preamble Data FCS Destination

Address

Source

Address

Type /

Length

8 bytes 46 – 1500 bytes 4 bytes 6 bytes 6 bytes 2 bytes

Preamble

8 bytes

Routing

Header

8 bytes

Transport Layer

Packet

Data-Link

Layer Packet FCS

32 – 1984 bytes 4 bytes 16 bytes

Flags

4 bytes

Ethernet frame format

Tofu2 frame format

Page 13: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Copyright 2014 FUJITSU LIMITED

Agenda

Introduction

Implementation

Transmission Technology

New Features

Preliminary evaluations

Summary

August 26, 2014, Hot Interconnects 22 12

Page 14: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Cache Injection

The major issue of latency: cache flush

Ordinary DMA from a peripheral causes flushing of CPU cache

Tofu2 can inject received data into L2 cache directly

Bypassing main memory

Data to be injected is indicated with a flag bit that is set by the sender

Tofu2’s cache injection feature is harmless

Cache injection is only performed when cache hits and the line is in exclusive state

A cache line in exclusive state is highly likely to be polled by a corresponding processor core

13 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

Page 15: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Session-mode Control Queue

Offloading a collective communication of long messages

Command execution may be delayed until a reception of Put

Control flow can be branched or joined

An example of handshaking pipelined gather in a ring logical-topology August 26, 2014, Hot Interconnects 22 Copyright 2014 FUJITSU LIMITED 14

Wait for session progress

Wait for command

and session progress

NOP

Session-mode Session-mode Normal-mode

Command queueing

can be delayed

Execution

Wait for command

Enqueuing commands

Page 16: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

RDMA Atomic Read-Modify-Write

Atomically read, modify and write back remote data

Typical operations: compare-and-swap and fetch-and-add

Usage: software-based synchronization and lock-free algorithms

Atomicity

Guaranteed by extending the coherency protocol of processor

• not by extending each network interface

Strong atomicity: No memory accesses can break atomicity

Mutual atomicity: Atomic operations of processor and Tofu2 mutually guarantee their atomicity

Mutual atomicity enables an efficient implementation of unified multi-process and multi-thread runtime

15 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

Page 17: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Copyright 2014 FUJITSU LIMITED

Agenda

Introduction

Implementation

Transmission Technology

New Features

Preliminary evaluations

Summary

August 26, 2014, Hot Interconnects 22 16

Page 18: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Evaluation Environment

System-level logic simulations

Using Cadence Palladium XP hardware emulator

Simulation model

Verilog RTL codes for the production

Multiple nodes of Post-FX10

Test programs

Executed on the simulated processor cores

Used Tofu2 hardware directly

Simulation waveform

Provides detail information of signal propagation delay in logic circuits

Latency results can be obtained directly from simulation waveforms

Throughput results were calculated from measured latency values

17 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

Page 19: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Throughput of Single Put Transfer

Achieved 11.46 GB/s of throughput which is 92% efficiency

August 26, 2014, Hot Interconnects 22 Copyright 2014 FUJITSU LIMITED 18

11.46

4.66

0.03

0.06

0.13

0.25

0.50

1.00

2.00

4.00

8.00

16.00

4 16 64 256 1024 4096

Thro

ughp

ut of P

ut tr

ansfe

r (G

B/s

)

Message size (byte)

Tofu2

Tofu1

Page 20: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Throughput of Concurrent Put Transfers

Linear increase in throughput without the host bus bottleneck

19 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

45.82

17.63

0

5

10

15

20

25

30

35

40

45

50

1-way 2-way 3-way 4-way

Thro

ughp

ut of P

ut tr

ansfe

r (G

B/s

)

Tofu2

Tofu1

Host bus bottleneck

Page 21: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Communication Latencies

Cache injection reduced latency by 0.16 μsec

Overheads of the other two features are low

Session-mode introduced no additional latency

The overhead of an atomic operation was about 0.1 μsec

20 Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22

Pattern Method Latency

One-way Put 8-byte to memory 0.87 μsec

Put 8-byte to cache 0.71 μsec

Round-trip Put 8-byte ping-pong by CPU 1.42 μsec

Put 8-byte ping-pong by session 1.41 μsec

Atomic RMW 8-byte 1.53 μsec

Page 22: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Breakdown of Latency

More than 0.2 μsec less latency than Tofu1

August 26, 2014, Hot Interconnects 22 Copyright 2014 FUJITSU LIMITED 21

0

100

200

300

400

500

600

700

800

900

1000

Tofu1 Tofu2

Esti

mate

d c

um

ula

tive l

ate

ncy (

ns

ec

)

Rx CPU

Rx Host bus

Rx TNI

Packet Transfer

Tx TNI

Tx Host bus

Tx CPU

Cache injection (~0.16 μsec)

Host bus elimination (~0.06 μsec)

Page 23: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Copyright 2014 FUJITSU LIMITED

Agenda

Introduction

Implementation

Transmission Technology

New Features

Preliminary evaluations

Summary

August 26, 2014, Hot Interconnects 22 22

Page 24: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Summary

Tofu interconnect 2: the next generation interconnect

Optical-dominant: 2/3 of network links are optical

System-on-Chip integrated controller

New features

Cache injection reduces communication latency without cache pollution

Session-mode offloads various collective communication algorithms

Atomic RMW functions guarantee mutual atomicity

Preliminary evaluation results

Throughput 11.46 GB/s

Latency 0.71 μsec

Cache injection reduced latency by 0.16 μsec

Elimination of the host bus reduced latency by about 0.06 μsec

Session-mode introduced no additional latency

Overhead of an atomic operation was about 0.1 μsec

Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22 23

Page 25: The Tofu Interconnect 2 · Frame Format Tofu2 harnesses the Ethernet frame format Preamble, frame check sequence and inter-frame gap are the same The frame size is a multiple of 32

Copyright 2014 FUJITSU LIMITED August 26, 2014, Hot Interconnects 22 24