An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04 · An STDM...

An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system

2019.10.4

K.Azegami1, K.Musha1, K.Hironaka1, A.B.Ahmed1, M.Koibuchi2, Y.Hu2 and

H. Amano1

1Keio University, 2National Institute of Informatics

Special Thanks to Tomohiro Kudoh, Ryosei Takano, Kohei Itoh, Kensuke Iizuka and Yugo Yamauchi

MEC(Multi-access Edge Computing)

CentralizedCloud

⑤スマートコミュニティ

MEC data center

5G5G

5G

5G enables < 0.5msec latencyTiming Critical JobsLow Power but High Performance

FPGA Computing

FPGA computing in MEC data center• Benefits:

• Suitable to timing critical applications.• Data are directly transferred from/to IoT devices with FPGA hardware.• The constant time execution is achieved with the hardware engine.

• Low energy computation.• Much energy efficient than GPUs

• Scalability for multiple access.• Easy to increase the number of FPGAs• An FPGA itself can be used as a switch.

• Programming environment has been improved:• Open-CL is widespread for computational usage.• Vivado-HLS is popularly used for general usage.

• High performance FPGAs are available.• Intel’s Stratix 10 with a lot of floating DSPs• Xilinx’s Ultrascale+ with UltraRAM

→However, cost is an important issue in the MEC data center.

HostPC

FPGA

PCIe

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

Our Proposal : Virtual FPGA for MECA lot of cost-efficient middle-scale FPGAs are

tightly connected with their own serial links.They can be treated as if they were a single FPGAin HLS description level.

Higher performance per cost than conventionalFPGA in cloud.Practically infinite resource is used.Separated into a number of virtual FPGAs and shared by the multiple accesses.

Flow-in-Cloud (FiC) is the first prototype.

Direct datafrom IoT devices

FPGAFPGAFPGA

FPGAFPGAFPGAFPGASTDM switch STDM switch STDM switch STDM switch

FPGA

STDM switch STDM switch STDM switch STDM switch

Circuit switching network

HLS modules

FiC-SW

Host CPUI/O boardKCU1500

High Speed Serial Links

Flow-in-Cloud (FiC) overview

Today, the STDM switch is focused

Flow-in-Cloud (FiC) SW BoardFiC Network8x4 9.9Gbps

Ethernet

Control Network

Application Logic Area

SWControl board

Raspberry Pi 3 model B

FPGAXilinx Kintex

Ultrascale XCKU095

Rusberry Pi 3

FPGA KU085/095

STDMSwitch

HLS modules

DDR-4 SDRAM 16Gb

Here, we call eachlink “channel”,

and a bundle of4 channels “bundle”.

A board has 8 bundleseach of which has4 channels

DDR-4 SDRAM 16Gb

XilinxAurora

XilinxAurora

STDMswitch

8.5Gbps x32 (4 chan. x 8 lane )

HLS module

PR domain

Static domain

8.5Gbps x32 (4 chan. x 8 lane )

Raspi3

Ethernet

DRAM

170bit 170bit

100MHz

100MHz

9.9Gbps

9.9Gbps

Block Diagram of FiC

STDMswitch

STDMswitch

STDMswitch

sw0sw1

sw2sw3

STDM (Static Time Division Multiplexing)

8

Port1

Port2

Port3

Port 4

Port1

Port2

Port3

Port 4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4An input register is selected according to the pre-loaded table, and transferred to the outputregister.

Input data arrive at each port cyclically registered.

Output data are cyclicallysent to the output port

An example of4x4 with four slots

STDM (Static Time Division Multiplexing)

9

Port1

Port2

Port3

Port 4

Port1

Port2

Port3

Port 4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

P2S1 P2S2 P1S3 P3S4 P2S1 P2S2P1S3 P3S4…. ….port1

• A circuit is established betweensource and destination.

• Latency and bandwidth are kept.

• Latency = 45+2 x (# of slots)clock cycles

Multicast using the STDM

Port1

Port2

Port3

Port 4

Port1

Port2

Port3

Port 4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

For internal usage

Multicast is done efficiently.

Multiple outputs can receive the same data in a specific slot.

The resource usage

GT: High speed link

Enough design is remained for HLS design.

4 switches are provided for each channel.

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

The current configuration: 4x6 torus

The current FiC with 24 boards

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

Round Trip Time (24boards)

Round Trip Time (24 nodes HLS-HLS)

Slot number

Late

ncy

(clo

ck c

ycle

s)

1.2GB HLS-HLS effective bandwidth4.6um max. latency

Transferred Data (x160b x4)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2000 4000 6000 8000 10000 12000

Node-to-Node Bandwidth (HLS-HLS)B

an

dw

idth

(G

B /

sec）

Transferred Data (x160b x4)

2-slot

4-slot

8-slot

16-slot

32-slot

64-slot

0

0.5

1

1.5

2

2.5

3

3.5

4

4 32 256 2048 16384 131072 1048576

バン

ド幅

(GB

/sec

)

データサイズ(Byte)

GPUtoGPU(PEACH3)

GPUtoGPU(PEACH2)

GPUtoGPU(MPI/IB)

Transferred Data (Byte)

Ban

dw

idth

(G

B /

sec）

FiC 2slot

Comparison between other FPGA Connected Multi-GPU system and MPI/Infiniband

Kaneda, et.al “Performance Evaluation of PEACH3: FPGA switch for Tightly Coupled Accelerators,”

HEART 2017

0

200

400

600

800

1000

1200

0 10 20 30 40 50 60 70

The number of Slots

8 boards

4 boards

2 boards

Rou

nd

Tri

p T

ime (

clo

cks)

Round Trip Time vs. The number of Slots

Pass through time for a board is about 450nsec

with 2 slots.

1

320

Slot synchronousdelay increases the round trip time

0

1

2

3

4

5

6

8 32 128 512

レイ

テン

シ(μsec)

データサイズ(Byte)

GPUtoGPU(PEACH3)

GPUtoGPU(PEACH2)

GPUtoGPU(MPI/IB)

Late

ncy

(μsec）

Transferred Data (Byte)

FiC 2slot (450ns)

K’s Tofu (100ns)

Comparison between other systems

Ajima, et.al “The Tofu Interconnect,” Hot Interconnect 2012

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Broadcast can be done with the maximum hop

①

②

③

④⑤

0

100

200

300

400

500

0 5 10 15 20 25 30

1-to-all and All-to-all

Array Size

Late

ncy

(clo

cks)

2x2

2x4

4x6

4x4

All-to-all

1-to-all

All-to-all broadcast can be done efficiently

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Reduction Computation

① ① ① ① ① ①

②

0

200

400

600

800

1000

1200

0 5 10 15 20

Reduction Calculation (170bit integer data)

The number of Slots

Com

pu

tati

on

Tim

e (

clo

cks)

Each nodefinishes

computation with1 clock cycle.

Summary● 34GB max. aggregated throughput with 32links

○ vs. PEACH 3 : 31.4GB max.

○ vs. Catapult-2:10GB max. (40Gbps=5GB x 2 )● 1.2GB/sec 1-to-1 HLS throughput

○ Advantageous for small data size● 450nsec pass through time

○ Much smaller than PEACH3 and MPI /Infiniband● 4.6μsec maximum HLS-to-HLS latency

● Future work

○ Improving input/output delay of HLS part

○ Slot synchronization

An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04 · An STDM...

Documents

Transcript of An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04 · An STDM...