An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04 · An STDM...
Transcript of An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04 · An STDM...
An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system
2019.10.4
K.Azegami1, K.Musha1, K.Hironaka1, A.B.Ahmed1, M.Koibuchi2, Y.Hu2 and
H. Amano1
1Keio University, 2National Institute of Informatics
Special Thanks to Tomohiro Kudoh, Ryosei Takano, Kohei Itoh, Kensuke Iizuka and Yugo Yamauchi
MEC(Multi-access Edge Computing)
CentralizedCloud
⑤スマートコミュニティ
MEC data center
5G5G
5G
5G enables < 0.5msec latencyTiming Critical JobsLow Power but High Performance
FPGA Computing
FPGA computing in MEC data center• Benefits:
• Suitable to timing critical applications.• Data are directly transferred from/to IoT devices with FPGA hardware.• The constant time execution is achieved with the hardware engine.
• Low energy computation.• Much energy efficient than GPUs
• Scalability for multiple access.• Easy to increase the number of FPGAs• An FPGA itself can be used as a switch.
• Programming environment has been improved:• Open-CL is widespread for computational usage.• Vivado-HLS is popularly used for general usage.
• High performance FPGAs are available.• Intel’s Stratix 10 with a lot of floating DSPs• Xilinx’s Ultrascale+ with UltraRAM
→However, cost is an important issue in the MEC data center.
HostPC
FPGA
PCIe
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
Our Proposal : Virtual FPGA for MECA lot of cost-efficient middle-scale FPGAs are
tightly connected with their own serial links.They can be treated as if they were a single FPGAin HLS description level.
Higher performance per cost than conventionalFPGA in cloud.Practically infinite resource is used.Separated into a number of virtual FPGAs and shared by the multiple accesses.
Flow-in-Cloud (FiC) is the first prototype.
Direct datafrom IoT devices
FPGAFPGAFPGA
FPGAFPGAFPGAFPGASTDM switch STDM switch STDM switch STDM switch
FPGA
STDM switch STDM switch STDM switch STDM switch
Circuit switching network
HLS modules
FiC-SW
Host CPUI/O boardKCU1500
High Speed Serial Links
Flow-in-Cloud (FiC) overview
Today, the STDM switch is focused
Flow-in-Cloud (FiC) SW BoardFiC Network8x4 9.9Gbps
Ethernet
Control Network
Application Logic Area
SWControl board
Raspberry Pi 3 model B
FPGAXilinx Kintex
Ultrascale XCKU095
Rusberry Pi 3
FPGA KU085/095
STDMSwitch
HLS modules
DDR-4 SDRAM 16Gb
Here, we call eachlink “channel”,
and a bundle of4 channels “bundle”.
A board has 8 bundleseach of which has4 channels
DDR-4 SDRAM 16Gb
XilinxAurora
XilinxAurora
STDMswitch
8.5Gbps x32 (4 chan. x 8 lane )
HLS module
PR domain
Static domain
8.5Gbps x32 (4 chan. x 8 lane )
Raspi3
Ethernet
DRAM
170bit 170bit
100MHz
100MHz
9.9Gbps
9.9Gbps
Block Diagram of FiC
STDMswitch
STDMswitch
STDMswitch
sw0sw1
sw2sw3
STDM (Static Time Division Multiplexing)
8
Port1
Port2
Port3
Port 4
Port1
Port2
Port3
Port 4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4An input register is selected according to the pre-loaded table, and transferred to the outputregister.
Input data arrive at each port cyclically registered.
Output data are cyclicallysent to the output port
An example of4x4 with four slots
STDM (Static Time Division Multiplexing)
9
Port1
Port2
Port3
Port 4
Port1
Port2
Port3
Port 4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
P2S1 P2S2 P1S3 P3S4 P2S1 P2S2P1S3 P3S4…. ….port1
• A circuit is established betweensource and destination.
• Latency and bandwidth are kept.
• Latency = 45+2 x (# of slots)clock cycles
Multicast using the STDM
Port1
Port2
Port3
Port 4
Port1
Port2
Port3
Port 4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
For internal usage
Multicast is done efficiently.
Multiple outputs can receive the same data in a specific slot.
The resource usage
GT: High speed link
Enough design is remained for HLS design.
4 switches are provided for each channel.
fic00
fic01
fic02
fic03
1
2
1
2
fic04
fic05
fic06
fic07
1
2
1
2
fic08
fic09
fic10
fic11
1
2
1
2
m2fic00
m2fic01
m2fic02
m2fic03
1
2
1
2
m2fic04
m2fic05
m2fic06
m2fic07
1
2
1
2
m2fic08
m2fic09
m2fic10
m2fic11
1
2
1
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
The current configuration: 4x6 torus
The current FiC with 24 boards
fic00
fic01
fic02
fic03
1
2
1
2
fic04
fic05
fic06
fic07
1
2
1
2
fic08
fic09
fic10
fic11
1
2
1
2
m2fic00
m2fic01
m2fic02
m2fic03
1
2
1
2
m2fic04
m2fic05
m2fic06
m2fic07
1
2
1
2
m2fic08
m2fic09
m2fic10
m2fic11
1
2
1
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
Round Trip Time (24boards)
Round Trip Time (24 nodes HLS-HLS)
Slot number
Late
ncy
(clo
ck c
ycle
s)
1.2GB HLS-HLS effective bandwidth4.6um max. latency
Transferred Data (x160b x4)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 2000 4000 6000 8000 10000 12000
Node-to-Node Bandwidth (HLS-HLS)B
an
dw
idth
(G
B /
sec)
Transferred Data (x160b x4)
2-slot
4-slot
8-slot
16-slot
32-slot
64-slot
0
0.5
1
1.5
2
2.5
3
3.5
4
4 32 256 2048 16384 131072 1048576
バン
ド幅
(GB
/sec
)
データサイズ(Byte)
GPUtoGPU(PEACH3)
GPUtoGPU(PEACH2)
GPUtoGPU(MPI/IB)
Transferred Data (Byte)
Ban
dw
idth
(G
B /
sec)
FiC 2slot
Comparison between other FPGA Connected Multi-GPU system and MPI/Infiniband
Kaneda, et.al “Performance Evaluation of PEACH3: FPGA switch for Tightly Coupled Accelerators,”
HEART 2017
0
200
400
600
800
1000
1200
0 10 20 30 40 50 60 70
The number of Slots
8 boards
4 boards
2 boards
Rou
nd
Tri
p T
ime (
clo
cks)
Round Trip Time vs. The number of Slots
Pass through time for a board is about 450nsec
with 2 slots.
1
320
Slot synchronousdelay increases the round trip time
0
1
2
3
4
5
6
8 32 128 512
レイ
テン
シ(μsec)
データサイズ(Byte)
GPUtoGPU(PEACH3)
GPUtoGPU(PEACH2)
GPUtoGPU(MPI/IB)
Late
ncy
(μsec)
Transferred Data (Byte)
FiC 2slot (450ns)
K’s Tofu (100ns)
Comparison between other systems
Ajima, et.al “The Tofu Interconnect,” Hot Interconnect 2012
fic00
fic01
fic02
fic03
1
2
1
2
fic04
fic05
fic06
fic07
1
2
1
2
fic08
fic09
fic10
fic11
1
2
1
2
m2fic00
m2fic01
m2fic02
m2fic03
1
2
1
2
m2fic04
m2fic05
m2fic06
m2fic07
1
2
1
2
m2fic08
m2fic09
m2fic10
m2fic11
1
2
1
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Broadcast can be done with the maximum hop
①
②
③
④⑤
0
100
200
300
400
500
0 5 10 15 20 25 30
1-to-all and All-to-all
Array Size
Late
ncy
(clo
cks)
2x2
2x4
4x6
4x4
All-to-all
1-to-all
All-to-all broadcast can be done efficiently
fic00
fic01
fic02
fic03
1
2
1
2
fic04
fic05
fic06
fic07
1
2
1
2
fic08
fic09
fic10
fic11
1
2
1
2
m2fic00
m2fic01
m2fic02
m2fic03
1
2
1
2
m2fic04
m2fic05
m2fic06
m2fic07
1
2
1
2
m2fic08
m2fic09
m2fic10
m2fic11
1
2
1
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Reduction Computation
① ① ① ① ① ①
②
0
200
400
600
800
1000
1200
0 5 10 15 20
Reduction Calculation (170bit integer data)
The number of Slots
Com
pu
tati
on
Tim
e (
clo
cks)
Each nodefinishes
computation with1 clock cycle.
Summary● 34GB max. aggregated throughput with 32links
○ vs. PEACH 3 : 31.4GB max.
○ vs. Catapult-2:10GB max. (40Gbps=5GB x 2 )● 1.2GB/sec 1-to-1 HLS throughput
○ Advantageous for small data size● 450nsec pass through time
○ Much smaller than PEACH3 and MPI /Infiniband● 4.6μsec maximum HLS-to-HLS latency
● Future work
○ Improving input/output delay of HLS part
○ Slot synchronization