Deep Learning with GPU Optimized Servers...
Transcript of Deep Learning with GPU Optimized Servers...
![Page 1: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/1.jpg)
Confidential
© 2016 Supermicro
Server Dpt.
James Hao 郝旭光
Deep Learning with GPU Optimized Servers让深度学习更加高效
![Page 2: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/2.jpg)
Confidential
HETEROGENEOUS COMPUTING
PARALLEL WORKLOADSSERIAL WORKLOADS• Optimized for low-latency
access to cached data sets
• Control logic for out-of-order
and speculative execution
• Optimized for data-parallel,
throughput computation
• Architecture tolerant of
memory latency
• More transistors dedicated
to computation
![Page 3: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/3.jpg)
Confidential
Defense Intelligence Safety and Security
Computational
Finance
GPU Compute enables future applications• Enriching the user experience via GPU compute
• Delivering heterogeneous energy-efficient
computing
• Allows developers to unlock the potential of
complex application for consumers
Research and Scientific Machine Learning
Media & Entertainment
Oil & Gas
CAD and CAE
GPU APPLICATIONS
![Page 4: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/4.jpg)
Confidential
X10 GPU Server Portfolio
7048GR
4:2 (4U)
4028GR
8:2 (4U)
1028GQ
4:2 (1U)
2028GR
6:2 (2U)
1028GR
3:2 (1U)
1018GR/5018GR
2:1 (1U)
GPU
GPU
Ratio:
GPU:CPUTOWER RACK DEEP LEARNING
GP
U O
PT
IMIZ
ED
4028GR-TR2
10:2 (4U)
1028GQ-TXRT
4:2 (1U)
4028GR-TXRT
8:2 (4U)
![Page 5: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/5.jpg)
Confidential
Machine Learning – Driven By Scale
CPU GPU Cloud
(Many CPU)HPC
(Many GPU)1 million
Connections
(2007)
10 million
Connections
(2008)
1 billion
Connections
(2011)
100 billion
Connections
(2015)
Architecture
CodeExperiment
![Page 6: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/6.jpg)
Confidential
CUSTOMER PAIN POINTS
Machine Learning / AI
applications have large
datasets well beyond one
single GPU.
PROBLEM SOLUTION
Aggregate GPU resources
to tackle large dataset
computation, in
conjunction with high
speed connectivity to
minimize latency
![Page 7: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/7.jpg)
Confidential
Server Portfolio GPU Peering
Best–in-class technology designed for augmented performance in Machine Learning
applications to enable can train twice as fast and explore networks twice as large.
1028GQ-TXR
1U Chassis
Dual HSW/BDW CPUs
16 DDR4 DIMMs
2 2.5” HS HDD bays
4 Pascal w/ 40GB/s NVLink
3/1 x16/x8 PCIe 3.0 slot
2 2000W Titanium PWS
Scalability
4028GR-TR2
4U Chassis
Dual HSW/BDW CPUs
24 DDR4 DIMMs
24 2.5” HS HDD bays
10 Double-Wide GPUs
11/1 x16/x8 PCIe 3.0 slot;
4 (2+2) 2000W Titanium PWS
Flexibility
10 4 4028GR-TXR
4U Chassis
Dual HSW/BDW CPUs
24 DDR4 DIMMs
16 2.5” HS HDD bays
8 Pascal w/ 20GB/s NVLink
4/2 x16/x8 PCIe 3.0 slot
4 (2+2) 2000W Titanium PWS
HyperScale
8
![Page 8: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/8.jpg)
Confidential
SYS-4028GR-TR(T) SYS-4028GR-TR(T)2
1
2
3
4
7
8
9
10
5 6
1
2
3
4
9
10
11
12
5 8
FROM TO SYS-4028GR-TR(T) SYS-4028GR-TR(T)2
(uSEC) (uSEC)
GPU1 GPU2 6.6 6.6
GPU2 GPU4 6.7 6.6
GPU3 GPU9 21.2 6.7
New Architecture More Performance
![Page 9: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/9.jpg)
Confidential
NVLINK
80 GB/sNVLink
• Interconnect at 80 GB/s
(Speed of CPU Memory)
Stacked 3D Memory
• 4x Higher Bandwidth – 1 TB/s
(2.5x Capacity, 4x more Efficient)
Unified Memory
• Lower level of Development
(Available today in CUDA 6)
Stacked HBM
Memory 1TB/sDDR4 Memory
50-75 GB/s
Unified
Memory
PASCAL GPU ARCHITECTURE
![Page 10: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/10.jpg)
Confidential
SYS-1028GQ-TXR / TXRT
PASCAL GPU READY• Performance – 10 TFLOPs FP32• NVLink – 5x PCIe• 3D Memory - 2x Memory Bandwidth
X10 SUPERMICRO ADVANTAGE● PERFORMANCE: 4x PASCAL with GPUs IN 1U
● NVLINK: 80GB/s High Bandwidth GPU Interconnect
● GPU RDMA: Direct Internode GPU Interconnect
● EFFICIENCY: Titanium-rated Power Supply
● DESIGN: No GPU preheating ADVANTAGES• All GPUs capable of Peer-to-Peer direct access to all other GPUs’ memory as well as
direct transfer (memcpy) operations via NVLink at high Bandwidth
• High performance for collective communications
• PCIe bandwidth fully available for host and/or NIC communication during inter-GPU
communication
Unparalleled 1U platform for the highest parallel applications. No one else can do so much in
a 1U!!!! Up to Pascals with NVLink in , supporting Optimized GPU RDMA
![Page 11: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/11.jpg)
Confidential
NVLINK ARCHITECTURE: CUBE MESH
SYS-4028GR-TXRTProcessor Support
Dual Xeon E5-2600 v4/v3 CPUs (Socket R3)
8 Tesla P100 (Pascal) GPUs (SXM2)
Memory Capacity
24 DIMMs, 3TB ECC DDR4 2400MHz
Expansion Slots 4 PCI-e 3.0 x16 (For RDMA via EDR)2 PCI-e 3.0 x8
I/O ports 1x VGA, 2x 10G-BaseT LAN, 3x USB 3.0, and 1x IPMI dedicated LAN port
Drive Bays
16 hot-swap 2.5” drives bay (Support 8x NVMe)
System Cooling
8 heavy duty fans optimize to support 8 GPU cards
Power Supply
4 x 2000W (2+2) Titanium Level efficiency redundant power supply
1
● THROUGHPUT: Highest Parallelism with 8x Pascal GPUs
● NVLINK: 80GB/s High Bandwidth GPU Interconnect
● RDMA FABRIC: Lowest latency of data access and transfer
● FLEXIBILITY: Revolutionary Rack Scale Design
● DESIGN: Independent GPU and CPU thermal zones
2
3
4
6
7
Key Features:
5
![Page 12: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/12.jpg)
Confidential
GPU: 1U DP SYS-1028GQ-TR(T)
12
3
4
6
7
5
Processor Support
Dual Xeon E5-2600 v4/v3 CPUs (Socket R3)
Memory Capacity 16 DIMMs, up to 1TB ECC DDR4 2400MHz
Expansion Slots 4 PCI-e x16 Gen 3.0 for double-wide GPU cards2 x8 (in x16 slot) LP card
I/O ports 1x VGA, 2x GbE or 2x 10GbaseT LAN, 2x USB 3.0, and 1x IPMI dedicated LAN port
Drive Bays2 hot-swap 2.5” drives bays; 4 total 2.5” HDD bays
System Cooling 9 counter rotating fans with optimal fan speed control
Power Supply2000W Platinum Level efficiency redundant power supply
1
Motherboard: X10DGQ
Chassis: CSE-118GQETS-R2K03P
• Supports up to 4 double width GPU cards (including GTX)
• Redundant Platinum Level 2000W power supplies
• No GPU-Preheat
• Cost Optimized System
• Oil & Gas
• Research & Scientifics
• VDI technology
• Computational Finance
2
3
4
5
6
7
Key Features: Key Applications:
![Page 13: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for](https://reader034.fdocument.pub/reader034/viewer/2022043005/5f8b4066de34406d692eca76/html5/thumbnails/13.jpg)
Confidential
© 2016 Supermicro
Thank You!