Deep Learning with GPU Optimized Servers...

Confidential

© 2016 Supermicro

Server Dpt.

James Hao 郝旭光

Deep Learning with GPU Optimized Servers让深度学习更加高效

Confidential

HETEROGENEOUS COMPUTING

PARALLEL WORKLOADSSERIAL WORKLOADS• Optimized for low-latency

access to cached data sets

• Control logic for out-of-order

and speculative execution

• Optimized for data-parallel,

throughput computation

• Architecture tolerant of

memory latency

• More transistors dedicated

to computation

Confidential

Defense Intelligence Safety and Security

Computational

Finance

GPU Compute enables future applications• Enriching the user experience via GPU compute

• Delivering heterogeneous energy-efficient

computing

• Allows developers to unlock the potential of

complex application for consumers

Research and Scientific Machine Learning

Media & Entertainment

Oil & Gas

CAD and CAE

GPU APPLICATIONS

Confidential

X10 GPU Server Portfolio

7048GR

4:2 (4U)

4028GR

8:2 (4U)

1028GQ

4:2 (1U)

2028GR

6:2 (2U)

1028GR

3:2 (1U)

1018GR/5018GR

2:1 (1U)

GPU

GPU

Ratio:

GPU:CPUTOWER RACK DEEP LEARNING

GP

U O

PT

IMIZ

ED

4028GR-TR2

10:2 (4U)

1028GQ-TXRT

4:2 (1U)

4028GR-TXRT

8:2 (4U)

Confidential

Machine Learning – Driven By Scale

CPU GPU Cloud

(Many CPU)HPC

(Many GPU)1 million

Connections

(2007)

10 million

Connections

(2008)

1 billion

Connections

(2011)

100 billion

Connections

(2015)

Architecture

CodeExperiment

Confidential

CUSTOMER PAIN POINTS

Machine Learning / AI

applications have large

datasets well beyond one

single GPU.

PROBLEM SOLUTION

Aggregate GPU resources

to tackle large dataset

computation, in

conjunction with high

speed connectivity to

minimize latency

Confidential

Server Portfolio GPU Peering

Best–in-class technology designed for augmented performance in Machine Learning

applications to enable can train twice as fast and explore networks twice as large.

1028GQ-TXR

1U Chassis

Dual HSW/BDW CPUs

16 DDR4 DIMMs

2 2.5” HS HDD bays

4 Pascal w/ 40GB/s NVLink

3/1 x16/x8 PCIe 3.0 slot

2 2000W Titanium PWS

Scalability

4028GR-TR2

4U Chassis

Dual HSW/BDW CPUs

24 DDR4 DIMMs


10 Double-Wide GPUs

11/1 x16/x8 PCIe 3.0 slot;

4 (2+2) 2000W Titanium PWS

Flexibility

10 4 4028GR-TXR

4U Chassis

Dual HSW/BDW CPUs

24 DDR4 DIMMs


8 Pascal w/ 20GB/s NVLink

4/2 x16/x8 PCIe 3.0 slot

4 (2+2) 2000W Titanium PWS

HyperScale

8

Confidential

SYS-4028GR-TR(T) SYS-4028GR-TR(T)2

1

2

3

4

7

8

9

10

5 6

1

2

3

4

9

10

11

12

5 8

FROM TO SYS-4028GR-TR(T) SYS-4028GR-TR(T)2

(uSEC) (uSEC)

GPU1 GPU2 6.6 6.6

GPU2 GPU4 6.7 6.6

GPU3 GPU9 21.2 6.7

New Architecture More Performance

Confidential

NVLINK

80 GB/sNVLink

• Interconnect at 80 GB/s

(Speed of CPU Memory)

Stacked 3D Memory

• 4x Higher Bandwidth – 1 TB/s

(2.5x Capacity, 4x more Efficient)

Unified Memory

• Lower level of Development

(Available today in CUDA 6)

Stacked HBM

Memory 1TB/sDDR4 Memory

50-75 GB/s

Unified

Memory

PASCAL GPU ARCHITECTURE

Confidential

SYS-1028GQ-TXR / TXRT

PASCAL GPU READY• Performance – 10 TFLOPs FP32• NVLink – 5x PCIe• 3D Memory - 2x Memory Bandwidth

X10 SUPERMICRO ADVANTAGE● PERFORMANCE: 4x PASCAL with GPUs IN 1U

● NVLINK: 80GB/s High Bandwidth GPU Interconnect

● GPU RDMA: Direct Internode GPU Interconnect

● EFFICIENCY: Titanium-rated Power Supply

● DESIGN: No GPU preheating ADVANTAGES• All GPUs capable of Peer-to-Peer direct access to all other GPUs’ memory as well as

direct transfer (memcpy) operations via NVLink at high Bandwidth

• High performance for collective communications

• PCIe bandwidth fully available for host and/or NIC communication during inter-GPU

communication

Unparalleled 1U platform for the highest parallel applications. No one else can do so much in

a 1U!!!! Up to Pascals with NVLink in , supporting Optimized GPU RDMA

Confidential

NVLINK ARCHITECTURE: CUBE MESH

SYS-4028GR-TXRTProcessor Support

Dual Xeon E5-2600 v4/v3 CPUs (Socket R3)

8 Tesla P100 (Pascal) GPUs (SXM2)

Memory Capacity

24 DIMMs, 3TB ECC DDR4 2400MHz

Expansion Slots 4 PCI-e 3.0 x16 (For RDMA via EDR)2 PCI-e 3.0 x8

I/O ports 1x VGA, 2x 10G-BaseT LAN, 3x USB 3.0, and 1x IPMI dedicated LAN port

Drive Bays

16 hot-swap 2.5” drives bay (Support 8x NVMe)

System Cooling

8 heavy duty fans optimize to support 8 GPU cards

Power Supply

4 x 2000W (2+2) Titanium Level efficiency redundant power supply

1

● THROUGHPUT: Highest Parallelism with 8x Pascal GPUs

● NVLINK: 80GB/s High Bandwidth GPU Interconnect

● RDMA FABRIC: Lowest latency of data access and transfer

● FLEXIBILITY: Revolutionary Rack Scale Design

● DESIGN: Independent GPU and CPU thermal zones

2

3

4

6

7

Key Features:

5

Confidential

GPU: 1U DP SYS-1028GQ-TR(T)

12

3

4

6

7

5

Processor Support

Dual Xeon E5-2600 v4/v3 CPUs (Socket R3)

Memory Capacity 16 DIMMs, up to 1TB ECC DDR4 2400MHz

Expansion Slots 4 PCI-e x16 Gen 3.0 for double-wide GPU cards2 x8 (in x16 slot) LP card

I/O ports 1x VGA, 2x GbE or 2x 10GbaseT LAN, 2x USB 3.0, and 1x IPMI dedicated LAN port

Drive Bays2 hot-swap 2.5” drives bays; 4 total 2.5” HDD bays

System Cooling 9 counter rotating fans with optimal fan speed control

Power Supply2000W Platinum Level efficiency redundant power supply

1

Motherboard: X10DGQ

Chassis: CSE-118GQETS-R2K03P

• Supports up to 4 double width GPU cards (including GTX)

• Redundant Platinum Level 2000W power supplies

• No GPU-Preheat

• Cost Optimized System

• Oil & Gas

• Research & Scientifics

• VDI technology

• Computational Finance

2

3

4

5

6

7

Key Features: Key Applications:

Deep Learning with GPU Optimized Servers...

Documents

Transcript of Deep Learning with GPU Optimized Servers...