HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server...
Transcript of HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server...
![Page 1: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/1.jpg)
HPC on Cloud 류현곤 차장
![Page 2: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/2.jpg)
NVIDIA CONFIDENTIAL
Agenda
VDI, GPU on Cloud
CUDA
OpenACC
CUDA Kepler Architecture
CUDA on ARM
CUDA library
cuBLAS
cuFFT
pyCUDA
![Page 3: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/3.jpg)
VDI, GPU on CLOUD
![Page 4: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/4.jpg)
NVIDIA CONFIDENTIAL
![Page 5: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/5.jpg)
NVIDIA CONFIDENTIAL
![Page 6: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/6.jpg)
NVIDIA CONFIDENTIAL
![Page 7: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/7.jpg)
NVIDIA CONFIDENTIAL
Without GPU With GPU
Night and Day Difference
![Page 8: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/8.jpg)
NVIDIA CONFIDENTIAL
![Page 9: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/9.jpg)
NVIDIA CONFIDENTIAL
Iray Photorealism
Iray Photograph
![Page 10: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/10.jpg)
NVIDIA CONFIDENTIAL
VIRTUAL DESKTOPS
VIRTUAL MACHINE
NVIDIA Driver
NVIDIA GRID Enabled Virtual Desktop
NVIDIA GRID GPU
VDI
NVIDIA GRID ENABLED Hypervisor
![Page 11: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/11.jpg)
NVIDIA CONFIDENTIAL
VIRTUAL DESKTOP Virtualized GPU
VIRTUAL REMOTE WORKSTATION Dedicated GPU
DESIGNER
POWER USER PLM, Factory Floor Work Instructions,
TechPubs
KNOWLEDGE WORKER MS Office, Photoshop
PTC, ANSYS, MSC
Patran
Siemens NX 8.5
CATIA, DELMIA,
SIMULIA
SolidWorks, AutoDesk,
Adobe CS Visualization
![Page 12: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/12.jpg)
NVIDIA CONFIDENTIAL
Graphics Options in Virtualization
![Page 13: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/13.jpg)
NVIDIA CONFIDENTIAL
Ap
pli
cati
on
Co
mp
ati
bil
ity
Number of CCUs
SoftPC VDI
w/o GPU
Dedicated GPU HDX 3DPRO
View vDGA
NVIDIA GRID
vGPU XDT 7.1
Shim Graphics
View sVGA
RemoteFX
GRAPHICS ACCELERATED VDI
![Page 14: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/14.jpg)
NVIDIA CONFIDENTIAL
NVIDIA
GRID
Hypervisor Virtual Machine
Guest OS
NVIDIA GRID USM (Driver)
Virtual Desktop
Apps
Remote Display
NVIDIA GRID GPU GPU MMU
GPU Hypervisor
Hypervisor Device
Emulation Framework
Virtual GPU
Manager
Resource Manager
Remote Protocol
State Graphics Commands
Per-VM Dedicate Channel
Per-VM Dedicate Channel
Per-VM Dedicate Channel
Per-VM Dedicated Channels
vGPU = VGX Hypervisor
Partition GPU
memory & allocate
to VMs
Time-slice GPU
cores as needed
for processing
![Page 15: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/15.jpg)
NVIDIA CONFIDENTIAL
NVIDIA GRID K1
Shipping Now
GPU 4 Kepler GPUs
CUDA cores 768 (192 / GPU)
Memory Size 16GB DDR3 (4GB / GPU)
Power 130 W
Form Factor Dual Slot ATX, 10.5”
Display IO None
Aux power requirement 6-pin connector
PCIe x16
PCIe Generation Gen3 (Gen2 compatible)
Cooling solution Passive
# users 4 - 1001
OpenGL 4.x
Microsoft DirectX 11
1 Number of users depends on software solution, workload, and screen resolution
GRID K1 = 4 Quadro K600
![Page 16: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/16.jpg)
NVIDIA CONFIDENTIAL
GPU 2 High End Kepler GPUs
CUDA cores 3072 (1536 / GPU)
Memory Size 8GB GDDR5
Power 225 W
Form Factor Dual Slot ATX, 10.5”
Display IO None
Aux power requirement 8-pin connector
PCIe x16
PCIe Generation Gen3 (Gen2 compatible)
Cooling solution Passive
# users 2 – 641
OpenGL 4.3
Microsoft DirectX 11 NVIDIA GRID K2
Shipping Now
1 Number of users depends on software solution, workload, and screen resolution
GRID K2 = Two Quadro K5000
![Page 17: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/17.jpg)
NVIDIA CONFIDENTIAL
![Page 18: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/18.jpg)
CUDA 101
![Page 19: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/19.jpg)
NVIDIA CONFIDENTIAL
The Era of Accelerated Computing is Here
1980 1990 2000 2010 2020
Era of
Vector
Computing
Era of
Accelerated Computing
Era of
Distributed
Computing
![Page 20: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/20.jpg)
NVIDIA CONFIDENTIAL
GPU Roadmap
2012 2014 2008 2010
DP G
FLO
PS p
er
Watt
Kepler
Tesla
Fermi
Maxwell
Volta Stacked DRAM
Unified Virtual Memory
Dynamic Parallelism
FP64
CUDA
32
16
8
4
2
1
0.5
![Page 21: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/21.jpg)
NVIDIA CONFIDENTIAL
Agenda
openACC
Kepler Architecture
CUDA5.5 features
![Page 22: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/22.jpg)
OpenACC
![Page 23: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/23.jpg)
NVIDIA CONFIDENTIAL
Application Code
+
GPU CPU Use GPU to Parallelize
Compute-Intensive Functions Rest of Sequential
CPU Code
![Page 24: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/24.jpg)
NVIDIA CONFIDENTIAL
Real-Time Object Detection
Global Manufacturer of Navigation Systems
Valuation of Stock Portfolios using Monte Carlo
Global Technology Consulting Company
Interaction of Solvents and Biomolecules
University of Texas at San Antonio
Directives: Easy, Open & Powerful
Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ”
-- Developer at the Global Manufacturer of Navigation Systems
“ 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
![Page 25: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/25.jpg)
NVIDIA CONFIDENTIAL
OpenACC Directives
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
...
End Program myscience
CPU GPU
Your original
Fortran or C code
Simple Compiler hints
Compiler Parallelizes code
Works on many-core GPUs &
multicore CPUs
OpenACC
Compiler
Hint
Can you mix CUDA and OpenACC?
Yes, you can even use CUDA to
manage memory
![Page 26: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/26.jpg)
NVIDIA CONFIDENTIAL
2 Basic Steps to Get Started
Step 1: Annotate source code with directives:
Step 2: Compile & run:
pgf90 -ta=nvidia -Minfo=accel file.f
!$acc data copy(util1,util2,util3) copyin(ip,scp2,scp2i)
!$acc parallel loop
…
!$acc end parallel
!$acc end data
![Page 27: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/27.jpg)
NVIDIA CONFIDENTIAL
OpenACC Directives Example !$acc data copy(A,Anew)
iter=0
do while ( err > tol .and. iter < iter_max )
iter = iter +1
err=0._fp_kind
!$acc kernels
do j=1,m
do i=1,n
Anew(i,j) = .25_fp_kind *( A(i+1,j ) + A(i-1,j ) &
+A(i ,j-1) + A(i ,j+1))
err = max( err, Anew(i,j)-A(i,j))
end do
end do
!$acc end kernels
IF(mod(iter,100)==0 .or. iter == 1) print *, iter, err
A= Anew
end do
!$acc end data
Copy arrays into GPU memory
within data region
Parallelize code inside region
Close off parallel region
Close off data region,
copy data back
![Page 28: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/28.jpg)
Kepler Architecture
![Page 29: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/29.jpg)
NVIDIA CONFIDENTIAL
Kepler GK110 Block Diagram
Architecture
7.1B Transistors
15 SMX units
> 1 TFLOP FP64
1.5 MB L2 Cache
384-bit GDDR5
PCI Express Gen3
![Page 30: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/30.jpg)
NVIDIA CONFIDENTIAL
Power vs Clock Speed Example
Fermi
2x clock A B
A
A
B
B
Kepler
1x clock
Logic
Area Power
1.0x 1.0x
1.8x 0.9x
Clocking
Area Power
1.0x 1.0x
1.0x 0.5x
![Page 31: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/31.jpg)
NVIDIA CONFIDENTIAL
GPU Perf.
이론치 성능
# ratio of DP ( 1/2, 1/3, 1/8, 1/16)
# of CUDA core / MP ( 8, 32, 48,192)
# of MP / GPU ( 1, 2, 7,8,13,14,15)
# of FMA ( 1,2)
# of Clock Mhz ( 704, 870, 1200)
DP : 1.2 Tflops
SP : 2.4 Tflops
실측치 성능
DGEMM ratio ( 75%, 85%, 95%)
CPU + GPU DGEMM
DTRSM ratio
MPI ratio ( 50%, 90%)
DGEMM : 1117 Gflops
DP : 880 Gflops
![Page 32: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/32.jpg)
NVIDIA CONFIDENTIAL
new CUDA-HPL (N = 50,000)
( >50 ms) memcpyH2D ( 1024 x 50000 doubles )
loop 50 ( 50,000/1024)
( >1 ms) Gather ( 1 kernel - read/write 1024x1024 doubles)
( >2 ms) Scatter ( 1 kernel - read/write 1024x1024 doubles)
( >4 ms) cublasDtrsm ( 31 Kernels - A=1024x1024 B=1024x1024 )
( >1 ms) Transpose ( 1 kernel - read/write 1024x1024 doubles)
( >110 ms) cublasDgemm ( 1 or 2 kernels A=1024x50000 B=1024x1024 C=1024x50000)
( >50 ms) memcpyD2H ( 1024 x 50000 doubles)
![Page 33: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/33.jpg)
NVIDIA CONFIDENTIAL
New Instruction: SHFL
Data exchange between threads within a warp
Avoids use of shared memory
One 32-bit value per exchange
4 variants:
h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f
a b c d e f g h
Indexed
any-to-any
Shift right to nth
neighbour
Shift left to nth neighbour Butterfly (XOR)
exchange
__shfl() __shfl_up() __shfl_down() __shfl_xor()
![Page 34: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/34.jpg)
NVIDIA CONFIDENTIAL
Texture Cache Unlocked
Added a new path for compute
Avoids the texture unit
Allows a global address to be fetched and cached
Eliminates texture setup
Why use it?
Separate pipeline from shared/L1
Highest miss bandwidth
Flexible, e.g. unaligned accesses
Managed automatically by compiler
“const __restrict” indicates eligibility
Tex
SMX
L2
Tex Tex Tex
Read-only
Data Cache
![Page 35: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/35.jpg)
NVIDIA CONFIDENTIAL
const __restrict Example
__global__ void saxpy(float x, float y, const float * __restrict input, float * output) { size_t offset = threadIdx.x + (blockIdx.x * blockDim.x); // Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y; }
Annotate eligible kernel
parameters with const __restrict
Compiler will automatically
map loads to use read-only
data cache path
![Page 36: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/36.jpg)
NVIDIA CONFIDENTIAL
What is Dynamic Parallelism?
The ability to launch new grids from the GPU
Dynamically
Simultaneously
Independently
CPU GPU CPU GPU
Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself
![Page 37: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/37.jpg)
NVIDIA CONFIDENTIAL
CPU GPU CPU GPU
What Does It Mean?
Autonomous, Dynamic Parallelism GPU as Co-Processor
![Page 38: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/38.jpg)
CUDA 5.5
![Page 39: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/39.jpg)
ARM: Fastest Growing CPU
1993 1998 2003 2008 2013 0%
20%
40%
80%
100%
Pro
cess
or
Mark
et
Share
x86 and Cortex Processors Shipped
60%
0
1
2
4
5
3
Pro
cesso
r Ship
ments (B
illions)
Source: Mercury Research, ARM, Internal estimates
![Page 40: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/40.jpg)
PCI-e 슬롯
MXM slot
MXM 슬롯 모든 GPU 호환
CARMA MXM 판매중 Kyla mITX 판매중 Kyla MXM 출시예정
MXM 슬롯 K20X 호환
Q1000M 호환
ARM + CUDA 개발킷
![Page 41: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/41.jpg)
Unpack the Kyla
ARM Kyla Order SECO (Italy) 1EA 600$
Kepler GPU
mATX power
Kepler GPU
RS-232 cable PC
PC login ID : utuntu
login PW : utuntu
CUDA
![Page 42: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/42.jpg)
Smarter Robots to Energy Efficient Supercomputers
Building Self-Aware Robots 35% More Energy Efficient
![Page 43: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/43.jpg)
NVIDIA CONFIDENTIAL
Without Hyper-Q 100
50
0 Time
GPU
Uti
lizati
on %
100
50
0 Time
GPU
Uti
lizati
on %
With Hyper-Q
![Page 44: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/44.jpg)
NVIDIA CONFIDENTIAL
GPU
CUDA Server Process
CUDA
MPI
Rank
0
CUDA
MPI
Rank
1
CUDA
MPI
Rank
2
CUDA
MPI
Rank
3
Multi-Process Server Required for Hyper-Q / MPI
$ mpirun -np 4 my_cuda_app No application re-compile to share the GPU
No user configuration needed Can be preconfigured by SysAdmin
MPI Ranks using CUDA are clients
Server spawns on-demand per user
One job per user No isolation between MPI ranks
Exclusive process mode enforces single server
One GPU per rank No cudaSetDevice()
only CUDA device 0 is visible
![Page 45: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/45.jpg)
NVIDIA CONFIDENTIAL
Strong Scaling of CP2K on Cray XK7
Hyper-Q with multiple
MPI ranks leads to
2.5X speedup over
single MPI rank using
the GPU
Blog post by Peter
Messmer of NVIDIA
![Page 46: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/46.jpg)
NVIDIA CONFIDENTIAL
Stream Priorities Accelerates Critical Path
Kernel A Kernel B
Kernel X
Kernel C
Kernel X
Kernel C Stream 1
High-Priority
Stream 2
Kernel A Kernel B Stream 1
Stream 2
With Priorities—especially useful when Kernel X generates data for MPI_Send()
No Priorities
Kernel X
Launched
Kernel X
Launched
![Page 47: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/47.jpg)
NVIDIA CONFIDENTIAL
![Page 48: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/48.jpg)
NVIDIA CONFIDENTIAL
cuBLAS
행렬 연산용 BLAS 루틴의 CUDA 가속 라이브러리
최신 CUDA library에 내장
CUDA-HPL에서 cuBLAS 라이브러리 이용
![Page 49: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/49.jpg)
NVIDIA CONFIDENTIAL
cuBLAS Model
Malloc Memcpy
Compute
GPU CPU
Memcpy
free
![Page 50: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/50.jpg)
NVIDIA CONFIDENTIAL
cuBLAS call step
GPU Memory Alloc
cublasAlloc( n_size, memSize, GPU_ptr);
GPU Memory upload(set)
cublasSetVector( n_size, memSize, CPU_ptr, 1, GPU_ptr, 1 );
GPU Compute
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
GPU Memory download(get)
cublasGetVector(n2, memSize , GPU_ptr, 1, CPU_ptr, 1);
nvcc mysrc.c –lcuda –lcublas
![Page 51: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/51.jpg)
NVIDIA CONFIDENTIAL
Main code in cuBLAS
h_A = (float*) malloc(n2 * sizeof(h_A[0]));
h_B = (float *) malloc(n2 * sizeof(h_B[0]));
h_C = (float *) malloc(n2 * sizeof(h_C[0]));
cublasAlloc (n2, sizeof(d_A[0]), (void**)&d_A );
cublasAlloc (n2, sizeof(d_B[0]), (void **)&d_B );
cublasAlloc (n2, sizeof(d_C[0]), (void **)&d_C );
cublasSetVector (n2, sizeof(h_A[0]), h_A, 1, d_A, 1 );
cublasSetVector (n2, sizeof(h_B[0]), h_B, 1, d_B, 1 );
cublasSetVector (n2, sizeof(h_C[0]), h_C, 1, d_C, 1 );
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N );
cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1 );
![Page 52: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/52.jpg)
NVIDIA CONFIDENTIAL
cuFFT
CUDA기반 FFTW 가속 라이브러리
최신 CUDA toolkit에 포함
http://docs.nvidia.com/cuda/cufft/index.html
![Page 53: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/53.jpg)
NVIDIA CONFIDENTIAL
cuFFT 사용법
#define NX 64 #define NY 64 #define NZ 128
cufftHandle plan; cufftComplex *data1, *data2; cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ); cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ); /* Create a 3D FFT plan. */ cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);
/* Transform the first signal in place. */ cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);
/* Transform the second signal using the same plan. */ cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);
/* Destroy the cuFFT plan. */ cufftDestroy(plan); cudaFree(data1); cudaFree(data2);
GPU
GPU Memory
CPU
CPU Memory
![Page 54: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/54.jpg)
NVIDIA CONFIDENTIAL
cuFFT Sample #include <stdio.h>
#include <math.h>
#include "cufft.h"
int main(int argc, char *argv[])
{
cufftComplex *a_h, *a_d;
cufftHandle plan;
int N = 1024, batchSize = 10;
int i, nBytes;
double maxError;
nBytes = sizeof(cufftComplex)*N*batchSize;
a_h = (cufftComplex *)malloc(nBytes);
for (i=0; i < N*batchSize; i++)
{
a_h[i].x = sinf(i);
a_h[i].y = cosf(i);
}
cudaMalloc((void **)&a_d, nBytes);
cudaMemcpy(a_d, a_h, nBytes,
cudaMemcpyHostToDevice);
cufftPlan1d(&plan, N, CUFFT_C2C, batchSize);
cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);
cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE);
cudaMemcpy(a_h, a_d, nBytes, cudaMemcpyDeviceToHost);
// check error - normalize
for (maxError = 0.0, i=0; i < N*batchSize; i++)
{
maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError);
maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError);
}
printf("Max fft error = %g\n", maxError);
cufftDestroy(plan);
free(a_h);
cudaFree(a_d);
return 0;
}
![Page 55: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/55.jpg)
NVIDIA CONFIDENTIAL
cuFFT 간접 디버깅
![Page 56: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/56.jpg)
NVIDIA CONFIDENTIAL
pyCUDA를 이용한 이미지 프로세싱
![Page 57: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/57.jpg)
NVIDIA CONFIDENTIAL
이미지 프로세싱 CUDA kernel
R G B
pixel
R G B R G B R G B
0 0 0 1
4 0 0
2
4
2
2
2
4 8 9 9
2 2 0 1
6
1
2 0
1
8
1
6
1
2
1
2
1
2 9
Original
리사이즈, 회전,
컬러변환
히스토그램, 정렬
필터링, 세선화, 압축
FFT, BLAS Result
![Page 58: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/58.jpg)
NVIDIA CONFIDENTIAL
영상처리작업
이미지 파일 처리
BMT, JPG, PNG, DICOM 등 다양한 포맷
OpenCV, ITK 등 Image Library
화면 출력 및 디버깅
windows/linux UI 고려
QT library, MITK library
![Page 59: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/59.jpg)
NVIDIA CONFIDENTIAL
CUDA 커널 개발에 집중할 수 없을까?
File I/O, UI는 python에 맡기고
GPU 가속은 CUDA를 이용하자
pyCUDA!!!
![Page 60: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/60.jpg)
NVIDIA CONFIDENTIAL
PYTHON 101
http://www.python.org/
PIL : python Image Library
Numpy : 과학계산용 python 모듈 (행렬연산)
pyCUDA : python과 CUDA를 연결
![Page 61: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/61.jpg)
NVIDIA CONFIDENTIAL
Python Image Libraray
from __future__ import
division
import numpy
import matplotlib.pyplot as
plt
from PIL import Image
# read image file
img = Image.open("6x3-
pixel.png")
arr=numpy.array(img)
print arr
Image library initialize
Image File Read
array for image processing
Show plot
![Page 62: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/62.jpg)
NVIDIA CONFIDENTIAL
pyCUDA import
########### for image processing ###############
from __future__ import division
import numpy
import matplotlib.pyplot as plt
from PIL import Image
from scipy import misc
########### for pyCUDA ##############
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy.linalg as la
from pycuda.compiler import SourceModule
![Page 63: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/63.jpg)
NVIDIA CONFIDENTIAL
UI part
########### for image processing ###############
from __future__ import division
import numpy
import matplotlib.pyplot as plt
from PIL import Image
from scipy import misc
########### for pyCUDA ##############
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy.linalg as la
from pycuda.compiler import SourceModule
########### read image file
##################
img = Image.open("Fisheye-Nikkor 10.5mm-
sample4-building.jpg")
# convert image to numpy array
arr = numpy.array(img)
#upload to GPU
#TODO kernel
#download to CPU
# result value
tmp = numpy.empty_like(arr)
tmp = arr
# plot the numpy array
plt.subplot(121)
plt.title("original")
plt.imshow(arr)
plt.subplot(122)
plt.title("defished")
plt.imshow(tmp)
plt.show()
![Page 64: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/64.jpg)
NVIDIA CONFIDENTIAL
pyCUDA Template
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
import numpy.linalg as la
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float
*b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(4).astype(numpy.float32)
b = numpy.random.randn(4).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1))
print a*b
print dest
print dest-a*b
pyCUDA initialize
define CUDA kernel
init on CPU memory
Define cudaMemcpy
Define Job index
Launch Kernel
define python function
![Page 65: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/65.jpg)
NVIDIA CONFIDENTIAL
CUDA Kernel Launch
bright = mod.get_function("bright")
# read image file
img = Image.open("papua.jpg")
# convert image to numpy array
arr = numpy.array(img)
tmp = numpy.empty_like(arr)
bright(
drv.In(arr), drv.Out(tmp) ,
grid=(40,60,1), block=(40,20,1)
)
![Page 66: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/66.jpg)
NVIDIA CONFIDENTIAL
CUDA kernel 작성
mod = SourceModule("""
__global__ void bright(unsigned char *in, unsigned char *out){
int ix = blockIdx.x * blockDim.x + threadIdx.x; //1600 16* 5 , 20
int iy = blockIdx.y * blockDim.y + threadIdx.y; //1200 12* 5 , 20
int it = ix * ( gridDim.y * blockDim.y ) + iy;
int px = ix *3;
int py = iy *3;
int pixel_R = (py * 1600) + px + 0;
int pixel_G = (py * 1600) + px + 1;
int pixel_B = (py * 1600) + px + 2;
int R,G,B;
R=in[pixel_R];
G=in[pixel_G];
B=in[pixel_B];
out[pixel_R]=R/2;
out[pixel_G]=G/2;
out[pixel_B]=B/2;
return;
}
""“)
Numpy Array 와
CUDA job Index 관계 고려
![Page 67: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/67.jpg)
NVIDIA CONFIDENTIAL
[
[[ 0 0 0] [ 0 0 0] [255 255 255] [ 0 0 0] [ 0 0 0] [128 0 255]]
[[ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [237 28 36]]
[[ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [136 0 21]]
]
arr=np.array (img)
PNG image example
Numpy Array 데이터 구조
![Page 68: HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data](https://reader031.fdocument.pub/reader031/viewer/2022021612/5b2065047f8b9adc588b4fa8/html5/thumbnails/68.jpg)
NVIDIA CONFIDENTIAL
Plot 출력
out[pixel_R]=R/2;
out[pixel_G]=G/2;
out[pixel_B]=B/2;
# plot the numpy array
plt.subplot(121)
plt.title("original")
plt.imshow(arr)
plt.subplot(122)
plt.title("result")
plt.imshow(tmp)
plt.show()