OPENACCの現状 - GTC On-Demand Featured Talks |...

86
OPENACCの現状 Akira Naruse NVIDAI Developer Technologies

Transcript of OPENACCの現状 - GTC On-Demand Featured Talks |...

Page 1: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

OPENACCの現状

Akira Naruse

NVIDAI Developer Technologies

Page 2: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

アプリをGPUで加速する方法

Application

Library

GPU対応ライブラリにチェンジ 簡単に開始

CUDA OpenACC

主要処理をCUDAで記述 高い自由度

既存コードにディレクティブを挿入 簡単に加速

Page 3: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

OPENACC

Program myscience

... serial code ...

!$acc kernels

do k = 1,n1

do i = 1,n2

... parallel code ...

enddo

enddo

!$acc end kernels

... serial code …

End Program myscience

CPU

GPU

既存のC/Fortranコード

簡単: 既存のコードに

コンパイラへのヒントを追加

強力: そこそこの労力で、コンパイラがコードを自動で並列化

オープン: 複数コンパイラベンダが、 複数アクセラレータをサポート

NVIDIA, AMD, Intel(予定)

ヒントの追加

Page 4: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

実行モデル

アプリケーション・コード

GPU

CPU

並列部分は GPUコードを生成

計算の 重い部分

逐次部分は CPUコードを生成

$acc parallel

$acc end parallel

Page 5: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma acc parallel copy(y[:n]) copyin(x[:n])

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...

saxpy(N, 3.0, x, y);

...

SAXPY (Y=A*X+Y, C/C++)

OpenMP OpenACC

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma omp parallel for

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...

saxpy(N, 3.0, x, y);

...

Page 6: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

subroutine saxpy(n, a, X, Y)

real :: a, Y(:), Y(:)

integer :: n, i

!$acc parallel copy(Y(:)) copyin(X(:))

do i=1,n

Y(i) = a*X(i)+Y(i)

enddo

!$acc end parallel

end subroutine saxpy

...

call saxpy(N, 3.0, x, y)

...

SAXPY (Y=A*X+Y, FORTRAN)

OpenMP OpenACC

subroutine saxpy(n, a, X, Y)

real :: a, X(:), Y(:)

integer :: n, i

!$omp parallel do

do i=1,n

Y(i) = a*X(i)+Y(i)

enddo

!$omp end parallel do

end subroutine saxpy

...

call saxpy(N, 3.0, x, y)

...

Page 7: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

OPENMPとの併用

OpenMP / OpenACC

void saxpy(int n, float a,

float *x,

float *restrict y)

{

#pragma acc parallel copy(y[:n]) copyin(x[:n])

#pragma omp parallel for

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...

saxpy(N, 3.0, x, y);

...

Page 8: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

簡単にコンパイル

OpenMP / OpenACC

void saxpy(int n, float a,

float *x,

float *restrict y)

{

#pragma acc parallel copy(y[:n]) copyin(x[:n])

#pragma omp parallel for

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...

saxpy(N, 3.0, x, y);

...

$ pgcc -Minfo -acc saxpy.c

saxpy:

16, Generating present_or_copy(y[:n])

Generating present_or_copyin(x[:n])

Generating Tesla code

19, Loop is parallelizable

Accelerator kernel generated

19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Page 9: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

簡単に実行

OpenMP / OpenACC

void saxpy(int n, float a,

float *x,

float *restrict y)

{

#pragma acc kernels copy(y[:n]) copyin(x[:n])

#pragma omp parallel for

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...

saxpy(N, 3.0, x, y);

...

$ pgcc -Minfo -acc saxpy.c

saxpy:

16, Generating present_or_copy(y[:n])

Generating present_or_copyin(x[:n])

Generating Tesla code

19, Loop is parallelizable

Accelerator kernel generated

19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

$ nvprof ./a.out

==10302== NVPROF is profiling process 10302, command: ./a.out

==10302== Profiling application: ./a.out

==10302== Profiling result:

Time(%) Time Calls Avg Min Max Name

62.95% 3.0358ms 2 1.5179ms 1.5172ms 1.5186ms [CUDA memcpy HtoD]

31.48% 1.5181ms 1 1.5181ms 1.5181ms 1.5181ms [CUDA memcpy DtoH]

5.56% 268.31us 1 268.31us 268.31us 268.31us saxpy_19_gpu

Page 10: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

簡単に高速

Real-Time Object Detection

Global Manufacturer of Navigation Systems

Valuation of Stock Portfolios using Monte Carlo

Global Technology Consulting Company

Interaction of Solvents and Biomolecules

University of Texas at San Antonio

40時間で5倍 4時間で2倍 8時間で5倍

Automotive Financial Life Science

Page 11: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

コンパイラとツール

2013年10月~ 2013年12月~ 2014年1月~ 2015年(予定)

コンパイラ

デバッグツール

OpenACC 2.0対応

Page 12: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

SPEC ACCEL

15本のOpenACCベンチマーク

www.spec.org/accel

Page 13: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

NCAR-CISL, ORNL / CESM • CAM-SE (HOMME)

• LANL / POP

NASA / GEOS-5 NOAA-GFDL / CFSv2 • NOAA-GFDL / MOM6

UKMO / HadGEM3 • UM

• NEMO

MPI-M / MPI-ESM • ECHAM6

• MPIOM

RIKEN, UniTokyo / NICAM IPSL / DYNAMICO

UKMO / UM ECMWF / IFS DWD / GME NOAA-NCEP / GFS EC, CMC / GEM USNRL / NAVGEM NOAA-ESRL / FIM DWD, MPI-M / ICON NOAA-ESRL / NIM NCAR / MPAS-A

LANL / POP NOAA-GFDL / MOM6 CNRS, STFC/ NEMO USNRL / HYCOM MIT / MITgcm LANL / MPAS-O MPI-M / ICON-OCE

NCAR-M3 / WRF USNRL / COAMPS DWD, MCH / COSMO MFR / AROME MFR, ICHEC / HARMONIE • HIRLAM + ALADIN

JAMSTEC-JMA / ASUCA CAS-CMA / GRAPES UniMiami / OLAM

NCAR-M3 / WRF DWD, MCH / COSMO UniMiami / OLAM

Rutgers-UCLA / ROMS UNC-ND / ADCIRC

MPAS-O

MPAS-A or NIM

MPAS-A or NIM

ICON-ATM

NIM

GungHo

PantaRhei

MPAS-O

NIM?

ICON-OCE

GPU Development (8) CAM-SE, GEOS-5, NEMO, WRF, COSMO, NIM, FIM, GRAPES

GPU Evaluation (15) POP, ICON, NICAM, OLAM, GungHo, PantaRhei, ASUCA,

HARMONIE, COAMPS, HYCOM, MITgcm, ROMS, ADCIRC,

DYNAMICO, MOM6

GPU Not Started (7) MPAS-A, MPAS-O, GFS, GEM, NAVGEM, AROME, ICON-OCE

Indicates Next-Gen Model

ICON

GungHo

気候(C) 天候(W) 海洋(O)

気象・天候・海洋モデル

Page 14: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

OPENACCへの移行 Model Focus GPU Approach Collaboration

NCAR / WRF NWP/Climate-R (1) OpenACC, (2) CUDA (1) NCAR-MMM, (2) SSEC UW-M

DWD / COSMO NWP/Climate-R CUDA+OpenACC CSCS, MeteoSwiss (MCH)

ORNL / CAM-SE Climate-G CUDA-F OpenACC ORNL, Cray

NCAR / CAM-SE Climate- G CUDA,CUDA-F,OpenACC NCAR-CISL

NOAA / NIM&FIM NWP/Climate-G F2C-ACC,OpenACC NOAA-ESRL, PGI

NASA / GEOS-5 Climate-G CUDA-F OpenACC NASA, PGI

CNRS / NEMO Ocean GCM OpenACC STFC

UKMO / GungHo NWP/Climate-G OpenACC STFC, UKMO in future?

USNRL / HYCOM Ocean GCM OpenACC US Naval Research Lab

RIKEN / NICAM Climate-G OpenACC RIKEN, UniTokyo

UNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI

NOAA / MOM6 Ocean GCM OpenACC NOAA-GFDL

NASA / FV-Core Atmospheric GCM OpenACC NASA, NOAA-GFDL

Other Evaluations: US – COAMPS, MPAS, ROMS, OLAM; Europe – ICON, IFS, HARMONIE; DYNAMICO

Asia-Pacific – ASUCA (JP), GRAPES (CN)

Page 15: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

OPENACCでどこまで出来るの?

Page 16: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

例: JACOBI ITERATION

while ( error > tol ) {

error = 0.0;

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]));

}

}

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

A(i,j) A(i+1,j) A(i-1,j)

A(i,j-1)

A(i,j+1)

Page 17: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

並列領域 (KERNELS CONSTRUCT)

Parallels と Kernels

— 並列領域を指示

Parallels

— 並列実行スタート

Kernels

— 複数のカーネル

while ( error > tol ) {

error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

Page 18: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

並列領域 (KERNELS CONSTRUCT)

Parallels と Kernels

— 並列領域を指示

Parallels

— 並列走行の開始

Kernels

— 複数のGPUカーネル

while ( error > tol ) {

error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

$ pgcc -Minfo=acc -acc jacobi.c

jacobi:

60, Loop carried scalar dependence for 'error' at line 64

...

Accelerator scalar kernel generated

61, Loop carried scalar dependence for 'error' at line 64

...

Accelerator scalar kernel generated

Page 19: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

リダクション (REDUCTION CLAUSE)

演算の種類

+ 和

* 積

Max 最大

Min 最小

| ビット和

& ビット積

^ XOR

|| 論理和

&& 論理積

while ( error > tol ) {

error = 0.0;

#pragma acc kernels

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

Page 20: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

リダクション (REDUCTION CLAUSE)

演算の種類

+ 和

* 積

Max 最大

Min 最小

| ビット和

& ビット積

^ XOR

|| 論理和

&& 論理積

while ( error > tol ) {

error = 0.0;

#pragma acc kernels

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

$ pgcc -Minfo=acc -acc jacobi.c

jacobi:

59, Generating present_or_copyout(Anew[1:4094][1:4094])

Generating present_or_copyin(A[:][:])

Generating Tesla code

61, Loop is parallelizable

63, Loop is parallelizable

Accelerator kernel generated

61, #pragma acc loop gang /* blockIdx.y */

63, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Max reduction generated for error

Page 21: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

データ転送方法 (DATA CLAUSE)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

$ pgcc -Minfo=acc -acc jacobi.c

jacobi:

59, Generating present_or_copyout(Anew[1:4094][1:4094])

Generating present_or_copyin(A[:][:])

Generating Tesla code

61, Loop is parallelizable

63, Loop is parallelizable

Accelerator kernel generated

61, #pragma acc loop gang /* blockIdx.y */

63, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Max reduction generated for error

Page 22: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

データ転送方法 (DATA CLAUSE)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels \

pcopyout(Anew[1:N-2][1:M-2]) pcopyin(A[0:N][0:M])

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

#pragma acc kernels \

pcopyout(A[1:N-2][1:M-2]) pcopyin(Anew[1:N-2][1:M-2])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

copyin (HostGPU)

copyout (HostGPU)

copy

create

present

pcopyin

pcopyout

pcopy

pcreate

Page 23: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

データ転送方法 (DATA CLAUSE)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels \

pcopy(Anew[:][:]) pcopyin(A[:][:])

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

#pragma acc kernels \

pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

copyin (HostGPU)

copyout (HostGPU)

copy

create

present

pcopyin

pcopyout

pcopy

pcreate

Page 24: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

データ転送がボトルネック (NVVP)

1 cycle

GPU

kernel

GPU

kernel

稼働率:低い

Page 25: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

過剰なデータ転送

while ( error > tol ) {

error = 0.0;

#pragma acc kernels \

pcopy(Anew[:][:]) \

pcopyin(A[:][:])

{

}

#pragma acc kernels \

pcopy(A[:][:]) \

pcopyin(Anew[:][:])

{

}

}

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

Host GPU

copyin

copyin

copyout

copyout

Page 26: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

データ領域 (DATA CONSTRUCT)

copyin (CPUGPU)

copyout (CPUGPU)

copy

create

present

pcopyin

pcopyout

pcopy

pcreate

#pragma acc data pcopy(A) create(Anew)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

#pragma acc kernels pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

Page 27: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

適正なデータ転送

#pragma acc data \

pcopy(A) create(Anew)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels \

pcopy(Anew[:][:]) \

pcopyin(A[:][:])

{

}

#pragma acc kernels \

pcopy(A[:][:]) \

pcopyin(Anew[:][:])

{

}

}

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

copyin

copyout

Host GPU

Page 28: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

データ転送の削減 (NVVP)

稼働率:高い 1 cycle

Page 29: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

2つの処理

データ転送

計算オフロード

計算オフロード、データ転送、両方を考慮する必要がある

GPU Memory CPU Memory

PCI

Page 30: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

カーネルチューニング

Page 31: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

カーネルチューニング (LOOP CONSTRUCT)

#pragma acc data pcopy(A) create(Anew)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

...

}

Page 32: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

カーネルチューニング (LOOP CONSTRUCT)

Gang

Worker

Vector … SIMD幅

Independent

Collapse

Seq

...

#pragma acc data pcopy(A) create(Anew)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

...

}

$ pgcc -Minfo=acc -acc jacobi.c

jacobi:

59, Generating present_or_copyout(Anew[1:4094][1:4094])

Generating present_or_copyin(A[:][:])

Generating Tesla code

61, Loop is parallelizable

63, Loop is parallelizable

Accelerator kernel generated

61, #pragma acc loop gang /* blockIdx.y */

63, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Max reduction generated for error

Page 33: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

カーネルチューニング (LOOP CONSTRUCT)

Gang

Worker

Vector … SIMD幅

Collapse

Independent

Seq

Cache

Tile

#pragma acc data pcopy(A) create(Anew)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])

#pragma acc loop reduction(max:error) gang vector(1)

for (int j = 1; j < N-1; j++) {

#pragma acc loop reduction(max:error) gang vector(128)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

...

}

Page 34: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

実行条件設定 (VECTOR CLAUSE)

#pragma acc loop gang vector(4)

for (j = 0; j < 16; j++) {

#pragma accloop gang vector(16)

for (i = 0; i < 16; i++) {

...

4 x 16

i

4 x 16

4 x 16

4 x 16

j

#pragma acc loop gang vector(8)

for (j = 1; j < 16; j++) {

#pragma accloop gang vector(8)

for (i = 0; i < 16; i++) {

...

i

j

8 x 8 8 x 8

8 x 8 8 x 8

Page 35: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

カーネルチューニング (LOOP CONSTRUCT)

Gang

Worker

Vector … SIMD幅

Collapse

Independent

Seq

Cache

Tile

...

#pragma acc data pcopy(A) create(Anew)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])

#pragma acc loop reduction(max:error) \

collapse(2) gang vector(128)

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

...

}

Page 36: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

カーネルチューニング (LOOP CONSTRUCT)

Gang

Worker

Vector … SIMD幅

Collapse

Independent

Seq

Cache

Tile

...

#pragma acc data pcopy(A) create(Anew)

while ( error > tol ) {

error = 0.0;

#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])

#pragma acc loop reduction(max:error) independent

for (int jj = 1; jj < NN-1; jj++) {

int j = list_j[jj];

#pragma acc loop reduction(max:error)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

...

}

Page 37: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

カーネルチューニング (LOOP CONSTRUCT)

Gang

Worker

Vector … SIMD幅

Collapse

Independent

Seq

Cache

Tile

...

#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])

#pragma acc loop seq

for (int k = 3; k < NK-3; k++) {

#pragma acc loop

for (int j = 0; j < NJ; j++) {

#pragma acc loop

for (int i = 0; i < NI; i++) {

Anew[k][j][i] = func(

A[k-1][j][i], A[k-2][j][i], A[k-3][j][i],

A[k+1][j][i], A[k+2][j][i], A[k+3][j][i], ...

);

}

}

}

Page 38: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

MPIとは簡単に併用できるの?

Page 39: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

MPI並列 (HALO EXCHANGE)

ブロック分割

各プロセスは1ブロック担当

境界部(halo)のデータ交換

A(i,j) A(i+1,j) A(i-1,j)

A(i,j-1)

A(i,j+1)

Page 40: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

MPI JACOBI ITERATION

#pragma acc data pcopy(A) create(Anew)

while ( error > tol ) {

#pragma acc kernels pcopy(Anew) pcopyin(A)

calc_new_A( Anew, A, ... );

#pragma acc kernels pcopy(A) pcopyin(Anew)

update_A( A, Anew );

}

Page 41: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

MPI JACOBI ITERATION

#pragma acc data pcopy(A) create(Anew)

while ( error > tol ) {

pack_data_at_boundary( send_buf, A, ... );

exchange_data_by_MPI( recv_buf, send_buf, ... );

unpack_data_to_halo( A, recv_buf, ... );

#pragma acc kernels pcopy(Anew) pcopyin(A)

calc_new_A( Anew, A, ... );

#pragma acc kernels pcopy(A) pcopyin(Anew)

update_A( A, Anew );

}

1.送信データ の梱包

2.データの交換

3.受信データ の開梱

GPU

GPU

MPI

Page 42: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

MPI JACOBI ITERATION

#pragma acc data pcopy(A) create(Anew)

while ( error > tol ) {

#pragma acc kernels pcopyin(A) pcopyout(send_buf)

pack_data_at_boundary( send_buf, A, ... );

exchange_data_by_MPI( recv_buf, send_buf, ... );

#pragma acc kernels pcopy(A) pcopyin(recv_buf)

unpack_data_to_halo( A, recv_buf, ... );

#pragma acc kernels pcopy(Anew) pcopyin(A)

calc_new_A( Anew, A, ... );

#pragma acc kernels pcopy(A) pcopyin(Anew)

update_A( A, Anew );

}

1. GPU上でデータを送信バッファに梱包し、Hostに転送

3. GPUに転送、GPU上で受信バッファのデータを開梱

2. 隣接プロセスとデータ交換

GPU

GPU

MPI

Page 43: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

MPI JACOBI ITERATION (NVVP)

1 cycle

データ梱包

MPI

データ開梱

MPI Pack Upck

Page 44: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

オーバーラップ (ASYNC/WAIT CLAUSE)

while ( error > tol ) {

#pragma acc kernels pcopyin(A) pcopyout(send_buf)

pack_data_at_boundary( send_buf, A, ... );

exchange_data_by_MPI( recv_buf, send_buf, ... );

#pragma acc kernels pcopy(A) pcopyin(recv_buf)

unpack_data_to_halo( A, recv_buf, ... );

#pragma acc kernels pcopy(Anew) pcopyin(A)

calc_new_A( Anew, A, ... );

#pragma acc kernels pcopy(A) pcopyin(Anew)

update_A( A, Anew );

}

Page 45: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

オーバーラップ (ASYNC/WAIT CLAUSE)

while ( error > tol ) {

#pragma acc kernels pcopyin(A) pcopyout(send_buf)

pack_data_at_boundary( send_buf, A, ... );

#pragma acc kernels pcopy(Anew) pcopyin(A)

calc_new_A_inside( Anew, A, ... );

exchange_data_by_MPI( recv_buf, send_buf, ... );

#pragma acc kernels pcopy(A) pcopyin(recv_buf)

unpack_data_to_halo( A, recv_buf, ... );

#pragma acc kernels pcopy(Anew) pcopyin(A)

calc_new_A_at_boundary( Anew, A, ... );

#pragma acc kernels pcopy(A) pcopyin(Anew)

update_A( A, Anew );

}

内部

境界部

Page 46: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

オーバーラップ (ASYNC/WAIT CLAUSE)

while ( error > tol ) {

#pragma acc kernels pcopyin(A) pcopyout(send_buf) async(2)

pack_data_at_boundary( send_buf, A, ... );

#pragma acc kernels pcopy(Anew) pcopyin(A) async(1)

calc_new_A_inside( Anew, A, ... );

#pragma acc wait(2)

exchange_data_by_MPI( recv_buf, send_buf, ... );

#pragma acc kernels pcopy(A) pcopyin(recv_buf) async(2)

unpack_data_to_halo( A, recv_buf, ... );

#pragma acc kernels pcopy(Anew) pcopyin(A) async(2)

calc_new_A_at_boundary( Anew, A, ... );

#pragma acc kernels pcopy(A) pcopyin(Anew) wait(1,2)

update_A( A, Anew );

}

Page 47: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

オーバーラップ(NVVP)

1 cycle

MPI Pack Upck

Page 48: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

OPENACCって、 実際に使われているの?

Page 49: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

NICAM

気象・気候モデル by 理研AICS/東大

—膨大なコード (数十万行)

—ホットスポットがない (パレートの法則)

特性の異なる2種類の処理

—力学系 … メモリバンド幅ネック

—物理系 … 演算ネック

Page 50: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

NICAM: 力学系(NICAM-DC)

OpenACCによるGPU化

— 主要サブルーチンは、全てGPU上で動作(50以上)

— MPI対応済み

— 2週間

良好なスケーラビリティ

— Tsubame 2.5, 最大2560 GPUs

— Scaling factor: 0.8

Weak scaling

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04

Perf

orm

ance (

GFLO

PS)

Number of CPUs or GPUs

Tsubame 2.5 (GPU:K20X)

K computer

Tsubame 2.5 (CPU:WSM)

(*) weak scaling

Page 51: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

NICAM: 力学系(NICAM-DC)

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

Measu

red P

erf

orm

ance

(GFLO

PS)

Aggregate Peak Memory Bandwidth (GB/s)

Tsubame 2.5 (GPU:K20X)

K computer

Tsubame 2.5 (CPU:WSM)

Page 52: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

NICAM: 物理系(SCALE-LES)

Atmospheric radiation transfer

—物理系の中で、最も重い計算

— OpenACCによるGPU対応、完了

1.00 1.99 3.88 8.51

37.8

76.0

151

0

20

40

60

80

100

120

140

160

1 core 2 core 4 core 10 core 1 GPU 2 GPUs 4 GPUs

Xeon E5-2690v2(3.0GHz,10-core) Tesla K40

Speedup

vs.

CPU

1-c

ore

(*) PCIデータ転送時間込み, グリッドサイズ:1256x32x32

Page 53: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

SEISM3D

地震シミュレーション by 東大地震研(古村教授)

主要サブルーチンのGPU対応が完了

— メモリバンド幅ネック、 3次元モデル(2次元分割)、隣接プロセス間通信

605

459

134

0

100

200

300

400

500

600

K: 8x SPARC64VIIIfx

CPU: 8x XeonE5-2690v2

GPU: 8x TeslaK40

Tim

e (

sec)

SEISM3D (480x480x1024, 1K steps)

3.4x speedup

(アプリ全体)

0

20

40

60

80

100

120

140

GPU: 8x Tesla K40

Others (CPU, MPI and so on)

[CUDA memcpy DtoH]

[CUDA memcpy HtoD]

(other subroutines)

update_vel_pml

update_vel

update_stress_pml

update_stress

diff3d_*

GPUの実行時間内訳

Page 54: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

SEISM3D

100

1,000

10,000

100 1,000 10,000

性能

(M

gri

ds/

sec)

トータルピークメモリバンド幅 (GB/s)

Tesla K40

SX9

FX10

K

Xeon E5-2* v2 (IVB)

Xeon E5-4* (SDB)

Xeon X7* (NHL EX)

Page 55: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

FFR/BCM

次世代CFD by 理研AICS/北大(坪倉准教授)

MUSCL_bench:

— MUSCLスキームに基づくFlux計算 (とても複雑な計算)

— CFD計算の主要部分 (60-70%)

— OpenACCによるGPU対応、完了

1.00 1.93 4.55

8.30

33.21

05

101520253035

1 core 2 core 5 core 10 core 1 GPU

Xeon E5-2690v2(3.0GHz,10-core) Tesla K40

Speedup

vs.

1 C

PU

core

(*) PCIデータ転送時間込み、サイズ:80x32x32x32

Page 56: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

まとめ

OpenACCの現状を紹介

簡単: 既存コードへのディレクティブ追加

強力: 少ない労力でGPU利用可能

オープン: 採用事例の増加

Page 57: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUDA 6の強化ポイント

Akira Naruse

NVIDAI Developer Technologies

Page 58: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUDA 6

ユニファイド・メモリ

XTライブラリ

ドロップイン・ライブラリ

GPUDirect RDMA

開発ツール

Page 59: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

ユニファイドメモリ

Now

ホストメモリ GPUメモリ

開発者から見えるメモリモデル

ユニファイドメモリ

Page 60: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

煩雑なメモリマネジメント

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }

void sortfile(FILE *fp, int N) { char *data char *d_data; data = (char *)malloc(N); cudaMalloc(&d_data, N); fread(data, 1, N, fp); cudaMemcpy(d_data, data, N, ..); qsort<<<...>>>(d_data,N,1,compare); cudaDeviceSynchronize(); cudaMemcpy(data, d_data, N, ..); use_data(data); cudaFree(d_data); free(data); }

CPUコード GPUコード

Page 61: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

メモリマネジメントを簡素化

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }

void sortfile(FILE *fp, int N) { char *data cudaMallocManaged(&d_data, N); fread(data, 1, N, fp); qsort<<<...>>>(d_data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }

CPUコード ユニファイドメモリ(CUDA6)

Page 62: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

メモリマネジメントの統合(将来)

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }

void sortfile(FILE *fp, int N) { char *data data = (char *)malloc(N); fread(data, 1, N, fp); qsort<<<...>>>(d_data,N,1,compare); cudaDeviceSynchronize(); use_data(data); free(data); }

CPUコード 将来?

Page 63: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

DEEP COPY

CPU Memory

“Hello World”

dataElem

prop1

prop2

*text

GPU Memory

struct dataElem { int prop1; int prop2; char *text; };

Page 64: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CPU Memory

“Hello World”

dataElem

prop1

prop2

*text

GPU Memory

“Hello World”

dataElem

prop1

prop2

*text

コピーが

2回必要

struct dataElem { int prop1; int prop2; char *text; };

DEEP COPY

Page 65: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CPU Memory

“Hello World”

dataElem

prop1

prop2

*text

GPU Memory

“Hello World”

dataElem

prop1

prop2

*text

void launch(dataElem *elem) { dataElem *g_elem; char *g_text; int textlen = strlen(elem->text); cudaMalloc(&g_elem, sizeof(dataElem)); cudaMalloc(&g_text, textlen); cudaMemcpy(g_elem, elem, sizeof(dataElem)); cudaMemcpy(g_text, elem->text, textlen); cudaMemcpy(&(g_elem->text), &g_text, sizeof(g_text)); kernel<<< ... >>>(g_elem); }

DEEP COPY

実際は

3回必要

Page 66: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CPU Memory

GPU Memory

Unified Memory

“Hello World”

dataElem

prop1

prop2

*text

void launch(dataElem *elem) { kernel<<< ... >>>(elem); }

DEEP COPY (ユニファイドメモリ)

Page 67: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

連結リスト

CPU Memory

GPU Memory

key

data

next

key

data

next

key

data

next

key

data

next

Page 68: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

連結リスト

CPU Memory

GPU Memory

key

data

next

key

data

next

key

data

next

key

data

next

全部を

転送?

毎回、全部転送

— PCIのバンド幅ネック

最初は全部転送、以降は更新箇所だけ転送

— とても複雑な処理

CPUメモリにデータを配置、GPUはPCI経由のアクセス

— PCI経由、遅い

Page 69: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

連結リスト (ユニファイドメモリ)

CPU Memory

GPU Memory

Unified Memory

key

data

next

key

data

next

key

data

next

key

data

next

通常のメモリアクセス

通常のメモリアクセス

CPUからもGPUからもリスト操作が可能

— 挿入、削除

リスト更新後に、CPUメモリとGPUメモリ間の明示的な同期は不要

CPUとGPUから同時アクセスはNG、排他制御必要

Page 70: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

ロードマップ

CUDA 6:簡単に利用

単一のポインタ

Memcpy記述不要

ホスト側プログラムと データ構造を共有

Next:最適化

プリフェッチ

データ移動ヒント

OSサポートの追加

Pascal

システムアロケータの統合

スタックメモリの統合

メモリコヒーレンシを HWでアクセラレート

Page 71: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

XTライブラリ

Page 72: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

XTライブラリ

cuBLAS-XT and cuFFT-XT

明示的なデータ転送の指示は不要

—必要なGPUメモリはライブラリが確保

マルチGPUに自動対応

—マルチGPU向けのコード記述は不要

GPUメモリ容量を超えるサイズに対応 (out-of-core)

—カーネル実行とデータ転送をオーバーラップ(BLAS level 3)

Page 73: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUBLAS

cublasHandle_t handle; cublasCreate(&handle); cudaMalloc(&d_A, ..); cudaMalloc(&d_B, ..); cudaMalloc(&d_C, ..); cudaSetMatrix(.., d_A, .., A, ..); cudaSetMatrix(.., d_B, .., B, ..); cublasDgemm(handle, .., d_A, .., d_B, .., d_C, ..); cudaGetMatrix(.., d_C, .., C, ..); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle);

cuBLAS 行列積コード

Page 74: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUBLAS CUBLAS-XT

cublasHandle_t handle; cublasCreate(&handle); cudaMalloc(&d_A, ..); cudaMalloc(&d_B, ..); cudaMalloc(&d_C, ..); cudaSetMatrix(.., d_A, .., A, ..); cudaSetMatrix(.., d_B, .., B, ..); cublasDgemm(handle, .., d_A, .., d_B, .., d_C, ..); cudaGetMatrix(.., d_C, .., C, ..); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle);

cublasXtHandle_t handle; cublasXtCreate(&handle); cublasXtDgemm(handle, .., A, .., B, .., C, ..); cublasXtDestroy(handle);

cuBLAS cuBLAS-XT 行列積コード

Page 75: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUBLAS-XT API

使用GPU

— cublasXtDeviceSelect() GPU数、使用GPU IDs

ブロッキングサイズ

— cublasXtSetBlockDim() ブロッキングサイズの設定

— cublasXtGetBloskDim() (現設定の取得)

CPU・GPUハイブリッド実行

— cublasXtSetCpuRoutine() CPU版BLASの設定

— cublasXtSetCpuRatio() CPU比率の設定

Pinnedメモリ — cublasXtSetPinningMemMode() Pinnedメモリの設定

— cublasXtGetPinningMemMode() (現設定の取得)

Page 76: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUBLAS-XT

全てのBLAS level 3 ルーチンをサポート

行列サイズがGPUメモリ容量超でもOK (out-of-core)

0

500

1000

1500

2000

2500

0 4096 8192 12288 16384 20480 24576 28672

GFLO

PS

Matrix Size (NxN)

cuBLAS ZGEMM Performance on 2 GPUs

1 K20c 2 K20c

In-core Out-of-core

Page 77: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUBLAS-XT (NVVP)

Page 78: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

ドロップイン・ライブラリ

Page 79: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

ドロップイン・ライブラリ

標準ライブラリAPIでのGPU利用を可能に

NVBLAS

— BLAS level 3関数呼び出しを、自動的にcuBLASに置き換え

— cuBLAS利用のためのソース変更は不要

使い方

— NVBLASを入れて再コンパイル

— Linuxは、LD_PRELOAD設定で使用可能 (最コンパイル不要)

dgemm(.., A, .., B, .., C, ..);

CPU

Page 80: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

ドロップイン・ライブラリ

標準ライブラリAPIでのGPU利用を可能に

NVBLAS

— BLAS level 3関数呼び出しを、自動的にcuBLASに置き換え

— cuBLAS利用のためのソース変更は不要

使い方

— NVBLASを入れて再コンパイル

— Linuxは、LD_PRELOAD設定で使用可能 (最コンパイル不要)

dgemm(.., A, .., B, .., C, ..); dgemm(.., A, .., B, .., C, ..);

CPU NVBLAS

Page 81: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

NVBLAS (LINUX)

NVBLAS_LOGFILE nvblas.log

NVBLAS_CPU_BLAS_LIB libmkl_intel_lp64.so \ libmkl_core.so \ libmkl_intel_thread.so

NVBLAS_GPU_LIST 0 # ALL, ALL0

NVBLAS_TILE_DIM 2048

NVBLAS_AUTOPIN_MEM_ENABLED

設定ファイル (nvblas.conf)

$ LD_PRELOAD=/usr/local/cuda-6.0/lib64/libnvblas.so ./a.out

Page 82: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

BLAS level 3使用のアプリに適用可能

— Octave, Scilab, など.

0

500

1000

1500

2000

2500

3000

0 5000 10000 15000 20000 25000 30000 35000

fp64 G

Flo

ps/

s

matrix dimension

R言語での行列乗算

nvBLAS, 4x K20X GPUs

MKL, 6-core Xeon E5-2667 CPU

NVBLAS

Page 83: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

NVBLASデモ

Page 84: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUDA 6

ユニファイド・メモリ

XTライブラリ

ドロップイン・ライブラリ

GPUDirect RDMA

開発ツール

Page 85: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUDA 6

並列コンピューティング

を簡単に

developer.nvidia.com/cuda-toolkit

CUDA Registered Developer Program

Page 86: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM

CUDA 6.5 RC

64-bit ARMマシン

Microsoft Visual Studio 2013 (VC12)

cuFFT callbacks

cuSPARSE (BSR格納形式)

CUDA占有率計算API

CUDA FORTRAN デバッグ機能

アプリケーションリプレイモード (Visual Profile and nvprof)

Nvprune ユーティリティ (objectサイズ削減)