SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop...

27
HPC Advisory Council China Workshop October 28, 2012 CENTER for MANYCORE PROGRAMMING 매니코어 프로그래밍 연구단 SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters Jaejin Lee Center for Manycore Programming School of Computer Science and Engineering Seoul National University [email protected] http://aces.snu.ac.kr

Transcript of SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop...

Page 1: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

HPC Advisory Council China Workshop October 28, 2012

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단

SnuCL: An OpenCL Framework for Heterogeneous

CPU/GPU Clusters

Jaejin LeeCenter for Manycore Programming

School of Computer Science and EngineeringSeoul National University

[email protected]://aces.snu.ac.kr

Page 2: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Heterogeneous Computing SystemsContain different types of processors

Processors: CPUs, DSPs, GPUs, FPGAs, or ASICs For extra performance and power efficiency

Heterogeneity inISAs, processing power, power consumption, memory hierarchies, micro-architectures, etc.

GPGPU systems and clusters are widening their user base

2

PCI-E

CPUcore corecore corecore core

CPUcore corecore corecore core

mem

GPU

mem

GPU

mem

GPU

mem

GPU

Main memory

Compute node

Interconnection network

Page 3: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Parallel Programming ModelsAn interface between the programmer and the parallel machine when developing an application

Languages, libraries, language extensions, compiler directives, etc.

Important to have balance between delivering high performance and ease of programming

3

High performance

Ease of

programming

Page 4: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Ease of Programming

How to handle the heterogeneity?

4

Page 5: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

OpenCL

Open Computing Language

A framework (parallel programming model) for heterogeneous parallel computing

A language, API, libraries, and a runtime system

The specification of OpenCL 1.0 was released in late 2008

Now, OpenCL 1.2

5

Page 6: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

OpenCL (contd.)

From mobile devices to supercomputers

Portable code across different architecturesCPUs, GPUs, Cell BE processors, etc.Not yet portable performance

Based on ANSI/ISO C99 standard

Supported by many vendors, such as Apple, AMD, ARM, IBM, Intel, NVIDA, SAMSUNG, etc.

6

Page 7: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Data ParallelismAlso known as loop-level parallelismPerforming the same operation to different items of data at the same timeMore data, more parallelism

7

for(i=0; i<16; i++){ c[i] = a[i] + b[i];}

+ =

a[i] b[i] c[i]+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

a[i]

b[i]

c[i]

+

=

+

=

+

=

+

=

+

=

Page 8: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

OpenCL ApplicationThe combination of programs running on a host processor and OpenCL compute devices

Compute devices: CPUs, GPUs, etc.A host program + OpenCL programs

OpenCL program: a set of kernelsBased on ISO C99

8

main(){…cl_context = clCreateContextFromType( … );…cmd_queue = clCreateCommandQueue(…);…memobj[0] = clCreateBuffer(…);...}

Host program

OpenCL program

__kernel void mat_mul( __global const float *a,__global const float *b,__global float *c)

{ ...}

__kernel void vec_add(__global const float *a, __global const float *b, __global float *c){ ...}

OpenCL program+

Page 9: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Vector Addition Example

9

void vec_add(int n, const float *A, const float *B, float *C){ int i; for (i=0; i<n; i++) c[i] = a[i] + b[i];}

__kernel void vec_add( __global const float *A, __global const float *B, __global float *C){ int id = get_global_id(0);

C[id] = A[id] + B[id];}

main(){

float srcA[N], srcB[N], srcC[N];

// initialize srcA and srcB...vec_add(m, srcA, srcB, srcC);...}

Page 10: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

OpenCL ApplicationHost program

Executes on the host and manages kernel execution

KernelsBasic unit of executable code (a function) on compute devicesWhen executed, many instances are created

Exploits data parallelism

The host program and kernels all run in parallel

10

Host Device 0 Device 1

Kernel-A

Kernel-BKernel-C

Kernel-A

Page 11: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Limitations

Current OpenCL implementations are targeting parallelism for multiple compute devices under a single OS instance

An application for a heterogeneous CPU/GPU cluster MPI + OpenCL or MPI + CUDAComplicated, less portable, and hard to maintain

11

Page 12: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

An Illusion of a Single OS Instance

If the programmer can write applications for heterogeneous CPU/GPU clusters using only OpenCL

Easy to program and more portable

12

PCI-E

CPUcore corecore corecore core

CPUcore corecore corecore core

mem

GPU

mem

GPU

mem

GPU

mem

GPU

Main memory

Compute node

Interconnection network

Main memory

CPU

A system image running a single OS instance

...... GPUCPU CPU CPU CPU GPU GPU GPU GPU GPU

SnuCL

Page 13: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

SnuCL

An OpenCL framework [ICS’12]Freely available, open-source software developed at Seoul National UniversitySupports x86 CPUs, AMD GPUs, and NVIDIA GPUsIts source code is publicly available at http://aces.snu.ac.kr

SnuCL version 1.2 beta released June 13, 2012 (supports OpenCL 1.2)Passed most of OpenCL conformance tests

Platform layer + runtime + kernel compiler

13

Page 14: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

SnuCL (contd.)

Naturally extends the original OpenCL semantics to the heterogeneous cluster environment

Provides an illusion of a heterogeneous system running a single OS instance

With SnuCL, an OpenCL application written for a single heterogeneous system runs on a heterogeneous cluster without any modification

14

Page 15: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

The Effect of Using SnuCL

Copy buffers between different nodes in the cluster environment (Buffer A → Buffer B)

15

Previous approach (Mixture of MPI and OpenCL)

SnuCL(OpenCL only)

MPI_Init(..);MPI_Comm_rank(MPI_COMM_WORLD, &rank);…cl_mem bufferA = clCreateBuffer(…);cl_mem bufferB = clCreateBuffer(…);…void *temp = malloc(…);if (rank == SRC_DEV) { clEnqueueReadBuffer(cq, bufferA, …, temp, …); MPI_Send(temp, …, DST_DEV, …);} else if (rank == DST_DEV) { MPI_Recv(temp, …, SRC_DEV, …); clEnqueueWriteBuffer(cq, bufferB, …, temp, …);}…MPI_Finalize();

…cl_mem bufferA = clCreateBuffer(…);cl_mem bufferB = clCreateBuffer(…);…clEnqueueCopyBuffer(cq, bufferA, bufferB, …);…

Page 16: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

How to Achieve the Single System Image?

SnuCL runtime provides the illusionMapping components between the OpenCL platform and underlying hardware resources

Source-to-source kernel restructuring techniquesOpenCL C to C for CPUs

Buffer management techniquesEfficient node to node data transferConsistency management

16

OpenCL application

x86 CPUs NVIDIA GPUs

SnuCL runtime

SnuCL OpenCL-to-C

compiler

AMD GPUs

NVIDIAOpenCL runtime

AMD OpenCL runtime

Page 17: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

SnuCL Extensions to OpenCLSnuCL has extensions to OpenCL for copying buffers (e.g., memory objects)

Buffer-copy memory commands are often inefficient in the cluster environment depending on the access patternSimilar to MPI collective communication operations

17

SnuCL MPI EquivalentclEnqueueAlltoAllBufferclEnqueueBroadcastBufferclEnqueueScatterBufferclEnqueueGatherBufferclEnqueueAllGatherBufferclEnqueueReduceBufferclEnqueueAllReduceBufferclEnqueueReduceScatterBufferclEnqueueScanBuffer

MPI_AlltoallMPI_BcastMPI_ScatterMPI_GatherMPI_AllgatherMPI_ReduceMPI_AllreduceMPI_Reduce_scatterMPI_Scan

Page 18: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Matrix Multiplication Performance

18

Page 19: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

SnuCL for CPU Devices in the Cluster

19

The kernel code is portable across different types of compute devices

The OpenCL application for GPUs will run on CPU devices with SnuCL

Replace CL_DEVICE_TYPE_GPU with CL_DEVICE_TYPE_CPU

Page 20: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

SNU NPB Suite

20

Most of the applications in NAS Parallel Benchmarks (NPB 3.3) are implemented in C, OpenMP C, and OpenCL [IISWC ’11]

NPB-SER-C: a serial C version of the NPB code NPB-OMP-C: an OpenMP C version of the NPB codeNPB-OCL: an OpenCL version of the NPB code for a single deviceNPB-OCL-MD: an OpenCL version of the NPB code for multiple OpenCL compute devices

Source code is publicly availablehttp://aces.snu.ac.kr

Page 21: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

21

ApplicationsApplication Source Description Input Global memory

size (MB)Extensions

used

BinomialOption AMD Binomial option pricing 65504 or 2097152 samples, 512 steps, 100 iterations 2.0 or 64.0

BlackScholes PARSEC Black-Scholes PDE 33538048 options, 100 iterations 895.6

BT NAS Block tridiagonal solver Class C or Class D 1982.1 or 30686.7

CG NAS Conjugate gradient Class C or Class D 1102.6 or 20399.1

CP Parboil Coulombic potential 16384x16384, 1000 atoms 4.1

EP NAS Embarrassingly parallel Class D 0.8

FT NAS 3-D FFT PDE Class B or Class C 2816.0 or 11264.0 AlltoAll

MatrixMul NVIDIA Matrix multiplication 10752x10752 or 16384x16384 1323.0 or 3072.0 Broadcast

MG NAS Multigrid Class C or Class D 3575.3 or 28343.7

Nbody NVIDIA N-Body simulation 1048576 bodies 64.0

SP NAS Pentadiagonal solver Class C or Class D 1477.9 or 19974.4

Page 22: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Speedup (over 1 CPU core)

22

CPU devices only (11 cores in a CPU device), 1 CPU device per nodeSnuCL-Static : Using the static scheduling for the kernel workload distributionThe numbers on x-axis represent the number of CPU compute devices

GPU devices only (4 GPU device per node)The numbers on x-axis represent the number of GPU compute devices.

0 1000 2000 3000 4000 5000 6000

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 BinomialOption BlackScholes CP EP.D MatrixMul Nbody

Spee

dup

9515

0

4

8

12

16

4 9 16 25 36 1 2 4 8 16 32 4 8 16 32 4 8 16 32 4 9 16 25 36 BT.C CG.C FT.B MG.C SP.C

0

10

20

30

40

50

1 2 4 8 1 2 4 8 1 4 9 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 4 9 BinomialOption BlackScholes BT.C CG.C CP EP.D FT.B MatrixMul MG.C Nbody SP.C

Spee

dup

SnuCL-Static SnuCL 76,82 70,73

Page 23: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

SnuCL vs. MPI-Fortran

23

Speedup over a single compute node (a CPU compute device with 4 CPU cores) Normalized to MPI-Fortran

MPI-Fortran : The unmodified original MPI-Fortran versions from NPB

The numbers on x-axis represent the number of compute nodes (CPU compute devices)

0 1 4

16 64

256

1 4 16

64

256 1 4 16

64

25

6 4 16

64

256 4 16

64

25

6 1 4 16

64

256 1 4 16

64

25

6 1 4 16

64

256 1 4 16

64

25

6 4 16

64

256 1 4 16

64

25

6 4 16

64

256

BinomialOption BlackScholes BT.D CG.D CP EP.D FT.C MatrixMul MG.D Nbody SP.D

Spee

dup

MPI-Fortran SnuCL

Page 24: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Future Directions

Scalability for more than 1000 nodes

Achieving a single compute device image for multiple heterogeneous devices in a heterogeneous CPU/GPU cluster

AutotuningTo make performance portable across heterogeneous devices

Intelligent load balancing between multiple heterogeneous compute devices

24

Page 25: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

References

[ICS] Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. SnuCL: an OpenCL Framework for Heterogeneous CPU/GPU Clusters, ICS ’12: Proceedings of the 26th International Conference on Supercomputing, San Servolo Island, Venice, Italy, June 2012.[IISWC] Sangmin Seo, Gangwon Jo, and Jaejin Lee. Performance Characterization of the NAS Parallel Benchmarks in OpenCL, IISWC ’11: Proceedings of the 2011 IEEE International Symposium on Workload Characterization, Austin, Texas, USA, November 2011.[PACT] Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, and Jaejin Lee. An OpenCL Framework for Homogeneous Manycores with no Hardware Cache Coherence, PACT ’11: Proceedings of the 20th ACM/IEEE/IFIP International Conference on Parallel Architectures and Compilation Techniques, Galveston Island, Texas, USA, October 2011.

25

Page 26: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

References (contd.)[LCPC] Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. OpenCL as a Programming Model for GPU Clusters, LCPC ’11: Proceedings of the 24th International Workshop on Languages and Compilers for Parallel Computing, Fort Collins, Colorado, USA, September 2011.[PPoPP] Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, PPoPP ʼ11: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.  277 — 288, San Antonio, Texas, USA, February 2011,  DOI: 10.1145/1941553.1941591. [PACT] Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo, Seung Hak Lee, Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-Deok Choi. An OpenCL Framework for Heterogeneous Multicores with Local Memory, PACT ’10: Proceedings of the 19th ACM/IEEE/IFIP International Conference on Parallel Architectures and Compilation Techniques, pp. 193 — 204, Vienna, Austria, September 2010, DOI: 10.1145/1854273.1854301.

26

Page 27: SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� �aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�

]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Contributors to SnuCL

27

Jungwon KimSangmin Seo

Jun LeeJeongho NahGangwon Jo

Jaejin Lee

SnuCL is publicly available at http://aces.snu.ac.kr