More Modern GPU
Preferred Networks, Inc.
Preferred Infrastructure, Inc.
12/3 2015PFI/PFN
GPU/CUDA
l GPU/CUDA
l GPUCPU
TitanX 6TFlops (3092 cores), Xeon 0.8TFlops (18 cores)
CPU24SIMD
GPU2
l GPU
GPUGB/s
l
2
GPU
l
CPU
GPU/CUDAGPUagain
l
HW
l
CPU
CUDA
l CUDA [Okuta 2015]
l cupy (chainer)
numpyGPU
l modern gpu/thrust
STL
CUDA
4
cupy (chainer.cuda.cupy)
l chainernumpy
l Numpygpu
l CUDA
l
cuda
cuda
u Chainer
Nvidiakernelload
l Chainer Meetup
5
ModernGPUl
MIMDMultiple Instruction, Multiple Data)
u
u PESY-SC()datacenter as a computer
SIMDSingle Instruction, Multiple Data)
u
u SSE
l GPUMIMD + SIMDSIMT
(SM MIMD
SMwarp1632SIMD
u gather/scatter
SM 6
l
10030002900
l
7
(Prefix-)Scan
l X[0n)Scan
X[i] := X[0] + X[1] + ... + X[i-1] exclusive scan
X[i] := X[0] + X[1] + + X[i] inclusive scan
l Scan
X[i]X[i] += X[i-1]
X[i], X[i] += X[i-2]
X[i]X[i] += X[i-4]
...
l X[7]
X[7] += X[6]
X[7] += X[5] (=X[4]+X[5])
l O(log n)8
Modern GPU
l GPU
l Modern GPU
Nvidia2013
http://nvlabs.github.io/moderngpu/
l
1.
2.
3.
l 1
9
(1/2)
l X, Y
l X = {1, 3, 3, 5, 7, 9, 10, 10, 11, 13, 15}
l Y = {0, 2, 3, 3, 7, 8, 8, 9, 10, 11, 14}
l Z={0, 1, 2, 3, 3, 3, 3, 5, 7, 7 8, 8, 9, 9, 10, 10, 10, 11, 11, 13, 14, 15}
l n=|X|, m=|Y|O(n+m)
10
(2/2)
X, Y, Z
ix = 0, iy = 0, iz = 0
while (iz < n+m) {
if (comp(X[ix], Y[iy]) Z[iz++] = X[ix++]
else Z[iz++] = Y[iy++]
}
// comp
11
0 1 3 5 6 6 7 9 1012247899
10
12
ABZ[iz++] = X[ix++]Z[iz++] = Y[iy++]
X[4]=6 < Y[4]=7
l
13
0 1 3 5 6 6 7 9 1012247899
10
Z[04) Z[48)
Z[812)
Z[1216)
l X = 0 1 3 5 6 6 7 9 10
l Y = 1 2 2 4 7 8 9 9 10
l 4
l 0 1 1 2
l 3 5 2 4
l 6 6 7 7
l 9 8 9 9
l 10 10
l
15
0 1 3 5 6 6 7 9 1012247899
10
X[i] < Y[8 i]i
l
l
Z
#pragma unroll
l
TitanX 288GB/32700/
16
l Bulk Insert, Bulk Delete
l Segmented Vector Reduction
Reduction
l DBJoin
Outer, Inner, Left-, Right- Join
l MapReduce
l Modern GPUThrust, Cub
17
GPUMapReduce
l Map+Shuffle(Sort)+ReduceGPU
l MapReduce
Shuffle
GPUGPU
l MapDKV D-> [K, V]
l Shuffle [K, V] -> [K, [V]]
l Reduce [V] -> Z
l [D] -> [K, Z]
18
GPU
l
1.
1. !isAlpha(right) && isAlpha(left))
2.
1. Segmented Reduction
2. Reduce
3. KRReduce
3.
4. Segmented Reduction
1.
19
l https://github.com/hillbig/gpuexperiments
20
l 700MB
l Titan X
l 106,888,008
l 1,252,268
l GPU1.67 CPU->GPU 0.2
0.1%
l CPU 14.80
l 10
21
22
input=300000000 wordCount=45788064distinctWord=1129243 words=457880640 2528465 the1 1564080 of2 1219248 and3 986168 in4 862412 a5 862356 to6 507386 is7 484451 The8 445334 was9 336005 for10 334510 s11 316207 as12 295183 by13 282728 with14 281566 on15 241960 that16 235218 doc17 221649 from18 193797 at19 189947 his20 157175 an
l nvcc + thrust
templatenvcc
1
l
10
23
l GPU
cupy, thrust, cubCUDA
l
Gochannel, go routine
l
Datacenter as a computer1 (TFlops in 1 chip)
24
Copyright 2015-
Preferred Networks All Right Reserved.
Top Related