ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel...
-
Upload
percival-conley -
Category
Documents
-
view
231 -
download
0
Transcript of ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel...
![Page 1: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/1.jpg)
ICAL
GPU 架構中所提供分散式運 算之功能與限制
![Page 2: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/2.jpg)
11/17/09 ICAL 2
Outline
• Parallel computing with GPU
• NVIDIA CUDA
• SVD matrix computation
• Conclusion
![Page 3: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/3.jpg)
11/17/09 ICAL 3
Parallel computing with GPU
• Parallel computing
• Flynn’s Taxonomy
• Algorithm decomposed
• Amdahl’s Law
• Correctness concepts
![Page 4: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/4.jpg)
11/17/09 ICAL 4
Parallel computing
• Parallel computing is a form of computation in which many calculations are carried out simultaneously.
• Parallel computers hardware:– Single machine: multi-core CPU, GPU– Multiple machines: clusters, MPPs, grid
![Page 5: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/5.jpg)
11/17/09 ICAL 5
Parallel computing (cont.)
• There are several kinds of parallel computing, such as:
– Bit-level
– Instruction level
– Data decomposed
– Task decomposed
• The parallel computing has the speedup limit.
![Page 6: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/6.jpg)
11/17/09 ICAL 6
Algorithm decomposed
• Task decomposition • Data DecompositionPrepared the Dinner
Enjoy the dinner
Cooking
Cleaningtable
Purchasing
John clean the table Mary go shopping
Wishing dishes
John and Mary wishing dishes
![Page 7: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/7.jpg)
11/17/09 ICAL 7
Flynn’s Taxonomy
Data
Inst
ruct
ion
Single Multiple
Sing
leM
ulti
ple
SISD
MISD
SIMD
MIMD
![Page 8: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/8.jpg)
11/17/09 ICAL 8
Amdahl’s Law
• Amdahl's law is a model for the expected speedup from partial improvement
SP
P-1
1
P: Parallel PortionS: Speedup of parallel portion
![Page 9: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/9.jpg)
11/17/09 ICAL 9
Correctness concepts
• Race condition • Deadlock
……a=19……
Read a
save a=21
……a=20……
save a=a+1
ERROR!
![Page 10: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/10.jpg)
11/17/09 ICAL 10
NVIDIA CUDA
• Historical Trends
• CUDA
• Programming Languages
• Reported Speedup
![Page 11: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/11.jpg)
11/17/09 ICAL 11
Historical Trends
![Page 12: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/12.jpg)
11/17/09 ICAL 12
CUDA
• Compute Unified Device Architecture, CUDA
• CUDA is a computing engine in NVIDIA GPU (graphics processing units)
![Page 13: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/13.jpg)
11/17/09 ICAL 13
Programming Languages
Application
C/C++ Fortran OpenCL ......
NVIDIA GPUwith the CUDA Parallel Computing Architecture
![Page 14: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/14.jpg)
11/17/09 ICAL 14
Reported Speedup
![Page 15: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/15.jpg)
11/17/09 ICAL 15
CUDA Architecture
• Physical Reality behind CUDA
• CUDA Architectures
• Introducing the “Fermi” Architecture
• SM Architecture
• CUDA Core Architecture
![Page 16: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/16.jpg)
11/17/09 ICAL 16
Physical Reality behind CUDA
CPU (host)GPU (device)
Main Memory
![Page 17: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/17.jpg)
11/17/09 ICAL 17
CUDA Architectures
• G80– First CUDA-capable
processor
• G8x, G9x– Global memory
• GT200– Double precision
– Shared memory
– Larger register file
– Relaxed memory coalescing rules
Basic CUDA architecture
![Page 18: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/18.jpg)
11/17/09 ICAL 18
“Fermi” Architecture
• 3 billion transistors
• Over 2x the cores (512 total)
• 8x the peak DP performance
• L1 and L2 caches
• ~2x memory bandwidth
• Up to 1 terabyte of GPU memory
![Page 19: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/19.jpg)
11/17/09 ICAL 19
SM Architecture
• 32 CUDA cores per SM (Streaming Multiprocessor)
• 8x peak double precision floating point performance
• Dual Thread Scheduler
• 64 KB of RAM for shared memory and L1 cache
![Page 20: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/20.jpg)
11/17/09 ICAL 20
CUDA Core Architecture
• New IEEE 754-2008 floating-point standard
• Fused multiply-add (FMA) instruction for both single and double precision
• Newly designed integer ALC optimized for 64-bit and extended precision operations
![Page 21: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/21.jpg)
11/17/09 ICAL 21
SVD matrix computation
• SVD
• SVD matrix computation
• Experiment Datasets
• Experiment Environment
• Experiment Results
![Page 22: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/22.jpg)
11/17/09 ICAL 22
SVD
• The singular value decomposition (SVD) is an important factorization of matrix, with many applications in signal processing and statistics.
• Suppose M is an m-by-n matrix, then there exists a factorization of the form.
*VUM
![Page 23: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/23.jpg)
11/17/09
SVD matrix computation
212....17521
.............
.......24053
44...32210
32....4445
.............
.......100151
98...121112
187....175121
.............
.......54212
33...128121
ImageRGB pixel matrixSVD Matrix
*RRRR VUM
*GGGG VUM
*BBBB VUM
![Page 24: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/24.jpg)
11/17/09
Experiment Datasets
• 3 test images
• RBG full color
• 1024x1024 total 1048576 pixels
![Page 25: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/25.jpg)
11/17/09
Experiment Environment
GPU
DeviceNVIDA Geforce 9600 GSO
Cores 96
ProcessorClock
1375 MHz
StandardMemory
384 MB
MemoryBandwidth
38.4 GB/sec
CPU
DeviceIntel Core2 Quad Q9300
Cores 4
Processor Clock
2.5 GHz
FSB speed 1333 MHz
L2 Cache 6 MB
![Page 26: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/26.jpg)
11/17/09
Experiment Results
![Page 27: ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.](https://reader035.fdocument.pub/reader035/viewer/2022081506/56649f425503460f94c61242/html5/thumbnails/27.jpg)
11/17/09 ICAL 27
Conclusion
• Using GPU to improve the program speed is feasible.
• NVIDIA CUDA is good with SIMD parallel computing.
• But there are additional costs about Data passing between main memory and GPU memory.