CGPOP Analysis and Optimization
-
Upload
hongtao-cai -
Category
Technology
-
view
36 -
download
2
Transcript of CGPOP Analysis and Optimization
![Page 1: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/1.jpg)
Analysis and Optimization of CGPOP
Hongtao Cai, Xiaoxiang Hu, Haoruo Peng Department of CST, Tsinghua University
SIAM Annual Meeting, July 9, 2012
![Page 2: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/2.jpg)
Acknowledgment
Prof. Xiaoge Wang , Prof. Wei Xue
Support from the State 863 Project Fund
Support from Explore-100, Tianhe-1A, Shenwei supercomputer systems
Support from SIAM
2
![Page 3: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/3.jpg)
Outline
Background
Research
Analysis of original PCG Method in CGPOP
Optimizations:
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
3
![Page 4: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/4.jpg)
Outline
Background
Research
Analysis of original PCG Method in CGPOP
Optimizations:
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
4
![Page 5: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/5.jpg)
Parallel Ocean Program
The crucial role of Oceans in Global Climate
70% of earth surface
Water 1000 times higher the heat capacity of air
repository of carbon(93%)
Transport heat
POP : Surface Pressure of Oceans[1]
5
![Page 6: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/6.jpg)
Conjugate Gradient Parallel Ocean Program (CGPOP)
Three computation parts: Barotropic, 3D-update, Baroclinic
Barotropic computation dominates when core number exceeds 10,000 [2]
CGPOP contains the core part of Barotropic compuation
6
![Page 7: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/7.jpg)
Conjugate Gradient Parallel Ocean Program (CGPOP)
Linear equation system in every time step
𝛻 ∙ 𝐻𝛻 −1
𝑔𝛼𝜏∆𝑡𝜂𝑛+1 = 𝛻 ∙ 𝐻
𝑈
𝑔𝛼𝜏+ 𝛻𝜂𝑛−1 −
𝜂𝑛
𝑔𝛼𝜏∆𝑡−
𝑞𝑊𝑛
𝑔𝛼𝜏
Ax = b
(A is a real, sparse, symmetric, positive-definite matrix)
Our work: Exploring new algorithms in CGPOP. Experiments on top supercomputer in the world.
7
![Page 8: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/8.jpg)
Outline
Background
Research
Analysis of original PCG Method in CGPOP
Optimizations:
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
8
![Page 9: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/9.jpg)
Chron-Gear Preconditioned Conjugate Gradient Solver
Matrix-vector Multiplication, Dot Product, Daxpy : Communication 9
![Page 10: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/10.jpg)
PCG Solver
(on Shenwei Supercomputer)
Percentage of Time consumed by Dot Product
10
![Page 11: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/11.jpg)
Three Variants
1S1D /2S2D/2S1D
1-Sided MPI : put/get
2-Sided MPI : send/receive
2D : direct data access, more memory
1D : Ocean points stored compactly. Less memory, indirect data access
2D 1D
11
![Page 12: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/12.jpg)
Three Variants
Total Time for 1 Time Step(on Tianhe-1A ) 12
![Page 13: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/13.jpg)
Analysis Conclusions
Dot product consumes time
Three variants – 2s1d selected as the benchmark
13
![Page 14: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/14.jpg)
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
14
![Page 15: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/15.jpg)
Chebyshev
Mat-vec Mul, Daxpy, No Dot Product
15
![Page 16: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/16.jpg)
Chebyshev
PCG 4 Daxpy + 1 MV + 3 DP
CBS
× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1
3 Daxpy + 1 MV × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟2
Dot Product(DP) Daxpy Mat-Vec Mul(MV)
16
![Page 17: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/17.jpg)
Chebyshev
17
![Page 18: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/18.jpg)
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
18
![Page 19: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/19.jpg)
Richardson-PCG
Single Precision: Faster[5]
A processor can take 2 double or 4 single at a time
Memory Pressure
Double Precision: More Accurate
Mix them up
19
![Page 20: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/20.jpg)
Richardson-PCG
Richardson Method ( Splitting Method )
Iteration:
Our Motivation: Let 𝑀 = 𝐴𝑓𝑙𝑜𝑎𝑡 , s.t. 𝑀−1𝑁 = 𝐼 − 𝑀−1𝐴 ≈ 0
Our Method:
𝐴𝑥 = 𝑏, 𝐴 = 𝑀 − 𝑁 𝑥 = 𝑀−1𝑁𝑥 +𝑀−1𝑏
𝑥𝑘+1 ← 𝑀−1𝑁𝑥𝑘 +𝑀−1𝑏 𝜌 𝑀−1𝑁 < 1
𝑥𝑘+1 ← 𝑥𝑘 +𝑀−1(𝑏 − 𝐴𝑥𝑘)
Same as solving AfloatΔ𝑥 = (𝑏 − 𝐴𝑥𝑘)
Approximation : Tolerance
20
![Page 21: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/21.jpg)
Richardson-PCG
𝑥𝑘+1 ← 𝑥𝑘 +𝑀−1(𝑏 − 𝐴𝑥𝑘)
21
![Page 22: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/22.jpg)
Richardson-PCG
PCG 4 Daxpy + 1 DMV + 3 DDP × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1
Rich-PCG
× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 + 4 Saxpy + 1 SMV + 3 SDP + 2 CV
2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟
Double Mat-Vec Mul (DMV) Daxpy Double Dot Product(DDP)
Single Mat-Vec Mul (SMV) Saxpy Single Dot Product(SDP)
Convert Vector(CV) Convert Matrix(CM)
1CM +
22
![Page 23: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/23.jpg)
Richardson-PCG
23
![Page 24: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/24.jpg)
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
24
![Page 25: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/25.jpg)
Richardson-Chebyshev
25
![Page 26: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/26.jpg)
Richardson-Chebyshev
Rich-CBS
× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟4 + 3 Saxpy + 1 SMV + 2 CV
2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟4
Rich-PCG
× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 + 4 Saxpy + 1 SMV + 3 SDP + 2 CV
2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟3
1CM +
1CM +
26
![Page 27: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/27.jpg)
Richardson-Chebyshev
27
![Page 28: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/28.jpg)
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
28
![Page 29: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/29.jpg)
Experiments
Our supercomputers:
Tianhe-1A
CPU: 2.93GHz Intel Xeon X5670
Memory: 32/48GB per node. Bandwidth: 40GB/s
Network: 160Gbps, 22ns. Fat tree structure
Shenwei
CPU: 1.1GHz Shenwei Processor
Memory: 32GB per node. Bandwidth: 68GB/s (List result)
Network: Crossbar for every 256 CPU. Fat tree structure
29
![Page 30: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/30.jpg)
Experiments on Tianhe-1A
30
![Page 31: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/31.jpg)
Experiments on Shenwei
31
![Page 32: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/32.jpg)
Conclusion
Two techniques
Reducing dot-products
Effective in large core numbers ( more than 5000)
Mixed precision
Effective in small core numbers ( less than 1000)
32
![Page 33: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/33.jpg)
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
33
![Page 34: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/34.jpg)
Future Work
Complete the investigation of the current code
Integrate Optimization techniques into our ocean modeling programs
Apply our methods to other parallel programs
34
![Page 35: CGPOP Analysis and Optimization](https://reader030.fdocument.pub/reader030/viewer/2022020218/55a5241f1a28ab10768b467b/html5/thumbnails/35.jpg)
References
[1] R. Smith, P. Gent, “Reference Manual for the Parallel Ocean Program(POP)”, May, 2002, Page 1-74.
[2]A. Stone, J. M. Dennis, M. M. Strout, “The CGPOP Miniapp, Version 1.0”, July, 2011, Page 4-5.
[3] Y. Saad, A. Sameh, P. Saylor, “Solving elliptic difference equations on a linear array of processors”, SIAM J. Sci. Stat. Comput., Vol. 6, No. 4, October 1985, Page 1049-1063.
[4] E. Stiefel, “Kernel polynomials in linear algebra and their numerical applications”, Nat. Bur. Standards, Appl. Math. Series 49, 1958, page 1-22.
[5] A. Buttari, E. Lyon, J. Dongarra. “Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy”, ACM Transactions on Math. Software, Vol.34, No.4, Article 17, Page 1-8.
35