Post on 12-Mar-2020
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Lattice QCD software for the IBM Blue Gene/Parchitecture
Stefan Krieg<s.krieg@fz-juelich.de>
Jülich Supercomputing Center,Forschungszentrum Jülich
andFB C, Mathematik und Naturwissenschaften,
Bergische Universität Wuppertal
Joint Seminar on High Performance ComputingTrinity College, Dublin - Bergische Universität Wuppertal
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Some Lattice QCD basics
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Some Lattice QCD basics
The lattice
Lattice QCD (LQCD) is defined on a 4 dim. periodic lattice;LQCD is a way to define QCD in a mathematically precise way
Key ingedients are:I The Quarks living on the
lattice sitesI The Gluons living on the
lattice linksI Typically the LQCD action
connects only neighboringsites (plain Wilson)
Simulations of LQCD are the only available method to directly accessthe low energy regime of QCD
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Some Lattice QCD basics
QCD particle spectrum (Blue Gene, BMW coll.)
Science 322, 1224 (2008)
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Simulation algorithm
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Simulation algorithm
Importance SamplingI After Wick-rotation, the path-integral becomes equivalent to a
partition function of statistical mechanics
〈O〉F ,GQCD =
∫DψDψDAO(ψ,ψ,A) exp
{−SQCD
E (ψ,ψ,A)}
=
∫DA det[D] 〈O(A)〉F exp
{−SQCD
E,G (A)}
I The Boltzmann-weight can be interpreted as probabilityI if Ai has the probability p ∝ exp{−SQCD
E (Ai )} to appear in theensemble of A’s then
〈O〉F ,GQCD =
1N
N∑i=0
〈O(Ai )〉F ,
gives a statistical estimate of an operator expectation value
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Simulation algorithm
Simulation algorithm: HMCThe fermionic determinant∫ ∏
i
dηidη†i exp{−
∑ij
η†i Dijηj} = det[D]
is included via the pseudofermions (pos. definite K ):∫ ∏i
dcidc∗i exp{−∑
ij
c∗i Kijcj} =1
det[K ]
The Hybrid Monte Carlo algorithm now proceeds as follows1. Generate a pseudofermion vector2. Molecular Dynamics evolution of the gauge links (⇒ Inversions)3. Metropolis accept reject step (energy: E = S + 1
2 Π2)
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Blue Gene/P system
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Blue Gene/P system
Blue Gene/P system structure
Chip4 processors
13.6 GF/s Compute Card1 chip, 13.6 GF/s
2.0 GB DDR2(4.0GB optional)
Node Card(32 chips 4x4x2)
32 compute, 0-1 IO cards435 GF/s, 64 GB Rack
32 Node CardsCabled 8x8x1613.9 TF/s, 2 TB
System72 Racks, 72x32x32
1 PF/s, 144 TB
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Blue Gene/P system
The Blue Gene/P node
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Blue Gene/P system
Blue Gene/P networksBlue Gene/P main networks are:
I The 3 dimensional torus networkI Bandwidth (node): 6*2*425MB/s =
5.1GB/sI HW latency (1 hop): 100ns (32B),
800ns (256B packet)I HW latency (worst): 3.2µsI DMA controlled: overlapping
communication and computation
I The tree network for collectivesI Bandwidth (node): 2*850MB/s =
1.7GB/sI HW latency (worst): 3.5µs
I Global barrier network
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Blue Gene/P functionality
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Blue Gene/P functionality
DMA communications
DMADMA
DataData
countercounter
FIFOFIFOFIFOFIFOFIFOFIFO
Memory
upd.
load/storesend/
recv.
read/store
upd.
Compute Node
Torus HWTorus HW
DMADMA
Compute Node
DMA is capable of:I Direct-put: put data into
memory on destination nodeI MemFifo comms: put data
into reception fifo ondestination node
I Remote-get: put a descriptorinto injection fifo ondestination node
I Prefetch-only: prefetch datainto L3 (no transfers)
I Destination node can be thenode itself (local transfers)
DMA “directly” programmable: SPI
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Blue Gene/P functionality
“Double Hummer” FPU
The “Double Hummer” FPUfeatures:
I Instructions optimizedfor complex arithmetic:only 2 instructionsrequired for complexmultiplication
I 32 primary + 32secondary registers
I Capability to load 16Byte quadword
I 5 stage pipeline
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Blue Gene/P functionality
Complex multiplication with “Double Hummer” FPUB
A
C
B
A
C
C
**
+-
ss
**
s s
Complex multiplication (A×B=C)
Re(C) = (Re(A)Re(B)− Im(A)Im(B))
Im(C) = (Im(A)Re(B) + Re(A)Im(B))
Required:I a cross copy primary multiply
(2+1 register operands)I a cross mixed negative
secondary multiply-add(3+1 register operands)
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
The Wilson kernel
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
The Wilson kernel
Wilson kernel properties
I Wilson kernel is sparse matrix vector multiplicationI Sparse: memory footprint of kernel scales linearly with NI Wilson kernel is a first order derivative connecting NNI However “NN coefficients” are random matrices
Blue Gene/P implementation strategy:1. (scalar) Spin project forward
2. (comm) Start communication forward
3. (scalar) Spin project backward and SU(3) multiply
4. (comm) Wait forward; Start communication backward
5. (scalar) SU(3) multiply fwd. and sum up
6. (comm) Wait backward
7. (scalar) Add backward
⇒ Calculations and communications overlap.
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
The Wilson kernel
Wilson kernel communication pattern
CPU0 CPU3
CPU2CPU1
Match 4 dim. periodicphysics lattice tocommunication hardware:
I Put 3 dimensionsalong torus directions
I Use local transfers for4th dimension:
Core0 → Core1↑ ↓
Core2 → Core3
⇒ 4 dimensional torus
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the DMA
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the DMA
Injection Fifo
Fifo starting address
Fifo head
Fifo tail
Fifo ending address
DMA
DMA
Inject new
Fifohead&tailwrap around
descriptor
counterdec.
Fifo:contiguouschunk ofDDR Mem.
send
descriptor
Torus HW
I Fifo is a contiguous chunk ofDDR memory
I The Fifo hasI Starting addressI Ending addressI HeadI Tail
I Message descriptors areinjected into injection fifo
I DMA “executes” a descriptor,then updates fifo head
I New descriptor is injected atfifo tail, tail is then updated
I Tail is wrapped around
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the DMA
Counters
Two tyes of counters:
I Reception (rDMA) countersI Base address: associated virtual address in main memory;
incoming data is stored relative to this addressI Max address: associated virtual address in main memory; defines
“memory reception window”I Counter value: is decreased by DMA when receiving a message
associated to the counter
I Injection (iDMA) countersI Base address: associated virtual address in main memory; data is
send relative to this addressI Counter value: is decreased by DMA when sending a message
associated to the counter
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the DMA
Descriptors
Target X,Y,Z
Rec. counter ID
Send offset
Recv offset
Inj. counter ID
Hints, …
Inj. counter Grp
Rec. counter Grp
Msg. length
A message descriptor contains (direct-put)I Coordinates of the destination node
(nonlocal comm.)I Send offset relative to inj. counter base
addressI Injection counter group and IDI Reception offset relative to rec. counter
base address (on destination node)I Reception counter group and ID (on
destination node)I Message length
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the DMA
SPI software layer
SPI contains low level function calls toI access and manipulate fifos, counters etc.I create descriptors, inject descriptorsI use the global interrupt and collective network
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the DMA
Counters and descriptors
For e.g. a “direct-put“ using SPI one could proceed as follows:1. Allocate 1 injection fifo, 1 reception and 1 injection counter
(if you do not have them already)
2. Set the injection counter base address e.g. to the smallest virtual address of thedata that you want to communicate
3. Set the injection counter value to the number of bytes you intend to send
4. (destination node) send the reception counter base address to the smallestvirtual address where you want to store received data
5. Set the reception counter max address
6. Set the reception counter value to the number of bytes send
For each continuous chunk of data, inject 1 descriptor into the inj. fifo1. Calculate the address offset relative to the injection counter base
2. Calculate the address offset relative to the reception counter base
3. Give the message size
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the DMA
Summary: kernel persistent communication
Setting up the persistent communication for the kernel with SPI1. Allocate 6+2 injection fifos, 2 reception and 2 injection counters
2. Set the reception counter base and max and the injection counter baseaddresses (same on every node)
3. Deactivate fifos and counters.
4. Calculate the offsets and message sizes etc. (block-strided-move) and inject thedescriptors
5. Move each fifo head to the fifo tail
6. Activate fifos and counters
To start the communication1. Set the reception and injection counter values
2. Move the fifo heads to the fifo start
And the DMA does its work in the background.Poll counters to make sure communication is completed.
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the “Double Hummer” FPU
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the “Double Hummer” FPU
Problems
Making efficient use of the “Double Hummer” FPU is not easy:
I Secondary FPU and secondary registers only accessible throughoedipus instructions
I Oedipus instructions always work on 16 Byte aligned 16 Bytequadwords = 2 double precision floating point numbers→ User has to make sure data is aligned
I Compilers frequently fail to rearrange code properly→ “Simdization” fails.
⇒ To get maximum performance one needs to go low level
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the “Double Hummer” FPU
Fast and easy: XL intrinsics
I IBM XL compilers provide built-in functions (intrinsics) that mapto (e.g. floating point) assembly instructionse.g. __lfpd, __stfpd, __fpmadd and __dcbt
I Intrinsics operate on “double _Complex” variables that map toregisters
I Number of variables is not limited (like registers)I Scheduling will be done by compiler
⇒ Comparatively easy to use.I LQCD code has large parts optimize using intrinsicsI Use a set of macros that implement basic mathematical
operationsI With macros optimization is almost trivial
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the “Double Hummer” FPU
Example: daxpy
0
0.2
0.4
0.6
0.8
1
32241680
flops
/cyc
le
vector length[KBytes]
450d unrolled, alignedESSL BG
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the “Double Hummer” FPU
Example: caxpy
0
0.2
0.4
0.6
0.8
1
1.2
1.4
32241680
flops
/cyc
le
vector length[KBytes]
450d450d unrolled, aligned
intrinsics
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Using the “Double Hummer” FPU
Faster and annoying: assembly
When using the intrinsics, the compiler has great influence on theperformance.
I For more control use (gcc inline) assemblyI Kernel serial code written in assemblyI Uses explicit prefetches (dcbt)I All scheduling and register allocation done by handI No problem if all that is already there (BGL kernel)
⇒ Performance typically another 10% better compared to intrinsics
⇒ Code generation typically 10 times slower compared to intrinsics
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Kernel performance
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Kernel performance
Wilson kernel performance (strong scaling)
0
10
20
30
40
50
60
70
80
4K 16K 32K 64K
1 4 8 16
Perf
orm
ance
/TFl
op/s
# Cores
Number of BG/P racks
37.5% PeakWilson kernel shows
I almost perfect strongscaling,
I a large scaling range,
I perfect weak scaling,
I and reaches 37.5% ofabsolute peak
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Kernel performance
Performance comparison BGL/BGP
0
5
10
15
20
25
30
35
2K 8K 16K 0
2
4
6
8
101 4 8
Perf
orm
ance
/%Pe
ak
Ker
nel m
emor
y fo
otpr
int/M
Byt
e
# Cores
Number of BG/L racks
perf singleperf doublemem single
mem double
0
10
20
30
40
50
4K 16K 32K 64K 0
2
4
6
8
101 4 8 16
Perf
orm
ance
/%Pe
ak
Ker
nel m
emor
y fo
otpr
int/M
Byt
e
# Cores
Number of BG/P racks
perf singleperf doublemem single
mem double
I Optimized comm. + calc. I Large scaling region
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Performance of mixed precision inverters
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Performance of mixed precision inverters
Mixed precision via iterative refinement
I Single precision kernel has best performance and largest scalingregion
I Valence and sea sector calculations require double precisionaccuracy
I Solution: Mixed precision invertersUse a less precise A−1 = a−1:
1. Compute b − Ax0 = r2. Compute a−1r = A−1r + rs, with |rs| = ε|r |3. Update x: x1 = x0 + a−1r
⇒ |b − Ax1| = |b − Ax0 − Aa−1r | = |r − AA−1r − rs| = ε|r |I This can be used to speed up a generic CGI Precision of a−1 can be tuned
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Performance of mixed precision inverters
Mixed precision GMRESR
relGMRESR(A,b, ε)
{computes x with ‖Ax − b‖ ≤ ε · ‖b‖ via relaxed GMRESR}
x = 0; {initial value}r = b;U = C = []; {empty matrix}while ‖r‖ > ε · ‖b‖ do
solve Au = r to relative accuracy ξ {preconditioner}compute c with ‖Au − c‖ ≤ ε · ‖b‖ · ‖u‖/‖r‖;for i=1:size(C,2) doβ = C[:, i]† · c;c = c − β · C[:, i];u = u − β · U[:, i];
end forc = c/‖c‖; u = u/‖c‖;C = [C, c]; U = [U, u];α = c† · r ;x = x + α · u;r = r − α · c;
end while
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Performance of mixed precision inverters
Mixed precision inverter performance
0 20 40 60
0.0001
0.001
0.01
0.1
1
sec
WilsonCG-64 vs. CG-mix
0 5 10 15 20 25
0.0001
0.001
0.01
0.1
1
1000 sec
overlapCG-64 vs. GMRES-mix
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
OutlineLattice QCD
LQCD basicsSimulation algorithm
Blue Gene/P hardwareBGP systemBGP functionality
Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code
Performance resultsKernel PerformanceMixed prec. inverters
Conclusion and outlook
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Conclusion and Outlook: LQCD software
Blue Gene/L and Blue Gene/P low level software layerI is the core of all the simulation codes used within BMW projects
on Blue GeneI contains all performance critical routinesI extends, combined with mixed action inverters, the scaling
region of the code to over ten thousands of CPUsI delivers up to 37.5% sustained performanceI has been successfully used within the LQCD simulations to test
and verify the Blue Gene’s at JülichIn the future
I further optimize software layerI include support for CELL BE
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook
Excuse me? THEORETIC? Surely you must be joking... Boeing!
Stefan Krieg 30.01.2009
BlueGene software for LQCD FZ Jülich, BU Wuppertal