Recent advances on the solution phase of direct solvers with...
Transcript of Recent advances on the solution phase of direct solvers with...
Recent advances on the solution phaseof direct solvers with multiple sparseright-hand sides
P. Amestoy1, J.-Y. L’Excellent2, G. Moreau2
1. Universite de Toulouse INPT and IRIT , 2. Universite de Lyon, Inria and LIP-ENS Lyon ,[email protected]
Mumps User’s Days, Montbonnot, June 1-2 2017
SIAM CSE 17, Altanta, Georgia
Introduction
ோߝ��������������������������௫ܧ ൌ 10
ൌ, െ
|, |
�௫ܧ����������
����������������������������������������������������������������������������
��
Linear systems of equations :Ax = b, A is sparseSolve phase (Ly = b,Ux = y) maybe critical.
01234D
epth
(km
)
0Dip (km)
5
10
15
20
Cross (
km)
5 10 15 20
3000 4000 5000 6000m/s
Application coming from Helmholtz or Maxwell equations:
name n (million) nrhs nnz/nrhs Tfacto Tsolve
sei70m 2.9 2302 587 1258 1267
sei50m 7.1 2302 486 6289 2985
E1 0.33 8000 9.8 55.2 291
E3 2.8 8000 7.5 1951 5610
Table: Characteristics of matrices and right-hand-sides.
2 / 19
Introduction
Objectives:
• focus on the forward solution phase Ly = b;
• exploit sparsity of right-hand-sides;
• limit the number of operations (∆);
3 / 19
Overview
Exploitation of sparse right-hand-sidesContext of studyTree pruningExploitation of subintervals of columns at each node
Minimizing the number of operationsPermutation of columnsAdapted blocking technique
Conclusion
4 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube)
separator tree L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree
L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree
L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree
L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree L factor
5 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Exposition of padded zeros
When B is a matrix with multiple columns:
• use of BLAS 3 operations for efficiency;
• Tp(B) =⋃Tp(Bi ), where Bi is column i of B;
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
nrhs
1 2 3 4 5 6
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
But still, extra computations are done ...
∆ = nrhs ×∑
u∈Tp(B)Fu
8 / 19
Exposition of padded zeros
When B is a matrix with multiple columns:
• use of BLAS 3 operations for efficiency;
• Tp(B) =⋃Tp(Bi ), where Bi is column i of B;
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
nrhs
1 2 3 4 5 6
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
But still, extra computations are done ...
∆ = nrhs ×∑
u∈Tp(B)Fu
8 / 19
Solutions
What are the possible alternatives ?
• Indirections: rebuilding data structures;
• Sequential: solution phase on each column⇒ optimal (∆ = ∆min) but not efficient;
• Regular blocking: how to build blocks ?◦ minimal access to factors (out of core) [Amestoy et al.,SISC,2012];◦ minimal number of operations (in core) [Yamazaki et al.,2013];
• Exploitation of subintervals of columns at each node [Amestoy et
al.,SISC,2015].
9 / 19
Solutions
What are the possible alternatives ?
• Indirections: rebuilding data structures;
• Sequential: solution phase on each column⇒ optimal (∆ = ∆min) but not efficient;
• Regular blocking: how to build blocks ?◦ minimal access to factors (out of core) [Amestoy et al.,SISC,2012];◦ minimal number of operations (in core) [Yamazaki et al.,2013];
• Exploitation of subintervals of columns at each node [Amestoy et
al.,SISC,2015].
9 / 19
Work on node subintervals
Let u ∈ T :
Active columns at node u
Zu = {i ∈ {1, . . . ,m} | u ∈ Tp(Bi )}
Subinterval is given by:
θu = max(Zu)−min(Zu) + 1
Example: θu1 = 1, θu10 = 6
∆ =∑
u∈Tp(B)
Fu × θu
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
1 2 3 4 5 6θ1 = 1θ2 = 1θ3 = 4θ4 = 1θ5 = 1θ6 = 4θ7 = 5θ8 = 6θ9 = 0θ10 = 6θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
1 4 2 5 6 3θ1 = 1θ2 = 1θ3 = 2θ4 = 1θ5 = 1θ6 = 2θ7 = 4θ8 = 5θ9 = 0θ10 = 5θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
↪→ ∆ is extremely dependant on column permutation.
10 / 19
Work on node subintervals
Let u ∈ T :
Active columns at node u
Zu = {i ∈ {1, . . . ,m} | u ∈ Tp(Bi )}
Subinterval is given by:
θu = max(Zu)−min(Zu) + 1
Example: θu1 = 1, θu10 = 6
∆ =∑
u∈Tp(B)
Fu × θu
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
1 2 3 4 5 6θ1 = 1θ2 = 1θ3 = 4θ4 = 1θ5 = 1θ6 = 4θ7 = 5θ8 = 6θ9 = 0θ10 = 6θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
1 4 2 5 6 3θ1 = 1θ2 = 1θ3 = 2θ4 = 1θ5 = 1θ6 = 2θ7 = 4θ8 = 5θ9 = 0θ10 = 5θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
↪→ ∆ is extremely dependant on column permutation.
10 / 19
Work on node subintervals
Let u ∈ T :
Active columns at node u
Zu = {i ∈ {1, . . . ,m} | u ∈ Tp(Bi )}
Subinterval is given by:
θu = max(Zu)−min(Zu) + 1
Example: θu1 = 1, θu10 = 6
∆ =∑
u∈Tp(B)
Fu × θu
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
1 2 3 4 5 6θ1 = 1θ2 = 1θ3 = 4θ4 = 1θ5 = 1θ6 = 4θ7 = 5θ8 = 6θ9 = 0θ10 = 6θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
1 4 2 5 6 3θ1 = 1θ2 = 1θ3 = 2θ4 = 1θ5 = 1θ6 = 2θ7 = 4θ8 = 5θ9 = 0θ10 = 5θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
↪→ ∆ is extremely dependant on column permutation.
10 / 19
Work on node subintervals
Let u ∈ T :
Active columns at node u
Zu = {i ∈ {1, . . . ,m} | u ∈ Tp(Bi )}
Subinterval is given by:
θu = max(Zu)−min(Zu) + 1
Example: θu1 = 1, θu10 = 6
∆ =∑
u∈Tp(B)
Fu × θu
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
1 2 3 4 5 6θ1 = 1θ2 = 1θ3 = 4θ4 = 1θ5 = 1θ6 = 4θ7 = 5θ8 = 6θ9 = 0θ10 = 6θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
1 4 2 5 6 3θ1 = 1θ2 = 1θ3 = 2θ4 = 1θ5 = 1θ6 = 2θ7 = 4θ8 = 5θ9 = 0θ10 = 5θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
↪→ ∆ is extremely dependant on column permutation.
10 / 19
Problem statement & algorithms
Goal is to minimize or decrease ∆ =∑
u∈Tp(B)Fu × θu:
• find permutation σ of columns to decrease θu,∀u ∈ Tp(B);
• in case of blocking, minimize the number of blocks.
Proposed heuristics:
• based on geometrical properties (Nested Dissection);
• generalization possible thanks to pruned tree Tp(B).
11 / 19
Problem statement & algorithms
Goal is to minimize or decrease ∆ =∑
u∈Tp(B)Fu × θu:
• find permutation σ of columns to decrease θu,∀u ∈ Tp(B);
• in case of blocking, minimize the number of blocks.
Proposed heuristics:
• based on geometrical properties (Nested Dissection);
• generalization possible thanks to pruned tree Tp(B).
11 / 19
Flat Tree Algorithm
Intuition based on a simple 2D example:
2D physical domain
u0
u1 u2a b c
d
e
f
g
h
i
j
k
l
structure of B
a bc cbd e f g h i j k l
T [u1]
T [u2]
u0
T [u11]
T [u12]
u1
T [u21]
T [u12]
u2u0
×
f
×
×
f×
×
×
×
×
×f
×
f
f
×
××
f
×f
f
×
f
×
f×
×
×××
×××
×f
×f×
×
ff
×
××f
×ff
u0
u2
u22u21
u1
u12u11
elimination tree
• Nested Dissection ⇒ partition right-hand-sides into 3 sets (a, b, c);
• θu1 = a + c + b
⇒ θu1 = a + b;
↪→ Top-down approach + local optimisation for the nodes at thecurrent layer in the tree.
12 / 19
Flat Tree Algorithm
Intuition based on a simple 2D example:
2D physical domain
u0
u1 u2a b c
d
e
f
g
h
i
j
k
l
structure of B
a bc cbd e f g h i j k l
T [u1]
T [u2]
u0
T [u11]
T [u12]
u1
T [u21]
T [u12]
u2u0
×
f
×
×
f×
×
×
×
×
×f
×
f
f
×
××
f
×f
f
×
f
×
f×
×
×××
×××
×f
×f×
×
ff
×
××f
×ff
u0
u2
u22u21
u1
u12u11
elimination tree
• Nested Dissection ⇒ partition right-hand-sides into 3 sets (a, b, c);
• θu1 = a + c + b ⇒ θu1 = a + b;
↪→ Top-down approach + local optimisation for the nodes at thecurrent layer in the tree.
12 / 19
Flat Tree Algorithm
Intuition based on a simple 2D example:
2D physical domain
u0
u1 u2a b c
d
e
f
g
h
i
j
k
l
structure of B
a bc cbd e f g h i j k l
T [u1]
T [u2]
u0
T [u11]
T [u12]
u1
T [u21]
T [u12]
u2u0
×
f
×
×
f×
×
×
×
×
×f
×
f
f
×
××
f
×f
f
×
f
×
f×
×
×××
×××
×f
×f×
×
ff
×
××f
×ff
u0
u2
u22u21
u1
u12u11
elimination tree
• Nested Dissection ⇒ partition right-hand-sides into 3 sets (a, b, c);
• θu1 = a + c + b ⇒ θu1 = a + b;
↪→ Top-down approach + local optimisation for the nodes at thecurrent layer in the tree.
12 / 19
Results on the Flat Tree
flops: normalized with the dense case; Ordering: Nested Dissection;
E1 E3 sei70m sei50mMatrices
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Norm
aliz
ed fl
ops
TP INT PO FT LB
Strategies:
• TP = tree pruning only;
• INT = tree pruning + nodeinterval+natural order;
• PO = tree pruning+nodeinterval+Postorder;
• FT = tree pruning+node interval+FlatTree;
• LB = Lower Bound (∆min).
↪→ Still 28% above the lower bound on one case.
13 / 19
Adapted blocking technique
Objective: decrease ∆ with the creation of a minimum number ofgroups.
×f
f
f
×
ff
f
×
f
f×
f
f×
×ff
×ff×
×
f
ff
×
×ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
4 2 1 5 6 34
×f
f
f
2
×
ff
f
1×
f
f×
f
f×
5
×ff
×ff×
6
×
f
ff
3
×
×ff
4 2 6 3
×f
f
f
×
ff
f
×
f
ff
×
×ff
1 5×
f
f×
f
f×
×ff
×ff×
Computations on explicit zeros still exist.
• however, not performant (loss of BLAS 3 operations);
• need to find some property to group right hand sides togetherwithout introducing extra operations.
14 / 19
Adapted blocking technique
Objective: decrease ∆ with the creation of a minimum number ofgroups.
×f
f
f
×
ff
f
×
f
f×
f
f×
×ff
×ff×
×
f
ff
×
×ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
4 2 1 5 6 34
×f
f
f
2
×
ff
f
1×
f
f×
f
f×
5
×ff
×ff×
6
×
f
ff
3
×
×ff
4 2 6 3
×f
f
f
×
ff
f
×
f
ff
×
×ff
1 5×
f
f×
f
f×
×ff
×ff×
∆min may be obtain by creating nrhs groups:
• however, not performant (loss of BLAS 3 operations);
• need to find some property to group right hand sides togetherwithout introducing extra operations.
14 / 19
Adapted blocking technique
Objective: decrease ∆ with the creation of a minimum number ofgroups.
×f
f
f
×
ff
f
×
f
f×
f
f×
×ff
×ff×
×
f
ff
×
×ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
4 2 1 5 6 34
×f
f
f
2
×
ff
f
1×
f
f×
f
f×
5
×ff
×ff×
6
×
f
ff
3
×
×ff
4 2 6 3
×f
f
f
×
ff
f
×
f
ff
×
×ff
1 5×
f
f×
f
f×
×ff
×ff×
∆min may be obtain by creating nrhs groups:
• however, not performant (loss of BLAS 3 operations);
• need to find some property to group right hand sides togetherwithout introducing extra operations.
14 / 19
Adapted blocking technique
Principle (1): group sets of right hand sides that belong to differentsubdomains (starting with root separator).
a b c
a bc cc b
2D Domain structure of B
• non-zero structure of a and c are disjoint;
• whereas b may have the non-zero structure of both a and c ;
• thus, we extract them.
15 / 19
Adapted blocking technique
Principle (2): extract set of right hand sides that belong to bothsubdomains (starting with root separator).
a b c
a bc cc b
2D Domain structure of B
• non-zero structure of a and c are disjoint;
• whereas b may have the non-zero structure of both a and c ;
• thus, we extract them.
15 / 19
Adapted blocking technique
Principle (2): extract set of right hand sides that belong to bothsubdomains (starting with root separator).
a b c
a bc cc b
2D Domain structure of B
• non-zero structure of a and c are disjoint;
• whereas b may have the non-zero structure of both a and c ;
• thus, we extract them.
15 / 19
Comparison with a regular blocking strategy
Our Blocking algorithm (BLK):
• greedy algorithm to choose nextgroup;
• stop condition: ∆ < ∆tol ,where ∆tol = 1.01∆min.
Regular blocking algorithm(REG):
• split in chunk of regular size;
• stop condition: ∆ < ∆tol ,where ∆tol = 1.01∆min.
nb groups 5Hz 7Hz E1 E3
REG 328 255 363 258
BLK 3 3 4 4
Table: Number of groups created for each strategy with a tolerance such that∆ < 1.01×∆tol .
16 / 19
Conclusion
Achievements:
• implementation of two heuristics (permutation, blocking);
• 90% decrease in flops by exploiting sparsity;
• Up to 40% decrease in time for forward solve w.r.t. INT strategyand Nested Dissection ordering (sequential).
Perspectives:
• adapt the Flat Tree algorithm to unbalanced trees;
• parallelism and sparsity aspects of Flat Tree permutation;
• extend to more general test cases.
17 / 19
Acknowledgements
• LIP laboratory for access to the machines;
• EMGS et SEISCOPE for providing test cases;
• This work was performed within the frameworks of both theMUMPS consortium and the LABEX MILYON(ANR-10-LABX-0070) of Universite de Lyon, within the program”Investissements d’Avenir” (ANR-11-IDEX-0007) operated by theFrench National Research Agency (ANR).
18 / 19
? Thanks!Questions?