Post on 15-Mar-2020
IBM Research – Tokyo
April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016 IBM Corporation
How SIMD Width Affects Energy Efficiency:A Case Study on Sorting
Hiroshi InoueIBM Research – Tokyo
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
Goal & Approach
Goal:§ to understand how SIMD width affects execution time and
energy consumption– Not to propose a new energy-efficient algorithm or system
Approach:§ to take SIMD mergesort as an example§ to measure execution time, power and energy (= execution
time × power) with various hardware configurations on a commodity PC
– SIMD width (8-way AVX, 4-way SSE or 1-way scalar)– Memory bandwidth
2
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
SIMD mergesort
§ Combining advantages of sorting networks (SIMD friendly) and usual mergesort (lower computational complexity)
– usual comparison-based mergesort in memory• computational complexity of O(N log(N))• mostly sequential memory accesses
– vector-register-level bitonic merge operation implemented with SIMD min/max instructions
• data parallelism• less conditional branch
è Wider vector gives sub-linear reduction in the number of instructions
3
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
< < < <
< < < <
4
2 3
Inputtwo vector registers contain four presorted values in each
Outputeight values in two vector registered are now sorted
SIMD mergingone SIMD comparison and “shuffle” operations for each stage without conditional branch
1 4 7 8
stage 1
stage 2
stage 3
input
output
< < < <
6 5 3 2
1 4 7 85 6
sorted sorted
sorted
(example of bitonic merge)
SIMD-based merge for values in two vector registers
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
Evaluation
§ Hardware: a commodity PC + external power meter– Core i7 4770 (Haswell) 3.4 GHz, 4 cores, 8 threads– one or two 4-GB DDR3-1333 DIMMs (single or dual channel)– power meter Yokogawa WT-210 (for system-level power)– Redhat Enterprise Linux 6.5, gcc-5.2
§ Tested algorithms (for sorting random 256-M 32-bit integers)– SIMD mergesort w/ scalar (1 way), SSE (4 way), or AVX (8 way)– radix sort (scalar)– quicksort (std::sort, scalar)
5
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
Summary of observations
1. Execution time– Wider SIMD gives larger speedup (up to 10x)
2. Power– SIMD increases power only up to 15%
3. Energy (= Execution time x Power)– Lower energy consumption with wider SIMD
4. Power and Execution time with lower bandwidth-to-compute ratios– Wider SIMD may yield better performance with lower power!
Refer to paper (not covered today)§ Energy consumption with various bandwidth-to-compute
ratios (achieved using DVFS)– Need to balance core compute performance and memory
bandwidth to minimize energy consumption6
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
0
5
10
15
20
25
30
mergesort with 8-way SIMD
mergesort with 4-way SIMD
mergesort without SIMD
quicksort (std::sort)
radix sort without SIMD
exec
utio
n tim
e (s
ec)
1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)
Execution time (scalar vs. SIMD with 1 thread)
7
faster
9.7xspeedupby 8-way SIMD
6.8xspeedupby 4-way SIMD
with 1 thread
SIMD mergesort
üWider SIMD gave larger speedup as expected
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
0
5
10
15
20
25
30
mergesort with 8-way SIMD
mergesort with 4-way SIMD
mergesort without SIMD
quicksort (std::sort)
radix sort without SIMD
exec
utio
n tim
e (s
ec)
1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)
Execution time (scalar vs. SIMD with 8 thread)
8
5.0x speedupby 8-way SIMD
4.4x speedupby 4-way SIMD
with 8 threads
faster
üSmaller gains from SIMD due to memory bandwidth bottleneck
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
mergesort with 8-way SIMD
mergesort with 4-way SIMD
mergesort without SIMD
quicksort (std::sort)
radix sort without SIMD
exec
utio
n tim
e (s
ec)
1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)
Execution time (8-way vs. 4-way)
9
42%speedup
14%speedup
faster
ü8-way SIMD (AVX) gave additional speedups over 4-way SIMD (SSE)
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
0
20
40
60
80
100
120
mergesort with 8-way SIMD
mergesort with 4-way SIMD
mergesort without SIMD
quicksort (std::sort)
radix sort without SIMD
pow
er (w
att)
1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)
Power
10
only up to 15% increase in power
ü Increase in power by use of SIMD was not so significant
better
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
0
200
400
600
800
1000
1200
mergesort with 8-way SIMD
mergesort with 4-way SIMD
mergesort without SIMD
quicksort (std::sort)
radix sort without SIMD
ener
gy (j
oule
)
1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)
Energy (= Execution time x Power) with 1 thread
11
8.8xreductionin energyby 8-way SIMD
6.4xreduction in energyby 4-way SIMD
with 1 thread
better
ü Energy consumption was significantly reduced due to shorter execution time
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
0
200
400
600
800
1000
1200
mergesort with 8-way SIMD
mergesort with 4-way SIMD
mergesort without SIMD
quicksort (std::sort)
radix sort without SIMD
ener
gy (j
oule
)
1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)
betterEnergy (= Execution time x Power) with 8 threads
12
4.6x reduction in energyby 8-way SIMD
3.9x reduction in energyby 4-way SIMD
with 8 threads
ü Energy consumption was significantly reduced due to shorter execution time
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
020406080
100120140160180200
mergesort with 8-way SIMD
mergesort with 4-way SIMD
mergesort without SIMD
quicksort (std::sort)
radix sort without SIMD
ener
gy (j
oule
)
1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)
Energy (= Execution time x Power) 8-way vs. 4-way
13
38% less energy
16%lessenergy
42% less execution time with 3% higher power
14% less execution time with 2% lower power
better
ü Wider SIMD yielded better performance with lower power when using 8 threads
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
10 20 30 40 50 60 70 80 90 100 110
thro
ughp
ut (
1/se
c)
power (watt)
mergesort with 8-way SIMD
mergesort with 4-way SIMD
mergesort without SIMD
0.0
0.2
0.4
0.6
0.8
1.0
1.2
10 20 30 40 50 60 70 80 90 100 110
thro
ughp
ut (
1/se
c)
power (watt)
mergesort with 8-way SIMD
mergesort with 4-way SIMD
mergesort without SIMD
Power and Execution time
14
idle power
1 thread
2 threads
4 threads
8 threads(4 cores w/ SMT)
better(lower power)
faster
Wider SIMD yields higher performance and power
Wider SIMD yieldsshorter time and lower power
ü Wider SIMD yielded better performance with lower power when using 8 threads
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
10 20 30 40 50 60 70 80 90 100 110
thro
ughp
ut (
1/se
c)
power (watt)
mergesort with 8-way SIMDmergesort with 4-way SIMDmergesort without SIMDmergesort with 8-way SIMDmergesort with 4-way SIMDmergesort without SIMD
Power and Execution time with reduced bandwidth
15
ü With lower memory bandwidth, power reduction by SIMD was more significant
with 2 memory channels(full bandwidth)
with 1 memory channel(half bandwidth)
Wider SIMD yieldsshorter time and lower power
better(lower power)
faster
IBM Research – Tokyo
How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation
Summary & Future work
§ Summary of this study– Wider SIMD gives larger speedup and less energy consumption– Also, it potentially yields lower power by reducing number of
instructions when bandwidth-to-compute ratio is low– (It is important to balance core performance and memory
bandwidth to achieve best energy efficiency)è Increasing SIMD width will be important for future low-power processors even with limited bandwidth-to-compute ratios
§ Future work– to evaluate with other workloads, especially floating-point
intensive applications
16