Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara,...
-
Upload
sydney-fleming -
Category
Documents
-
view
214 -
download
2
Transcript of Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara,...
Analyzing the Impact of Data Prefetching on Chip MultiProcessors
Naoto Fukumoto, Tomonobu Mihara,Koji Inoue, Kazuaki Murakami
Kyushu University, Japan
ACSAC 13August 4, 2008
2
Back Ground
• CMP (Chip MultiProcessor):– Several processor cores integrated in a chip– High performance by parallel processing– New feature: Cache-to-cache data transfer
• Limitation factor of CMP performance– Memory-wall problem is more critical
• High frequency of off-chip accesses• Not scaling bandwidth with the number of cores
CMP
Core
L1 $
L2 $
Core
L1 $
chip
Data prefetching is more important in CMPs
Motivation & Goal
• Motivation– Conventional prefetch techniques have been
developed for uniprocessors– Not clear that these prefetch techniques achieve high
performance in even in CMPs– Is it necessary for the prefetch techniques to consider
CMP features ?– Need to know the effect of prefetch on CMPs
• Goal– Analysis of the prefetch effect on CMPs
3
Outline
• Introduction• Prefetch Taxonomy for multiprocessors• Extension for CMPs• Quantitative Analysis• Conclusions
4
Classification of Prefetches According to Impact on Memory Performance
• Focusing on each prefetch• Definition of the prefetch states– Initial state: the state just after a block is
prefetched into cache memory – Final State: the state when the block is evicted
from cache memory– The state transits based on Events during the life
time of the prefetched block in cache memory
5
Definition of Events
Event1. The prefetched block is accessed by the local core
Event2. The local core accesses the block which has evicted from the cache by the prefetch
Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core
6
Local core Remote core
Core
L1 $
L2 $
Core
L1 $
Main MemoryA
A
A
prefetch ALoad A
hiding off-chip access latency
Core
L1 $
L2 $
Core
L1 $
Main Memory
Definition of Events
Event1. The prefetched block is accessed by the local core
Event2. The local core accesses the block which has evicted from the cache by the prefetch
Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core
B
7
Local core Remote core
A
A
A
prefetch ALoad B
Cache miss!!blockBIs evicted
Core
L1 $
L2 $
Core
L1 $
Main Memory
Definition of Events
Event1. The prefetched block is accessed by the local core
Event2. The local core accesses the block which has evicted from the cache by the prefetch
Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core
8
Local core Remote core
AA
prefetch A Store A
Invalidate Request
The State Transition of Prefetch in Local Core
Useless
Useless/Conflict
Useful Event2
Event1.
Useful/Conflict
Event1
9
Local core Remote core
Core
L1 $
L2 $
Core
L1 $
Main MemoryA
A
A
prefetch ALoad A# of local L1 cache misses is decreased
# of memory accesses is increased
in local core(initial state)
# of local L1 cache misses and # of
accesses are increased in local
core
# of memory accesses is
Increased in local core
B
Load B
blockBIs evicted
cache miss!!
The State Transition of Prefetch in Local and Remote Cores*
Useless
Useless/Conflict
Useful
Useful/Conflict
* Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.
Event2
Event1
Event1
10
Useful/Conflict
Event1
Useful
The State Transition of Prefetch in Local and Remote Cores*
Useless
Useless/Conflict
Harmful
Harmful/Conflict
Event3
Event2
* Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.
Event3
Event2
Event1
11
Core
L1 $
L2 $
Core
L1 $
Main Memory
Local core Remote core
AA
prefetch A Store A
Invalidate Request
B
Load B
# of invalidation requests and # of
memory accesses are increased
# of invalidation requests, # of memory accesses
and #of cache misses are increased
cache miss!!
invalidated
Considering Cache-to-Cache Data Transfer
• Event4. The prefetched block loaded from L2 or main memory is accessed by a remote core
Local core Remote core
Core
L1 $
L2 $
Core
L1 $
Main MemoryA
A
A
prefetch A Load A
A
12
hiding off chip access latency
The State Transition in CMPs
Useful
Useful/Conflict
Harmful
Harmful/Conflict
Event3.
Event2.
Event3.
Event2.
Event1.
13
Event1.
Local core Remote core
Core
L1 $
L2 $
Core
L1 $
Main MemoryA
B
Useless
Useless/Conflict
Event2.Harmful
Harmful/Conflict
The State Transition in CMPs
Useless
Useless/Conflict
Useful
Useful/Conflict
Useless/Conflict/Remote
Useless/Remote
Event4
Event2
Event1
Event1
Event4
14
Local core Remote core
Core
L1 $
L2 $
Core
L1 $
Main MemoryA
A
A
prefetch A
Load A
AB
Load B
cache miss
Load A
# of L2 access is decreased in remote core
# of L2 accesses is decreased in remote core,
# of cache misses is increased in local core
Classification of Prefetches in CMPs
Useless
Useless/Conflict
Harmful
Harmful/Conflict
Useful
Useful/Conflict
Useless/Remote
Useless/Conflict/Remote
one L2 access decreased in remote core
one memory access is increased in local core
one cache miss is decreased in local core
one memory access in local core, and invalidate request in remote core are increased
one cache miss is increasedin local core
15
Best case
Worst case
Better case
Worse case
Outline
• Introduction• Prefetch Taxonomy– for Multiprocessors– for CMP
• Quantitative Analysis• Conclusions
16
• Simulator– M5: CMP simulator• Prefetch mechanism attached on L1 cache• Stride prefetch and tagged prefetch• MOESI coherence protocol
• Benchmark programs– SPLASH-2:
Scientific computation programs
Simulation Environment
L2$
Main memory
Core
D I
Core
D I
Core
D I
Core
D I
64KB 2-way
4MB8way
17
0%10%20%30%40%50%60%70%80%90%
100%Harmful/Conflict
Harmful
Useless/Conflict/Remote
Useless/Conflict
Useless/Remote
Useless
Useful/Conflict
Useful
Brea
kdow
n of
pre
fetc
hes
Can Conventional Prefetch Techniques Exploit Cache-to-Cache data transfer ?
•The percentage of Useless/Remote and Useless/Conflict/Remote prefetches is only 5% Conventional prefetch techniques do not exploit cache-to-cache data transfer effectively
1 2 FMM LU Radix Water
1. stride prefetch 2.tagged prefetch
18
Useless/Remote
Useless/Conflict/Remote
0%10%20%30%40%50%60%70%80%90%
100%Harmful/Conflict
Harmful
Useless/Conflict/Remote
Useless/Conflict
Useless/Remote
Useless
Useful/Conflict
Useful
Brea
kdow
n of
pre
fetc
hes
Are the Prefetched-Block Invalidations Serious Problem for CMPs?
•Prefetches of Harmful and Harmful/Conflict are extremely few (average 0.2%) Invalidations of prefetched blocks are negligible
1 2 FMM LU Radix Water
1. stride prefetch 2.tagged prefetch
19
20
Multiprocessor vs. Chip Multiprocessor
• Harmful and Harmful/Conflict prefetches– 0.01~0.70% in CMP (tagged prefetch) Small negative impact – 2~18% in MP* (sequential prefetch) Large negative impact
• Why does this difference occur ?
*Jerger, N., Hill, E., and Lipasti, M., Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.
The Reason of Difference of Invalidation Rate
• Difference of the life time of prefetched blocks in cache – Long life time (large cache size) High possibility of invalidation– Short life time (small cache size) Low possibility of invalidation
• If the cache size is large, the negative impact is large( like MPs)
• If the cache size is small, the negative impact is small (like CMPs)
CMP
Core
L1 $
L2 $
core
L1 $
Multiprocessor
load prefetched block and keep coherence
Core
L1$
L2$
Core
L1$
L2$
21
The Invalidation Rate of Prefetched Blockswith Varying L1 Cache Size (tagged prefetch)
Larger cache large negative impact ( like MPs)Smaller cache small negative impact (like CMPs)
FMM LU Radix Water0%
10%
20%
30%
40%
50%
60%128KB 256KB 512KB 1MB
inva
lidat
ed ra
te
L1 cache size
22
Summary
• Contributions– New method to analyze prefetch effects on CMPs– Quantitative analysis for two types of prefetches
• Observations– Conventional prefetch techniques DO NOT exploit
cache-to-cache data transfer effectively– Harmful prefetches are NOT harmful in CMPs
• Future work– Propose novel prefetch technique exploiting the
features of CMPs 23
Any Questions ?~Please speak slowly~
Thank you
Average Memory Access Time (AMAT)
25
base
strid
e
tagg
ed
base
strid
e
tagg
ed
base
strid
e
tagg
ed
base
strid
e
tagg
ed
FMM LU Radix Water
0
0.5
1
1.5
2
2.5
3
MCL2MBSSSBCCHCCL2HCCRL1HCCL1L1 $
Remote L1 $L2 $
Shared bus
Memory bus
Main memory
Harmful and Harmful/Conflict Prefetches varying # of cores
26
2 co
res
4 co
res
8 co
res
2 co
res
4 co
res
8 co
res
2 co
res
4 co
res
8 co
res
2 co
res
4 co
res
8 co
res
2 co
res
4 co
res
8 co
res
2 co
res
4 co
res
8 co
res
Barnes FMM LU Radix Raytrace Water
0.00%
0.50%
1.00%
1.50%
2.00%
2.50%
3.00%
harmful/conflictharmful
MultiProcessor Traffic and Miss Taxonomy (MPTMT [Jerger’06])
• MultiProcessor Traffic and Miss Taxonomy (MPTMT)– This is an extended version of Uniproccessor
taxonomy (Srinivasan et al.)– Prefetches are classified according to effects on
memory performance– To count the classified prefetches, we can measure
the prefetch effects precisely
27