Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara,...

Analyzing the Impact of Data Prefetching on Chip MultiProcessors

Naoto Fukumoto, Tomonobu Mihara,Koji Inoue, Kazuaki Murakami

Kyushu University, Japan

ACSAC 13August 4, 2008

2

Back Ground

• CMP (Chip MultiProcessor):– Several processor cores integrated in a chip– High performance by parallel processing– New feature: Cache-to-cache data transfer

• Limitation factor of CMP performance– Memory-wall problem is more critical

• High frequency of off-chip accesses• Not scaling bandwidth with the number of cores

CMP

Core

L1 $

L2 $

Core

L1 $

chip

Data prefetching is more important in CMPs

Motivation & Goal

• Motivation– Conventional prefetch techniques have been

developed for uniprocessors– Not clear that these prefetch techniques achieve high

performance in even in CMPs– Is it necessary for the prefetch techniques to consider

CMP features ?– Need to know the effect of prefetch on CMPs

• Goal– Analysis of the prefetch effect on CMPs

3

Outline

• Introduction• Prefetch Taxonomy for multiprocessors• Extension for CMPs• Quantitative Analysis• Conclusions

4

Classification of Prefetches According to Impact on Memory Performance

• Focusing on each prefetch• Definition of the prefetch states– Initial state: the state just after a block is

prefetched into cache memory – Final State: the state when the block is evicted

from cache memory– The state transits based on Events during the life

time of the prefetched block in cache memory

5

Definition of Events

Event1. The prefetched block is accessed by the local core

Event2. The local core accesses the block which has evicted from the cache by the prefetch

Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core

6

Local core Remote core

Core

L1 $

L2 $

Core

L1 $

Main MemoryA

A

A

prefetch ALoad A

hiding off-chip access latency

Core

L1 $

L2 $

Core

L1 $

Main Memory





B

7


A

A

A

prefetch ALoad B

Cache miss!!blockBIs evicted

Core

L1 $

L2 $

Core

L1 $

Main Memory





8


AA

prefetch A Store A

Invalidate Request

The State Transition of Prefetch in Local Core

Useless

Useless/Conflict

Useful Event2

Event1.

Useful/Conflict

Event1

9


Core

L1 $

L2 $

Core

L1 $

Main MemoryA

A

A

prefetch ALoad A# of local L1 cache misses is decreased

# of memory accesses is increased

in local core(initial state)

# of local L1 cache misses and # of

accesses are increased in local

core

# of memory accesses is

Increased in local core

B

Load B

blockBIs evicted

cache miss!!

The State Transition of Prefetch in Local and Remote Cores*

Useless

Useless/Conflict

Useful

Useful/Conflict

* Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International　 Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.

Event2

Event1

Event1

10

Useful/Conflict

Event1

Useful

The State Transition of Prefetch in Local and Remote Cores*

Useless

Useless/Conflict

Harmful

Harmful/Conflict

Event3

Event2

* Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International　 Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.

Event3

Event2

Event1

11

Core

L1 $

L2 $

Core

L1 $

Main Memory


AA

prefetch A Store A

Invalidate Request

B

Load B

# of invalidation requests and # of

memory accesses are increased

# of invalidation requests, # of memory accesses

and #of cache misses are increased

cache miss!!

invalidated

Considering Cache-to-Cache Data Transfer

• Event4. The prefetched block loaded from L2 or main memory is accessed by a remote core


Core

L1 $

L2 $

Core

L1 $

Main MemoryA

A

A

prefetch A Load A

A

12

hiding off chip access latency

The State Transition in CMPs

　 Useful

Useful/Conflict

Harmful

Harmful/Conflict

Event3.

Event2.

Event3.

Event2.

Event1.

13

Event1.


Core

L1 $

L2 $

Core

L1 $

Main MemoryA

B

Useless

　　 Useless/Conflict

Event2.Harmful

Harmful/Conflict

The State Transition in CMPs

Useless

　　 Useless/Conflict

　 Useful

Useful/Conflict

Useless/Conflict/Remote

Useless/Remote

Event4

Event2

Event1

Event1

Event4

14


Core

L1 $

L2 $

Core

L1 $

Main MemoryA

A

A

prefetch A

Load A

AB

Load B

cache miss

Load A

# of L2 access is decreased in remote core

# of L2 accesses is decreased in remote core,

# of cache misses is increased in local core

Classification of Prefetches in CMPs

Useless

　 Useless/Conflict

Harmful

Harmful/Conflict

Useful

Useful/Conflict

Useless/Remote


one L2 access decreased in remote core

one memory access is increased in local core

one cache miss is decreased in local core

one memory access in local core, and invalidate request in remote core are increased

one cache miss is increasedin local core

15

Best case

Worst case

Better case

Worse case

Outline

• Introduction• Prefetch Taxonomy– for Multiprocessors– for CMP

• Quantitative Analysis• Conclusions

16

• Simulator– M5: CMP simulator• Prefetch mechanism attached on L1 cache• Stride prefetch and tagged prefetch• MOESI coherence protocol

• Benchmark programs– SPLASH-2:

Scientific computation programs

Simulation Environment

L2$

Main memory

Core

D I

Core

D I

Core

D I

Core

D I

64KB 2-way

4MB8way

17

0%10%20%30%40%50%60%70%80%90%

100%Harmful/Conflict

Harmful


Useless/Conflict

Useless/Remote

Useless

Useful/Conflict

Useful

Brea

kdow

n of

pre

fetc

hes

Can Conventional Prefetch Techniques Exploit Cache-to-Cache data transfer ?

•The percentage of Useless/Remote and Useless/Conflict/Remote prefetches is only 5% Conventional prefetch techniques do not exploit cache-to-cache data transfer effectively

1 2 FMM LU Radix Water

1. stride prefetch 2.tagged prefetch

18

Useless/Remote


0%10%20%30%40%50%60%70%80%90%

100%Harmful/Conflict

Harmful


Useless/Conflict

Useless/Remote

Useless

Useful/Conflict

Useful

Brea

kdow

n of

pre

fetc

hes

Are the Prefetched-Block Invalidations Serious Problem for CMPs?

•Prefetches of Harmful and Harmful/Conflict are extremely few (average 0.2%) Invalidations of prefetched blocks are negligible

1 2 FMM LU Radix Water

1. stride prefetch 2.tagged prefetch

19

20

Multiprocessor vs. Chip Multiprocessor

• Harmful and Harmful/Conflict prefetches– 0.01~0.70% in CMP (tagged prefetch) Small negative impact – 2~18% in MP* (sequential prefetch) Large negative impact

• Why does this difference occur ?

*Jerger, N., Hill, E., and Lipasti, M., Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE International　 Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.

The Reason of Difference of Invalidation Rate

• Difference of the life time of prefetched blocks in cache – Long life time (large cache size) High possibility of invalidation– Short life time (small cache size) Low possibility of invalidation

• If the cache size is large, the negative impact is large( like MPs)

• If the cache size is small, the negative impact is small (like CMPs)

CMP

Core

L1 $

L2 $

core

L1 $

Multiprocessor

load prefetched block and keep coherence

Core

L1$

L2$

Core

L1$

L2$

21

The Invalidation Rate of Prefetched Blockswith Varying L1 Cache Size (tagged prefetch)

Larger cache large negative impact ( like MPs)Smaller cache small negative impact (like CMPs)

FMM LU Radix Water0%

10%

20%

30%

40%

50%

60%128KB 256KB 512KB 1MB

inva

lidat

ed ra

te

L1 cache size

22

Summary

• Contributions– New method to analyze prefetch effects on CMPs– Quantitative analysis for two types of prefetches

• Observations– Conventional prefetch techniques DO NOT exploit

cache-to-cache data transfer effectively– Harmful prefetches are NOT harmful in CMPs

• Future work– Propose novel prefetch technique exploiting the

features of CMPs 23

Any Questions ?~Please speak slowly~

Thank you

Average Memory Access Time (AMAT)

25

base

strid

e

tagg

ed

base

strid

e

tagg

ed

base

strid

e

tagg

ed

base

strid

e

tagg

ed

FMM LU Radix Water

0

0.5

1

1.5

2

2.5

3

MCL2MBSSSBCCHCCL2HCCRL1HCCL1L1 $

Remote L1 $L2 $

Shared bus

Memory bus

Main memory

Harmful and Harmful/Conflict Prefetches varying # of cores

26

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

Barnes FMM LU Radix Raytrace Water

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

harmful/conflictharmful

MultiProcessor Traffic and Miss Taxonomy (MPTMT [Jerger’06])

• MultiProcessor Traffic and Miss Taxonomy (MPTMT)– This is an extended version of Uniproccessor

taxonomy (Srinivasan et al.)– Prefetches are classified according to effects on

memory performance– To count the classified prefetches, we can measure

the prefetch effects precisely

27

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara,...

Documents

Transcript of Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara,...