ISCA-2000 海外調査報告
description
Transcript of ISCA-2000 海外調査報告
![Page 2: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/2.jpg)
2
会議の概要
• The 27th Annual International Symposium on Computer Architecture, Vancouver Canada 6月 10 日~ 1 4日– キーノート1– パネル1– 一般講演29(採択率17%)
• 参加者444人 (日本人13人,大学から4人)
• http://www.cs.rochester.edu/~ISCA2k/
![Page 3: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/3.jpg)
3
紹介する文献• Multiple-Banked Register File Architecture• On the Value Locality of Store Instructions• Completion Time Multiple Branch Prediction for
Enhancing Trace Cache Performance• Circuits for Wide-Window Superscalar Processor• Trace Preconstruction• A Hardware Mechanism for Dynamic Extraction and
Relayout of Program Hot Spots• A Fully Associative Software-Managed Cache Design• Performance Analysis of the Alpha 21264-based Compaq
ES40 System• Piranha: A Scalable Architecture Based on Single Chip
Multiprocessing
![Page 4: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/4.jpg)
4
Multiple-Banked Register File Architecture
Jose-Lorenzo Cruz et al.
Universitat Politecnica de Catalunya, Spain
ISCA-2000 p.316-325
Session 8 – Microarchitecture Innovations
![Page 5: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/5.jpg)
5
レジスタファイルの構成out1
32 Registers
Tri-state buffer
Reg31
Reg2
Reg1
Reg0
decoder5
64
• the number of registers
• the number of ports (Read, Write)
RegisterFile
5
value64
RegNo
Read
![Page 6: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/6.jpg)
6
研究の動機
• The register file access time is one of the critical delays
• The access time of the register file depends on the number of registers and the number of ports– Instruction window -> registers– Issue width -> ports
![Page 7: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/7.jpg)
7
研究の目的
• レジスタファイルのポート数を増やす
• シングル・サイクルでアクセスできるレジスタファイルに近づける
Machine Cycle
RegFile
RegFile
request value
![Page 8: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/8.jpg)
8
Impact of Register File Architecture
2.00
2.25
2.50
2.75
3.00
3.25
3.50
3.75
4.00
48 64 96 128 160 192 224 256
Physical Register File Size
IPC
SEPCint95
![Page 9: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/9.jpg)
9
Observation
• Processor needs many physical registers but a very small number are actually required at a given moment.– Registers with no value– Value used by later instructions– Last-use and overwrite– Bypass only or never read
![Page 10: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/10.jpg)
10
Multiple-Banked Register File Architecture
Bank 2Bank 1
(a) one-level
Bank 1
Bank 2
(b) multi-level (register file cache)
uppermostlevel
lowestlevel
![Page 11: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/11.jpg)
11
Register File Cache
Bank 1
Bank 2
uppermost Level16 registers
lowest Level128 registers
• The lowest level is always written.
• Data is moved only from lower to upper level.
• Cached in upper level based on heuristics.
• There is a prefetch mechanism.
![Page 12: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/12.jpg)
12
Caching and Fetching Policy
• Non-bypass caching– バイパスロジックから読まれていない結果のみを上位レ
ベルに格納
• Ready caching– まだ発行されていない命令で必要とされている値のみを
上位レベルに格納
• Fetch-on-demand– 必要となった時点で値を上位レベルに転送
• Prefetch-first-pair -> next slide
Locality properties of registers and memory are very different.
![Page 13: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/13.jpg)
13
Prefetch-first-pair
(1) p1 = P2 + P3(2) P4 = P3 + P6(3) P7 = P1 + P8
命令 (1) の結果レジスタ P1 を最初に利用する命令 (3)のもう一つのレジスタ P8 をプリフェッチする.
命令 (1) が発行される際に P8 をプリフェッチ
• 命令 (1) から (3) は,プロセッサ内でリネームステージを経過している.
• P1 ~ P8 は,ハードウェアによって変換された物理的なレジスタの番号
![Page 14: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/14.jpg)
14
評価結果 (conf. C3)
• One-cycle single-banked– Area 18855, cycle 5.22 ns (191 MHz)– Read 4 port, Write 3 port
• Two-cycle single-banked– Area 18855, cycle 2.61 ns (383 MHz)– Read 4 port, Write 3 port
• Register file cache– Area 20529, cycle 2.61 ns (382 MHz)– Upper: Read 4 port, Write 4 port– Lower: Write 4 port, Bus 2
![Page 15: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/15.jpg)
15
評価結果
0.0
1.0
2.0
3.0
SpecInt95 SpecFP95
Rel
ativ
e In
stru
ctio
n Th
roug
hput
1-cycle non-bypass caching + prefetch-first-pair 2-cycle, 1-bypass
![Page 16: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/16.jpg)
16
研究の新規性• Register File Cache の提案
– 高速動作が可能– 上位レベルのミス率を削減することで,
アクセスのサイクル数を1に近づける.• 2つのキャッシュ方式と,2つの
フェッチ方式の提案• エリアとサイクル時間を考慮した性
能評価
![Page 17: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/17.jpg)
17
研究へのコメント
• 巨大なレジスタファイル,ポート数の増加,アクセス時間の低減という要求
• 従来単純な構成だったレジスタファイルに関しても,キャッシュのように複数階層の構成が必要
• 今後,大規模なILPアーキテクチャにおける複雑化は避けられない?
![Page 18: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/18.jpg)
18
On the Value Locality of Store Instructions
Kevin M. Lepak et al.
University of Wisconsin
ISCA-2000 p.182-191
Session 5a – Analysis of Workloads and Systems
![Page 19: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/19.jpg)
19
Value Locality (Value Prediction)
• Value locality– a program attribute that describes the likelihood of t
he recurrence of previously-seen program values
• ある命令が前回生成した演算結果(データ値)と,今回生成するデータ値には関連がある.
(1) p1 = P2 + P3(2) P4 = P3 + P6(3) P7 = P1 + P8
P1 の演算結果の履歴 ... 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1 ?
![Page 20: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/20.jpg)
20
研究の目的• Much publication has focused on load
instruction outcome.
• Examine the implications of store value locality– Introduce the notion of silent stores– Introduce two new definitions of
multiprocessor true sharing– Reduce multiprocessor data and address
traffic
![Page 21: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/21.jpg)
21
Memory-centric and producer-centric Locality
• Program structure store value locality– The locality of values written by a particular
static store.
• Message-passing store value locality– The locality of values written to a particular
address in data memory.
![Page 22: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/22.jpg)
22
20%-62% of stores are silent stores
0
10
20
30
40
50
60
70
80
90
100
go
m88
ksim gcc
com
pres
s li
ijpeg perl
vort
ex oltp
barn
es
ocea
n
Sile
nt S
tore
s (%
)
Silent store is a store that does not change the system state.
![Page 23: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/23.jpg)
23
Silent Store Removal Mechanism
• Realistic Method– All previous store addresses must be known.– Load the data from the memory subsystem.– If the data values are equal, the store is
update-silent.• Remove from the LSQ and flag the RUU entry
– If store is silent, the store retires with no memory access.
![Page 24: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/24.jpg)
24
Evaluation Results
• Writeback Reduction– Range in reduction from 81% to 0%– Average 33% reduction
• Instruction Throughput– Speedups of 6.3% and 6.9% for realistic
and perfect removal mechanisms.
![Page 25: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/25.jpg)
25
New Definition of False Sharing
• Multiprocessor applications– All of the previous definitions rely on the
specific addresses in the same block.– No attempt is made to determine when the
invalidation of a block is unnecessary because the value stored in the line does not change.
– Silent stores and stochastically silent stores
![Page 26: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/26.jpg)
26
Address-based Definition of Sharing [Dubois 1993]
• Cold Miss– The first miss to a given block by a processor
• Essential Miss– A cold miss is an essential miss– If during the lifetime of a block, the processor
accesses a value defined by another processor since the last essential miss to that block, it is an essential miss.
• Pure True Sharing miss (PTS)– An essential miss that is not cold.
• Pure False Sharing miss (PFS)– A non-essential miss.
![Page 27: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/27.jpg)
27
Update-based False Sharing (UFS)
• Essential Miss– A cold miss is an essential miss– If during the lifetime of a block, the
processor accesses an address which has had a different data value defined by another processor since the last essential miss to that block, it is an essential miss.
![Page 28: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/28.jpg)
28
Stochastic False Sharing (SFS)
• It seems intuitive that– if we define false sharing to
compensate for the effect of silent stores that we could also define it in the presence of stochastically silent stores (values that are trivially predictable via some mechanism)
![Page 29: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/29.jpg)
29
研究の新規性• Overall characterization of store value locality• Notion of silent stores• Uniprocessor speedup by squashing silent store
s• Definition of UFS and SFS• How to exploit UFS to reduce address and data
bus traffic on shared memory multiprocessors
![Page 30: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/30.jpg)
30
研究へのコメント
• ストア命令のデータ値の局所性に関する様々な事柄をまとめている.
• 評価は初期的な構成のもので,今後の研究の動機付けとなる.
• 並列計算機における局所性の利用に関しては詳細な検討が必要
![Page 31: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/31.jpg)
31
Completion Time Multiple Branch Prediction for Enhancing Trace
Cache Performance
Ryan Rakvic et al.
Carnegie Mellon University
ISCA-2000 p.47-58
Session 2a – Exploiting Traces
![Page 32: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/32.jpg)
32
Branch Prediction andMultiple Branch Prediction
Control Flow Graph
TakenNotTaken
Basic Block
Branch Prediction:T or N?
TakenNotTaken
Multiple Branch Prediction:(T or N) (T or N) (T or N) ?
Taken
Taken
![Page 33: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/33.jpg)
33
動機と目的
• Wide Instruction Fetching – 4-way -> 8-way
• Multiple branch prediction– Branch Execution Rate: about 15%– One branch prediction per cycle is not enough.
• Tree-Based Multiple Branch Predictor (TMP)
![Page 34: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/34.jpg)
34
用いられている分岐予測の例 gshare
00
11
Two-bit bimodal branch predictor
01
10Taken Taken
Not taken Not takenB-3
B0
Taken
Not Taken
Not Taken
Global history
T N N B0
10
Tag 2BC
Taken
B-2
B-1index
N
T
N
N T
TN
T
![Page 35: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/35.jpg)
35
Tree-Based Multiple Branch Predictor (TMP)
B-3
B-2
B-1
B0
B1
B2
B3
B4
Predicted path
Taken
Not Taken
Not Taken
Global history
TTNT Tree
T N N B0
Tree-based Pattern History Table (Tee-PHT)
B0
B1
B4
B2
B3
Tree(i)
![Page 36: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/36.jpg)
36
Tree-based Pattern History Table
tag Tree
Tree-based Pattern History Table (Tee-PHT)
Predicted Path
00
11
Two-bit bimodal branch predictor
01
10
01 TN
11 T
T
N
NTNT
00
11
Taken Taken
Not taken Not taken
![Page 37: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/37.jpg)
37
Updating of 2-bit bimodal tree-node
01 TN
11 T
T
N
TNNN
00
11
Recently completed path:NTNTOld predicted path:
TNTTNew predicted path:
10
11
10
10
01
10
0111
00
11
01
10Taken Taken
Not taken Not takenN
T
N
N T
TN
T
![Page 38: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/38.jpg)
38
Tree-PHT with second level PHT(Node-PHT) for tree-node prediction
tag Tree
Tree-based Pattern History Table (Tee-PHT)
Predicted Path
Node Pattern History Table(Node-PHT)
TN
01...1
01...1n bits of local history
2-bit bimodal
• global(g)• per-branch(p)• shared(s)
![Page 39: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/39.jpg)
39
研究の新規性と評価結果• TMP の提案
– Three-level branch predictor– Maintain a tree structure– Completion time update
• TMPs-best (shared)
– The number of entries in the Tree-PHT: 2K– Local history bit: 6– 72KB Memory
• 96%: 1 block• 93%: 2 consecutive blocks• 87%: 3 consecutive blocks• 82%: 4 consecutive blocks
![Page 40: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/40.jpg)
40
研究へのコメント
• サイクル当たり複数の分岐命令の分岐先を予測するために,3レベルの予測機構を提案
• 分岐予測はさらに複雑になるが• 着実な性能向上
![Page 41: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/41.jpg)
41
Circuits for Wide-Window Superscalar Processor
Dana S. Henry, Bradley C. Kuszmaul, Gabriel H. Loh and Rahul Sami
ISCA-2000 p.236-247
Session 6 – Circuit Considerations
![Page 42: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/42.jpg)
42
Instruction Window
命令供給 実行命令ウィンドウ
実行結果 (タグ,値)
命令 命令,データ
アウトオブオーダ実行のスーパースカラプロセッサと命令ウィンドウ
Src1-Tag Valid Src2-Tag Valid OpSrc1-Tag Valid Src2-Tag Valid OpSrc1-Tag Valid Src2-Tag Valid OpSrc1-Tag Valid Src2-Tag Valid Op
• Wake-up• Schedule
(1) p1 = P2 + P3Src1 Src2
![Page 43: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/43.jpg)
43
研究の動機• 命令ウィンドウを大きくすることで命
令レベル並列性利用の可能性が増大• Alpha 21264 のウィンドウサイズは 35• MIPS R10000 のウィンドウサイズは ??• サイズの大きい命令ウィンドウを構成
することは困難• Power4, two 4-issue processors• Intel Itanium, VLIW techniques
実行命令列
命令ウィンドウ
![Page 44: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/44.jpg)
44
研究の目的
• 高速動作する大きなサイズ(128)の命令ウィンドウを実現する
• Log-depth cyclic segment prefix (CSP) circuit の提案
• Log-depth cyclic segment prefix circuit とサイクル時間の関係を議論
• 大きなサイズの命令ウィンドウによる性能向上を議論
![Page 45: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/45.jpg)
45
Gate-delay cyclic segmented prefix (CSP)
out 0in 0s 0
out 1in 1s 1
out 2in 2s 2
out 3in 3s 3
Head
Tail
An 4-entry wrap-aroundReordering buffer withAdjacent, linear gated-delaycyclic segmented prefix.
![Page 46: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/46.jpg)
46
Commit login using CSP
doneout 0in 0s 0
out 1in 1s 1
out 2in 2s 2
out 3in 3s 3
done
done
Not done
Head
Tail done
done
done
Not done
Head
Tail
![Page 47: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/47.jpg)
47
Wake-up logic for logical register R5
D: R4=R5+R7Not done
out 0in 0s 0
out 1in 1s 1
out 2in 2s 2
out 3in 3s 3
A: R5=R8+R1done
B: R1=R5+R1Not done
C: R5=R3+R3Not done
Head
Tail
![Page 48: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/48.jpg)
48
Scheduler logic scheduling two FUs
D: R4=R5+R7request
A: R5=R8+R1request
B: R1=R5+R1
C: R5=R3+R3request
Head
Tail
+
+
+
+
1
1
2
2
Logarithmic gate-delay implementations p.240 - 241 参照
![Page 49: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/49.jpg)
49
評価結果
• 128エントリの命令ウィンドウを設計
• Commit logic: 1.41 ns (709 MHz)
• Wakeup logic: 1.54 ns (649 MHz)
• Schedule logic: 1.69 ns (591 MHz)
• 現在のプロセス技術を用いて 500MHz以上の動作速度を達成
![Page 50: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/50.jpg)
50
研究へのコメント
• 128エントリの命令ウィンドウの実現可能性を示した.
• 従来,命令ウィンドウのエントリ数を増やすことは困難と考えられてきた.
• この点を覆すという意味で面白い.
![Page 51: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/51.jpg)
51
Trace Preconstruction
Quinn Jacobson et al.
Sun Microsystems
ISCA-2000 p.37-46
Session 2a – Exploiting Traces
![Page 52: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/52.jpg)
52
Trace Cache
BranchPredict
I-Cache
ExecutionEngine
Trace Cache
NextTrace
Predictor
TraceConstructor
Traces are snapshots of short segments of the dynamic instruction stream.
InstructionFetch
![Page 53: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/53.jpg)
53
Trace Cache Slow Path
BranchPredict
I-Cache
ExecutionEngine
Trace Cache
NextTrace
Predictor
TraceConstructor
Slow Path
If the trace cache does not have the needed trace, the slow path is used.
When the trace cache is busy, the slow path hardware is idle.
InstructionFetch
![Page 54: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/54.jpg)
54
目的• Trace cache enables high bandwidth, low
latency instruction supply.
• Trace cache performance may suffer due to capacity and compulsory misses.
• By performing a function analogous to prefetching, trace preconstruction augments a trace cache.
![Page 55: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/55.jpg)
55
Preconstruction Method
For preconstruction to be successful Region start points must identify
instructions that the actual execution path will reach in the future.
Heuristic loop back edge procedure call
![Page 56: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/56.jpg)
56
Preconstruction Example
a
h
i
j
b
c
d
e
f
g
JAL
Br1
Br2
Br3Br4
JMP
a b c c c c c d e g h i i i
h i i i i i
Region Start Point: JAL Br1 Br2
Region 1
d e g
f g
j j
j j
Region 2
j
Region 3
CFG
![Page 57: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/57.jpg)
57
Preconstruction Efficiency• Slow-path dynamic branch predictor
– To reduce the number of paths
– Assume a bimodal branch predictor
– Only the strongly biased path is followed during preconstruction.
• Trace Alignment– . . c c c d e g . . .
– <c c c> <d e g> or <. c c> <c d e> or ...
– In order for pre-constructed trace to be useful, it must align with the actual execution path
![Page 58: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/58.jpg)
58
Conclusions• One implementation of trace
preconstruction– SPECint95 benchmarks with large working
set sizes– Reduce the trace cache miss rates from 30%
to 80%– 3% to 10% overall performance improvement
![Page 59: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/59.jpg)
59
A Hardware Mechanism for Dynamic Extraction and
Relayout of Program Hot Spots
Matthew C. Merten et al.
Coordinated Science Lab
ISCA-2000 p.59-70
Session 2a – Exploiting Traces
![Page 60: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/60.jpg)
60
Hot Spot
• 頻繁に実行される命令コード Hot Spot – 10対 90, 1対 99 の法則
![Page 61: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/61.jpg)
61
目的
• Dynamic extraction and relayout of program hot spots– A hardware-driven profiling scheme for
identifying program hot spots to support runtime optimization, ISCA 1999
• Improve instruction fetch rate
![Page 62: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/62.jpg)
62
Branch Behavior Buffer with new fields shaded in gray
Refresh Timer
Reset Timer
Branch Address
= &
+I
-D
At Zero:Hot Spot Detected
SaturatingAdder
Bra
nch
Tag
Exec
Counte
r
Take
n C
ounte
r
Cand F
lag
Take
n V
alid
Bit
Rem
app
ed T
ake
n A
ddr
FV V
alid
Bit
Rem
app
ed F
T A
dd
r
Call
ID
Touch
ed B
it
0
1
Hot SpotDetectionCounter
![Page 63: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/63.jpg)
63
Memory-Based Code Cache• Use a region of memory called a code cache to
contain the translated code.• Standard paging mechanisms are used to
manage a set of virtual pages reserved for the code cache.
• The code cache pages can be allocated by the operating system and marked as read-only executable code.
• The remapping hardware must be allowed to write this region.
![Page 64: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/64.jpg)
64
Hardware Interaction
Fetch Decode ... Execute In-order Retire
BranchPredictor
BTBIcache BBB
Trace GenerationUnit
Branch BehaviorBuffer
ProfileInformation
TGU
Memory
Code Cache
Update
![Page 65: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/65.jpg)
65
Code Deployment
• Original code cannot be altered
• Transfer to the optimized code is handled by the Branch Target Buffer.– BTB target for the entry point branch is
updated with the address of the entry point target in the code cache.
• Optimized code consists of only the original code.
![Page 66: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/66.jpg)
66
Trace Generation Algorithm
• Scan Mode– Search for a trace entry point. This is the initial
mode following hot spot detection
• Fill Mode– Construct a trace by writing each retired
instructions into the memory-based code cache.
• Pending Mode– Pause trace construction until a new path is
executed.
![Page 67: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/67.jpg)
67
State Transition
ScanMode
New Entry Point(jcc || jmp) && candidate && taken
End Trace2-> jcc && candidate && both_dir_remapped3-> jcc && !candidate && off_path_branch_cnt > max_off_path_allowed4-> red && red_addr_mismatch5-> jcc && candidate && recursive_call
FillMode
From Profile Mode
PendingMode
Cold Branch(jcc || jmp) && !candidate
Merge jcc && candidate && other_dir_not_remapped && exec_dir_reampped
Continue jcc && addr_matches_pending_target && exec_dir_not_remapped
Hot spot detection
![Page 68: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/68.jpg)
68
Trace Example with Optimization (1/3)
Execution order during remapping
A
B
D (b) Original code layout
A1 B1 A2 C2 A3 B2 C3 D1C1
C
Branch Taken
![Page 69: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/69.jpg)
69
Trace Example with Optimization (2/3)
Execution order during remapping
(c) Trace generated by basic remapping
A1 B1 A2 C2 A3 B2 C3 D1C1
Branch Taken
RM-D1
RM-A1
RM-A2
RM-C2
RM-B1
Entrance: C1
![Page 70: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/70.jpg)
70
Trace Example with Optimization (3/3)
Execution order during remapping
(d) The application of two remapping optimizations, patching and branch replication
A1 B1 A2 C2 A3 B2 C3 D1C1
RM-A1
RM-B1
RM-A2
RM-C3
RM-A3
RM-B2
RM-C3
RM-D1
Entrance: C1
![Page 71: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/71.jpg)
71
Fetched IPC for various fetch mechanisms
0
2
4
6
8
IC:64KB IC:64KBremap
IC:64KB,TC:8KB
IC:64KB,TC:8KBremap
IC:64KB,TC:128KB
IC:64KB,TC:128KB
remap
Fetc
h IP
C
![Page 72: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/72.jpg)
72
Conclusion• Detect the hot spots to perform
– code straightening– partial function inlining– loop unrolling
• Achieve significant fetch performance improvement at little extra hardware cost
• Create opportunities for more aggressive optimizations in the future
![Page 73: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/73.jpg)
73
研究へのコメント
• 昨年度の Hot Spot の検出から, Hot Spot の最適化に踏み込んだ研究
• 最適化のアルゴリズムが複雑?• 最適化のバリエーションは無限?
–ソフトウェアによる最適化– 命令セット間の変換– 等
![Page 74: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/74.jpg)
74
A Fully Associative Software-Managed Cache Design
Erik G. Hallnor et al.
The University of Michigan
ISCA-2000 p.107-116
Session 3 – Memory Hierarchy Improvement
![Page 75: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/75.jpg)
75
Data Cache Hierarchy
MPU
L1 Data Cache
L2 Data Cache1 MB
Off-chip Data Cache
Main Memory (DRAM)
100 million Transistor
Tag Data Data Data DataTag Data Data Data DataTag Data Data Data DataTag Data Data Data Data
Data AddressTag Offset
Tag Data Data Data DataTag Data Data Data DataTag Data Data Data DataTag Data Data Data Data
Data
2-way set-associative data cache
![Page 76: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/76.jpg)
76
研究の目的• Processor-memory gap
– miss latencies approaching a thousand instruction execution time
• On-chip SRAM caches in excess of one megabyte [Alpha 21364]
• Re-examination of how secondary caches are organized and managed– Practical, fully associative, software managed
secondary cache system
![Page 77: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/77.jpg)
77
Indirect Index Cache (IIC)
hash
Tag Offset
TE TE TE TE Chain
Data
Data Value
=?
Hit?
=?
Hit?
=?
Hit?
=?
Hit?
Data Array
TAG STATUS INDEX REPL
Primary Hash Table
Secondary Storage for Chaining
ChainTE
TE
Data Address
=?
Hit?
![Page 78: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/78.jpg)
78
Generational Replacement• Blocks are grouped into a small number of
prioritized pools.– Blocks that are referenced regularly are
promoted into higher-priority pools.– Demote unreferenced blocks into lower-priority
pools. Fresh pool
Pool 0(lowest priority) Pool 1 Pool 2
Pool 3(Highest priority)
Ref = 1 Ref = 1 Ref = 1
Ref = 0 Ref = 0 Ref = 0
Ref = 0 Ref = 1
![Page 79: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/79.jpg)
79
Generational Replacement Algorithm
• Each pool is variable-length FIFO queue of blocks.
• On a hit, only the block's reference bit is updated.• On a each miss, the algorithm checks the head of
each pool FIFO.– Head block's reference bit is set -> Next higher-priority
pool, reference bit=0
– Head block's reference bit is not set -> Next lower-priority pool, reference bit=0
![Page 80: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/80.jpg)
80
Evaluation Result and Conclusion
• Substantial reduction in miss rates(8-85%) relative to a conventional 4-way associative LRU cache.
• IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform as CPU-relative latencies grow.
![Page 81: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/81.jpg)
81
研究へのコメント
• 大規模なオンチップの2次キャッシュを想定
• ソフトウェアが支援しているが,プログラマや実行コードからはキャッシュに見える.
• 命令セットやプログラマの支援を考慮した検討も必要ではないか.
![Page 82: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/82.jpg)
82
Performance Analysis of the Alpha 21264-based Compaq
ES40 System
Zarka Cvetanovic, R.E.Kessler
Compaq Computer Corporation
ISCA-2000 p.192-202
Session 5a – Analysis of Workloads and Systems
![Page 83: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/83.jpg)
83
研究の動機
0
500
1000
1500
2000
2500
3000
1-CPU 2-CPU 4-CPU
SPEC
fp_rat
e95
Compaq ES40/ 21264 667MHz
HP PA-8500 440MHz
SUN USparc- II 400MHz
Shared Memory Multiprocessor Comparison
Compaq ES40
![Page 84: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/84.jpg)
84
研究の目的
• Evaluation of Compaq ES40 shared memory multiprocessor– Up to four Alpha 21264 CPU
• Quantitatively show the performance– Instruction Per Cycle – Branch mispredicts– Cache misses
![Page 85: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/85.jpg)
85
Alpha 21264 Microprocessor
• 15-million transistors, 4-way out-of-order superscalar
• 80 in-flight instructions• 35 instruction window• hybrid (Local and global) branch predicti
on• two-clusters integer execution core• load hit/miss, store/load prediction
![Page 86: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/86.jpg)
86
Instruction Cache Miss Rate Comparison
0 5 10 15 20 25 30 35 40 45 50 55 60
SPECfp95
SPECint95
TPM
Icache Misses per 1000 Retires
Alpha 21164
Alpha 21264
8KB direct-mapped -> 64KB two-way associative
Transactions Per Minute
![Page 87: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/87.jpg)
87
Branch Mispredicts Comparison
0 2 4 6 8 10 12 14 16 18 20 22
SPECfp95
SPECint95
TPM
Branch Mispredicts per 1000 Retires
Alpha 21164
Alpha 21264
2bit predictor -> Local and global hybrid predictor
![Page 88: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/88.jpg)
88
IPC Comparison
0.0 0.5 1.0 1.5 2.0
SPECfp95
SPECint95
TPM
Retired Instructions Per Cycle
Alpha 21164
Alpha 21264
![Page 89: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/89.jpg)
89
Compaq ES40 Block Diagram
CPUAlpha 21264
CPUAlpha 21264
CPUAlpha 21264
CPUAlpha 21264
Control Chip
64b
Memory BankMemory Bank
Memory BankMemory Bank
8 DataSwitches
(Crossbar-based)
64b 256b
PCI-Chip PCI-Chip
PCI PCI
L28MB
L2
L2
L2
![Page 90: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/90.jpg)
90
Inter-processor Communication
• Write-invalidate cache coherence– 21264 passes the L2 miss requests to the
control chip.– The control chip simultaneously forwards the
request to DRAM and other 21264s.– Other 21264s check for necessary coherence
violations and respond.– The control chip responds with data from
DRAM or another 21264.
![Page 91: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/91.jpg)
91
STREAM Copy Bandwidth
• The STREAM benchmark• Measure the best memory bandwidth in
megabytes per second– COPY: a(i) = b(i)
– SCALE: a(i) = q*b(i)
– SUM: a(i) = b(i) + c(i)
– TRAID: a(i) = b(i) + q*c(i)– The general rule for STREAM is that each array must be
at least 4x the size of the sum of all the last-level caches used in the run, or 1 Million elements -- whichever is larger.
![Page 92: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/92.jpg)
92
STREAM Copy Bandwidth
0
250
500
750
1000
1250
1500
1750
2000
2250
2500
2750
3000
1-CPU 2-CPU 3-CPU 4-CPU
Mem
ory
Cop
y B
andw
idth
(M
B/s
ec)
Compaq ES40/ 21264 667MHzCompaq ES40/ 21264 500MHzSUN Ultra Enterprise 6000AlphaServer 4100 5/ 600
< 3 GBytes / sec
1197
2547
263470
![Page 93: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/93.jpg)
93
研究結果
• Five times the memory bandwidth
• Microprocessor enhancement
• Compiler enhancement
• Compaq ES40 provides 2 to 3 times the performance of the AlphaServer 4100
![Page 94: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/94.jpg)
94
研究へのコメント
• ネットワークにクロスバを採用することで性能向上を達成
• Alpha 21264 の挙動を紹介する文献として興味深い
• 実マシンの詳細な性能評価として興味深い
• アイデアの新しさはない
![Page 95: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/95.jpg)
95
Piranha: A Scalable Architecture Based on Single Chip
Multiprocessing
Luiz Andre Barroso
Compaq Computer Corporation
ISCA-2000 p.282-293
Session 7 – Extracting Parallelism
![Page 96: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/96.jpg)
96
研究の動機• Complex processors
– Higher development cost– Longer design times
• On-line transaction Processing (OLTP)– Little instruction-level parallelism– Thread-level or process-level parallelism
• Semiconductor integration density
• Chip multiprocessing (CMP)
![Page 97: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/97.jpg)
97
研究の目的• Piranha: a research prototype at Compaq
– Targeted at parallel commercial workloads– Chip multiprocessing architectures– Small team, modest investment, short design
time
• General-purpose microprocessors or Piranha?
![Page 98: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/98.jpg)
98
Single-chip Piranha processing node
CPU0 CPU7
iL1 dL1
Intra-Chip Switch
iL1 dL1
L20 L27
MC0 MC7
HomeEngine
RemoteEngine
SystemControl
Pack
et
Sw
itch
OutputQueue
InputQueue
Router
Inte
rcon
nect
Li
nks
DirectRambus Array RDRAM
RDRAMRDRAM
RDRAM
RDRAMRDRAM
01
31
01
31
Chip
![Page 99: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/99.jpg)
99
Alpha CPU Core and L1 Caches
• CPU: Single-issue, in-order, 500MHz datapath
• iL1, dL1: 64KB two-way set-associative– Single-cycle latency
• TBL: 256 entries, 4-way set-associative
• 2-bit state field per cache line for MESI protocol
![Page 100: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/100.jpg)
100
Intra-Chip Switch (ICS)
• ICS manages 27 clients.
• Conceptually, ICS is a crossbar.
• Eight internal datapaths
• Capacity of 32 GB/sec– 500MHz, 64bit bus (8Byte), 8 datapaths– 500MHz x 8Byte x 8 = 32 GB/sec– 3 times the available memory bandwidth
![Page 101: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/101.jpg)
101
Second-Level Cache• L2: 1MB unified instruction/data cache
– Physically partitioned into eight banks
• Each bank is 8-way set-associative• Non-inclusive on-chip cache hierarchy
– Keep a duplicate copy of the L1 tags and state
• L2 controllers are responsible for coherence within a chip.
![Page 102: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/102.jpg)
102
Memory Controller
• Eight memory controllers
• Each RDRAM channel has a maximum 1.6GB/sec
• Maximum local memory bandwidth of 12.8GB/sec
![Page 103: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/103.jpg)
103
Single-chip Piranha versus out-of-order processor
233
145
100
34
350
191
100
44
0
50
100
150
200
250
300
350
400
P1 INO OOO P8 P1 INO OOO P8
Nor
mal
ized
Exe
cutio
n Tim
e
On-Line Transaction Processing (OLTP)
500MHz 1GHz 1GHz 500MHz 500MHz 1GHz 1GHz 500MHz 1-issue 1-issue 4-issue 1-issue 1-issue 1-issue 4-issue 1-issue
DSS(Query 6 of the TPC-D)
![Page 104: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/104.jpg)
104
Example System Configuration
P chip P chip P chip
P chip P chipP chip
I/O chip
I/O chip
A Piranha system with six processing(8 CPUs each) and two I/O chips.
![Page 105: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/105.jpg)
105
研究の新規性
• Detailed evaluation of database workloads in the context of CMP
• Simple processor, standard ASIC design– short time, small team size, small investment
![Page 106: ISCA-2000 海外調査報告](https://reader035.fdocument.pub/reader035/viewer/2022062217/568148b8550346895db5d209/html5/thumbnails/106.jpg)
106
研究へのコメント
• トランザクション処理では命令レベルの並列性を利用できない。
• CMP が強い?• 普及するのは時間の問題?