lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu...
Transcript of lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu...
![Page 1: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/1.jpg)
Advanced Architecture
Computer Organization and Assembly Languages Yung-Yu Chuang
with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne
![Page 2: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/2.jpg)
Intel microprocessor history
![Page 3: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/3.jpg)
3
Early Intel microprocessors
• Intel 8080 (1972)– 64K addressable RAM– 8-bit registers– CP/M operating system– 5,6,8,10 MHz– 29K transistors
• Intel 8086/8088 (1978)– IBM-PC used 8088– 1 MB addressable RAM– 16-bit registers– 16-bit data bus (8-bit for 8088)– separate floating-point unit (8087)– used in low-cost microcontrollers now
my first computer (1986)
![Page 4: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/4.jpg)
4
The IBM-AT
• Intel 80286 (1982)– 16 MB addressable RAM– Protected memory– several times faster than 8086– introduced IDE bus architecture– 80287 floating point unit– Up to 20MHz– 134K transistors
![Page 5: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/5.jpg)
5
Intel IA-32 Family
• Intel386 (1985)– 4 GB addressable RAM– 32-bit registers– paging (virtual memory)– Up to 33MHz
• Intel486 (1989)– instruction pipelining– Integrated FPU– 8K cache
• Pentium (1993)– Superscalar (two parallel pipelines)
![Page 6: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/6.jpg)
6
Intel P6 Family• Pentium Pro (1995)
– advanced optimization techniques in microcode– More pipeline stages– On-board L2 cache
• Pentium II (1997)– MMX (multimedia) instruction set– Up to 450MHz
• Pentium III (1999)– SIMD (streaming extensions) instructions (SSE)– Up to 1+GHz
• Pentium 4 (2000)– NetBurst micro-architecture, tuned for multimedia– 3.8+GHz
• Pentium D (2005, Dual core)
![Page 7: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/7.jpg)
IA32 Processors
• Totally Dominate Computer Market• Evolutionary Design
– Starting in 1978 with 8086– Added more features as time goes on– Still support old features, although obsolete
• Complex Instruction Set Computer (CISC)– Many different instructions with many different
formats• But, only small subset encountered with Linux programs
– Hard to match performance of Reduced Instruction Set Computers (RISC)
– But, Intel has done just that!
![Page 8: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/8.jpg)
ARM history
• 1983 developed by Acorn computers– To replace 6502 in BBC computers – 4-man VLSI design team– Its simplicity comes from the inexperience team– Match the needs for generalized SoC for reasonable
power, performance and die size– The first commercial RISC implemenation
• 1990 ARM (Advanced RISC Machine), owned by Acorn, Apple and VLSI
![Page 9: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/9.jpg)
ARM Ltd
Design and license ARM core design but not fabricate
![Page 10: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/10.jpg)
Why ARM?
• One of the most licensed and thus widespread processor cores in the world– Used in PDA, cell phones, multimedia players,
handheld game console, digital TV and cameras– ARM7: GBA, iPod– ARM9: NDS, PSP, Sony Ericsson, BenQ– ARM11: Apple iPhone, Nokia N93, N800– 90% of 32-bit embedded RISC processors till 2009
• Used especially in portable devices due to its low power consumption and reasonable performance
![Page 11: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/11.jpg)
ARM powered products
![Page 12: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/12.jpg)
12
Performance boost
• Increasing clock rate is insufficient. Architecture (pipeline/cache/SIMD) becomes more significant.
In his 1965 paper,Intel co-founder Gordon Mooreobserved that “the number of transistors per square inch had doubled every 18 months.
![Page 13: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/13.jpg)
Basic architecture
![Page 14: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/14.jpg)
Basic microcomputer design
• clock synchronizes CPU operations• control unit (CU) coordinates sequence of
execution steps• ALU performs arithmetic and logic operations
Central Processor Unit(CPU)
Memory StorageUnit
registers
ALU clock
I/ODevice
#1
I/ODevice
#2
data bus
control bus
address bus
CU
![Page 15: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/15.jpg)
Basic microcomputer design
• The memory storage unit holds instructions and data for a running program
• A bus is a group of wires that transfer data from one part to another (data, address, control)
Central Processor Unit(CPU)
Memory StorageUnit
registers
ALU clock
I/ODevice
#1
I/ODevice
#2
data bus
control bus
address bus
CU
![Page 16: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/16.jpg)
Clock
• synchronizes all CPU and BUS operations• machine (clock) cycle measures time of a single
operation• clock is used to trigger events
one cycle
1
0
• Basic unit of time, 1GHz→clock cycle=1ns• An instruction could take multiple cycles to
complete, e.g. multiply in 8088 takes 50 cycles
![Page 17: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/17.jpg)
Instruction execution cycle
• Fetch• Decode• Fetch
operands• Execute • Store output
I-1 I-2 I-3 I-4
PC program
I-1instructionregister
op1op2
memory fetch
ALU
registers
writ
e
decode
execute
read
writ
e
(output)
registers
flags
program counterinstruction queue
![Page 18: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/18.jpg)
Pipeline
![Page 19: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/19.jpg)
Multi-stage pipeline
• Pipelining makes it possible for processor to execute instructions in parallel
• Instruction execution divided into discrete stages
S1 S2 S3 S4 S51
Cyc
les
StagesS6
23456789
101112
I-1
I-2
I-1
I-2
I-1
I-2
I-1
I-2
I-1
I-2
I-1
I-2
Example of a non-pipelined processor. For example, 80386. Many wasted cycles.
![Page 20: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/20.jpg)
Pipelined execution
• More efficient use of cycles, greater throughput of instructions: (80486 started to use pipelining)
S1 S2 S3 S4 S51
Cyc
les
StagesS6
234567
I-1I-2 I-1
I-2 I-1I-2 I-1
I-2 I-1I-2 I-1
I-2
For k stages and n instructions, the number of required cycles is:
k + (n – 1)
compared to k*n
![Page 21: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/21.jpg)
• Pipelining requires buffers– Each buffer holds a single value– Ideal scenario: equal work for each stage
• Sometimes it is not possible• Slowest stage determines the flow rate in the
entire pipeline
Pipelined execution
![Page 22: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/22.jpg)
Pipelined execution
• Some reasons for unequal work stages– A complex step cannot be subdivided conveniently– An operation takes variable amount of time to
execute, e.g. operand fetch time depends on where the operands are located
• Registers • Cache • Memory
– Complexity of operation depends on the type of operation
• Add: may take one cycle• Multiply: may take several cycles
![Page 23: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/23.jpg)
• Operand fetch of I2 takes three cycles– Pipeline stalls for two cycles
• Caused by hazards– Pipeline stalls reduce overall throughput
Pipelined execution
![Page 24: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/24.jpg)
Wasted cycles (pipelined)
• When one of the stages requires two or more clock cycles, clock cycles are again wasted.
S1 S2 S3 S4 S51
Cyc
les
Stages
S6
234567
I-1I-2I-3
I-1I-2I-3
I-1I-2I-3
I-1
I-2 I-1I-1
89
I-3 I-2I-2
exe
1011
I-3I-3
I-1
I-2
I-3
For k stages and ninstructions, the number of required cycles is:
k + (2n – 1)
![Page 25: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/25.jpg)
Superscalar
A superscalar processor has multiple execution pipelines. In the following, note that Stage S4 has left and right pipelines (u and v).
S1 S2 S3 u S51
Cyc
les
Stages
S6
234567
I-1I-2I-3I-4
I-1I-2I-3I-4
I-1I-2I-3I-4
I-1
I-3 I-1I-2 I-1
v
I-2
I-4
S4
89
I-3I-4
I-2I-3
10 I-4
I-2
I-4
I-1
I-3
For k states and ninstructions, the number of required cycles is:
k + n
Pentium: 2 pipelinesPentium Pro: 3
![Page 26: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/26.jpg)
Pipeline stages
• Pentium 3: 10• Pentium 4: 20~31• Next-generation micro-architecture: 14• ARM7: 3
![Page 27: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/27.jpg)
Hazards
• Three types of hazards– Resource hazards
• Occurs when two or more instructions use the same resource, also called structural hazards
– Data hazards• Caused by data dependencies between instructions,
e.g. result produced by I1 is read by I2– Control hazards
• Default: sequential execution suits pipelining• Altering control flow (e.g., branching) causes
problems, introducing control dependencies
![Page 28: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/28.jpg)
Data hazardsadd r1, r2, #10 ; write r1sub r3, r1, #20 ; read r1
fetch decode reg ALU wb
fetch decode reg ALUstall wb
![Page 29: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/29.jpg)
Data hazards
• Forwarding: provides output result as soon as possible
fetch decode reg ALU wb
fetch decode reg ALUstall
add r1, r2, #10 ; write r1sub r3, r1, #20 ; read r1
wb
![Page 30: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/30.jpg)
Data hazards
• Forwarding: provides output result as soon as possible
fetch decode reg ALU wb
fetch decode reg ALUstall
add r1, r2, #10 ; write r1sub r3, r1, #20 ; read r1
fetch decode reg ALUstall wb
wb
![Page 31: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/31.jpg)
Control hazards
fetch decode reg ALU wb
fetch decode reg ALU wb
fetch decode reg ALU wb
fetch decode reg ALU
fetch decode reg ALU
bz r1, targetadd r2, r4, 0...
target: add r2, r3, 0
wb
![Page 32: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/32.jpg)
Control hazards
• Braches alter control flow– Require special attention in pipelining– Need to throw away some instructions in the
pipeline• Depends on when we know the branch is taken• Pipeline wastes three clock cycles
– Called branch penalty– Reducing branch penalty
• Determine branch decision early
![Page 33: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/33.jpg)
Control hazards
• Delayed branch execution– Effectively reduces the branch penalty– We always fetch the instruction following the branch
• Why throw it away?• Place a useful instruction to execute• This is called delay slot
add R2,R3,R4
branch target
sub R5,R6,R7
. . .
branch target
add R2,R3,R4
sub R5,R6,R7
. . .
Delay slot
![Page 34: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/34.jpg)
Branch prediction
• Three prediction strategies– Fixed
• Prediction is fixed– Example: branch-never-taken
» Not proper for loop structures
– Static• Strategy depends on the branch type
– Conditional branch: always not taken– Loop: always taken
– Dynamic• Takes run-time history to make more accurate predictions
![Page 35: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/35.jpg)
Branch prediction
• Static prediction– Improves prediction accuracy over Fixed
Instruction type Instruction Distribution
(%)
Prediction: Branch taken?
Correct prediction
(%) Unconditional branch
70*0.4 = 28 Yes 28
Conditional branch
70*0.6 = 42 No 42*0.6 = 25.2
Loop 10 Yes 10*0.9 = 9
Call/return 20 Yes 20
Overall prediction accuracy = 82.2%
![Page 36: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/36.jpg)
Branch prediction
• Dynamic branch prediction– Uses runtime history
• Takes the past n branch executions of the branch type and makes the prediction
– Simple strategy• Prediction of the next branch is the majority of the
previous n branch executions• Example: n = 3
– If two or more of the last three branches were taken, the prediction is “branch taken”
• Depending on the type of mix, we get more than 90% prediction accuracy
![Page 37: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/37.jpg)
Branch prediction
• Impact of past n branches on prediction accuracy
Type of mix n Compiler Business Scientific0 64.1 64.4 70.4 1 91.9 95.2 86.6 2 93.3 96.5 90.8 3 93.7 96.6 91.0 4 94.5 96.8 91.8 5 94.7 97.0 92.0
![Page 38: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/38.jpg)
10
Predict branch
01
Predict no branch
Branch prediction
00
Predict no branch
11
Predict branch
branch
branch
nobranch
nobranch
branch
nobranch
nobranch branch
![Page 39: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/39.jpg)
Multitasking
• OS can run multiple programs at the same time.• Multiple threads of execution within the same
program.• Scheduler utility assigns a given amount of CPU
time to each running program.• Rapid switching of tasks
– gives illusion that all programs are running at once– the processor must support task switching– scheduling policy, round-robin, priority
![Page 40: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/40.jpg)
Cache
![Page 41: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/41.jpg)
SRAM vs DRAM
Tran. Access Needsper bit time refresh? Cost Applications
SRAM 4 or 6 1X No 100X cache memories
DRAM 1 10X Yes 1X Main memories,frame buffers
Central Processor Unit(CPU)
Memory StorageUnit
registers
ALU clock
I/ODevice
#1
I/ODevice
#2
data bus
control bus
address bus
CU
![Page 42: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/42.jpg)
The CPU-Memory gap
The gap widens between DRAM, disk, and CPU speeds.
110
1001,000
10,000100,000
1,000,00010,000,000
100,000,000
1980 1985 1990 1995 2000
year
ns
Disk seek timeDRAM access timeSRAM access timeCPU cycle time
register cache memory disk
Access time(cycles)
1 1-10 50-100 20,000,000
![Page 43: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/43.jpg)
Memory hierarchies
• Some fundamental and enduring properties of hardware and software:– Fast storage technologies cost more per byte, have
less capacity, and require more power (heat!). – The gap between CPU and main memory speed is
widening.– Well-written programs tend to exhibit good locality.
• They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
![Page 44: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/44.jpg)
Memory system in practice
Larger, slower, and cheaper (per byte) storage devices
registers
on-chip L1cache (SRAM)
main memory(DRAM)
local secondary storage (virtual memory)(local disks)
remote secondary storage(tapes, distributed file systems, Web servers)
off-chip L2cache (SRAM)
L0:
L1:
L2:
L3:
L4:
L5:
Smaller, faster, and more expensive (per byte) storage devices
![Page 45: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/45.jpg)
Reading from memory
• Multiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.
Processor Chip
L1 Data1 cycle latency
16 KB4-way assoc
Write-through32B lines
L1 Instruction16 KB, 4-way
32B lines
Regs. L2 Unified128KB--2 MB4-way assocWrite-back
Write allocate32B lines
MainMemory
Up to 4GB
Pentium III cache hierarchy
![Page 46: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/46.jpg)
Cache memory
• High-speed expensive static RAM both inside and outside the CPU.– Level-1 cache: inside the CPU– Level-2 cache: outside the CPU
• Cache hit: when data to be read is already in cache memory
• Cache miss: when data to be read is not in cache memory. When? compulsory, capacity and conflict.
• Cache design: cache size, n-way, block size, replacement policy
![Page 47: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/47.jpg)
Caching in a memory hierarchy
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Larger, slower, cheaper Storage device at level k+1 is partitioned into blocks.
Data is copied between levels in block-sized transfer units
8 9 14 3Smaller, faster, more Expensive device at level k caches a subset of the blocks from level k+1
level k
level k+1
4
4
4 10
10
10
![Page 48: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/48.jpg)
Request14
Request12
General caching concepts
• Program needs object d, which is stored in some block b.
• Cache hit– Program finds b in the cache at
level k. E.g., block 14.
• Cache miss– b is not at level k, so level k cache
must fetch it from level k+1. E.g., block 12.
– If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”?
• Placement policy: where can the new block go? E.g., b mod 4
• Replacement policy: which block should be evicted? E.g., LRU
9 3
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
levelk
level k+1
1414
12
14
4*
4*12
12
0 1 2 3
Request12
4*4*12
![Page 49: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/49.jpg)
Locality• Principle of Locality: programs tend to reuse
data and instructions near those they have used recently, or that were recently referenced themselves.– Temporal locality: recently referenced items are
likely to be referenced in the near future.– Spatial locality: items with nearby addresses tend to
be referenced close together in time.• In general, programs with good locality run
faster then programs with poor locality• Locality is the reason why cache and virtual
memory are designed in architecture and operating system. Another example is web browser caches recently visited webpages.
![Page 50: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/50.jpg)
Locality example
• Data– Reference array elements in succession (stride-1
reference pattern):– Reference sum each iteration:
• Instructions– Reference instructions in sequence:– Cycle through loop repeatedly:
sum = 0;for (i = 0; i < n; i++)
sum += a[i];return sum;
Spatial locality
Spatial locality
Temporal locality
Temporal locality
![Page 51: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/51.jpg)
Locality example
• Being able to look at code and get a qualitative sense of its locality is important. Does this function have good locality?
int sum_array_rows(int a[M][N]){
int i, j, sum = 0;
for (i = 0; i < M; i++)for (j = 0; j < N; j++)
sum += a[i][j];return sum;
} stride-1 reference pattern
![Page 52: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/52.jpg)
Locality example
• Does this function have good locality?
int sum_array_cols(int a[M][N]){
int i, j, sum = 0;
for (j = 0; j < N; j++)for (i = 0; i < M; i++)
sum += a[i][j];return sum;
} stride-N reference pattern
![Page 53: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/53.jpg)
Blocked matrix multiply performance• Blocking (bijk and bikj) improves performance
by a factor of two over unblocked versions (ijk and jik)– relatively insensitive to array size.
0
10
20
30
40
50
60
25 50 75 100
125
150
175
200
225
250
275
300
325
350
375
400
Array size (n)
Cyc
les/
itera
tion
kjijkikijikjjikijkbijk (bsize = 25)bikj (bsize = 25)
![Page 54: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/54.jpg)
Cache-conscious programming
• make sure that memory is cache-aligned
• Split data into hot and cold (list example)
• Use union and bitfields to reduce size and increase locality
![Page 55: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/55.jpg)
SIMD
![Page 56: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/56.jpg)
56
SIMD
• MMX (Multimedia Extension) was introduced in 1996 (Pentium with MMX and Pentium II).
• Intel analyzed multimedia applications and found they share the following characteristics:– Small native data types (8-bit pixel, 16-bit audio)– Recurring operations– Inherent parallelism
![Page 57: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/57.jpg)
57
SIMD
• SIMD (single instruction multiple data) architecture performs the same operation on multiple data elements in parallel
• PADDW MM0, MM1
![Page 58: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/58.jpg)
58
SISD/SIMD/Streaming
![Page 59: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/59.jpg)
59
MMX
• After analyzing a lot of existing applications such as graphics, MPEG, music, speech recognition, game, image processing, they found that many multimedia algorithms execute the same instructions on many pieces of data in a large data set.
• Typical elements are small, 8 bits for pixels, 16 bits for audio, 32 bits for graphics and general computing.
• New data type: 64-bit packed data type. Why 64 bits?– Good enough– Practical
![Page 60: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/60.jpg)
60
MMX data types
![Page 61: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/61.jpg)
61
MMX instructions
• 57 MMX instructions are defined to perform the parallel operations on multiple data elements packed into 64-bit data types.
• These include add, subtract, multiply, compare, and shift, data conversion, 64-bit data move, 64-bit logical operation and multiply-add for multiply-accumulate operations.
• All instructions except for data move use MMX registers as operands.
• Most complete support for 16-bit operations.
![Page 62: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/62.jpg)
62
Saturation arithmetic
wrap-around saturating
• Useful in graphics applications.• When an operation overflows or underflows,
the result becomes the largest or smallest possible representable number.
• Two types: signed and unsigned saturation
![Page 63: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/63.jpg)
63
Keys to SIMD programming
• Efficient data layout• Elimination of branches
![Page 64: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/64.jpg)
64
Application: frame difference
A B
|A-B|
![Page 65: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/65.jpg)
65
Application: frame difference
A-B B-A
(A-B) or (B-A)
![Page 66: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/66.jpg)
66
Application: frame differenceMOVQ mm1, A //move 8 pixels of image AMOVQ mm2, B //move 8 pixels of image BMOVQ mm3, mm1 // mm3=APSUBSB mm1, mm2 // mm1=A-BPSUBSB mm2, mm3 // mm2=B-APOR mm1, mm2 // mm1=|A-B|
![Page 67: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/67.jpg)
67
Data-independent computation
• Each operation can execute without needing to know the results of a previous operation.
• Example, sprite overlayfor i=1 to sprite_Sizeif sprite[i]=clr then out_color[i]=bg[i]else out_color[i]=sprite[i]
• How to execute data-dependent calculations on several pixels in parallel.
![Page 68: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/68.jpg)
68
Application: sprite overlay
![Page 69: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/69.jpg)
69
Application: sprite overlayMOVQ mm0, spriteMOVQ mm2, mm0MOVQ mm4, bgMOVQ mm1, clrPCMPEQW mm0, mm1PAND mm4, mm0PANDN mm0, mm2POR mm0, mm4
![Page 70: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/70.jpg)
70
Other SIMD architectures
• Graphics Processing Unit (GPU): nVidia 7800, 24 pipelines (8 vector/16 fragment)
![Page 71: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/71.jpg)
Impacts on programming
• You need to be aware of architecture issues to write more efficient programs (such as cache-aware).
• Need parallel thinking for better utilizing parallel features of processors.
![Page 72: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/72.jpg)
RISC v.s. CISC
![Page 73: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/73.jpg)
Trade-offs of instruction sets
• Before 1980, the trend is to increase instruction complexity (one-to-one mapping if possible) to bridge the gap. Reduce fetch from memory. Selling point: number of instructions, addressing modes. (CISC)
• 1980, RISC. Simplify and regularize instructions to introduce advanced architecture for better performance, pipeline, cache, superscalar.
high-level language machine codesemantic gap
compiler
C, C++Lisp, Prolog, Haskell…
![Page 74: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/74.jpg)
RISC
• 1980, Patternson and Ditzel (Berkeley),RISC• Features
– Fixed-length instructions– Load-store architecture– Register file
• Organization– Hard-wired logic– Single-cycle instruction– Pipeline
• Pros: small die size, short development time, high performance
• Cons: low code density, not x86 compatible
![Page 75: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/75.jpg)
RISC Design Principles
• Simple operations– Simple instructions that can execute in one cycle
• Register-to-register operations– Only load and store operations access memory– Rest of the operations on a register-to-register basis
• Simple addressing modes– A few addressing modes (1 or 2)
• Large number of registers– Needed to support register-to-register operations– Minimize the procedure call and return overhead
![Page 76: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/76.jpg)
RISC Design Principles
• Fixed-length instructions– Facilitates efficient instruction execution
• Simple instruction format– Fixed boundaries for various fields
• opcode, source operands,…
![Page 77: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/77.jpg)
CISC and RISC
• CISC – complex instruction set– large instruction set– high-level operations (simpler for compiler?)– requires microcode interpreter (could take a long
time)– examples: Intel 80x86 family
• RISC – reduced instruction set– small instruction set– simple, atomic instructions– directly executed by hardware very quickly– easier to incorporate advanced architecture design– examples: ARM (Advanced RISC Machines) and DEC
Alpha (now Compaq), PowerPC, MIPS
![Page 78: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/78.jpg)
CISC and RISC
CISC(Intel 486)
RISC(MIPS R4000)
#instructions 235 94
Addr. modes 11 1
Inst. Size (bytes) 1-12 4
GP registers 8 32
![Page 79: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/79.jpg)
Why RISC?
• Simple instructions are preferred– Complex instructions are mostly ignored by
compilers• Due to semantic gap
• Simple data structures– Complex data structures are used relatively
infrequently– Better to support a few simple data types efficiently
• Synthesize complex ones• Simple addressing modes
– Complex addressing modes lead to variable length instructions
• Lead to inefficient instruction decoding and scheduling
![Page 80: lec08 architecture.ppt [相容模式] - 國立臺灣大學 · Advanced Architecture ... Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, ... – 4-man VLSI](https://reader030.fdocument.pub/reader030/viewer/2022020412/5ad59ea27f8b9a6d708d1cac/html5/thumbnails/80.jpg)
Why RISC? (cont’d)
• Large register set– Efficient support for procedure calls and returns
• Patterson and Sequin’s study– Procedure call/return: 1215% of HLL statements
» Constitute 3133% of machine language instructions» Generate nearly half (45%) of memory references
– Small activation record• Tanenbaum’s study
– Only 1.25% of the calls have more than 6 arguments– More than 93% have less than 6 local scalar variables– Large register set can avoid memory references