Chapter 5 Program Design and Analysis 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)
Chapter 3 CPUs 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook...
-
date post
20-Dec-2015 -
Category
Documents
-
view
250 -
download
8
Transcript of Chapter 3 CPUs 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook...
Chapter 3
CPUs
金仲達教授清華大學資訊工程學系
(Slides are taken from the textbook slides)
CPUs-2
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
CPUs-3
I/O devices An embedded system usually includes
some input/output devices A typical I/O interface to CPU:
CPU
statusreg
datareg
mec
hani
sm
CPUs-4
Application: 8251 UART Universal asynchronous receiver
transmitter (UART): provides serial communication
8251 functions are usually integrated into standard PC interface chip
Allows many communication parameters Baud (bit) rate: e.g. 9600 chars/sec Number of bits per character Parity/no parity Even/odd parity Length of stop bit (1, 1.5, 2 bits)
CPUs-5
8251 CPU interface
CPU 8251
status(8 bit)
data(8 bit)
serialport
xmit/rcv
time
bit 0 bit 1 bit n-1
nochar
start stop...
CPUs-6
Programming I/O Two types of instructions can support I/O:
special-purpose I/O instructions memory-mapped load/store instructions
Intel x86 provides in, out instructions Most other CPUs use memory-mapped I/O
MM I/O provide address for I/O registers
CPUs-7
ARM memory-mapped I/O Define location for device:DEV1 EQU 0x1000
Read/write code:LDR r1,#DEV1; set up device adrsLDR r0,[r1] ; read DEV1LDR r0,#8 ; set up value to writeSTR r0,[r1] ; write value to device
CPUs-8
SHARC memory mapped I/O Device must be in external memory space
(above 0x400000). Use DM to control access:
I0 = 0x400000;M0 = 0;R1 = DM(I0,M0);
CPUs-9
Peek and poke Traditional C interfaces:
int peek(char *location) {return *location; }
void poke(char *location, char newval) {(*location) = newval; }
CPUs-10
Busy/wait output Simplest way to program device
Use instructions to test when device is ready
#define OUT_CHAR 0x1000 // device data register#define OUT_STATUS 0x1001 // device status register
current_char = mystring;while (*current_char != ‘\0’) {
poke(OUT_CHAR,*current_char);while (peek(OUT_STATUS) != 0); // busy waitingcurrent_char++;
}
CPUs-11
Simultaneous busy/wait input and outputwhile (TRUE) {/* read */while (peek(IN_STATUS) == 0);achar = (char)peek(IN_DATA);/* write */poke(OUT_DATA,achar);poke(OUT_STATUS,1);while (peek(OUT_STATUS) != 0);}
CPUs-12
Interrupt I/O Busy/wait is very inefficient
CPU can’t do other work while testing device Hard to do simultaneous I/O
Interrupts allow a device to change the flow of control in the CPU Causes subroutine call to handle device
CPU
statusreg
datareg
mec
hani
sm
PC
intr request
intr ack
data/address
IR
CPUs-13
Interrupt behavior Based on subroutine call mechanism. Interrupt forces next instruction to be a
subroutine call to a predetermined location. Return address is saved to resume executing
foreground program.
CPUs-14
Interrupt physical interface CPU and device are connected by CPU bus CPU and device handshake:
device asserts interrupt request CPU checks the interrupt request line at the
beginning of each instruction cycle CPU asserts interrupt acknowledge when it can
handle the interrupt CPU fetches the next instruction from the
interrupt handler routine
CPUs-15
Example: character I/O handlersvoid input_handler() {achar = peek(IN_DATA);gotchar = TRUE;poke(IN_STATUS,0);
}void output_handler() {}
CPUs-16
Example: interrupt-driven main programmain() {while (TRUE) {
if (gotchar) {poke(OUT_DATA,achar);poke(OUT_STATUS,1);gotchar = FALSE;}
}}
CPUs-17
Example: interrupt I/O with buffers Queue for characters:
head tailhead tail
a
CPUs-18
Buffer-based input handlervoid input_handler() {
char achar;if (full_buffer()) error = 1;else { achar = peek(IN_DATA); add_char(achar); }poke(IN_STATUS,0);if (nchars == 1)
{ poke(OUT_DATA,remove_char(); poke(OUT_STATUS,1); }}
CPUs-19
Debugging interrupt code What if you forget to change registers?
Foreground program can exhibit mysterious bugs.
Bugs will be hard to repeat---depend on interrupt timing.
CPUs-20
Priorities and Vectors Two mechanisms allow us to make
interrupts more specific: Priorities determine what interrupt gets CPU
first Vectors determine what code (handler routine)
is called for each type of interrupt Mechanisms are orthogonal: most CPUs
provide both
CPUs-21
Prioritized interrupts
CPU
device 1 device 2 device n
L1 L2 .. Ln
interruptacknowledge
CPUs-22
Interrupt prioritization Masking: interrupt with priority lower than cur
rent priority is not recognized until pending interrupt is complete
Non-maskable interrupt (NMI): highest-priority, never masked Often used for power-down
CPUs-23
Interrupt Vectors Allow different devices to be handled by
different code Require additional vector line from device
to CPU
handler 0
handler 1
handler 2
handler 3
Interrupt Vector Table
CPU
Device
int req ack
Vector
CPUs-24
Generic interrupt mechanism
Assume priority selection is handled before this point.
intr?N
Y
N
ignore
Y
ack
vector?
Y
Y
Ntimeout?
Ybus error
call table[vector]
intr priority > current priority?
continueexecution
CPUs-25
Interrupt sequence CPU acknowledges request Device sends vector CPU calls handler Software processes request CPU restores state to foreground program
CPUs-26
Sources of interrupt overhead Handler execution time Interrupt mechanism overhead Register save/restore Pipeline-related penalties Cache-related penalties
CPUs-27
ARM interrupts ARM7 supports two types of interrupts:
Fast interrupt requests (FIQs) Interrupt requests (IRQs) FIO takes priority over IRQ
Interrupt table starts at location 0
CPUs-28
ARM interrupt procedure CPU actions:
Save PC; copy CPSR to SPSR. Force bits in CPSR to record interrupt. Force PC to vector.
Handler responsibilities: Restore proper PC. Restore CPSR from SPSR. Clear interrupt disable flags.
CPUs-29
ARM interrupt latency Worst-case latency to respond to interrupt
is 27 cycles: Two cycles to synchronize external request. Up to 20 cycles to complete current instruction. Three cycles for data abort. Two cycles to enter interrupt handling state.
CPUs-30
SHARC interrupt structure Interrupts are vectored and prioritized. Priorities are fixed: reset highest, user SW
interrupt 3 lowest. Vectors are also fixed. Vector is offset in
vector table. Table starts at 0x20000 in internal memory, 0x40000 in external memory.v
Pads Lab
SHARC interrupt sequenceStart: must be executing or IDLE/IDLE16.1. Output appropriate interrupt vector
address.2. Push PC value onto PC stack.3. Set bit in interrupt latch register.4. Set IMASKP to current nesting state.
Pads Lab
SHARC interrupt returnInitiated by RTI instruction.1. Return to address at top of PC stack.2. Pop PC stack.3. Pop status stack if appropriate.4. Clear bits in interrupt latch register and
IMASKP.
Pads Lab
SHARC interrupt performance Three stages of response:
1 cycle: synchronization and latching 1 cycle: recognition 2 cycles: branching to vector
Total latency: 3 cycles. Multiprocessor vector interrupts have 6
cycle latency.
CPUs-34
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
CPUs-35
Supervisor mode May want to provide protective barriers
between programs. e.g., avoid memory corruption
Need supervisor mode to manage the various programs
SHARC does not have a supervisor mode
CPUs-36
ARM supervisor mode Use SWI instruction to enter supervisor
mode, similar to subroutine:SWI CODE_1 Sets PC to 0x08 Argument to SWI is passed to supervisor mode
code to request various services Saves CPSR in SPSR
CPUs-37
Exception Exception: internally detected error Exceptions are synchronous with instructions
but unpredictable Build exception mechanism on top of interrup
t mechanism Exceptions are usually prioritized and vectoriz
ed A single instruction may generate more than one ex
ception
CPUs-38
Trap Trap (software interrupt): an exception
generated by an instruction Call supervisor mode
ARM uses SWI instruction for traps SHARC offers three levels of software
interrupts. Called by setting bits in IRPTL register
CPUs-39
Co-processor Co-processor: added function unit that is
called by instruction e.g. floating-point operations
A co-processor instruction can cause trap and be handled by software (if no such co-processor exists)
ARM allows up to 16 co-processors
CPUs-40
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
CPUs-41
Caches and CPUs
CPU
cach
eco
ntro
ller
cache
mainmemory
data
data
address
data
address
Memory access speed is falling further and further behind than CPU
Cache: reduce the speed gap
CPUs-42
Cache operation May have caches for:
instructions data data + instructions (unified)
Memory access time is no longer deterministic Cache hit: required location is in cache Cache miss: required location is not in cache Working set: set of locations used by program
in a time interval.
CPUs-43
Types of cache misses Compulsory (cold): location has never
been accessed. Capacity: working set is too large Conflict: multiple locations in working set
map to same cache entry.
CPUs-44
Memory system performance Cache performance benefits:
Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time.
Sequential accesses are faster after first access. h = cache hit rate. tcache = cache access time, tmain = main memory
access time. Average memory access time:
tav = htcache + (1-h)tmain
CPUs-45
Multi-level caches
h1 = cache hit rate. h2 = rate for miss on L1, hit on L2. Average memory access time:
tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain
CPU L1 cache L2 cache
CPUs-46
Replacement policies Replacement policy: strategy for choosing
which cache entry to throw out to make room for a new memory location.
Two popular strategies: Random. Least-recently used (LRU).
CPUs-47
Cache organizations Fully-associative: any memory location
can be stored anywhere in the cache (almost never implemented)
Direct-mapped: each memory location maps onto exactly one cache entry
N-way set-associative: each memory location can go into one of n sets
CPUs-48
Direct-mapped cache
valid
=
tag index offset
hit value
tag data
1 0xabcd byte byte byte ...
byte
cache block
CPUs-49
Direct-mapped cache locations Many locations map onto the same cache
block. Conflict misses are easy to generate:
Array a[] uses locations 0, 1, 2, … Array b[] uses locations 1024, 1025, 1026, … Operation a[i] + b[i] generates conflict misses
if locations 0 and 1024 are mapped to the same block in the cache.
Write operations: Write-through: immediately copy write to main
memory Write-back: write to main memory only when
location is removed from cache
CPUs-50
Set-associative cache A set of direct-mapped caches:
Set 1 Set 2 Set n...
hit data
CPUs-51
Memory Management Units Memory management unit (MMU)
translates addresses MMU are not common in embedded
system as it hardly has a secondary storage
CPU mainmemory
MMU
logicaladdress
physicaladdress
secondary storage
swapping
data
CPUs-52
Memory management tasks Allows programs to move in physical
memory during execution. Allows virtual memory:
memory images kept in secondary storage; images returned to main memory on demand
during execution. Page fault: request for location not
resident in memory, which generates an exception
CPUs-53
Address translation Requires some sort of register/table to
allow arbitrary mappings of logical to physical addresses
Two basic schemes: segmented paged
Segmentation and paging can be combined (x86)
CPUs-54
Segment address translation
segment base address logical address
rangecheck
physical address
+
rangeerror
segment lower boundsegment upper bound
to memory
from CPU
CPUs-55
Page address translation
page offset
page offset
page i base
concatenate
page table
logicaddress
physicaladdress
to memory
from CPU
CPUs-56
Page table organizations
flat tree
page descriptor
pagedescriptor
CPUs-57
Caching address translations Large translation tables require main
memory access. TLB: cache for address translation.
Typically small.
CPUs-58
ARM & SHARC memory management Memory region types:
section: 1 Mbyte block large page: 64 kbytes small page: 4 kbytes
An address is marked as section-mapped or page-mapped
Two-level translation scheme SHARC does not have a MMU
CPUs-59
ARM address translation
offset1st index 2nd index
physical address
Translation tablebase register
1st level tabledescriptor
2nd level tabledescriptor
concatenate
concatenate
CPUs-60
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
CPUs-61
Performance Acceleration There are 3 factors that can substantially
improve system performance: Pipelining Superscalar execution Caching
Need to take advantages of them where possible
But, they also cause problems in analyzing the performance
CPUs-62
Pipelining Several instructions are executed simultaneou
sly at different stages of completion Various conditions can cause pipeline stalls th
at reduce utilization: branches memory system delays data hazards, etc.
Both ARM and SHARC have 3-stage pipes: fetch instruction from memory; decode opcode and operands; execute.
CPUs-63
ARM pipeline execution
add r0,r1,#5
sub r2,r3,r6
cmp r2,#3
fetch
time
decode
fetch
execute
decode
fetch
execute
decode execute
1 2 3
CPUs-64
Pipeline performance Latency: time it takes for an instruction to
get through the pipeline. Throughput: number of instructions
executed per time period. Pipelining increases throughput without
reducing latency. Pipeline stall:
If a step cannot be completed in the same amount of time, pipeline stalls.
Bubbles introduced by stall increase latency, reduce throughput.
CPUs-65
fetch decodeex ld r2ldmia r0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
fetch
time
ex ld r3
decode ex sub
fetch decodeex cmp
Data stall Multi-cycle execution and data stall LDMIA: load multiple
CPUs-66
Control stalls Branches often introduce stalls (branch
penalty). Stall time may depend on whether branch is
taken. May have to squash instructions that
already started executing. Don’t know what to fetch until condition is
evaluated.
CPUs-67
ARM pipelined branch
time
fetch decode ex bnebne foo
sub r2,r3,r6
fetch decode
foo add r0,r1,r2
ex bne
fetch decode ex add
ex bne
CPUs-68
Delayed branch To increase pipeline efficiency, delayed
branch mechanism requires n instructions after branch always executed whether branch is executed or not
SHARC supports delayed and non-delayed branches Specified by bit in branch instruction 2 instruction branch delay slot
CPUs-69
Example: SHARC code schedulingL1=5;DM(I0,M1)=R1;L8=8;DM(I8,M9)=R2;
CPU cannot use DAG on cycle just after loading DAG’s register, because both need the same internal bus. CPU performs NOP between register assign and DM.
CPUs-70
Rescheduled SHARC codeL1=5;L8=8;DM(I0,M1)=R1;DM(I8,M9)=R2;
Avoids two NOP cycles.
CPUs-71
Example: ARM execution time Determine execution time of FIR filter:
for (i=0; i<N; i++)f = f + c[i]*x[i];
Only branch in loop test may take more than one cycle. BLT loop takes 1 cycle best case, 3 worst
case.
CPUs-72
Superscalar execution Superscalar processor can execute several
instructions per cycle Uses multiple pipelined data paths
Programs execute faster, but it is harder to determine how much faster
CPUs-73
Superscalar Processor
Control
Instruction 2
Instruction 1
Instruction Unit
Instruction Unit
Registers
CPUs-74
Data dependencies Execution time depends on operands, not just
opcode. Superscalar CPU checks data dependencies dy
namically:
add r2,r0,r1add r3,r2,r5
data dependency r0 r1
r2 r5
r3
CPUs-75
Memory system performance Caches introduce indeterminacy in
execution time. Depends on order of execution.
Cache miss penalty: added time due to a cache miss.
Several reasons for a miss: compulsory, conflict, capacity.
CPUs-76
CPU power consumption Most modern CPUs are designed with
power consumption in mind to some degree
Power vs. energy: Power is energy consumption per unit time heat depends on power consumption battery life depends on energy consumption
CPUs-77
CMOS power consumption Voltage drops: power consumption
proportional to V2
Toggling: more activity means more power Leakage: basic circuit characteristics; can
be eliminated by disconnecting power
CPUs-78
CPU power-saving strategies Reduce power supply voltage Run at lower clock frequency Disable function units with control signals
when not in use Disconnect parts from power supply when
not in use to eliminate leakage currents
CPUs-79
Power management styles Static power management: does not
depend on CPU activity Example: user-activated power-down mode
Dynamic power management: based on CPU activity Example: disabling off function units
CPUs-80
Application: PowerPC 603 Provides doze, nap, sleep modes for static po
wer management Dynamic power management features:
Can shut down unused execution units Cache organized into subarrays to minimize amoun
t of active circuitry
CPUs-81
PowerPC 603 activity Percentage of time units are idle for SPEC
integer/floating-point:unit Specint92 Specfp92D cache 29% 28%I cache 29% 17%load/store 35% 17%fixed-point 38% 76%floating-point 99% 30%system register 89% 97%
Idle units are turned off by switching off clocks Pipeline stages can be turned on or off
CPUs-82
Power-down costs Going into a power-down mode costs:
time energy
Must determine if going into mode is worthwhile
Can model CPU power states with power state machine
CPUs-83
Application: StrongARM SA-1100 Processor takes two supplies:
VDD is main 3.3V supply (on & off) VDDX is 1.5V (always remains on)
Three power modes: Run: normal operation. Idle: stops CPU clock, with logic still powered Sleep: shuts off most of chip activity; 3 steps,
each about 30 s; wakeup takes > 10 ms
CPUs-84
SA-1100 power state machine
run
idle sleep
Prun = 400 mW
Pidle = 50 mW Psleep = 0.16 mW
10 s
10 s90 s
160 ms90 s
CPUs-85
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption Example Design: Data Compressor
CPUs-86
Goals Compress data transmitted over serial
line. Receives byte-size input symbols. Produces output symbols packed into bytes.
Will build software module only here.
CPUs-87
Collaboration diagram for compressor
:input :data compressor :output
1..n: inputsymbols
1..m: packedoutputsymbols
CPUs-88
Huffman coding Early statistical text compression
algorithm. Select non-uniform size codes.
Use shorter codes for more common symbols. Use longer codes for less common symbols.
To allow decoding, codes must have unique prefixes. No code can be a prefix of a longer valid code.
CPUs-89
Huffman examplecharacter Pa .45b .24c .11d .08e .07f .05
P=1
P=.55
P=.31P=.19
P=.12
CPUs-90
Example Huffman code Read code from root to leaves:
a 1b 01c 0000d 0001e 0010f 0011
CPUs-91
Huffman coder requirements table
name data compression module
purpose code module for Huffmancompression
inputs encoding table, uncodedbyte-size inputs
outputs packed compression outputsymbols
functions Huffman coding
performance fast
manufacturing cost N/A
power N/A
physical size/weight N/A
CPUs-92
Building a specification Collaboration diagram shows only steady-
state input/output. A real system must:
Accept an encoding table. Allow a system reset that flushes the
compression buffer.
CPUs-93
data-compressor class
data-compressor
buffer: data-buffertable: symbol-tablecurrent-bit: integer
encode(): boolean,data-buffer
flush()new-symbol-table()
CPUs-94
Data-compressor behaviors encode: takes one-byte input, generates
packed encoded symbols and a Boolean indicating whether the buffer is full.
new-symbol-table: installs new symbol table in object, throws away old table.
flush: returns current state of buffer, including number of valid bits in buffer.
CPUs-95
Auxiliary classes
data-buffer
databuf[databuflen] :character
len : integer
insert()length() : integer
symbol-table
symbols[nsymbols] :data-buffer
len : integer
value() : symbolload()
CPUs-96
Auxiliary class roles data-buffer holds both packed and unpacked
symbols. Longest Huffman code for 8-bit inputs is 256 bits.
symbol-table indexes encoded verison of each symbol. load() puts data in a new symbol table.
CPUs-97
Class relationships
symbol-table
data-compressor
data-buffer
1
1
1
1
CPUs-98
Encode behavior
encode
create new bufferadd to buffers
add to buffer
return true
return false
input symbol
buffer filled?
T
F
CPUs-99
Insert behavior
pack intothis buffer
pack bottom bitsinto this buffer,
top bits intooverflow buffer
updatelength
inputsymbol
fills buffer?
T
F
CPUs-100
Program design In an object-oriented language, we can
reflect the UML specification in the code more directly.
In a non-object-oriented language, we must either: add code to provide object-oriented features; diverge from the specification structure.
CPUs-101
C++ classesClass data_buffer {
char databuf[databuflen];int len;int length_in_chars() { return len/bitsperbyte; }
public:void insert(data_buffer,data_buffer&);int length() { return len; }int length_in_bytes() { return (int)ceil(len/8.0); }int initialize(); ...
CPUs-102
C++ classes, cont’d.class data_compressor {
data_buffer buffer;int current_bit;symbol_table table;
public:boolean encode(char,data_buffer&);void new_symbol_table(symbol_table);int flush(data_buffer&);data_compressor();~data_compressor();}
CPUs-103
C codestruct data_compressor_struct {
data_buffer buffer;int current_bit;sym_table table;
}typedef struct data_compressor_struct data_compressor,
*data_compressor_ptr;boolean data_compressor_encode(data_compressor_ptr mycmptr
s, char isymbol, data_buffer *fullbuf) ...
CPUs-104
Testing Test by encoding, then decoding:
input symbols
symbol table
encoder decoder
compare
result
CPUs-105
Code inspection tests Look at the code for potential problems:
Can we run past end of symbol table? What happens when the next symbol does not
fill the buffer? Does fill it? Do very long encoded symbols work properly?
Very short symbols? Does flush() work properly?