L4: Architectural Level Design 성균관대학교 조 준 동 교수

L4: Architectural Level Design

성균관대학교 조 준 동 교수http://vlsicad.skku.ac.kr

System-Level Solutions• Spatial locality: an algorithm can be partitioned into natural clusters b

ased on connectivity• Temporal locality:average lifetimes of variables (less temporal storag

e, probability of future accesses referenced in the recent past).• Precompute physical capacitance of Interconnect and switching activ

ity (number of bus accesses)• Architecture-Driven Voltage Scaling: Choose more parallel architectu

re• Supply Voltage Scaling : Lowering V dd reduces energy, but increas

e delays

Software Power Issues Upto 40% of the on-chip power is dissipated on the buses !• System Software : OS, BIOS, Compilers• Software can affect energy consumption at various levels Inter-Instructi

on Effects• Energy cost of instruction varies depending on previous instruction• For example, XORBX 1; ADDAX DX;• Iest = (319:2+313:6)=2 = 316:4mA Iobs =323:2mA• The difference defined as circuit state overhead• Need to specify overhead as a function of pairs of instructions• Due to pipeline stalls, cache misses• Instruction reordering to improve cache hit ratio

Software Power Optimization• Instruction packing

– reduce cache miss with a high power penalty

– example• Fujisu DSP• permit an ALU operation and

a memory data transfer to be packed

• Instruction ordering– attempt to minimize the energy ass

ociated with the circuit state effect– reordering instruction to minimize

the total power for a given table

• Operand swapping– minimize activity

associated with the operand– attempts to swap operands

to ALU or FPU

Software Power Optimization• Minimizing memory access

costs– minimizes the number of

memory accesses required by an algorithm

– example

• Memory bank assignment– formulated as a graph partitioning

problem– each groups correspond to a

memory bank– optimum code sequence can vary

using dual loads

FOR i:= 1 TO N DOB[i] = f(A[i]);

Before

FOR i:= 1 TO N DOC[i] = g(B[i]);

FOR i:= 1 TO N DOB[i] = f(A[i]);

After

END_FOR;C[i] = g(B[i]);

eb

a

d

c

access graphfor code fragment

e

b

a

c

dpartitioned access graph

Bank A

Bank B

Power Management Mode• Support power management

– easy control for applications and OS

• APM : Advanced power management– power states

• Full On• APM Enabled• APM Standby• APM Suspend• Off

• APM System

APM-AwareApplication

APM-AwareApplication

APM-AwareDevice Driver

APM-AwareDevice Driver

APM Driver

APM BIOS

APM BIOSControlledHardware

Add-InDevice

Add-InDevice

OperatingSystem

BIOS

OS dependent

OS independent

Power Management Mode• APM state transitions

Full On

APM Enabled

APM Standby

APM SuspendHibernation

Off

PowerManaged

Off Switch

Off Switch

Off Switch

•Off Switch•Off Call

•Short Inactivity•Standby Call

•APM Disable•Disable Call

•Long Inactivity•Suspend Interrupt•Suspend Call

•APM Enable•Enable Call

On Switch

DeviceResponsivenessDecrease

PowerUsageIncrease

Power Management Mode

• PowerPC 603– Doze

• clock running to data cache, snooping logic, time base/decrementer only

– Nap• clocks running to time b

ase/decrementer only– Sleep

• all clocks stopped, no external input clock

• MIPS 4200– Reduced power

• clocks at 1/4 bus clock frequency

• Hitachi SH7032– Sleep

• CPU clocks stopped, preipherals remain clocked

– Standby• all clocks stopped peripheral

s initialized

Power Optimization

• Modeling and Technology• Circuit Design Level• Logic and Module Design Level• Architecture and System Design Level• Some Design Examples

– ARM7TDMI

Some Design Examples• ARM7TDMI core

– size : 1mm2 @ 0.25um– power :

• 0.181W@33MHz 5V• 143 MIPS/W

– feature• 32 bit addressing• 32x8 DSP multiplier• 32-bit register bank and ALU• 32-bit barrel shifter

– thumb instruction set• compressed 32-bit ARM instr

uciton• high-code density

ARM7DARM7TDMIPC403GA

V8106834929200486DXi960SA

Processor System Power(W) MIPS/W33Mhz 5V33Mhz 5V40Mhz 5V25Mhz 5V25Mhz 5V16Mhz 5V33Mhz 5V16Mhz 5V

0.1650.181

10.50.961.14.51.25

18514339369764

Processor with Power Management

• Clock power management – basic logical method

• gated clocking– hardware method

• external pin + control register bit– software method

• specific instructions + control register bit

Avoiding Wastful Computation

• Preservation of data correlation• Distributed computing / locality of reference• Application-specific processing• Demand-driven operation• Transformation for memory size reduction• Consider arrays A and C are already available in memory• When A is consumed another array B is generated; when C is consum

ed a scalar value D is produced. • Memory Size can be reduced by executing the j loop before the i loop s

o that C is consumed before B is generated and the same memory space can be used for both arrays.

Avoiding Wastful Computation

Architecture Lower Power Design

• Optimum Supply Voltage Architecture through Hardware Duplication (Trading Area for Lower Power) and/or Pipelining– complex and fewer instruction requires less encoding, but larger de

code logic!• use small complex instruction with smaller instruction length (e.

g., Hitachi SH: 16-bit fixed-length, arithmetic instruction uses only two operands, NEC V800: variable-length instruction decoding overhead )

• Superscalar: CPI < 1: parallel instruction execution. VLIW architecture.

Variable Supply Voltage Block Diagram

• Computational work varies with time. An approach to reduce the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

• The basic idea is to lower power supply when the a fixed supply for some fraction of time.

• The supply voltage and clock rate are increased during high workload period.

Power Reduction using Variable Supply•Circuits with a fixed supply voltage work at a fixed speed and idle if the data sample requires less than themaximum amount of computation.• Power is reduced in a linear fashion since the energy per operation is fixed. If the work load for a given sample period is less than peak, then the delay of the processing element can be increased by a factor of 1/workload without loss in throughput, allowing the processor to operate at alower supply voltage. Thus, energy per operation varies.

Data Driven Signal ProcessingThe basic idea of averaging two samples are buffered and their work loads are averaged.

The averaged workload is then used as the effective workload to drive the power supply.

Using a pingpong buffering scheme, data samples In +2, In +3

are being buffered while In, In +1

are being processed.

Datapath Parallelization

Memory Parallelization

At first order P= C * f/2 * Vdd2

Pipelined Micro-P

Architecture Trade-Off

PIPLELINED Implementation

Ppipeline = (1.15C)( 0.58V)2 (f) = 0.39P

Pparallel =

(2.15C)(0.58V)2 (0.5f) = 0.36P

Different Classes of RISC Micro-P

Application Specific Coprocessor

• DSP's are increasingly called upon to perform tasks for which they are not ideally suited, for example, Viterbi decoding.

• They may also take considerably more energy than a custom solution.• Use the DSP for portions of algorithms for which it is well suited, and cr

aft an application-specic coprocessor (i.e., custom hardware) for other tasks.

• This is an example of the dierence between power and energy• The application-specic coprocessor may actually consume a more pow

er than the DSP, but it may be able to accomplish the same task in far less time, resulting in a net energy savings.

• Power consumption varies dramatically with the instruction being executed.

Clock per Instruction (CPI)

SUPERPIPELINE micro-P

VLIW Architecture Compiler takes the responsibility for finding the operations that can be issued in parallel and creating a single very long instruction containing these operations. VLIW instruction decoding is easier than superscalar instruction due to the fixed format and to no instruction dependency. The fixed format could present more limitations to the combination of operations. Intel P6: CISC instructions are combined on chip to provide a set of micro-operations (i.e., long instruction word) that can be executed in parallel. As power becomes a major issue in the design of fast -Pro, the simple is the better architecture. VLIW architecture, as they are simpler than N-issue machines, could be considered as promising architectures to achieve simultaneouslyhigh-speed and low-power.

Architecture Optimization• 2’s complement architecture

– correlator example• 64MHz random input• 64KHz accumulated output• 1024 length

– accumulator acts as a low-pass filter

• higher order bits have little switching activity

– high switching activity of the adder• all of the input bits to the adde

r switch each time the input changes sign

+4

14

14

CLK(64MHz)

CLK(64MHz)

CLK(64KHz)

in_latched

current_sum

add_out

4

Bit Position

Tran

sitio

n A

ctiv

ity

0 2 4 6 8 10 120.0

0.5

1.0add_out

in_latchedcurrent_sum

sign-extension

Architecture Optimization• Sign-magnitude architecture

– low switching activity in high order bit

• no sign-extension is being performed

• higher order bits only need an incrementer

– power is not sensitive to very rapid fluctuations in the input data

Bit Position

Tran

sitio

n A

ctiv

ity

0 2 4 6 8 10 120.0

0.5

1.0sum(2’s complement)

suma + sumb(sign-magnitude)

suma

sumb

+313

gated clk clk(64KHz)

+313

gated clk clk(64KHz)

34

-14 14

clk(64KHz)clk(64MHz)

sign-bit(to control)

POSACC

NEGACC

13

13

input pattern 2’s(mW) Sign(mW)

constant(7,7,…) 1.97 2.25ramp(-7,-6,..,6,7..) 2.13 2.51

random 3.42 2.51min->max->min(-7,+7,-7,+7,…) 5.28 2.46

+ +

>>7 >>8

IN

IN IN

SUM2SUM1

Bit Position

Tran

sitio

n A

ctiv

ity

0 2 4 6 8 10 120.0

0.2

0.4 SUM1SUM2

+ +

>>7

>>8IN

IN IN

SUM2SUM1

Architecture Optimization• Ordering of input signals

– the ordering of operations can result in reduced switching activity

– example• multiplication with a constant : IN + (IN >> 7) + (IN >> 8)

– topology II• the output of first adder has a

small amplitude -> lower switching activity• switched 30% less

Bit Position

Tran

sitio

n A

ctiv

ity

0 2 4 6 8 10 120.0

0.2

0.4SUM1

SUM2

Architecture Optimization• Reducing glitching activity

– static design can exhibit spurious transitions• finite propagation delay from one logic block to the next

– important to balance all signal path and reduce the logic depth– multiple input addition

• 4 input case : 1.5 larger than tree implementation• 8 input case : 2.5 larger than tree implementation

+

+

+

A B

C

D

+

+

A B D

+

C

Chained implemenation Tree implemenation

Synchronous VS. Asynchronous SYSTEMS

• Synchronous system: A signal path starts from a clocked flip- flop through combinational gates and ends at another clocked flip- flop. The clock signals do not participate in computation but are required for synchronizing purposes. With advancement in technology, the systems tend to get bigger and bigger, and as a result the delay on the clock wires can no longer be ignored. The problem of clock skew is thus becoming a bottleneck for many system designers. Many gates switch unnecessarily just because they are connected to the clock, and not because they have to process new inputs. The biggest gate is the clock driver itself which must switch.

• Asynchronous system (self-timed): an input signal (request) starts the computation on a module and an output signal (acknowledge) signifies the completion of the computation and the availability of the requested data. Asynchronous systems are potentially response to transitions on any of their inputs at anytime, since they have no clock with which to sample their inputs.

Synchronous VS. Asynchronous SYSTEMS

• More difficult to implement, requiring explicit synchronization between communication blocks without clocks

• If the signal feeds directly to conventional gate-level circuitry, invalid• logic levels could propagate throughout the system.• Glitches, which are filtered out by the clock in synchronous designs, ma

y cause an asynchronous design to malfunction.• Asynchronous designs are not widely used, designers can't find the su

pporting design tools and methodologies they need.• DCC Error Corrector of Compact cassette player saves power of 80%

as compared to the synchronous counterpart.• Offers more architectural options/freedom encourages distributed, loc

alized control offers more freedom to adapt the supply voltage

Asynchronous Modules

Example: ABCS protocol

6% more logics

Control Synthesis Flow

PIPELINED SELF-TIMED micro P

Programming Style

Speed vs. Power Optimization

L4: Architectural Level Design 성균관대학교 조 준 동 교수

Documents

Transcript of L4: Architectural Level Design 성균관대학교 조 준 동 교수