L10 : Lower Power High Level Synthesis(1)

29
L10 : Lower Power High Level Synthesis(1) 1999. 8 성성성성성성 성 성 성 성성 http://vada.skku.ac.kr

description

L10 : Lower Power High Level Synthesis(1). 1999. 8 성균관대학교 조 준 동 교수 http://vada.skku.ac.kr. Low Power Design Flow. Early Analysis Leads to Power Savings. National Semiconductor Success A LAN switch ASIC of 200K gates and 41 memories characterized for state-dependent power. - PowerPoint PPT Presentation

Transcript of L10 : Lower Power High Level Synthesis(1)

Page 1: L10 : Lower Power High Level Synthesis(1)

L10 : Lower Power High Level Synthesis(1)

1999. 8 성균관대학교 조 준 동 교수

http://vada.skku.ac.kr

Page 2: L10 : Lower Power High Level Synthesis(1)

Low Power Design FlowFunction

Partitioning andHW/SW Allocation

SystemLevel

Specification

System-LevelPower Analysis

BehavioralDescription

SoftwareFunctions

ProcessorSelection

Power-drivenBehavioralTransformation

Behavioral-LevelPower Analysis

Power ConsciousBehavioralDescription

Power AnalysisRT-LevelHigh-Level

Synthesis andOptimization

SoftwareOptimization

Software-LevelPower Analysis

To RT-Level Design

Page 3: L10 : Lower Power High Level Synthesis(1)

Early Analysis Leads to Power Savings

National Semiconductor Success

A LAN switch ASIC of 200K gates and 41 memories characterized for state-dependent power.

DesignPower revealed excessive power consumption by the memories due to redundant read cycles.

The RTL was fixed and the power consumption reduced

Page 4: L10 : Lower Power High Level Synthesis(1)

Module Selection• Select the clock period, choose proper hardware modules for all

operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints.

• Full pipelining: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path.

• Clustering operations into non-pipelining hardware modules, the reusability of these modules over the complete computational graph be maximized.

• During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints

Page 5: L10 : Lower Power High Level Synthesis(1)

High-Level Power Estimation

• Pcore = PDP + PMEM + PCNTR + PPROC

• PDP = PREG +PMUX +PFU + +PFU, where PREG is the power of the registers

• PMUX is the power of multiplexers• PFU is the power of functional units• PINT is the power of physical interconnet capacitance

(HYPER). tsinterconne physical ofnumber theof estimatean is N and chip theof ecapacitanc estimated

total theis y),probabilitn transitiosignal averagean by multiplied accessesct interconne ofnumber total(the

activity average theis where,/int

total

total

C

NCC

Page 6: L10 : Lower Power High Level Synthesis(1)

Estimation• Estimate min and max bounds on the required resources to

– delimit the design space min bounds to serve as an initial solution – serve as entries in a resource utilization table which guides the transformation, assig

nment and scheduling operations

• Max bound on execution time is tmax: topological ordering of DFG using ASAP and ALAP

• Minimum bounds on the number of resources for each resource class

Where NRi: the number of resources of class Ri

dRi : the duration of a single operation

ORi : the number of operations

Page 7: L10 : Lower Power High Level Synthesis(1)

High-Level Power Estimation: PREG• Compute the lifetimes of all the variables in the given VHDL code.• Represent the lifetime of each variable as a vertical line from statement i through st

atement i + n in the column j reserved for the corresponding varibale v j .• Determine the maximum number N of overlapping lifetimes computing the maximu

m number of vertical lines intersecting with any horizontal cut-line.• Estimate the minimal number of N of set of registers necessary to implement the co

de by using register sharing. Register sharing has to be applied whenever a group of variables, with the same bit-width b i .

• Select a possible mapping of variables into registers by using register sharing• Compute the number w i of write to the variables mapped to the same set of register

s. Estimate n i of each set of register dividing w i by the number of statements S: i =wi/S; hence TR imax = n i f clk .

• Power of latches and flip flops is consumed not only during output transitions, but also during all clock edges by the internal clock buffers

• The non-switching power PNSK dissipated by internal clock buffers accounts for 30% of the average power for the 0.38-micron and 3.3 V operating system.

• In total,

,)(,),(1

kclkNkkNSKktkkkNSK

N

kkREG TRfPnPTRPnPPPP

Page 8: L10 : Lower Power High Level Synthesis(1)

PCNTR• After scheduling, the control is defined and optimized by the hardware mapper and further by the logic synthesis process before mapping to layout.• Like interconnect, therefore, the control needs to be estimated statistically.• Global control model:

states. ofnumber on thedependent strongly is ns transitioofnumber totalThe

22.1fF. is and 4.9fF is gy, technolo1.2 aFor ,

21

21

statesFSM NC

Local control model: the local controller account for a larger percentage of the total capacitance than the global controller.

.55.0,3.8,15.0,72 tech.,1.2 aFor

,

3,2,1,0,

3210

fstatestranslc BNNC

Where Ntrans is the number of tansitions, nstates is the number of states, Bf is the bus factor, and Clc is the capacitance switched in any local controller in one sample period. Bf is the ratio of the number of bus accesses to the number of busses.

Page 9: L10 : Lower Power High Level Synthesis(1)

Ntrans

• The number of transitions depends on assignment, scheduling, optimizations, logic• optimization, the standard cell library used, the amount of glitchings and the statistics of

the inputs.

.0.2,2.7,7.178 tech.1.2 aFor units.execution ofnumber totalfor the estimatean is andCDFG the

in nodes and edges ofnumber theare and period,sampleper cycles control ofnumber theis S s,controller loal the of outputs on the ns transitioofnumber theis where

)()(

321

321

Exu

nodesedges

trans

Exuedgesnodestrans

N

NN

N

NSNNN

Page 10: L10 : Lower Power High Level Synthesis(1)

Exploring the Design Space

• Find the minimal area solution constrained to the timing constraints• By checking the critical paths, it determine if the proposed graph violat

es the timing constraints. If so, retiming, pipelining and tree height reduction can be applied.

• After acceptable graph is obtained, the resource allocation process is• initiated.

– change the available hardware (FU's, registers, busses)– redistribute the time allocation over the sub-graphs– transform the graph to reduce the hardware requirements.

• Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence.

Page 11: L10 : Lower Power High Level Synthesis(1)

Behavioral Synthesis• loop unrolling : localize the data to reduce the activity of the inputs of the functio

nal units or two output samples are computed in parallel based on two input samples.

Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,

The transformation yields critical path of 3, thus voltage can be dropped.• Clock Selection : Choose optimal system clock period Eliminate slacks/improve resource

utilization and Enable greater voltage scaling• Module selection : For each operation, choose library template• Flow graph restructuring : pull out operations on the critical cycle.

)( 211

211

nnnnnn

nnn

YAXAXYAXYYAXY

22

1

211

nnnn

nnn

YAYAXY

YAXY

Page 12: L10 : Lower Power High Level Synthesis(1)

High-Level Power Estimation: PMUX and PFU

Page 13: L10 : Lower Power High Level Synthesis(1)

Critical Path• Longest delayed path from input to

output in combinational logic• Determine operating clock

frequency• Resizing non-critical path transistor

(In-Place Optimization)

• Critical path in Synchronous Sequential logic

skewclock of max.value :

flipflop of timesetup of max.value :

delaypath critical of max.value :

delay flipflop of max.value :

periodclock of min.value :,

,

skew,max

setup,max

logic,max

ff,max

mincycle

skew,maxsetup,maxlogic,maxff,maxmincycle

t

t

t

t

t

ttttt

D Q

D Q

D Q

D Q

D Q

D Q

c lk c lkC ombinational

Logic

path A

path B

Page 14: L10 : Lower Power High Level Synthesis(1)

Data path Synthesis

Page 15: L10 : Lower Power High Level Synthesis(1)

System Partitioning • To decide which components of the system will be realized in hardware

and which will be implemented in software • High-quality partitioning is critical in high-level synthesis. To be useful, high-

level synthesis algorithms should be able to handle very large systems. Typically, designers partition high-level design specifications manually into procedures, each of which is then synthesized individually. Different partitionings of the high-level specifications may produce substantial differences in the resulting IC chip areas and overall system performance.

• To decide whether the system functions are distributed or not. Distributed processors, memories and controllers can lead to significant power savings. The drawback is the increase in area. E.g., a non-distributed and a distributed design of a vector quantizer.

Page 16: L10 : Lower Power High Level Synthesis(1)

Circuit Partitioning

• graph and physical representation

Page 17: L10 : Lower Power High Level Synthesis(1)

VHDL exampleprocess communication

control/data flow graph

Behavioral description

Page 18: L10 : Lower Power High Level Synthesis(1)

Clustering Example

• Two-cluster Partition

• Three-cluster Partition

Page 19: L10 : Lower Power High Level Synthesis(1)

Clustering (Cont’d)

Page 20: L10 : Lower Power High Level Synthesis(1)

- 설계 자동화 연구실 -

½Ã½ºÅÛ ·¹º§

µ¿ÀÛÀû ·¹º§

¾ÆÅ°ÅØÃÄ ·¹º§

·ÎÁ÷/ȸ·Î ·¹º§

µð¹ÙÀ̽º/°øÁ¤ ·¹º§

상위 수준 합성 단계D esign Sp ecifi cation

C D F G(C on tro l D ata F low G rap h )

R E G I ST E R A L L O C A T I O NF O R L O W P O W E R

R E SO U R C E A L L O C A T I O NF O R L O W P O W E R

D A T A P A T H G E N E R A T IO NA N D C O N T R O L L E R

SY N T H E SI S

W R I T E V H D L

L O W P O W E R A N D F A STSC H E D U L I N G

M in im iz ing sw itch ing activ ity in R egister

M in im iz in g sw i tch i n g activ i tyin resource and interconnection

F ast an d E n able resou rceshar ing fo r low p ow er

sched u l in g

Page 21: L10 : Lower Power High Level Synthesis(1)

- 설계 자동화 연구실 -

상위 수준 합성 ( High Level Synthesis )( High Level Synthesis )

Instructions

Operations

Variables

Arrays

signals

회로의 동작적 기술

Control

Datapath Memory

Operators, Registers,

Memory, Multiplexor

Control

scheduling

Memory inferencing

Register sharing

Control interencing

for(I=0;I<=2;I=I+1begin

@(posedge clk);

if(fgb[I]%8; begin

p=rgb[I]%8;

g=filter(x,y)*8;

end

............

constraints

RTL(register transfer level) architecture 상위 수준 합성

Page 22: L10 : Lower Power High Level Synthesis(1)

High-Level Synthesis• The allocation task determines the type and quantity of

resources used in the RTL design. It also determines the clocking scheme, memory hierarchy and pipelining style. To perform the required trade-offs, the allocation task must determine the exact area and performance values.

• The scheduling task schedules operations and memory references into clock cycles. If the number of clock cycles is a constraint, the scheduler has to produce a design with the fewest functional units

• The binding task assigns operations and memory references within each clock cycle to available hardware units. A resource can be shared by different operations if they are mutually exclusive, i.e. they will never execute simultaneously.

Page 23: L10 : Lower Power High Level Synthesis(1)

- 설계 자동화 연구실 -

상위 수준 합성 과정 예+

+

<

*

¿¬»êÀÚÁ¦¾î±¸°£

1

4

3

2+

+

*

<< *+

+

½ºÄÉÁ층 ¸®¼Ò½ºÇÒ´ç

CDFG

Çϵå¿þ¾î¶óÀ̺귯¸®

¸ðµâ ¹ÙÀεù

Page 24: L10 : Lower Power High Level Synthesis(1)

Low Power Scheduling

Page 25: L10 : Lower Power High Level Synthesis(1)

- 설계 자동화 연구실 -

상위 레벨에서 제안된 저전력 방법Sibling 연산의 연산자 공유 [ Fang , 96 ]

데이타 correlation 를 고려한 resource sharing [ Gebotys, 97 ]

FU 의 shut down 방법 (Demand-driven operation) [ Alidina, 94 ]

연산의 규칙성 이용 [ Rabaey, 96 ]

Dual 전압 사용 [ Sarrafzadeh, 96 ]

Spurious 연산의 최소화 [ Hwang, 96 ]

최소 비용의 흐름 알고리즘을 사용한 스위칭 동작 최소화 + 연결구조 단순화를 통한 캐패시턴스 최소화 [Cho,97]

Page 26: L10 : Lower Power High Level Synthesis(1)

- 설계 자동화 연구실 -

레지스터의 전력 소모 모델

MUX

Register

DeMux

Control Control

ijk

i*j*k*

C out,Mux C in,Register C out,RegisterC in,DeMux

yx

Power(Register) =

switching(x)(Cout,Mux+Cin,Register)+switching(y) x (Cout,Register+Cin,DeMux)

switching(x)=switching(y) 이므로 Power(Register)=switching(y) x Ctotal

Page 27: L10 : Lower Power High Level Synthesis(1)

- 설계 자동화 연구실 -

CDFG( control data flow graph )

+1+2

+3

*1

a b c d

eg

f

h

e=a+b;g=c+d;f=e+b;h=f*g;

회로의 CDFG 표현

Page 28: L10 : Lower Power High Level Synthesis(1)

Schematic to CDFG of FIR3

Page 29: L10 : Lower Power High Level Synthesis(1)

- 설계 자동화 연구실 -

레지스터와 리소스의 수 결정

a dcb e f g h

1234

+1

*1

+3

a b c d

e

gf

h

+2

controlstep

1

2

3

A1

A1A2

M1