L11: Lower Power High Level Synthesis(2)

26
L11: Lower Power High Level Synthesis(2) 1999. 8 성성성성성성 성 성 성 성성 http://vada.skku.ac.kr

description

L11: Lower Power High Level Synthesis(2). 1999. 8 성균관대학교 조 준 동 교수 http://vada.skku.ac.kr. Exploiting spatial locality for interconnect power reduction. A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flowgraph representation. - PowerPoint PPT Presentation

Transcript of L11: Lower Power High Level Synthesis(2)

Page 1: L11: Lower Power High Level Synthesis(2)

L11: Lower Power High Level Synthesis(2)

1999. 8

성균관대학교 조 준 동 교수 http://vada.skku.ac.kr

Page 2: L11: Lower Power High Level Synthesis(2)

Exploiting spatial locality for interconnect power reduction

• A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flowgraph representation.

• Two nodes are tightly connected to each other on the flowgraph representaion if the shortest distance between them, in terms of number of edges traversed, is low.

• A spatially local assignment is a mapping of the algorithm operations to specific hardware units such that no operations in different clusters share the same hardware.

• Partitioning the algorithm into spatially local clusters ensures that the majority of the data transfers take place within clusters (with local bus) and relatively few occur between clusters (with global bus).

• The partitioning information is passed to the architecture netlist and floorplanning tools.

• Local: A given adder outputs data to its own inputs Global: A given adder outputs data to the aother adder's inputs

Page 3: L11: Lower Power High Level Synthesis(2)

Hardware Mapping

• The last step in the synthesis process maps the allocated, assigned and scheduled flow graph (called the decorated flow graph) onto the available hardware blocks.

• The result of this process is a structural description of the processor architecture, (e.g., sdl input to the Lager IV silicon assembly environment).

• The mapping process transforms the flow graph into three structural sub-graphs:

the data path structure graph

the controller state machine graph

the interface graph (between data path control inputs and the

controller output signals)

Page 4: L11: Lower Power High Level Synthesis(2)

Spectral Partitioning in High-Level Synthesis

• The eigenvector placement obtained forms an ordering in which nodes tightly connected to each other are placed close together.

• The relative distances is a measure of the tightness of connections.• Use the eigenvector ordering to generate several partitioning solutions• The area estimates are based on distribution graphs.• A distribution graph displays the expected number of operations executed in ea

ch time slot.• Local bus power: the number of global data transfers times the area of the clust

er• Global bus power: the number of global data transfer times the total area:

Page 5: L11: Lower Power High Level Synthesis(2)

Finding a good Partition

Page 6: L11: Lower Power High Level Synthesis(2)

Interconnection Estimation

• For connection within a datapath (over-the-cell routing), routing between units increases the actual height of the datapath by approximately 20-30% and that most wire lengths are about 30-40% of the datapath height.

• Average global bus length : square root of the estimated chip area.• The three terms represent white space, active area of the components, and wiri

ng area. The coefficients are derived statistically.

Page 7: L11: Lower Power High Level Synthesis(2)

Incorporating into HYPER-LP

Page 8: L11: Lower Power High Level Synthesis(2)

Experiments

Page 9: L11: Lower Power High Level Synthesis(2)

Datapath Generation

• Register file recognition and the multiplexer reduction:– Individual registers are merged as much as possible into register files– reduces the number of bus multiplexers, the overall number of busses (sinc

e all registers in a file share the input and output busses) and the number of control signals (since a register file uses a local decoder).

• Minimize the multiplexer and I/O bus, simultaneously (clique partitioning is Np-complete, thus Simulated Annealing is used)

• Data path partitioning is to optimize the processor floorplan

• The core idea is to grow pairs of as large as possible isomorphic regions from corresponding of seed nodes.

Page 10: L11: Lower Power High Level Synthesis(2)

Hardware Mapper

Page 11: L11: Lower Power High Level Synthesis(2)

Hyper's Basic Architecture Model

Page 12: L11: Lower Power High Level Synthesis(2)

Hyper's Crossbar Network

Page 13: L11: Lower Power High Level Synthesis(2)

Refined Architecture Model

Page 14: L11: Lower Power High Level Synthesis(2)

Bus Merging

Page 15: L11: Lower Power High Level Synthesis(2)

Fanin Bus Merging

Page 16: L11: Lower Power High Level Synthesis(2)

Fanout Bus merging

Page 17: L11: Lower Power High Level Synthesis(2)

Global bus Merging

Page 18: L11: Lower Power High Level Synthesis(2)

Test Example

Page 19: L11: Lower Power High Level Synthesis(2)

Control Signal Assignment

Page 20: L11: Lower Power High Level Synthesis(2)

Factors of the coarse-grained model(obtained by switch level simulator)

Page 21: L11: Lower Power High Level Synthesis(2)

- 설계 자동화 연구실 -

Low Power Scheduling and Binding

1

2

3 5

4

A1

A1

A1

A1

A2

controlstep

1

2

3

4

1

2

3

54

A1

A2

A1

A2 A1

controlstep

1

2

3

4

(a) 저전력을 고려하지 않은 스케쥴링

(b) 저전력을 고려한 스케쥴링

Page 22: L11: Lower Power High Level Synthesis(2)

( ) _ { } _ { } ( _ { / }) _ { }

( ) _ { } _ { } _ { } _ { }

[ _ { } _ { / }( _ { })] _ { }

( ) / ( )

1 2 2 2 2 2 1 2

2 2 1 2 1

1 1 2

2 1 17%

p

P cla P booth cla booth P booth

P cla P cla P booth P booth

booth cla booth cla P booth

ower reduction

The coarse-grained model provides a fast estimation of the power consumption whenno information of the activity of the input data to the functional units is available.

Page 23: L11: Lower Power High Level Synthesis(2)

Fine-grained model

Average Hamming Distance is:

where: ( , ) is the Hamming distance between and ;

is the value of operand in iteration of the algorithm

and is the total number of iterations.

AHD xH x x

nH p q p q

x x i

n

n

i ii

n

i

( ) lim( , )

,

11

When information of the activity of the input data to the functional units is available.

Page 24: L11: Lower Power High Level Synthesis(2)

Effect of the operand activity on the power consumption

of an 8 X 8-bit Booth multiplier.

AHDInput data

Page 25: L11: Lower Power High Level Synthesis(2)

High-Level Power Estimation: PMUX and PFU

Page 26: L11: Lower Power High Level Synthesis(2)

Loop Interchange If matrix A is laid

out in memory in column-major form, execution

order (a.2) implies more cache misses than the execution order in (b.2).

Thus, the compiler chooses algorithm (b.1) to reduce the running time.