1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with...

1

Exploring Custom Instruction Synthesis forApplication-Specific Instruction Set Processors

withMultiple Design Objectives

Lin, Hai Fei, Yunsi

ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), 2010

Date:2010/05/20

吳俊雄

2

OUTLINE

INTRODUCTION MULTI-OBJECTIVE ASIP DESIGN Two Algorithms for Custom Instruction Synthesis

1. Mixed Integer Linear Programming

2. Simulated Annealing Method

EXPERIMENTAL RESULTS

3

INTRODUCTION

Traditional custom instruction synthesis flows for ASIPs mainly target performance improvement.

We show that the existing custom instruction exploration algorithms

1. Mixed Integer Linear Programming (MILP)

2. Simulated Annealing Method And cost estimation methods

1. Performance improvement

2. Energy efficiency

3. Area overhead

4

INTRODUCTION

Our work presented in this paper has three major contributions

1. We address the importance of energy and resource efficiency in ASIP design

2. We discuss a set of key factors during the custom instruction selection

3. We show that traditional design space exploration algorithms are either not feasible or inefficient to estimate all the necessary factors

Since the theoretical complexity for exploring the design space thoroughly is O(2n), most practical techniques adopt heuristics to prune the design space during the search.

Present a holistic ASIP synthesis and simulation flow which allows the flexibility to adjust the optimization goal between energy efficiency, area overhead and performance.

5

MULTI-OBJECTIVE ASIP DESIGN

There are two major energy factors:

1. Instruction fetch consumes a considerable portion of the total energy within a processor.

2. The data communication between operations is originally implemented through register file accesses within the base processor.

The dynamic energy consumption is affected by the reduction of the number of instructions and data register file accesses.

6


Custom processor 1 with CFU1 achieves better performance improvement, because it utilizes operation parallelism in the DFG to reduce the total execution cycles.

Custom processor 2 with CFU2 achieves larger energy saving, because it realizes a sub-graph covering more operations and data transfer edges.

7


We show that generating custom instructions from a DFG can be viewed as solving an operation scheduling problem.

The scheduling scheme should ensure data dependency and that the input/output edges of each software stage satisfy the I/O constraint set by the register file ports.

For a scheduling scheme, the

number of software stages with

operations in represents the

number of instructions for the

customized processor.

The edges across different software

stages represent register file

accesses.

8

Two Algorithms for Custom InstructionSynthesis

Mixed Integer Linear Programming (MILP) Primary Variable definition:

i: index of the operations, l: index of software stages. Parameter definition: hardware execution delay

k is the index of operation types.

S3,4=1

9


Assistant Variable definition: execution cycle delay

Constraints:

1. data dependency constraint

2. I/O

Sd6=0.8i

j

10


SN:The number of instructions SE:The total number of data accesses

For multi-issue, out-of-order processors

equals to the longest execution path delay of the DFG

:The largest number of this type of operations among different software stages

:the number of functional modules (operators) of type k needed in the final custom hardware extension.

11


:The unit hardware area of functional module type k.

energy consumption area overhead execution cycle

The advantage of applying MILP to solve the scheduling problem is that, theoretically, it can find the optimum solution to the problem with sufficient searching time.

12


Simulated Annealing Method Solution Vector definition: OPv = {op1, op2, op3, ..., opn} Solution variation mechanism:

In each iteration, we randomly select n operations and move them to a different software stage to generate a new solution.

n represents the maximum distance between current solution and the one it evolves to. t is the current temperature, T is the starting temperature and N is the total number of operations.

13


The allowable range for certain operation to move around is determined by the location of its parent and child nodes.

In our algorithm, the actual moving range for an operation is further tightened by the current temperature - range = R * sqr(t/T ). We randomly move the operation to a software stage within this range.

R=[3~8]

14


Solution acceptance mechanism: A new solution is accepted when its cost is smaller than that of the current solution, or can be accepted with a probability of p when the new cost is larger than that of the current solution, where

Simulated Annealing algorithm balances the trade-off between the solution quality and searching time.

15


16

MULTI-OBJECTIVE ASIP SYNTHESISFLOW

17


CPLEX is used to solve the MILP problem for design space exploration.

The baseline processor is an out-of-order MIPS style processor. Set the ratio between the weight variable g1 and g2 to be 12.2 : 1. Set the register file I/O constraints to be 4/2. We perform experiments for energy reduction and for performance

improvement by setting the variable å2 and å3 at zero, and å1 and å2 at zero, respectively.

18


The average speedup 1.42 for Binary Tree 1.64 for MILP (p.) 1.56 for MILP (e.)

The average energy consumption reductions are 18.1%, 22.7% and 29.8%.

19


The custom instruction templates presented in (b) and (c) are targeting performance and energy efficiency, respectively. There are more operations in the templates identified for energy efficiency, shown in (c), and they include longer critical paths than the sub-graphs shown in (b).

20


For different designs, the ratio between å1 and å2 can be varied to find the best trade-off between them.

å3=0, å1 = 1, å2 = 0 å1 = å2 = 0.5

21


The SA algorithm achieves an average of 1.46 performance speedup, which is a little lower than that achieved by the MILP algorithm (1.64).

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with...

Documents

Transcript of 1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with...