Suyash Thesis

57
Technische Universität München Department of Electrical Engineering and Information Technology Institute for Electronic Design Automation Efficient Parallelization of Robustness Validation for Digital Circuits Bachelor Thesis Suyash Shukla

Transcript of Suyash Thesis

Page 1: Suyash Thesis

Technische Universität München

Department of Electrical Engineering and Information Technology

Institute for Electronic Design Automation

Efficient Parallelization of Robustness Validation for Digital Circuits

Bachelor Thesis

Suyash Shukla

Page 2: Suyash Thesis
Page 3: Suyash Thesis

1

Technische Universität München

Department of Electrical Engineering and Information Technology

Institute for Electronic Design Automation

Efficient Parallelization of Robustness Validation for Digital Circuits

Bachelor Thesis

Suyash Shukla

Supervisor : Dipl.-Ing. Martin Barke

Supervising Professor : Prof. Dr.-Ing. Ulf Schlichtmann

Topic issued : 24.10.2012

Date of submission : 14.01.2013

Suyash Shukla

AlzeyerStrasse 2

80993 Munich

Page 4: Suyash Thesis

2

Page 5: Suyash Thesis

3

Abstract

These days the feature sizes of ICs (Integrated Circuits) are entering into nanometer

technologies and due to this shrink in size, reliability issues such as negative bias temperature

instability (NBTI) and hot carrier injection (HCI) are coming up as potential threats. All these

reliability issues cause irregular and unsteady failures, ultimately causing the performance of

the chips to degrade over time. Due to these issues, the impact of ageing analysis is now

being studied which lead to a change of device parameters over time. In this thesis,

robustness of integrated circuits through a couple of C++ programs is being studied and parts

of the code are efficiently parallelized for the robustness validation for digital circuits. I used

the OpenMP interface and its directives to make the existing codes run in parallel, using

multiple processors and therefore saving time.

Page 6: Suyash Thesis

4

Page 7: Suyash Thesis

5

Contents

1. Introduction ......................................................................................................................................... 11

1.1. Motivation ......................................................................................................................................... 11

1.2. Goals .................................................................................................................................................. 12

1.3. Achievements .................................................................................................................................... 12

1.4. Report structure ................................................................................................................................. 12

2. Background .......................................................................................................................................... 14

2.1. Robustness model .............................................................................................................................. 14

2.2. Measuring robustness ....................................................................................................................... 16

2.3. Adaption to different applications ..................................................................................................... 17

2.3.1. Timing analysis and timing graph (TG) .......................................................................................... 17

2.4. Calculating the area........................................................................................................................... 19

3. Parallel programming with OpenMP .................................................................................................... 23

3.1. Overview ............................................................................................................................................ 23

3.2. Architecture and parallel programming models ................................................................................ 24

3.2.1. Architecture .................................................................................................................................. 24

3.2.2. Parallel programming models ....................................................................................................... 25

3.2.2.1. Shared and distributed memory ......................................................................................... 26

3.2.2.2. Threads model ..................................................................................................................... 27

3.2.2.3. Single program, multiple data ............................................................................................. 28

3.3. OpenMP ............................................................................................................................................. 28

3.3.1. OpenMP executable directives and its clauses ............................................................................. 30

3.3.2. OpenMP runtime library routines................................................................................................. 33

3.3.3. Environment variables .................................................................................................................. 33

3.3.4. Summary ....................................................................................................................................... 33

4. Implementation of parallel program and OpenMP in robustness ......................................................... 34

4.1. The existing serial program: robustness.cpp ..................................................................................... 34

4.2. The existing serial program: DichotomyCalculation.cpp and RobustnessCalculation.cpp ................ 37

4.3. Ideology ............................................................................................................................................. 39

4.4. Changes made in existing code .......................................................................................................... 41

4.5. Difficulties faced ................................................................................................................................ 44

Page 8: Suyash Thesis

6

5. Evaluation and Results.......................................................................................................................... 46

6. Conclusion ............................................................................................................................................ 52

6.1. Further improvements ....................................................................................................................... 52

Acknowledgments ......................................................................................................................................... 54

References ..................................................................................................................................................... 55

Page 9: Suyash Thesis

7

LIST OF FIGURES

FIGURE 1: OPERATING CONDITIONS AND SYSTEM PROPERTIES, Π NOM

AND Φ NOM

.................................................................... 14

FIGURE 2: PROPERTY OR PERFORMANCE SPACE, ΦS, OF TASK A AND TASK B .......................................................................... 15

FIGURE 3: PERTURBATION SPACE, ΠS, WHERE THE SYSTEM HAS TO WORK PROPERLY FOR TWO TASKS A AND B ............................. 15

FIGURE 4: A ROBUST SYSTEM ....................................................................................................................................... 16

FIGURE 5: COMPUTATION OF ARRIVAL TIME (AT) ........................................................................................................... 18

FIGURE 6: PERTURBATION AND PERFORMANCE SPACE BEING MAPPED TO VDD, T AND F RESPECTIVELY ........................................ 19

FIGURE 7: FINDING VALID POINTS STARTING FROM THE 4 SPECIFICATION CORNERS BY DICHOTOMY CALCULATION ......................... 21

FIGURE 8: A PAIR OF DOUBLE TEMP WHICH GIVES THE BOUNDARY OF THE ROBUST REGION. ..................................................... 22

FIGURE 9: SERIAL COMPUTING ..................................................................................................................................... 23

FIGURE 10: PARALLEL COMPUTING ............................................................................................................................... 24

FIGURE 11: THE FOUR MAIN COMPONENTS OF A COMPUTER .............................................................................................. 25

FIGURE 12: SHARED MEMORY CONFIGURATION ............................................................................................................... 26

FIGURE 13: DISTRIBUTED MEMORY CONFIGURATION ........................................................................................................ 27

FIGURE 14: THREADS MODEL ....................................................................................................................................... 28

FIGURE 15: OPENMP USES FORK-JOIN MODEL FOR PARALLEL EXECUTION ............................................................................ 29

FIGURE 16: THREE COMPONENTS OF OPENMP USED TO EXAMINE AND MODIFY THE PARALLEL EXECUTION PARAMETERS ................. 30

FIGURE 17: THE “SECTIONS” DIRECTIVE ......................................................................................................................... 31

FIGURE 18: SCREENSHOT OF MY BASIC OPENMP CODE USED IN THE PROGRAM ..................................................................... 32

FIGURE 19: DECLARING ONE *TG AND THEN USING IT TO CALL DIFFERENT FUNCTIONS, ALL FROM THE SAME CLASS, TG ................. 35

FIGURE 20: CODES WHERE THE USE_PROFILE AND THE MAPS OF ALL THE NODES OF THE CIRCUIT ARE CREATED ............................ 36

FIGURE 21: THE FOUR CORNERS OF SPECIFICATIONS WHICH NEED TO BE VERIFIED BY FUNCTION VERIFYSPEC( ) ONLY ONCE ............ 38

FIGURE 22: DEPENDING ON ‘SPECVIOLATED’, AREA CALCULATION TAKES PLACE BY FUNCTION CALLED ITERATE_D( ) ..................... 38

FIGURE 23: ITERATION TAKING PLACE IN TWO DIRECTIONS FROM EACH CORNER FOR CALCULATING THE AREA OF ROBUST REGION. ... 39

FIGURE 24: EIGHT VALID POINTS WHICH DEFINE THE AREA AND ARE NEEDED FOR CALCULATING THE ROBUSTNESS VALUE. .............. 40

FIGURE 25: EACH TG CALLS FUNCTIONS LIKE UPDATEAT( ) BY : M_TGVEC[OMP_GET_THREAD_NUM( )] -> UPDATEAT( ) ........... 40

FIGURE 26: THE VALUES OF ‘TEMP1’ TO ‘TEMP5’ ARE PUSHED BACK IN THE VECTOR OUTSIDE THE PARALLEL REGION. .................... 43

FIGURE 27: CMAKELISTS.TXT WHICH INCLUDES ALL THE DIFFERENT COMPILER FLAGS .............................................................. 44

FIGURE 28: THE AREA OF THE ROBUST REGION CANNOT BE DETERMINED BY SIMPLY PARALLELIZING THE ITERATE_D( ) ................... 45

FIGURE 29: BOTH THE PROCESSORS ARE BEING UTILIZED FOR THE COMMAND ‘ROBUSTNESS’ .................................................... 47

FIGURE 30: TIME ELAPSED FOR COMPUTATION DECREASES WITH AN INCREASE IN OPENMP THREADS ........................................ 49

FIGURE 31: RESULTS FOR CELL C3540_I89, 2D, ROBUSTNESS AT 10 YEARS, RECHNER3, NUMBER OF THREADS: 1 .................. 49

FIGURE 32: RESULTS FOR CELL C3540_I89, 2D, ROBUSTNESS AT 10 YEARS, RECHNER3, NUMBER OF THREADS: 2 ................... 50

FIGURE 33: RESULTS FOR CELL C3540_I89, 2D, ROBUSTNESS AT 10 YEARS, RECHNER3, NUMBER OF THREADS: 4 .................. 50

FIGURE 34: RESULTS FOR CELL C3540_I89, 2D, ROBUSTNESS AT 10 YEARS, RECHNER3, NUMBER OF THREADS: 8 .................. 51

Page 10: Suyash Thesis

8

Page 11: Suyash Thesis

9

List of Tables

TABLE 1: CPU USAGE OF THE TWO MACHINES WHILE RUNNING THE PROGRAM ‘ROBUSTNESS’ WITH DIFFERENT NUMBER OF OPENMP

THREADS ........................................................................................................................................................ 46

TABLE 2: RESULTS FOR NANGATEDESIGNS: CELL C1355_I89, DIMENSION: 2D, MAXIMUM AGE: ROBUSTNESS AT 10 YEARS,

MACHINE: REIN (2 CORE PROCESSORS) ................................................................................................................ 47

TABLE 3: RESULTS FOR NANGATEDESIGNS: CELL C1355_I89, DIMENSION: 2D, MAXIMUM AGE: ROBUSTNESS AT 10 YEARS,

MACHINE: RECHNER3 (8 CORE PROCESSORS) ...................................................................................................... 48

TABLE 4: RESULTS FOR NANGATEDESIGNS: CELL C1908_I89, DIMENSION: 2D, MAXIMUM AGE: ROBUSTNESS AT 10 YEARS,

MACHINE: REIN (2 CORE PROCESSORS) ................................................................................................................ 48

TABLE 5: RESULTS FOR NANGATEDESIGNS: CELL C1908_I89, DIMENSION: 2D, MAXIMUM AGE: ROBUSTNESS AT 10 YEARS,

MACHINE: RECHNER3 (8 CORE PROCESSORS) ...................................................................................................... 48

Page 12: Suyash Thesis

10

Page 13: Suyash Thesis

11

1. Introduction

Everyone and everything is this world doesn’t last forever due to unavoidable conditions.

This is applicable for humans, animals, things and even digital circuits. All manufacturers of

integrated circuits wish that their integrated circuits operate without faults for many years

which is impossible. Due to issues like NBTI, HCI, time dependent dielectric breakdown

(TDDB) and electro migration (EM) over prolonged periods, device characteristics are

changed which causes the circuit to fail. Over time, the gate delays also increase as the circuit

ages. This leads to violation of the timing specification although the specifications were met

right after manufacturing. To increase the life-time of these circuits, the ability of different

ICs to cope with such reliability issues and errors during their execution was investigated.

This ability is called robustness of integrated circuits. Digital components in these circuits

should have a robust behaviour. This means that the output of the circuit should depend

entirely on the input and should be controllable.

Analysis of robustness in this thesis is integrated through an automated design flow which

includes calculation of robustness over years in the presence of NBTI and HCI by software

algorithms and programs. In this report, an efficient parallelization of ageing analysis for

various digital circuits is discussed. Certain parts of the code are executed in parallel

depending on the number of threads and processors. This parallelization helps in faster

calculation of robustness for integrated circuits which saves a lot of time, especially while

dealing with larger circuits. The code is parallelized by using parallelism tools like OpenMP

together with the existing C++ code. The parallelized code together with the OpenMP

interface is evaluated by finally analyzing the CPU usage percentage and calculating the time

elapsed while the computation takes place.

1.1. Motivation

Most of the programs people write and execute run serially. The instructions in the program

are executed one after the other. This is called serial computing, where the program runs on a

single processor. The existing codes were all serial programs, i.e. they ran on a single

processor even though the computer had 2 or more core processors. With multi-core

machines being standardized, I want my program to utilize all the other processors, i.e. I want

it to work on several tasks at once, in parallel. That’s what multi-threading allows me to do.

Recently, however there has been a huge shift to parallel computing. A major source to speed

up is by parallelizing operations. Parallel machines provide a wonderful opportunity for

applications with large computational requirements, in my case, larger size integrated

circuits. In order to achieve this, I need to create extra threads, each running code

simultaneously on different core processors utilising all the resources. The introduction of

multi core processors has encouraged parallel computing, where the program executes in a

more efficient way, which is only possible with proper use of parallel multi processing

interfaces.

Page 14: Suyash Thesis

12

1.2. Goals

The objective of this project is to find out parts of code that can be efficiently parallelized by

using parallelization tools or other specific interfaces. In particular, this project has the

following aims:

Get a clear idea of what the program needs to do

Decide on what protocol or interface to use, depending on the existing programming

language

Algorithms are designed on how the work can be distributed between multiple

processors

Finally, the algorithms are implemented at the areas in the programs that need to be

parallelized

1.3. Achievements

OpenMP interface was ultimately chosen to parallelize the codes. The goal of OpenMP is to

provide a standard and portable API for writing shared memory parallel programs. This API

is specified for C/C++ and FORTRAN programming languages and it is portable on major

platforms especially Linux/Unix. The language extensions in OpenMP fall into three

categories:

1. Directives

2. Run-time routines

3. Environment variables

OpenMP directives are used to tell the compiler what parts of the program are to be executed

in parallel and how those parts of codes are assigned to individual threads. The compiler then

generates explicit threaded code which results in a multithread object program.

1.4. Report Structure

Background

This section will discuss all the previous work done related to the project, robustness

validation. It will also discuss some background directly relevant to understanding this report.

Parallel programming with OpenMP

This section will discuss parallel computing and model for shared memory programming.

OpenMP and its extensions used in the project will be briefly discussed.

Page 15: Suyash Thesis

13

Implementation of OpenMP in robustness validation

This section will discuss more about the existing programs and its functions, the algorithm

and idea which motivated the existing code to parallelize. The difficulties faced while

implementing OpenMP parallel code will also be discussed in this section.

Evaluation

This section will evaluate my results by analyzing processor usage and time taken while

running the robustness program with single and multi-thread.

Conclusion

This section will summarise the work done. Further improvements or changes that can be

carried forward will also be discussed here.

Page 16: Suyash Thesis

14

2. Background

The design of robust systems is becoming important as the integrated circuits are entering

nano-technology. The digital components play a leading role in embedded systems. Their

external environment can either be software, or other digital and analog components or the

physical world. As these digital components start ageing by time, the input data received by

their respective environment can be imprecise or contain errors. Nonetheless, a well designed

system should have an output behaviour that remains robust even in the presence of such

inaccuracy in the input sequence. The designs of such robust components are a challenging

task in the field of embedded systems. For critical sectors where precision is a must like the

automotive and aerospace, such as flight controllers, a smooth control action, longevity and a

robust behaviour must be guaranteed. These systems should be highly immune against

external disturbances and perturbations.

However with different implementations of a system by the means of robustness, their

specifications and applications, engineers could come to the conclusion of a concept that

takes care of the needs of microelectronic circuits and systems by solid definitions and proper

measuring techniques. Therefore, a general model for quantitatively measuring robustness is

created.

2.1. Robustness model

In Figure 1, System ‘S’ takes in an input sequence x(t) and transforms it into an output

sequence y(t). The operating conditions Π and system properties Φ have to be considered too.

These properties or performance features Φi have to be fulfilled for the system to be robust.

Whereas, each perturbation Π i which influences the performance is simply a representation

of all operating conditions or perturbation space to which all the production devices will be

exposed.

Figure 1: Operating conditions and system properties are given by their nominal values Π nom and Φ nom or their sets Π and Φ respectively. [Ref: [9]]

Page 17: Suyash Thesis

15

In case, a system is performing 2 applications, task A and task B, the performance space has

to stay within the regions ΦA and ΦB. Therefore, the properties of a robust system lie in the

intersection, ΦA ∩ ΦB as shown in Figure 2.

For the case of perturbation space, the system has to work fault free at least in the union,

Π A U Π B as shown in Figure 3. Normally, this region is larger than the union of both the

task’s perturbation space.

Figure 2: Property or performance space, ΦS, of task A and task B, ΦA ∩ ΦB (Shown in blue) [Ref: [9]]

Figure 3: Perturbation space, ΠS, where the system has to work properly for two tasks A and B, ΠA U ΠB (shown in blue) [Ref: [9]]

Page 18: Suyash Thesis

16

A system can be called robust if all the points of the set ΠS in perturbation set can be mapped

into the points in the performance space within the area ΦS (Figure 4). The margin between Φ

and ΦS can be used for a quantitative measurement of the robustness which is explained in

the next section.

Figure 4: A robust system where all specified points in the perturbation space are transformed into points within the specified area Φ. [Ref: [9]]

2.2. Measuring robustness

Robustness in this project is measured as a probability. A circuit would work for all operating

conditions and be ideally robust if it has a robust value of ‘1’. The advantage of calculating

robustness as a probability is that a comparison between completely different systems is

possible. In this project, the logic behind calculating robustness as a probability was already

coded in DichotomyCalculation.cpp.

The function probs( ) is called which returns a pair of double vales which is assigned to

‘workingProb’.

The final robustness value is then computed by calling out the function robustness( ) and

taking in ‘workingProb.first, workingProb.second’ as its input.

Code 1: Function probs( ) and robustness( double tmpProb, double specProb ) are

defined in RobustnessCalculation.cpp

std::pair < double, double > workingProb = probs( );

double robustnessprobValue = robustness( workingProb.first, workingProb.second );

Page 19: Suyash Thesis

17

2.3. Adaption to different applications

This following section gives an overview on how the robustness model and the way of

measuring robustness can be adapted to different applications like the ageing analysis.

Over time, the circuit performance degrades due to effects like NBTI and HCI which

influences the robustness of the whole system. To evaluate this decrease in performance,

static timing analysis (STA) with a gate model is needed to compute gate delays. The gate

model provides a delay for rising and falling input transition for each timing arcs. In today’s

chips, a unique critical path is found and the gates on this path get aged due to technology

scaling.

Code 2: This returns a Boolean value which determines whether it is the most critical

path of a circuit or not. “Path” shows the list of Ids of either nodes or edges in the

critical path

bool getCriticalPath( std::list < std::pair < unsigned int, bool > > *path );

Code 3: This prints the most critical path of the circuit

void printCriticalPath( std::list < std::pair < unsigned int, bool > > path, bool rising );

2.3.1. Timing Analysis and Timing Graph (TG)

Timing analysis is required for many different steps during the design process of digital

circuits. Timing analysis (TA) determines the maximum clock frequency a circuit can operate

in. Static Timing Analysis (STA) determines the path which has the longest path delay and

the latest arrival time (Figure 5) which calculates the circuit delay and verifies the set up time

constraints.

Page 20: Suyash Thesis

18

Figure 5: Computation of Arrival Time (AT). Here, ‘a’ and ‘b’ are inputs of a gate where ‘c’ is the output. The AT of ‘a’ is 9 and ‘b’ is 10. The delay from ‘a’ to ‘c’ and ‘b’ to ‘c’ is 4 and 2

respectively. Hence, the AT at output c will always be the maximum value of its input delays. In this case its 13 (9+4 > 10+2) [Ref: [10]]

Code 4: This returns a pair of rising and falling timings for all the sinks nodes in the

timing graph

std::pair <sslv, sslv> getSinkArrivalTime( );

Code 5: This updates arrival time of all the nodes in the timing graph by iterating

over each node

void updateAT( );

The implemented STA uses a timing graph (TG). This includes the Nodes (Gate in and

outputs) and Edges. Edges can be of two types, from gate inputs to their outputs which are

the gate delays for that particular timing arc. The other one is from the gate output to another

gate input which are the delays caused by interconnect network. Gate delays are of more

importance and should be considered compared to interconnect delays. Every TG edge has

two edge weights because of the rise and fall delay.

Code 6: This creates an Edge in the timing graph

unsigned int createEdge( std::string name, OpenAccess_4::oaOccNet *oaNet, unsigned int

start_node, unsigned int end_node, std::string pin1, std::string pin2 );

Code 7: This creates a Node in the timing graph

unsigned int createNode( std::string type, std::string name, OpenAccess_4::oaOccInst

*oaInst );

Page 21: Suyash Thesis

19

A source node and sink node are added to the TG as well. All the primary inputs and primary

outputs are connected to source and sink nodes respectively.

Code 8: This creates a Sink Node in the TG

unsigned int createSinkNode( std::string name, OpenAccess_4::oaOccTerm *oaTerm );

Code 9: This creates a Source Node in the TG

unsigned int createSourceNode( std::string name, OpenAccess_4::oaOccTerm *oaTerm );

Besides ageing, the circuits have to deal with other variability too such as:

Variations of perturbation or operating conditions, changes in just supply voltage VDD

and the operating temperature T

The performance space is reduced to the delay/frequency of the circuit

2.4. Calculating the area

Robustness is determined in the perturbation space by effectively sampling the area in which

the circuit still works properly. This sampling is done by using the mentioned ageing analysis

for the transformation to the performance space. Dichotomy is the method implemented, in

this project, to calculate this particular area.

Figure 6: Perturbation and performance space being mapped to VDD, T and f respectively. [Ref: [9]]

Page 22: Suyash Thesis

20

Dichotomy calculation, first, finds out the four corners of the specification and checks if any

one of them is violated. This happens only once, hence it isn’t important to parallelize these

functions. When the function calculateRobustness( ) is called in the main program

robustness.cpp with the help of a pointer dichotomy*, the program jumps to

DichotomyCalculation.cpp and starts executing it. The program runs in the following way:

1. ‘specViolated’ is declared as a pair of Boolean and an int value depending on function

verifySpec( ).

2. verifySpec( ) calls for the function verifyPoint( ).

3. These two functions work together and return a pair of Boolean and Integer value

which is assigned to ‘specViolated’. Depending on this pair value, it is determined if

any of the specification points were violated or not.

4. Assuming that none of the four specification corners were violated (normal cases),

iteration from these corners start taking place. From each corner it iterates in two

directions. One along the axis of temperature (‘T’) and another along the axis of

voltage (‘V’). See Figure 7.

In the existing code this happened one after the other. My aim was to parallelize this

iterate_d() function which would run all the eight iterate_d() functions in parallel

calculating the ‘validPoints’ concurrently. This is where OpenMP comes in use.

Page 23: Suyash Thesis

21

Figure 7: Finding valid points starting from the 4 specification corners by Dichotomy calculation

5. Function iterate_d() returns a pair of double value which is assigned to ‘temp’. Since

there are eight iterations taking place, I ended up with eight values, temp1 to temp8.

6. These eight values are then stored in a vector named ‘m_validPoints’. This vector has

eight pairs of double values, one representing the voltage, and other the temperature.

The order of these temp values is very important to consider. It has to start with the

point m_tempReq.first, m_voltReq.first, iterate till the boundary m_tempReq.first,

m_tempLimit.first and calculated validPoint has to be stored in m_validPoint[0]. The

result of 2nd

iteration is stored in m_validPoint[1], of 3rd

iteration in m_validPoint[2]

and so on. See Figure 8.

Page 24: Suyash Thesis

22

Figure 8: A pair of double temp stores a point depending on a voltage and temperature

value. This gives the boundary of the robust region which is important for the calculation of robustness as a probability.

7. The vector m_validPoint is then passed on to the function probs( ) which computes a

pair of double values ‘workingProb’ by iterating through all the eight points in the

vector.

8. The final robustness value is then computed by calling out the function robustness()

and taking in ‘(workingProb.first, workingProb.second)’ as its input.

(See Section 2.2, code 1)

Let’s take a look at point number 4 on page 20, where I mentioned about parallelizing the

function iterate_d( ). In the next chapter, parallel computing with OpenMP and its directives

are discussed.

Page 25: Suyash Thesis

23

3. Parallel programming with OpenMP

In this chapter, the benefits of parallel computing are discussed and the approach taken in

OpenMP to support the development of parallel applications is described.

3.1. Overview

These days, programs are mostly written in a way that they execute in series and run on a

single processing unit. A function is broken into a discrete series of other functions or

instructions and each instruction is executed one after another. Just one instruction is

executed at a time.

The function iterate_d( ) calls verifyPoint( ) twice and then verifyTolerance( ). This is done

serially as shown in Figure 9.

Figure 9: Serial computing [Ref: [5]]

In simple words, parallel computing uses multiple processors simultaneously to solve any

kind of computational problems. It runs on multiple processing units. A function is broken

into discrete parts which are executed concurrently and each part is then broken down into

series of instructions. Instructions from each part execute simultaneously on different

processors depending on the number of threads used.

Parallel computing utilizes all the available resources which can either be multiple core

processors, any number of computers connected within a network or even both.

Some advantages of parallel computing are:

Provides concurrency

Page 26: Suyash Thesis

24

Multiple program instructions are executable at any moment of time

By utilizing more resources, a task is completed faster, hence saving time and in some

cases, maybe money too

Larger complex computations or problems are being solvable

Parallel computing is shown in Figure 10. Here, the large function, iterate_d( ) is broken into

smaller discrete parts which is executed concurrently. Instructions from each part execute

simultaneously on different processors depending on the number of threads. In this Figure,

we use two threads.

Figure 10: Parallel computing [Ref: [5]]

3.2. Architecture and parallel programming models

3.2.1. Architecture

There are four main components in computers: memory, control unit, arithmetic logic unit

and input/output.

Page 27: Suyash Thesis

25

Figure 11: The four main components of a computer [Ref: [5]]

1. The data and program instructions are both stored either in RAM or Read/Write

memory. The instructions (iterate_d) are coded data which command the computer to

do a certain task, while the data (tempReq and voltReq) is just some information or

set of values taken in by the program.

2. The function of the control unit is to get instructions or data from this memory,

decode it and sequentially coordinate operations to run and accomplish the

programmed task.

3. ALU does basic math operations which are needed in the program.

4. Input / output are useful for the user to interact with the computer.

Parallel computers still follow this basic design just multiplied in units. The basic

fundamental architecture remains the same. They are characterized based on the data and

instruction streams forming various types of computer organizations like SISD (Single

instruction, single data), SIMD (Single instruction, multiple data), MISD (Multiple

instruction, single data) and MIMD (Multiple instructions, multiple data). Data and

instructions are read or fetched from memory. Data is stored or written to memory.

In my case, the computer organizes to deal with a single instruction and multiple data

(SIMD). It’s a type of parallel computer where all the core processors execute the same

instruction at any given time but each processor takes in different data values as input to

execute that instruction.

3.2.2. Parallel programming models

Several programming models are:

Shared memory

Distributed memory

Threads model

Data parallel

Single Program, Multiple Data (SPMD)

Multiple Program, Multiple Data (MPMD)

Page 28: Suyash Thesis

26

Amongst these models, shared and distributed memory, threads model and SPMD will be

studied in detail.

3.2.2.1. Shared and distributed memory

Shared memory: In shared memory systems, each processor has the ability to access all the

memory from one global address space (see Figure 12). The access to this global memory

space is controllable by using various mechanisms. Data can be shared or private. Shared data

is accessible by all the threads whereas private data can be accessed only by the thread that

owns it. Developing a program on this model is an advantage as:

There is no need to specify, explicitly, the communication of data between tasks.

Synchronization takes place but it’s implicit

It is easier to modify the existing serial codes into parallel

This allows multiple processors to operate individually while having access to all the memory

resources. The programmer must make sure that writing to global memory is handled

correctly especially when a variable is being read and written in calculations.

The disadvantage of shared memory is that when multiple cores happen to access the same

memory in the global memory space, bottleneck could be caused which ultimately slows

down or hangs the whole program.

Figure 12: Shared memory configuration [Ref: [5]]

Distributed memory: In distributed memory systems, each processor has its own local

memory and each processor is connected to other processors through different types of

networks. Due to this, the memory of one processor cannot be mapped to another, hence each

processor operates independently. The advantages of this configuration are:

The size of memory available increases every time more processors are to be utilized

Page 29: Suyash Thesis

27

Processors need not communicate with each other as each of them can quickly access

their own memory, saving the overhead time

The disadvantage of distributed memory is that the programmer needs to explicitly pass all

the data to a processor when it needs to do a task. That requires specific coding which can be

very tedious. It is also quite difficult to write a distributed memory parallel program and the

existing program cannot be easily made parallel as a lot of re-coding is required to be done.

Figure 13: Distributed memory configuration

3.2.2.2. Threads model

This model is a type of shared memory parallel programming where a single process can

have multiple, concurrent execution paths. To implement this thread model, the programmer

needs a set of compiler directives and runtime library routines that are called within the

parallel code. It is the programmer’s responsibly to select parts of code which need to be

parallelized.

The execution of any function done by the threads is not always in order. Any thread can

execute the function as soon as it’s free from its previous task. The amount of time taken by

each thread to do a task is also not fixed. Threads need to communicate with each other as

well; they need to synchronize with each other to make sure that they are not updating the

same memory address at the same time in the global memory space.

I used OpenMP to create and implement threads (See Section 3.3).

Page 30: Suyash Thesis

28

Figure 14: Threads model where there is a possibility for distribution of tasks (function

iterate_d( )) to the different threads

3.2.2.3. Single program, multiple data

SPMD is a high level programming model and a subset of SIMD (See Section 3.2.1).

Single Program: The same program is executed together at the same time on multiple data

streams by creating their copies by their respective tasks.

Multiple Data: The inputs for the task use different data.

3.3. OpenMP

OpenMP (Open Multi-Processing) is a standard Application Programming Interface (API) for

writing shared memory parallel applications in C, C++ or Fortran. The reason this

parallelization tool was chosen over the others is because OpenMP has an advantage of being

very easy to parallelize the existing serial codes which were written in C++. With careful

planning in advance on what parts of code should be parallelised and by using OpenMP

directives, clauses and the library subroutines, the work of parallel computing gets easier. It

also has the advantage of being widely used, highly portable and ideally suited for multi-core

processor machines. Multiple cores are naturally harder to coordinate than single core.

Algorithms are trickier and the programs are more complex.

Page 31: Suyash Thesis

29

OpenMP allows multi-threaded, explicit parallel programming. A combination of several

threads helps in executing code faster by running in parallel. The threads provide fast

switching as they share the work load. There’s communication between threads and hence,

synchronization is important too. A core allows only one thread to run at a time. A core that

allows multi-threading has additional hardware to switch between threads in very little time

with very little overhead time.

A parallel region is basically a block of codes executed by all the threads simultaneously.

Each thread has a thread ID which can be determined by calling a special OpenMP function.

The OpenMP team always has 1 master thread and its several workers. The OpenMP

program starts with this master thread, which has a thread ID = 0. The OpenMP operates on

Fork-Join model which is shown in Figure 15.

Figure 15: OpenMP uses Fork-Join model for parallel execution

When the first parallel region is encountered, which is created by #pragma omp parallel, a

team of threads is forked to carry out the work within that parallel region. The number of

threads created depends on the environment routine, omp_set_num_threads. The code within

the parallel region is then executed by random threads depending on how long each thread

takes to finish its previous task. So, the code runs in a random thread order and time slicing.

Page 32: Suyash Thesis

30

When the end of the parallel code is reached, the threads synchronize and re-join into the

master thread, which executes the rest of the code serially.

Parallelization using OpenMP is specified by its components, which are:

1. Directives and its clauses

2. Runtime library routines

3. Environment variables

Figure 16: These three pieces are typically used to examine and modify the parallel execution

parameters. Taken together, they define what is called an API (Application programming interface) [Ref: [2]]

3.3.1. OpenMP executable directives and its clauses

1. #pragma omp parallel: The parallel construct starts the parallel execution by creating

a team of threads (See Figure 15).

2. #pragma omp sections: The sections construct contains a set of individual code blocks

that are distributed and executed over the threads. The section directive must be

nested within SECTIONS/ END SECTIONS directive pair. A thread may execute more

than one section if it is quick enough. See

Figure 17. There is no guarantee in what order each section will run. It is the

programmer’s responsibility to make them work in an order.

I have discussed about this later in Section 4.5, point number 2, as I encountered the

same problem while modifying the existing code by inserting this particular section

direction.

Page 33: Suyash Thesis

31

Figure 17: The “sections” directive

3. #pragma omp single: The single construct serializes a section of code. It specifies that

the associated structured block is executed by just one thread, not necessarily the

master thread. It is useful when dealing with print statements. The I/O operations can

be enclosed by single directive, so that any thread that is not busy can perform the I/O

operation. The other threads skip around the single construct and move on with the

code.

Note: This directive was initially used in my edited parallel program just for the print

statements. But, I decided to print the values that were calculated by all the threads

since the threads execute the same program but with different data values. Hence,

ultimately this directive wasn’t used in my program.

4. #pragma omp barrier: The barrier construct basically synchronizes all the threads in

the team; it’s like a pause operation. Upon encountering this directive, each thread

waits for the other threads to arrive at the point. Only when all the individual threads

finish their computation and arrive at this point, they continue executing operations

behind the barrier.

5. #pragma omp critical: The critical construct restricts the associated structured block

to be executed only by one thread at a time. The first thread to reach this directive gets

to execute the code. Only when the current thread exits the critical region, other

Page 34: Suyash Thesis

32

threads can then execute the code. This means that the value can be updated once at a

time.

Most of the OpenMP directives support clauses. These clauses, with the directive, specify

additional information like determining the execution context, mostly about the variables

used and their values within the parallel code region. If a variable is visible in the parallel

code and is not mentioned in any of the sharing attribute clause, then the variable is

considered to be shared. Few of the clauses I used are:

1. Private(list) clause: This private clause was used with the #pragma omp parallel

construct. This requires each thread to create a private instance of a specified variable.

These specified variables are therefore private to each thread and not accessible by the

other threads. The value is lost at the end of the block.

2. Shared(list) clause: I also used shared with the #pragma omp parallel construct as

well. This clause specifies that the named variable should be shared and accessible by

all the threads in the team. Value persists after the end of the block.

This is illustrated in Figure 18.

Figure 18: Screenshot of my OpenMP code used in the program for just determining number of threads that are actually be used

Page 35: Suyash Thesis

33

3.3.2. OpenMP runtime library routines

OpenMP provides several execution environment routines that help the programmer to

manage his/her program in parallel, affect the monitor threads, processors and parallel

environment. Few of the execution environment routines I used in my project code are:

1. void omp_set_num_threads(int num_threads): Decides the number of threads used

for the parallel regions. The programmer can specify number of parallel threads by

two different ways: Either through omp_set_num_threads() runtime library routine, or

through OMP_NUM_THREADS environment variable.

2. int omp_get_num_threads(void): Returns the number of threads in the current team of

the parallel region.

3. int omp_get_thread_num(void): Returns the unique thread ID of the encountering

thread (thread numbering begins from ‘0’ till ‘num_threads minus 1’) to the user.

All these protocols are defined in the standard OpenMP include file, omp.h. Other than the

execution environment variables, OpenMP also provides lock routines and timing routines for

synchronization and for supporting portable wall clock timer respectively.

3.3.3. Environment variables

OpenMP Environment variables are all in upper case. These variables are read at the program

start-up and any modification to their values later is all ignored. They control the execution of

the parallel codes.

I will mention more about environment variables in Section 6.1.

3.3.4. Summary

OpenMP is a shared memory model. The threads communicate by sharing the variables.

Unintended sharing of data causes race conditions. This happens when the threads are

scheduled differently and there’s a concurrent access to the same variable. This changes the

program’s outcome. To control this, synchronization between the threads is important, which

is done by using the synchronization clauses like barrier, critical, single etc.

Page 36: Suyash Thesis

34

4. Implementation of Parallel Program and OpenMP in

Robustness

This section discusses how parts of the existing robustness.cpp, DichotomyCalculation.cpp

and RobustnessCalculation.cpp programs were made parallel and the challenges faced while

implementing this. For efficient parallelization of evaluating robustness, I modified and dealt

with several exiting serial programs such as:

1. robustness.cpp

2. DichotomyCalculation.cpp

3. DichotomyCalculation.h

4. RobustnessCalculation.cpp

5. RobustnessCalculation.h

The main program of my project is robustness.cpp. There are several pointers in this program

that point to various other functions and objects of other different classes. I was mainly

dealing with:

TG *tg;

EDARobust::RobustnessCalculation *calc; EDARobust::DichotomyCalculation *dichotomy;

4.1. The existing serial program: robustness.cpp

A. The main program, robustness.cpp starts by loading the following:

1. XML Timing Library: void loadTimingLib(std::string libname);

where libname is the XML file name from where the timing library is loaded.

2. Circuit Design:

void loadOA(std::string libname, std::string cellname, std::string viewname);

where libname is Open Access library name, cellname is Open Access Cell name and

viewname is Open Access View name.

3. XML Constraints Library: void loadConstraintsLib(std::string libname);

where libname is the XML file name from where the timing constraints is loaded.

These three functions belong to the class TG. The pointer *tg has a structure of this class and

is later allocated enough memory for storing functions or objects from this particular class.

*tg was then used to point to and call these three functions from the class TG.

Page 37: Suyash Thesis

35

The actual code:

Figure 19: Declaring one *tg and then using it to call different functions, all from the same

class, TG

B. Once the three files are loaded in, a usage profile for ageing analysis was selected:

void set_useProfile(useProfile prof) {

use_profile_ = prof;

}

This is called by the same pointer *tg as the function belongs to class TG.

For an example, this profile sets the AgingAnalysis = true, NBTI = true, HCI = true,

clockFreq to 10GHz, Vdd to 1.32V, temperature to 125 deg C, and the age to

1year*365days*24hours (See Figure 20).

C. Next, the pointer *tg points to the functions that do the following:

1. Return all the source nodes in timing graph with corresponding ID:

std::map<unsigned int, Node*> getSourceNodes() {

return source_nodes_;

}

The unsigned int specifies the corresponding ID. It returns a map (id -> pointer)

with all source nodes the timing graph contains. This map is stored in

“sourcenodes”.

Page 38: Suyash Thesis

36

2. Returns all the sink nodes in timing graph with corresponding ID:

std::map<unsigned int, Node*> getSinkNodes() {

return sink_nodes_;

}

The unsigned int specifies the corresponding ID. It returns a map (id -> pointer)

with all sink nodes the timing graph contains. This map is stored in “sinknodes”.

3. Returns all the nodes (including sink and source nodes) in the timing graph with

corresponding ID:

std::map<unsigned int, Node*> getNodes() {

return nodes_;

}

The unsigned int specifies the corresponding ID. It returns a map (id -> pointer) with all the

nodes, the timing graph contains, including the source and sink nodes. This map is stored in

“nodes” as can be seen in the last three lines of Figure 20.

Figure 20: Codes where the use_Profile and the maps of all the nodes of the circuit are created

D. Next, the Signal Probability (SP) and Transition Density (TD) are set at the inputs.

After setting the profile specification, the workload WL (i.e. the time a transistor is in

stress due to impacts of NBTI and HCI) is determined at the gate inputs by factors like

Page 39: Suyash Thesis

37

SP and TD respectively. Indirectly, ageing analysis and the performance of an aged

gate model is also determined by these factors.

SP and TD are given as a probability. They are found out by iterating through all the

sourcenodes. If the ageing analysis of the profile is set to true, these SP and TD values

are updating.

E. General setup parameters are then declared, like the minimum and maximum

temperature, minimum and maximum voltage, min and max frequency, age, stepSize,

NBTI, HCI etc.

F. Next, depending on the user selection of dimensions (2D or 3D) and the area

calculation method (dichotomy, stopsign or gradient), robustness is calculated. Most

parts of the codes that were modified and parallelized are in the function,

calculateRobustness( ), which is defined in DichotomyCalculation.cpp and

RobustnessCalculation.cpp.

4.2. The existing serial program: DichotomyCalculation.cpp and

RobustnessCalculation.cpp

The main function calculateRobustness( ) is called in DichotomyCalculation.cpp. This

function calls many other functions which are defined in RobustnessCalculation.cpp. These

two programs execute together and calculate the value of robustness of a given circuit in a

specific time. The main functions I dealt with and modified were:

1. bool RobustnessCalculation::verifyPoint(double temp, double voltage);

2. void RobustnessCalculation::iterate_d(doublePair start, doublePair boundary,

char alteratedParam, double intervalLimit, bool negativeFlag);

3. bool RobustnessCalculation::verifyTolerance(double temp, double voltage,

double tolerance)

There is another function called verifyspec( ) which is called out to verify if any of the

specifications are outside the robust region, see Figure 21. This function then calls out for

verifyPoint( ) which returns a Boolean value. Based on that verifyspec( ), it returns a pair of

bool and int value and assigns it to ‘specViolated’as explained in Figure 22. Area is then

calculated which depends on ‘specViolated’ value.

Page 40: Suyash Thesis

38

Figure 21: The four corners of specifications which need to be verified by function verifySpec( ) only once, hence parallelizing them isn’t of much difference

Figure 22: Depending on ‘specViolated’, area calculation takes place by function called iterate_d( )

Under normal conditions, none of the four corners are violated. Based on that, the area is then

calculated, either by the method of dichotomy, stopSign or gradient. Each of these function

call iterate( ). In my case, the area is always calculated by the method of dichotomy with

calls the function iterate_d( ).

The function iterate_d( ) is one of the most important function in calculating robustness.

Iteration begins from the four specification corners, which calculates a valid point. Each

Page 41: Suyash Thesis

39

corner iterates in two directions: Temperature and voltage. See Figure 23. All the eight

iterations take place one after another which is very time consuming.

Figure 23: Iteration taking place in two directions from each corner. This is important for the calculation of area of robust region.

This function calls verifyPoint( ) and verifyTolerance( ) which use *tg to access the

useProfile, updateArrivalTime( ) and getSinkArrivalTime( ).These values are different and

updated for each iteration.

After the eight iterations are completed, there are eight valid points which represent the

outline of the robust region. These eight points are stored in a vector called ‘m_validPoints’.

The function probs( ) then iterates through this vector ‘m_validPoints’ and returns a pair of

double values. These values are sent to the function robustness( ) which returns a double

value ‘robustnessprobValue’. This is the final robustness value.

4.3. Ideology

My main aim was to speed up the process of calculating robustness, especially for larger

circuits like c1355_i89, c1908_i89 and c3540_i89. This could be possible only if the robust

region was calculated quicker. The robust region is defined by the eight valid points (See

Figure 24) and their vector ‘m_validPoints’. In order to do this, I had to run the iterate_d( )

function in parallel. This would help me to compute all the eight valid points concurrently

saving a lot of time.

Page 42: Suyash Thesis

40

Figure 24: Eight valid points which define the area and are needed for calculating the robustness value.

If I parallelize the function iterate_d( ) by using the thread model programming (See Section

3.2.2.2), the other multiple worker threads will also execute the functions such as

verifyPoint( ) and verifyTolerance( ) simultaneously. Hence, I need to provide a private

copy of tg to each of these threads as the values tg hold will be different, based on the

useProfile, updateArrivalTime( ) and getSinkArrivalTime( ).

Hence, the number of copies of tg depends on the number of threads in the parallel region

(Ref: 1). So, depending on NUM_THREADS, I create those many copies of tg and store it all

in a vector of class TG called ‘TGVec’.

Each instance of tg can be called out by TGVec[0], TGvec[1] and so on, where 0, 1 are the

thread ID (Ref: 33). The thread ID can be called by function omp_get_thread_num( ). This

means tg0 is accessed by TGVec[thread ID=0].

See Figure 25.

Figure 25: Four copies of tg are created since we have four threads running in parallel. Each tg can call functions like updateArrivalTime( ) by :

m_TGVec[omp_get_thread_num( )] -> updateAT( )

Page 43: Suyash Thesis

41

4.4. Changes made in existing code

1. Instead of declaring just one tg, I now create several copies of tg’s depending on the

number of threads, NUM_THREADS. I then store them in a vector ‘TGVec’.

std::vector<TG*> TGVec;

for ( int j = 0 ; j < NUM_THREADS ; j++) {

TG *tg;

tg = new TG;

TGVec.push_back(tg);

}

2. Since there are several copies of tg now, I need to point every tg to the functions:

- TGVec[i]-> set_useProfile(prof1);

- TGVec[i]->getSourceNodes();

- TGVec[i]->getSinkNodes();

- TGVec[i]->getNodes();

for ( int i = 0 ; i < TGVec.size() ; i++ ) {

TGVec[i] -> set_useProfile(prof1);

}

3. Existing code:

useProfile *oldProf = m_timer -> get_useProfile();

useProfile newProf;

m_timer was declared as ‘static TG *m_timer ;’

But since this function is executing in parallel in verifyPoint( ) and

verifyTolerance( ), I need to create multiple copies of ‘m_timer’ or ‘tgs’

Modified code:

useProfile *oldProf = m_TGVec[omp_get_thread_num()] -> get_useProfile();

Page 44: Suyash Thesis

42

Refer to Figure 25.

4. Existing code:

startPoint = doublePair (m_tempReq.first, m_voltReq.first);

boundary = doublePair (m_tempReq.first, m_tempLimit.first);

iterate_d (startPoint, boundary, 't', m_intervalLimit);

The function iterate_d( ) is no longer a void function. It now “returns” a double pair

value, ‘start.first’ and ‘start.second’.

After iterating eight times in parallel sections (See Section 3.3.1, point 1 and point 2)

and hitting the barrier directive, each thread waits for the rest to finish their

computation (See Section 3.3.1, point 4).

The eight values are then stored in a vector ‘m_validPoints’. It is important to

maintain the order to the iteration values starting from ‘voltReq.first’, ‘tempReq.first’,

iterating with respective to the temperature axis first.

Modified code:

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel // Parallel region begins here

{

#pragma omp sections // Code is distributed amongst the threads

{

#pragma omp section

{

startPoint = doublePair (m_tempReq.first, m_voltReq.first);

boundary = doublePair (m_tempReq.first, m_tempLimit.first);

temp1 = iterate_d (startPoint, boundary, 't', m_intervalLimit);

}

}

#pragma omp barrier // All threads wait here for each other

} // Parallel region ends here

m_validPoints.push_back (doublePair (temp1.first, temp1.second) );

Page 45: Suyash Thesis

43

Figure 26: The values of ‘temp1’ to ‘temp5’ are pushed back in the vector outside the parallel region. The first value to be pushed back in the vector is temp1 and the last being temp5. It is

very important to maintain this order.

5. To analyze the results and compare the difference in time taken for calculating

robustness with different number of threads, I used an OpenMP Timing routine,

double omp_get_wtime(void);

This returns the real time elapsed in seconds for any kind of computations or

functions. In this example, calculateRobustness( ).

double OmpStart;

double OmpEnd;

OmpStart = omp_get_wtime();

{ … calculateRobustness ( ) … }

OmpEnd = omp_get_wtime();

std::cout << "Robustness at "<< age << "years calculated in ";

std::cout << static_cast<double> (OmpEnd - OmpStart) << " OpenMP real time

seconds! ";

std::cout << std::endl;

Page 46: Suyash Thesis

44

6. To set the number of threads for parallel computing, I could either declare it by hard

coding: ‘#define NUM_THREADS 2’ or any other int value like four or eight. Instead

of hard coding it, I declared it as a Command Line argument. Depending on the user,

he or she can now set any number of threads and the program would run accordingly.

The number of threads can now be set by - - numthreads <int> in the command line.

The advantage is that, the user need not compile it every time. It is an automated

program.

Added code:

TCLAP::ValueArg<int> numThreadsArg("", "numthreads", "Override Number

of threads usage", false, 1, "int", cmd);

If nothing is declared in the command line, number of threads will be set to 1, as

default.

4.5. Difficulties faced

1. Running an OpenMP “Hello World” program in parallel was quite a challenge in the

start. OpenMP needs to be turned on by using certain compiler flags like –fopenmp or -

openmp and many others depending on the type of compiler used like GNU or Intel. In this

project, all these complier flags are declared in a Cmake file called CMakeLists.txt. I used the

internet [Refer to[8]] to search for the Cmake compiler flags and added the few lines of code

in this particular file. See Figure 27.

Figure 27: CMakeLists.txt which includes all the different compiler flags

Page 47: Suyash Thesis

45

2. The main idea was to speed up the eight iterations. Parallelizing the function

iterate_d( ) by using #pragma omp parallel and #pragma omp section was definite to be

done. Since the iteration function was earlier a void, it used to result in filling the vector

‘m_validPoint’ with the pair of double ‘temp’ values (valid points). If I simply parallelized

this function, the order of pushing back the ‘temp’ values into the vector becomes totally

random. See Figure 28.

Figure 28: The area of the robust region cannot be determined by simply parallelizing the iterate_d( ) function. As the threads carry out the part of code in section in a total random

manner, the order of these valid points is violated too.

To overcome this problem, I modified the function iterate_d( ). It now returns eight ‘temp’

values in parallel in a very random manner of course. And after the parallel section of code is

ended, these eights values were then just pushed back, starting from ‘temp1, temp2, temp3. .

.temp8’ in a correct order. See Figure 8 and Figure 24 for better understanding.

Page 48: Suyash Thesis

46

5. Evaluation and Results

The program was compiled with multiple threads depending on the user. This could be set by

the command line option - - numthreads x, where x is the number of threads (See Section

4.4, point 6). This was made possible by OpenMP interface. The OpenMP include file for

library routines is mandatory:

#include <omp.h>

These multiple threads execute the program by sharing the work amongst them. Hence, the

computation time for calculating robustness depends on the number of threads used. For

example if a Rechner machine takes 36 seconds to compute robustness by using a single

thread, it would take around 9 seconds with eight threads running in parallel for the same

circuit. The time spent for computing robustness is reduced by almost 3.2 to 4.3 times,

depending on the size of the circuit. Larger the circuit is, faster is the computation time.

Compare Table 3, Table 5 and Figure 34.

These results were observed after running three large circuits in Rein and Rechner1,

Rechner2 and Rechner3 machines each, several times.

The program now utilizes all the resources. The program ‘Robustness’ is now running on all

the core processors.

Rein (2 core processors) Rechner (8 core processors)

1 thread 90% – 100% 90% – 100%

2 threads 140% – 180% 160% – 199%

4 threads 140% – 180% 340% - 380%

Table 1: CPU Usage of the two machines while running the program ‘Robustness’ with different number of OpenMP threads

Page 49: Suyash Thesis

47

Figure 29: Both the processors are being utilized for the command ‘Robustness’

With the increase in number of threads, the computation time decreases. The time elapsed for

calculating robustness is shown with reference to the CPU seconds and real time wall clock

seconds. The CPU time increases once we use more threads, but compared to the real time, it

doesn’t make a hugh difference.

The results are analyzed by using the machines, Rein and Rechner and larger circuits like

c1355_i89, c1908_i89 and c3540_i89.

Num_Threads Robustness Value CPU Time OpenMP Real Time

1 (existing code) 0.325518 31.32 31.345 seconds

2 0.325518 32.09 19.6791 seconds

4 0.325518 31.86 18.4646 seconds

8 0.325518 31.99 18.4398 seconds

Table 2: Results for NangateDesigns: cell c1355_i89, dimension: 2D, maximum age: Robustness at 10 years,

machine: REIN (2 core processors)

Page 50: Suyash Thesis

48

Num_Threads Robustness Value CPU Time OpenMP Real Time

1 (existing code) 0.325518 28.63 28.697 seconds

2 0.325518 28.96 17.6511 seconds

4 0.325518 29.62 11.2495 seconds

8 0.325518 32.32 8.63535 seconds

12 0.325518 39.9 11.4595 seconds

Table 3: Results for NangateDesigns: cell c1355_i89, dimension: 2D, maximum age: Robustness at 10 years, machine: RECHNER3 (8 core processors)

My main aim was to:

1. Reduce the run time for computing robustness with increase in number of threads

2. The robustness result should remain the same, regardless of number of threads

By observing the readings, both the criteria were met. Although in the case of Rechner3, 12

threads (See Table 3 and Table 5), the time taken to compute the circuits robustness is more

than that of 8 threads. This is because of the several idle threads which don’t share the work

at all since the machine has just 8 core processors. These idle threads delay in executing the

program. 8 threads are just sufficient for an 8 core processor. Ideally, it is always preferred

to keep the number of threads equal to the number of processors, but it isn’t necessary.

Num_Threads Robustness Value CPU Time OpenMP Real Time

1 (existing code) 0.317204 47.15 47.1668 seconds

2 0.317204 48.6 28.8759 seconds

4 0.317204 48.29 28.1237 seconds

8 0.317204 48.47 27.5828 seconds

Table 4: Results for NangateDesigns: cell c1908_i89, dimension: 2D, maximum age: Robustness at 10 years,

machine: REIN (2 core processors)

Num_Threads Robustness Value CPU Time OpenMP Real Time

1 (existing code) 0.317204 43.55 43.6704 seconds

2 0.317204 43.53 25.3362 seconds

4 0.317204 44.19 16.4133 seconds

8 0.317204 47.97 11.563 seconds

12 0.317204 50.06 11.7313 seconds

Table 5: Results for NangateDesigns: cell c1908_i89, dimension: 2D, maximum age: Robustness at 10 years, machine: RECHNER3 (8 core processors)

Page 51: Suyash Thesis

49

Figure 30: Time elapsed for computation decreases with an increase in OpenMP threads. These results are based on the values from Table 5

Figure 31: Results for NangateDesigns: cell c3540_i89, dimension: 2D, maximum age: Robustness at 10 years,

machine: RECHNER3 (8 core processors), number of threads: 1

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20

OpenMP Real Time

OpenMP Real Time

Num_threads

Page 52: Suyash Thesis

50

Figure 32: Results for NangateDesigns: cell c3540_i89, dimension: 2D, maximum age: Robustness at 10 years,

machine: RECHNER3 (8 core processors), number of threads: 2

Figure 33: Results for NangateDesigns: cell c3540_i89, dimension: 2D, maximum age: Robustness at 10 years,

machine: RECHNER3 (8 core processors), number of threads: 4

Page 53: Suyash Thesis

51

Figure 34: Results for NangateDesigns: cell c3540_i89, dimension: 2D, maximum age: Robustness at 10 years,

machine: RECHNER3 (8 core processors), number of threads: 8

Page 54: Suyash Thesis

52

6. Conclusion

During the course of this thesis, efficient parallelization of the program using OpenMP was

implemented. The most important and time consuming task for me was the research on

OpenMP interface and to understand the working of the whole program. Without that, I

wouldn’t succeed in understanding how codes are actually parallelized and what parts of

codes were to be parallelized. Parallel computer programs are more difficult to write than

sequential programs as it requires more planning and skills to troubleshoot software bugs like

race conditions. Thus, at the beginning of this thesis, I was just busy getting myself familiar

with most of the variables and functions.

Parallelism has been employed for more than a few years now, mainly in high-performance

computers or super computers which were made of several core processors. With multi core

processor computers being so common these days, the interest had grown massively.

Multithreaded parallelization proves to be a simple and an effective method to reduce the

computational time associated with the solution of both constrained and unconstrained global

programming problems. OpenMP was introduced to explain more on parallel computing.

Although OpenMP is NOT

1. Meant for distributed memory parallel systems.

2. The most efficient use of shared memory.

3. Required to check for data conflicts, race conditions and deadlocks.

4. Designed to guarantee that the input or output to the same file is synchronous when

executed in parallel. The programmer is responsible for synchronizing input and

output.

It still has an advantage of being easy to implement on currently existing serial codes and

allowing incremental parallelization. OpenMP provides for a compact, but yet a powerful,

programming model for shared memory programming. It allows the programmer to

parallelize individual sections of a code at a time.

6.1. Further Improvements

The number of OpenMP threads can be set by several ways. The most basic was to hard-code

it. I hard-coded the program initially for testing purposes.

Page 55: Suyash Thesis

53

The number of threads could be declared at the start of the program with other include files:

#define NUM_THREADS 2

omp_set_num_threads ( NUM_THREADS ) ;

The number of OpenMP threads can also be set as an ‘environment variable’ in the linux

terminal by the command:

export OMP_NUM_THREADS = < int # >

where, # is user defined and can be any other integer/number.

int NUM_THREADS could then be defined as:

int NUM_THREADS = omp_get_num_threads( ) ;

Unfortunately the ‘environment variable’ could not be implemented. With this working, I

would not need the command line input for number of threads.

The code is automated, and the number of threads can be any integer, which can be declared

at the command line ( - -numthreads # ), which is used only for the program command

‘Robustness’.

The objective of this bachelor thesis, however, is met. Parts of robustness validation

programs are now efficiently parallelized, which speed up the whole process of robustness

calculation without altering the value of robustness, regardless of the number of threads.

Page 56: Suyash Thesis

54

Acknowledgments

This thesis results from my work as an undergraduate student at the Institute for Electronic

Design Automation at the Technischen Universität München.

Firstly, I express my gratitude to Professor Ulf Schlichtmann for giving me such a wonderful

opportunity to complete my bachelor thesis in his institute. I consider it as my privilege for

working on such a novel topic in his institute.

I would like to thank my supervisor, Mr. Martin Barke, for being so kind and encouraging

throughout the course of this bachelor thesis. Without his help and continued support, the

successful completion of this project would not have been possible. I have learnt a lot from

him and the skills I developed under his guidance are exceptional.

I am also very thankful to my colleagues, Mr. Sharad Shukla and Mr. Lachman Karunakaran

for their active help throughout my thesis. It was a great time with the colleagues at the

institute, which I will never forget.

Above all, I thank my parents for their constant support and love in all aspects. Without their

blessings, I wouldn’t have stood a chance to be a part of this great university.

Page 57: Suyash Thesis

55

References

[1] Alina Kiessling (April 2009): “An Introduction to Parallel Programming with

OpenMP”, A Pedagogical Seminar

[2] Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh

Menon (2001): “Parallel Programming in OpenMP”

[3] Tim Mattson, Barbara Chapman (2005): “OpenMP in Action”

[4] Katharina Boguslawski (28 April, 2010): “Parallel Programming with OpenMP,”

Group Talk

[5] Blaise Barney: “Introduction to Parallel Computing”, Lawrence Livermore National

Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#Abstract

[6] Blaise Barney: “OpenMP”, Lawrence Livermore National Laboratory,

https://computing.llnl.gov/tutorials/openMP/#ProgrammingModel

[7] “Introduction to OpenMP, Part 1”,

http://community.topcoder.com/tc?module=Static&d1=features&d2=091106

[8] March, 2012: “How to use OpenMP in Cmake”,

http://quotidianlinux.wordpress.com/2012/03/26/how-to-use-openmp-in-cmake/

[9] M. Barke, M. Kärgel, W. Lu, F. Salfelder, L. Hedrich, M. Olbrich, M. Radetzki and

U. Schlichtmann ( July 2012): “Robustness Validation of Integrated Circuits and

Systems”, Asia Symposium on Quality Electronics Design (ASQED)

[10] D. Lorenz, M. Barke and U. Schlichtmann (November, 2010): “Ageing analysis at

gate and macro cell level”, IEEE/ACM International Conference on Computer-

Aided Design (ICCAD)

[11] Juan Soulié (June, 2007) : “C++ Language Tutorial”,

http://www.cplusplus.com/doc/tutorial/

[12] “C++ Vector”, http://www.cplusplus.com/reference/vector/vector/

[13] “Summary of OpenMP 3.0 C/C++ Syntax”, http://www.openmp.org/mp-

documents/OpenMP3.0-SummarySpec.pdf, OpenMP Architecture Review Board

[14] Joel Yiluoma (Sept, 2007): “Guide into OpenMP: Easy multithreading programming

for C++”, bisqwit.iki.fi/story/howto/openmp/