Suyash Thesis
-
Upload
tanvee-katyal -
Category
Documents
-
view
89 -
download
0
Transcript of Suyash Thesis
Technische Universität München
Department of Electrical Engineering and Information Technology
Institute for Electronic Design Automation
Efficient Parallelization of Robustness Validation for Digital Circuits
Bachelor Thesis
Suyash Shukla
1
Technische Universität München
Department of Electrical Engineering and Information Technology
Institute for Electronic Design Automation
Efficient Parallelization of Robustness Validation for Digital Circuits
Bachelor Thesis
Suyash Shukla
Supervisor : Dipl.-Ing. Martin Barke
Supervising Professor : Prof. Dr.-Ing. Ulf Schlichtmann
Topic issued : 24.10.2012
Date of submission : 14.01.2013
Suyash Shukla
AlzeyerStrasse 2
80993 Munich
2
3
Abstract
These days the feature sizes of ICs (Integrated Circuits) are entering into nanometer
technologies and due to this shrink in size, reliability issues such as negative bias temperature
instability (NBTI) and hot carrier injection (HCI) are coming up as potential threats. All these
reliability issues cause irregular and unsteady failures, ultimately causing the performance of
the chips to degrade over time. Due to these issues, the impact of ageing analysis is now
being studied which lead to a change of device parameters over time. In this thesis,
robustness of integrated circuits through a couple of C++ programs is being studied and parts
of the code are efficiently parallelized for the robustness validation for digital circuits. I used
the OpenMP interface and its directives to make the existing codes run in parallel, using
multiple processors and therefore saving time.
4
5
Contents
1. Introduction ......................................................................................................................................... 11
1.1. Motivation ......................................................................................................................................... 11
1.2. Goals .................................................................................................................................................. 12
1.3. Achievements .................................................................................................................................... 12
1.4. Report structure ................................................................................................................................. 12
2. Background .......................................................................................................................................... 14
2.1. Robustness model .............................................................................................................................. 14
2.2. Measuring robustness ....................................................................................................................... 16
2.3. Adaption to different applications ..................................................................................................... 17
2.3.1. Timing analysis and timing graph (TG) .......................................................................................... 17
2.4. Calculating the area........................................................................................................................... 19
3. Parallel programming with OpenMP .................................................................................................... 23
3.1. Overview ............................................................................................................................................ 23
3.2. Architecture and parallel programming models ................................................................................ 24
3.2.1. Architecture .................................................................................................................................. 24
3.2.2. Parallel programming models ....................................................................................................... 25
3.2.2.1. Shared and distributed memory ......................................................................................... 26
3.2.2.2. Threads model ..................................................................................................................... 27
3.2.2.3. Single program, multiple data ............................................................................................. 28
3.3. OpenMP ............................................................................................................................................. 28
3.3.1. OpenMP executable directives and its clauses ............................................................................. 30
3.3.2. OpenMP runtime library routines................................................................................................. 33
3.3.3. Environment variables .................................................................................................................. 33
3.3.4. Summary ....................................................................................................................................... 33
4. Implementation of parallel program and OpenMP in robustness ......................................................... 34
4.1. The existing serial program: robustness.cpp ..................................................................................... 34
4.2. The existing serial program: DichotomyCalculation.cpp and RobustnessCalculation.cpp ................ 37
4.3. Ideology ............................................................................................................................................. 39
4.4. Changes made in existing code .......................................................................................................... 41
4.5. Difficulties faced ................................................................................................................................ 44
6
5. Evaluation and Results.......................................................................................................................... 46
6. Conclusion ............................................................................................................................................ 52
6.1. Further improvements ....................................................................................................................... 52
Acknowledgments ......................................................................................................................................... 54
References ..................................................................................................................................................... 55
7
LIST OF FIGURES
FIGURE 1: OPERATING CONDITIONS AND SYSTEM PROPERTIES, Π NOM
AND Φ NOM
.................................................................... 14
FIGURE 2: PROPERTY OR PERFORMANCE SPACE, ΦS, OF TASK A AND TASK B .......................................................................... 15
FIGURE 3: PERTURBATION SPACE, ΠS, WHERE THE SYSTEM HAS TO WORK PROPERLY FOR TWO TASKS A AND B ............................. 15
FIGURE 4: A ROBUST SYSTEM ....................................................................................................................................... 16
FIGURE 5: COMPUTATION OF ARRIVAL TIME (AT) ........................................................................................................... 18
FIGURE 6: PERTURBATION AND PERFORMANCE SPACE BEING MAPPED TO VDD, T AND F RESPECTIVELY ........................................ 19
FIGURE 7: FINDING VALID POINTS STARTING FROM THE 4 SPECIFICATION CORNERS BY DICHOTOMY CALCULATION ......................... 21
FIGURE 8: A PAIR OF DOUBLE TEMP WHICH GIVES THE BOUNDARY OF THE ROBUST REGION. ..................................................... 22
FIGURE 9: SERIAL COMPUTING ..................................................................................................................................... 23
FIGURE 10: PARALLEL COMPUTING ............................................................................................................................... 24
FIGURE 11: THE FOUR MAIN COMPONENTS OF A COMPUTER .............................................................................................. 25
FIGURE 12: SHARED MEMORY CONFIGURATION ............................................................................................................... 26
FIGURE 13: DISTRIBUTED MEMORY CONFIGURATION ........................................................................................................ 27
FIGURE 14: THREADS MODEL ....................................................................................................................................... 28
FIGURE 15: OPENMP USES FORK-JOIN MODEL FOR PARALLEL EXECUTION ............................................................................ 29
FIGURE 16: THREE COMPONENTS OF OPENMP USED TO EXAMINE AND MODIFY THE PARALLEL EXECUTION PARAMETERS ................. 30
FIGURE 17: THE “SECTIONS” DIRECTIVE ......................................................................................................................... 31
FIGURE 18: SCREENSHOT OF MY BASIC OPENMP CODE USED IN THE PROGRAM ..................................................................... 32
FIGURE 19: DECLARING ONE *TG AND THEN USING IT TO CALL DIFFERENT FUNCTIONS, ALL FROM THE SAME CLASS, TG ................. 35
FIGURE 20: CODES WHERE THE USE_PROFILE AND THE MAPS OF ALL THE NODES OF THE CIRCUIT ARE CREATED ............................ 36
FIGURE 21: THE FOUR CORNERS OF SPECIFICATIONS WHICH NEED TO BE VERIFIED BY FUNCTION VERIFYSPEC( ) ONLY ONCE ............ 38
FIGURE 22: DEPENDING ON ‘SPECVIOLATED’, AREA CALCULATION TAKES PLACE BY FUNCTION CALLED ITERATE_D( ) ..................... 38
FIGURE 23: ITERATION TAKING PLACE IN TWO DIRECTIONS FROM EACH CORNER FOR CALCULATING THE AREA OF ROBUST REGION. ... 39
FIGURE 24: EIGHT VALID POINTS WHICH DEFINE THE AREA AND ARE NEEDED FOR CALCULATING THE ROBUSTNESS VALUE. .............. 40
FIGURE 25: EACH TG CALLS FUNCTIONS LIKE UPDATEAT( ) BY : M_TGVEC[OMP_GET_THREAD_NUM( )] -> UPDATEAT( ) ........... 40
FIGURE 26: THE VALUES OF ‘TEMP1’ TO ‘TEMP5’ ARE PUSHED BACK IN THE VECTOR OUTSIDE THE PARALLEL REGION. .................... 43
FIGURE 27: CMAKELISTS.TXT WHICH INCLUDES ALL THE DIFFERENT COMPILER FLAGS .............................................................. 44
FIGURE 28: THE AREA OF THE ROBUST REGION CANNOT BE DETERMINED BY SIMPLY PARALLELIZING THE ITERATE_D( ) ................... 45
FIGURE 29: BOTH THE PROCESSORS ARE BEING UTILIZED FOR THE COMMAND ‘ROBUSTNESS’ .................................................... 47
FIGURE 30: TIME ELAPSED FOR COMPUTATION DECREASES WITH AN INCREASE IN OPENMP THREADS ........................................ 49
FIGURE 31: RESULTS FOR CELL C3540_I89, 2D, ROBUSTNESS AT 10 YEARS, RECHNER3, NUMBER OF THREADS: 1 .................. 49
FIGURE 32: RESULTS FOR CELL C3540_I89, 2D, ROBUSTNESS AT 10 YEARS, RECHNER3, NUMBER OF THREADS: 2 ................... 50
FIGURE 33: RESULTS FOR CELL C3540_I89, 2D, ROBUSTNESS AT 10 YEARS, RECHNER3, NUMBER OF THREADS: 4 .................. 50
FIGURE 34: RESULTS FOR CELL C3540_I89, 2D, ROBUSTNESS AT 10 YEARS, RECHNER3, NUMBER OF THREADS: 8 .................. 51
8
9
List of Tables
TABLE 1: CPU USAGE OF THE TWO MACHINES WHILE RUNNING THE PROGRAM ‘ROBUSTNESS’ WITH DIFFERENT NUMBER OF OPENMP
THREADS ........................................................................................................................................................ 46
TABLE 2: RESULTS FOR NANGATEDESIGNS: CELL C1355_I89, DIMENSION: 2D, MAXIMUM AGE: ROBUSTNESS AT 10 YEARS,
MACHINE: REIN (2 CORE PROCESSORS) ................................................................................................................ 47
TABLE 3: RESULTS FOR NANGATEDESIGNS: CELL C1355_I89, DIMENSION: 2D, MAXIMUM AGE: ROBUSTNESS AT 10 YEARS,
MACHINE: RECHNER3 (8 CORE PROCESSORS) ...................................................................................................... 48
TABLE 4: RESULTS FOR NANGATEDESIGNS: CELL C1908_I89, DIMENSION: 2D, MAXIMUM AGE: ROBUSTNESS AT 10 YEARS,
MACHINE: REIN (2 CORE PROCESSORS) ................................................................................................................ 48
TABLE 5: RESULTS FOR NANGATEDESIGNS: CELL C1908_I89, DIMENSION: 2D, MAXIMUM AGE: ROBUSTNESS AT 10 YEARS,
MACHINE: RECHNER3 (8 CORE PROCESSORS) ...................................................................................................... 48
10
11
1. Introduction
Everyone and everything is this world doesn’t last forever due to unavoidable conditions.
This is applicable for humans, animals, things and even digital circuits. All manufacturers of
integrated circuits wish that their integrated circuits operate without faults for many years
which is impossible. Due to issues like NBTI, HCI, time dependent dielectric breakdown
(TDDB) and electro migration (EM) over prolonged periods, device characteristics are
changed which causes the circuit to fail. Over time, the gate delays also increase as the circuit
ages. This leads to violation of the timing specification although the specifications were met
right after manufacturing. To increase the life-time of these circuits, the ability of different
ICs to cope with such reliability issues and errors during their execution was investigated.
This ability is called robustness of integrated circuits. Digital components in these circuits
should have a robust behaviour. This means that the output of the circuit should depend
entirely on the input and should be controllable.
Analysis of robustness in this thesis is integrated through an automated design flow which
includes calculation of robustness over years in the presence of NBTI and HCI by software
algorithms and programs. In this report, an efficient parallelization of ageing analysis for
various digital circuits is discussed. Certain parts of the code are executed in parallel
depending on the number of threads and processors. This parallelization helps in faster
calculation of robustness for integrated circuits which saves a lot of time, especially while
dealing with larger circuits. The code is parallelized by using parallelism tools like OpenMP
together with the existing C++ code. The parallelized code together with the OpenMP
interface is evaluated by finally analyzing the CPU usage percentage and calculating the time
elapsed while the computation takes place.
1.1. Motivation
Most of the programs people write and execute run serially. The instructions in the program
are executed one after the other. This is called serial computing, where the program runs on a
single processor. The existing codes were all serial programs, i.e. they ran on a single
processor even though the computer had 2 or more core processors. With multi-core
machines being standardized, I want my program to utilize all the other processors, i.e. I want
it to work on several tasks at once, in parallel. That’s what multi-threading allows me to do.
Recently, however there has been a huge shift to parallel computing. A major source to speed
up is by parallelizing operations. Parallel machines provide a wonderful opportunity for
applications with large computational requirements, in my case, larger size integrated
circuits. In order to achieve this, I need to create extra threads, each running code
simultaneously on different core processors utilising all the resources. The introduction of
multi core processors has encouraged parallel computing, where the program executes in a
more efficient way, which is only possible with proper use of parallel multi processing
interfaces.
12
1.2. Goals
The objective of this project is to find out parts of code that can be efficiently parallelized by
using parallelization tools or other specific interfaces. In particular, this project has the
following aims:
Get a clear idea of what the program needs to do
Decide on what protocol or interface to use, depending on the existing programming
language
Algorithms are designed on how the work can be distributed between multiple
processors
Finally, the algorithms are implemented at the areas in the programs that need to be
parallelized
1.3. Achievements
OpenMP interface was ultimately chosen to parallelize the codes. The goal of OpenMP is to
provide a standard and portable API for writing shared memory parallel programs. This API
is specified for C/C++ and FORTRAN programming languages and it is portable on major
platforms especially Linux/Unix. The language extensions in OpenMP fall into three
categories:
1. Directives
2. Run-time routines
3. Environment variables
OpenMP directives are used to tell the compiler what parts of the program are to be executed
in parallel and how those parts of codes are assigned to individual threads. The compiler then
generates explicit threaded code which results in a multithread object program.
1.4. Report Structure
Background
This section will discuss all the previous work done related to the project, robustness
validation. It will also discuss some background directly relevant to understanding this report.
Parallel programming with OpenMP
This section will discuss parallel computing and model for shared memory programming.
OpenMP and its extensions used in the project will be briefly discussed.
13
Implementation of OpenMP in robustness validation
This section will discuss more about the existing programs and its functions, the algorithm
and idea which motivated the existing code to parallelize. The difficulties faced while
implementing OpenMP parallel code will also be discussed in this section.
Evaluation
This section will evaluate my results by analyzing processor usage and time taken while
running the robustness program with single and multi-thread.
Conclusion
This section will summarise the work done. Further improvements or changes that can be
carried forward will also be discussed here.
14
2. Background
The design of robust systems is becoming important as the integrated circuits are entering
nano-technology. The digital components play a leading role in embedded systems. Their
external environment can either be software, or other digital and analog components or the
physical world. As these digital components start ageing by time, the input data received by
their respective environment can be imprecise or contain errors. Nonetheless, a well designed
system should have an output behaviour that remains robust even in the presence of such
inaccuracy in the input sequence. The designs of such robust components are a challenging
task in the field of embedded systems. For critical sectors where precision is a must like the
automotive and aerospace, such as flight controllers, a smooth control action, longevity and a
robust behaviour must be guaranteed. These systems should be highly immune against
external disturbances and perturbations.
However with different implementations of a system by the means of robustness, their
specifications and applications, engineers could come to the conclusion of a concept that
takes care of the needs of microelectronic circuits and systems by solid definitions and proper
measuring techniques. Therefore, a general model for quantitatively measuring robustness is
created.
2.1. Robustness model
In Figure 1, System ‘S’ takes in an input sequence x(t) and transforms it into an output
sequence y(t). The operating conditions Π and system properties Φ have to be considered too.
These properties or performance features Φi have to be fulfilled for the system to be robust.
Whereas, each perturbation Π i which influences the performance is simply a representation
of all operating conditions or perturbation space to which all the production devices will be
exposed.
Figure 1: Operating conditions and system properties are given by their nominal values Π nom and Φ nom or their sets Π and Φ respectively. [Ref: [9]]
15
In case, a system is performing 2 applications, task A and task B, the performance space has
to stay within the regions ΦA and ΦB. Therefore, the properties of a robust system lie in the
intersection, ΦA ∩ ΦB as shown in Figure 2.
For the case of perturbation space, the system has to work fault free at least in the union,
Π A U Π B as shown in Figure 3. Normally, this region is larger than the union of both the
task’s perturbation space.
Figure 2: Property or performance space, ΦS, of task A and task B, ΦA ∩ ΦB (Shown in blue) [Ref: [9]]
Figure 3: Perturbation space, ΠS, where the system has to work properly for two tasks A and B, ΠA U ΠB (shown in blue) [Ref: [9]]
16
A system can be called robust if all the points of the set ΠS in perturbation set can be mapped
into the points in the performance space within the area ΦS (Figure 4). The margin between Φ
and ΦS can be used for a quantitative measurement of the robustness which is explained in
the next section.
Figure 4: A robust system where all specified points in the perturbation space are transformed into points within the specified area Φ. [Ref: [9]]
2.2. Measuring robustness
Robustness in this project is measured as a probability. A circuit would work for all operating
conditions and be ideally robust if it has a robust value of ‘1’. The advantage of calculating
robustness as a probability is that a comparison between completely different systems is
possible. In this project, the logic behind calculating robustness as a probability was already
coded in DichotomyCalculation.cpp.
The function probs( ) is called which returns a pair of double vales which is assigned to
‘workingProb’.
The final robustness value is then computed by calling out the function robustness( ) and
taking in ‘workingProb.first, workingProb.second’ as its input.
Code 1: Function probs( ) and robustness( double tmpProb, double specProb ) are
defined in RobustnessCalculation.cpp
std::pair < double, double > workingProb = probs( );
double robustnessprobValue = robustness( workingProb.first, workingProb.second );
17
2.3. Adaption to different applications
This following section gives an overview on how the robustness model and the way of
measuring robustness can be adapted to different applications like the ageing analysis.
Over time, the circuit performance degrades due to effects like NBTI and HCI which
influences the robustness of the whole system. To evaluate this decrease in performance,
static timing analysis (STA) with a gate model is needed to compute gate delays. The gate
model provides a delay for rising and falling input transition for each timing arcs. In today’s
chips, a unique critical path is found and the gates on this path get aged due to technology
scaling.
Code 2: This returns a Boolean value which determines whether it is the most critical
path of a circuit or not. “Path” shows the list of Ids of either nodes or edges in the
critical path
bool getCriticalPath( std::list < std::pair < unsigned int, bool > > *path );
Code 3: This prints the most critical path of the circuit
void printCriticalPath( std::list < std::pair < unsigned int, bool > > path, bool rising );
2.3.1. Timing Analysis and Timing Graph (TG)
Timing analysis is required for many different steps during the design process of digital
circuits. Timing analysis (TA) determines the maximum clock frequency a circuit can operate
in. Static Timing Analysis (STA) determines the path which has the longest path delay and
the latest arrival time (Figure 5) which calculates the circuit delay and verifies the set up time
constraints.
18
Figure 5: Computation of Arrival Time (AT). Here, ‘a’ and ‘b’ are inputs of a gate where ‘c’ is the output. The AT of ‘a’ is 9 and ‘b’ is 10. The delay from ‘a’ to ‘c’ and ‘b’ to ‘c’ is 4 and 2
respectively. Hence, the AT at output c will always be the maximum value of its input delays. In this case its 13 (9+4 > 10+2) [Ref: [10]]
Code 4: This returns a pair of rising and falling timings for all the sinks nodes in the
timing graph
std::pair <sslv, sslv> getSinkArrivalTime( );
Code 5: This updates arrival time of all the nodes in the timing graph by iterating
over each node
void updateAT( );
The implemented STA uses a timing graph (TG). This includes the Nodes (Gate in and
outputs) and Edges. Edges can be of two types, from gate inputs to their outputs which are
the gate delays for that particular timing arc. The other one is from the gate output to another
gate input which are the delays caused by interconnect network. Gate delays are of more
importance and should be considered compared to interconnect delays. Every TG edge has
two edge weights because of the rise and fall delay.
Code 6: This creates an Edge in the timing graph
unsigned int createEdge( std::string name, OpenAccess_4::oaOccNet *oaNet, unsigned int
start_node, unsigned int end_node, std::string pin1, std::string pin2 );
Code 7: This creates a Node in the timing graph
unsigned int createNode( std::string type, std::string name, OpenAccess_4::oaOccInst
*oaInst );
19
A source node and sink node are added to the TG as well. All the primary inputs and primary
outputs are connected to source and sink nodes respectively.
Code 8: This creates a Sink Node in the TG
unsigned int createSinkNode( std::string name, OpenAccess_4::oaOccTerm *oaTerm );
Code 9: This creates a Source Node in the TG
unsigned int createSourceNode( std::string name, OpenAccess_4::oaOccTerm *oaTerm );
Besides ageing, the circuits have to deal with other variability too such as:
Variations of perturbation or operating conditions, changes in just supply voltage VDD
and the operating temperature T
The performance space is reduced to the delay/frequency of the circuit
2.4. Calculating the area
Robustness is determined in the perturbation space by effectively sampling the area in which
the circuit still works properly. This sampling is done by using the mentioned ageing analysis
for the transformation to the performance space. Dichotomy is the method implemented, in
this project, to calculate this particular area.
Figure 6: Perturbation and performance space being mapped to VDD, T and f respectively. [Ref: [9]]
20
Dichotomy calculation, first, finds out the four corners of the specification and checks if any
one of them is violated. This happens only once, hence it isn’t important to parallelize these
functions. When the function calculateRobustness( ) is called in the main program
robustness.cpp with the help of a pointer dichotomy*, the program jumps to
DichotomyCalculation.cpp and starts executing it. The program runs in the following way:
1. ‘specViolated’ is declared as a pair of Boolean and an int value depending on function
verifySpec( ).
2. verifySpec( ) calls for the function verifyPoint( ).
3. These two functions work together and return a pair of Boolean and Integer value
which is assigned to ‘specViolated’. Depending on this pair value, it is determined if
any of the specification points were violated or not.
4. Assuming that none of the four specification corners were violated (normal cases),
iteration from these corners start taking place. From each corner it iterates in two
directions. One along the axis of temperature (‘T’) and another along the axis of
voltage (‘V’). See Figure 7.
In the existing code this happened one after the other. My aim was to parallelize this
iterate_d() function which would run all the eight iterate_d() functions in parallel
calculating the ‘validPoints’ concurrently. This is where OpenMP comes in use.
21
Figure 7: Finding valid points starting from the 4 specification corners by Dichotomy calculation
5. Function iterate_d() returns a pair of double value which is assigned to ‘temp’. Since
there are eight iterations taking place, I ended up with eight values, temp1 to temp8.
6. These eight values are then stored in a vector named ‘m_validPoints’. This vector has
eight pairs of double values, one representing the voltage, and other the temperature.
The order of these temp values is very important to consider. It has to start with the
point m_tempReq.first, m_voltReq.first, iterate till the boundary m_tempReq.first,
m_tempLimit.first and calculated validPoint has to be stored in m_validPoint[0]. The
result of 2nd
iteration is stored in m_validPoint[1], of 3rd
iteration in m_validPoint[2]
and so on. See Figure 8.
22
Figure 8: A pair of double temp stores a point depending on a voltage and temperature
value. This gives the boundary of the robust region which is important for the calculation of robustness as a probability.
7. The vector m_validPoint is then passed on to the function probs( ) which computes a
pair of double values ‘workingProb’ by iterating through all the eight points in the
vector.
8. The final robustness value is then computed by calling out the function robustness()
and taking in ‘(workingProb.first, workingProb.second)’ as its input.
(See Section 2.2, code 1)
Let’s take a look at point number 4 on page 20, where I mentioned about parallelizing the
function iterate_d( ). In the next chapter, parallel computing with OpenMP and its directives
are discussed.
23
3. Parallel programming with OpenMP
In this chapter, the benefits of parallel computing are discussed and the approach taken in
OpenMP to support the development of parallel applications is described.
3.1. Overview
These days, programs are mostly written in a way that they execute in series and run on a
single processing unit. A function is broken into a discrete series of other functions or
instructions and each instruction is executed one after another. Just one instruction is
executed at a time.
The function iterate_d( ) calls verifyPoint( ) twice and then verifyTolerance( ). This is done
serially as shown in Figure 9.
Figure 9: Serial computing [Ref: [5]]
In simple words, parallel computing uses multiple processors simultaneously to solve any
kind of computational problems. It runs on multiple processing units. A function is broken
into discrete parts which are executed concurrently and each part is then broken down into
series of instructions. Instructions from each part execute simultaneously on different
processors depending on the number of threads used.
Parallel computing utilizes all the available resources which can either be multiple core
processors, any number of computers connected within a network or even both.
Some advantages of parallel computing are:
Provides concurrency
24
Multiple program instructions are executable at any moment of time
By utilizing more resources, a task is completed faster, hence saving time and in some
cases, maybe money too
Larger complex computations or problems are being solvable
Parallel computing is shown in Figure 10. Here, the large function, iterate_d( ) is broken into
smaller discrete parts which is executed concurrently. Instructions from each part execute
simultaneously on different processors depending on the number of threads. In this Figure,
we use two threads.
Figure 10: Parallel computing [Ref: [5]]
3.2. Architecture and parallel programming models
3.2.1. Architecture
There are four main components in computers: memory, control unit, arithmetic logic unit
and input/output.
25
Figure 11: The four main components of a computer [Ref: [5]]
1. The data and program instructions are both stored either in RAM or Read/Write
memory. The instructions (iterate_d) are coded data which command the computer to
do a certain task, while the data (tempReq and voltReq) is just some information or
set of values taken in by the program.
2. The function of the control unit is to get instructions or data from this memory,
decode it and sequentially coordinate operations to run and accomplish the
programmed task.
3. ALU does basic math operations which are needed in the program.
4. Input / output are useful for the user to interact with the computer.
Parallel computers still follow this basic design just multiplied in units. The basic
fundamental architecture remains the same. They are characterized based on the data and
instruction streams forming various types of computer organizations like SISD (Single
instruction, single data), SIMD (Single instruction, multiple data), MISD (Multiple
instruction, single data) and MIMD (Multiple instructions, multiple data). Data and
instructions are read or fetched from memory. Data is stored or written to memory.
In my case, the computer organizes to deal with a single instruction and multiple data
(SIMD). It’s a type of parallel computer where all the core processors execute the same
instruction at any given time but each processor takes in different data values as input to
execute that instruction.
3.2.2. Parallel programming models
Several programming models are:
Shared memory
Distributed memory
Threads model
Data parallel
Single Program, Multiple Data (SPMD)
Multiple Program, Multiple Data (MPMD)
26
Amongst these models, shared and distributed memory, threads model and SPMD will be
studied in detail.
3.2.2.1. Shared and distributed memory
Shared memory: In shared memory systems, each processor has the ability to access all the
memory from one global address space (see Figure 12). The access to this global memory
space is controllable by using various mechanisms. Data can be shared or private. Shared data
is accessible by all the threads whereas private data can be accessed only by the thread that
owns it. Developing a program on this model is an advantage as:
There is no need to specify, explicitly, the communication of data between tasks.
Synchronization takes place but it’s implicit
It is easier to modify the existing serial codes into parallel
This allows multiple processors to operate individually while having access to all the memory
resources. The programmer must make sure that writing to global memory is handled
correctly especially when a variable is being read and written in calculations.
The disadvantage of shared memory is that when multiple cores happen to access the same
memory in the global memory space, bottleneck could be caused which ultimately slows
down or hangs the whole program.
Figure 12: Shared memory configuration [Ref: [5]]
Distributed memory: In distributed memory systems, each processor has its own local
memory and each processor is connected to other processors through different types of
networks. Due to this, the memory of one processor cannot be mapped to another, hence each
processor operates independently. The advantages of this configuration are:
The size of memory available increases every time more processors are to be utilized
27
Processors need not communicate with each other as each of them can quickly access
their own memory, saving the overhead time
The disadvantage of distributed memory is that the programmer needs to explicitly pass all
the data to a processor when it needs to do a task. That requires specific coding which can be
very tedious. It is also quite difficult to write a distributed memory parallel program and the
existing program cannot be easily made parallel as a lot of re-coding is required to be done.
Figure 13: Distributed memory configuration
3.2.2.2. Threads model
This model is a type of shared memory parallel programming where a single process can
have multiple, concurrent execution paths. To implement this thread model, the programmer
needs a set of compiler directives and runtime library routines that are called within the
parallel code. It is the programmer’s responsibly to select parts of code which need to be
parallelized.
The execution of any function done by the threads is not always in order. Any thread can
execute the function as soon as it’s free from its previous task. The amount of time taken by
each thread to do a task is also not fixed. Threads need to communicate with each other as
well; they need to synchronize with each other to make sure that they are not updating the
same memory address at the same time in the global memory space.
I used OpenMP to create and implement threads (See Section 3.3).
28
Figure 14: Threads model where there is a possibility for distribution of tasks (function
iterate_d( )) to the different threads
3.2.2.3. Single program, multiple data
SPMD is a high level programming model and a subset of SIMD (See Section 3.2.1).
Single Program: The same program is executed together at the same time on multiple data
streams by creating their copies by their respective tasks.
Multiple Data: The inputs for the task use different data.
3.3. OpenMP
OpenMP (Open Multi-Processing) is a standard Application Programming Interface (API) for
writing shared memory parallel applications in C, C++ or Fortran. The reason this
parallelization tool was chosen over the others is because OpenMP has an advantage of being
very easy to parallelize the existing serial codes which were written in C++. With careful
planning in advance on what parts of code should be parallelised and by using OpenMP
directives, clauses and the library subroutines, the work of parallel computing gets easier. It
also has the advantage of being widely used, highly portable and ideally suited for multi-core
processor machines. Multiple cores are naturally harder to coordinate than single core.
Algorithms are trickier and the programs are more complex.
29
OpenMP allows multi-threaded, explicit parallel programming. A combination of several
threads helps in executing code faster by running in parallel. The threads provide fast
switching as they share the work load. There’s communication between threads and hence,
synchronization is important too. A core allows only one thread to run at a time. A core that
allows multi-threading has additional hardware to switch between threads in very little time
with very little overhead time.
A parallel region is basically a block of codes executed by all the threads simultaneously.
Each thread has a thread ID which can be determined by calling a special OpenMP function.
The OpenMP team always has 1 master thread and its several workers. The OpenMP
program starts with this master thread, which has a thread ID = 0. The OpenMP operates on
Fork-Join model which is shown in Figure 15.
Figure 15: OpenMP uses Fork-Join model for parallel execution
When the first parallel region is encountered, which is created by #pragma omp parallel, a
team of threads is forked to carry out the work within that parallel region. The number of
threads created depends on the environment routine, omp_set_num_threads. The code within
the parallel region is then executed by random threads depending on how long each thread
takes to finish its previous task. So, the code runs in a random thread order and time slicing.
30
When the end of the parallel code is reached, the threads synchronize and re-join into the
master thread, which executes the rest of the code serially.
Parallelization using OpenMP is specified by its components, which are:
1. Directives and its clauses
2. Runtime library routines
3. Environment variables
Figure 16: These three pieces are typically used to examine and modify the parallel execution
parameters. Taken together, they define what is called an API (Application programming interface) [Ref: [2]]
3.3.1. OpenMP executable directives and its clauses
1. #pragma omp parallel: The parallel construct starts the parallel execution by creating
a team of threads (See Figure 15).
2. #pragma omp sections: The sections construct contains a set of individual code blocks
that are distributed and executed over the threads. The section directive must be
nested within SECTIONS/ END SECTIONS directive pair. A thread may execute more
than one section if it is quick enough. See
Figure 17. There is no guarantee in what order each section will run. It is the
programmer’s responsibility to make them work in an order.
I have discussed about this later in Section 4.5, point number 2, as I encountered the
same problem while modifying the existing code by inserting this particular section
direction.
31
Figure 17: The “sections” directive
3. #pragma omp single: The single construct serializes a section of code. It specifies that
the associated structured block is executed by just one thread, not necessarily the
master thread. It is useful when dealing with print statements. The I/O operations can
be enclosed by single directive, so that any thread that is not busy can perform the I/O
operation. The other threads skip around the single construct and move on with the
code.
Note: This directive was initially used in my edited parallel program just for the print
statements. But, I decided to print the values that were calculated by all the threads
since the threads execute the same program but with different data values. Hence,
ultimately this directive wasn’t used in my program.
4. #pragma omp barrier: The barrier construct basically synchronizes all the threads in
the team; it’s like a pause operation. Upon encountering this directive, each thread
waits for the other threads to arrive at the point. Only when all the individual threads
finish their computation and arrive at this point, they continue executing operations
behind the barrier.
5. #pragma omp critical: The critical construct restricts the associated structured block
to be executed only by one thread at a time. The first thread to reach this directive gets
to execute the code. Only when the current thread exits the critical region, other
32
threads can then execute the code. This means that the value can be updated once at a
time.
Most of the OpenMP directives support clauses. These clauses, with the directive, specify
additional information like determining the execution context, mostly about the variables
used and their values within the parallel code region. If a variable is visible in the parallel
code and is not mentioned in any of the sharing attribute clause, then the variable is
considered to be shared. Few of the clauses I used are:
1. Private(list) clause: This private clause was used with the #pragma omp parallel
construct. This requires each thread to create a private instance of a specified variable.
These specified variables are therefore private to each thread and not accessible by the
other threads. The value is lost at the end of the block.
2. Shared(list) clause: I also used shared with the #pragma omp parallel construct as
well. This clause specifies that the named variable should be shared and accessible by
all the threads in the team. Value persists after the end of the block.
This is illustrated in Figure 18.
Figure 18: Screenshot of my OpenMP code used in the program for just determining number of threads that are actually be used
33
3.3.2. OpenMP runtime library routines
OpenMP provides several execution environment routines that help the programmer to
manage his/her program in parallel, affect the monitor threads, processors and parallel
environment. Few of the execution environment routines I used in my project code are:
1. void omp_set_num_threads(int num_threads): Decides the number of threads used
for the parallel regions. The programmer can specify number of parallel threads by
two different ways: Either through omp_set_num_threads() runtime library routine, or
through OMP_NUM_THREADS environment variable.
2. int omp_get_num_threads(void): Returns the number of threads in the current team of
the parallel region.
3. int omp_get_thread_num(void): Returns the unique thread ID of the encountering
thread (thread numbering begins from ‘0’ till ‘num_threads minus 1’) to the user.
All these protocols are defined in the standard OpenMP include file, omp.h. Other than the
execution environment variables, OpenMP also provides lock routines and timing routines for
synchronization and for supporting portable wall clock timer respectively.
3.3.3. Environment variables
OpenMP Environment variables are all in upper case. These variables are read at the program
start-up and any modification to their values later is all ignored. They control the execution of
the parallel codes.
I will mention more about environment variables in Section 6.1.
3.3.4. Summary
OpenMP is a shared memory model. The threads communicate by sharing the variables.
Unintended sharing of data causes race conditions. This happens when the threads are
scheduled differently and there’s a concurrent access to the same variable. This changes the
program’s outcome. To control this, synchronization between the threads is important, which
is done by using the synchronization clauses like barrier, critical, single etc.
34
4. Implementation of Parallel Program and OpenMP in
Robustness
This section discusses how parts of the existing robustness.cpp, DichotomyCalculation.cpp
and RobustnessCalculation.cpp programs were made parallel and the challenges faced while
implementing this. For efficient parallelization of evaluating robustness, I modified and dealt
with several exiting serial programs such as:
1. robustness.cpp
2. DichotomyCalculation.cpp
3. DichotomyCalculation.h
4. RobustnessCalculation.cpp
5. RobustnessCalculation.h
The main program of my project is robustness.cpp. There are several pointers in this program
that point to various other functions and objects of other different classes. I was mainly
dealing with:
TG *tg;
EDARobust::RobustnessCalculation *calc; EDARobust::DichotomyCalculation *dichotomy;
4.1. The existing serial program: robustness.cpp
A. The main program, robustness.cpp starts by loading the following:
1. XML Timing Library: void loadTimingLib(std::string libname);
where libname is the XML file name from where the timing library is loaded.
2. Circuit Design:
void loadOA(std::string libname, std::string cellname, std::string viewname);
where libname is Open Access library name, cellname is Open Access Cell name and
viewname is Open Access View name.
3. XML Constraints Library: void loadConstraintsLib(std::string libname);
where libname is the XML file name from where the timing constraints is loaded.
These three functions belong to the class TG. The pointer *tg has a structure of this class and
is later allocated enough memory for storing functions or objects from this particular class.
*tg was then used to point to and call these three functions from the class TG.
35
The actual code:
Figure 19: Declaring one *tg and then using it to call different functions, all from the same
class, TG
B. Once the three files are loaded in, a usage profile for ageing analysis was selected:
void set_useProfile(useProfile prof) {
use_profile_ = prof;
}
This is called by the same pointer *tg as the function belongs to class TG.
For an example, this profile sets the AgingAnalysis = true, NBTI = true, HCI = true,
clockFreq to 10GHz, Vdd to 1.32V, temperature to 125 deg C, and the age to
1year*365days*24hours (See Figure 20).
C. Next, the pointer *tg points to the functions that do the following:
1. Return all the source nodes in timing graph with corresponding ID:
std::map<unsigned int, Node*> getSourceNodes() {
return source_nodes_;
}
The unsigned int specifies the corresponding ID. It returns a map (id -> pointer)
with all source nodes the timing graph contains. This map is stored in
“sourcenodes”.
36
2. Returns all the sink nodes in timing graph with corresponding ID:
std::map<unsigned int, Node*> getSinkNodes() {
return sink_nodes_;
}
The unsigned int specifies the corresponding ID. It returns a map (id -> pointer)
with all sink nodes the timing graph contains. This map is stored in “sinknodes”.
3. Returns all the nodes (including sink and source nodes) in the timing graph with
corresponding ID:
std::map<unsigned int, Node*> getNodes() {
return nodes_;
}
The unsigned int specifies the corresponding ID. It returns a map (id -> pointer) with all the
nodes, the timing graph contains, including the source and sink nodes. This map is stored in
“nodes” as can be seen in the last three lines of Figure 20.
Figure 20: Codes where the use_Profile and the maps of all the nodes of the circuit are created
D. Next, the Signal Probability (SP) and Transition Density (TD) are set at the inputs.
After setting the profile specification, the workload WL (i.e. the time a transistor is in
stress due to impacts of NBTI and HCI) is determined at the gate inputs by factors like
37
SP and TD respectively. Indirectly, ageing analysis and the performance of an aged
gate model is also determined by these factors.
SP and TD are given as a probability. They are found out by iterating through all the
sourcenodes. If the ageing analysis of the profile is set to true, these SP and TD values
are updating.
E. General setup parameters are then declared, like the minimum and maximum
temperature, minimum and maximum voltage, min and max frequency, age, stepSize,
NBTI, HCI etc.
F. Next, depending on the user selection of dimensions (2D or 3D) and the area
calculation method (dichotomy, stopsign or gradient), robustness is calculated. Most
parts of the codes that were modified and parallelized are in the function,
calculateRobustness( ), which is defined in DichotomyCalculation.cpp and
RobustnessCalculation.cpp.
4.2. The existing serial program: DichotomyCalculation.cpp and
RobustnessCalculation.cpp
The main function calculateRobustness( ) is called in DichotomyCalculation.cpp. This
function calls many other functions which are defined in RobustnessCalculation.cpp. These
two programs execute together and calculate the value of robustness of a given circuit in a
specific time. The main functions I dealt with and modified were:
1. bool RobustnessCalculation::verifyPoint(double temp, double voltage);
2. void RobustnessCalculation::iterate_d(doublePair start, doublePair boundary,
char alteratedParam, double intervalLimit, bool negativeFlag);
3. bool RobustnessCalculation::verifyTolerance(double temp, double voltage,
double tolerance)
There is another function called verifyspec( ) which is called out to verify if any of the
specifications are outside the robust region, see Figure 21. This function then calls out for
verifyPoint( ) which returns a Boolean value. Based on that verifyspec( ), it returns a pair of
bool and int value and assigns it to ‘specViolated’as explained in Figure 22. Area is then
calculated which depends on ‘specViolated’ value.
38
Figure 21: The four corners of specifications which need to be verified by function verifySpec( ) only once, hence parallelizing them isn’t of much difference
Figure 22: Depending on ‘specViolated’, area calculation takes place by function called iterate_d( )
Under normal conditions, none of the four corners are violated. Based on that, the area is then
calculated, either by the method of dichotomy, stopSign or gradient. Each of these function
call iterate( ). In my case, the area is always calculated by the method of dichotomy with
calls the function iterate_d( ).
The function iterate_d( ) is one of the most important function in calculating robustness.
Iteration begins from the four specification corners, which calculates a valid point. Each
39
corner iterates in two directions: Temperature and voltage. See Figure 23. All the eight
iterations take place one after another which is very time consuming.
Figure 23: Iteration taking place in two directions from each corner. This is important for the calculation of area of robust region.
This function calls verifyPoint( ) and verifyTolerance( ) which use *tg to access the
useProfile, updateArrivalTime( ) and getSinkArrivalTime( ).These values are different and
updated for each iteration.
After the eight iterations are completed, there are eight valid points which represent the
outline of the robust region. These eight points are stored in a vector called ‘m_validPoints’.
The function probs( ) then iterates through this vector ‘m_validPoints’ and returns a pair of
double values. These values are sent to the function robustness( ) which returns a double
value ‘robustnessprobValue’. This is the final robustness value.
4.3. Ideology
My main aim was to speed up the process of calculating robustness, especially for larger
circuits like c1355_i89, c1908_i89 and c3540_i89. This could be possible only if the robust
region was calculated quicker. The robust region is defined by the eight valid points (See
Figure 24) and their vector ‘m_validPoints’. In order to do this, I had to run the iterate_d( )
function in parallel. This would help me to compute all the eight valid points concurrently
saving a lot of time.
40
Figure 24: Eight valid points which define the area and are needed for calculating the robustness value.
If I parallelize the function iterate_d( ) by using the thread model programming (See Section
3.2.2.2), the other multiple worker threads will also execute the functions such as
verifyPoint( ) and verifyTolerance( ) simultaneously. Hence, I need to provide a private
copy of tg to each of these threads as the values tg hold will be different, based on the
useProfile, updateArrivalTime( ) and getSinkArrivalTime( ).
Hence, the number of copies of tg depends on the number of threads in the parallel region
(Ref: 1). So, depending on NUM_THREADS, I create those many copies of tg and store it all
in a vector of class TG called ‘TGVec’.
Each instance of tg can be called out by TGVec[0], TGvec[1] and so on, where 0, 1 are the
thread ID (Ref: 33). The thread ID can be called by function omp_get_thread_num( ). This
means tg0 is accessed by TGVec[thread ID=0].
See Figure 25.
Figure 25: Four copies of tg are created since we have four threads running in parallel. Each tg can call functions like updateArrivalTime( ) by :
m_TGVec[omp_get_thread_num( )] -> updateAT( )
41
4.4. Changes made in existing code
1. Instead of declaring just one tg, I now create several copies of tg’s depending on the
number of threads, NUM_THREADS. I then store them in a vector ‘TGVec’.
std::vector<TG*> TGVec;
for ( int j = 0 ; j < NUM_THREADS ; j++) {
TG *tg;
tg = new TG;
TGVec.push_back(tg);
}
2. Since there are several copies of tg now, I need to point every tg to the functions:
- TGVec[i]-> set_useProfile(prof1);
- TGVec[i]->getSourceNodes();
- TGVec[i]->getSinkNodes();
- TGVec[i]->getNodes();
for ( int i = 0 ; i < TGVec.size() ; i++ ) {
TGVec[i] -> set_useProfile(prof1);
}
3. Existing code:
useProfile *oldProf = m_timer -> get_useProfile();
useProfile newProf;
m_timer was declared as ‘static TG *m_timer ;’
But since this function is executing in parallel in verifyPoint( ) and
verifyTolerance( ), I need to create multiple copies of ‘m_timer’ or ‘tgs’
Modified code:
useProfile *oldProf = m_TGVec[omp_get_thread_num()] -> get_useProfile();
42
Refer to Figure 25.
4. Existing code:
startPoint = doublePair (m_tempReq.first, m_voltReq.first);
boundary = doublePair (m_tempReq.first, m_tempLimit.first);
iterate_d (startPoint, boundary, 't', m_intervalLimit);
The function iterate_d( ) is no longer a void function. It now “returns” a double pair
value, ‘start.first’ and ‘start.second’.
After iterating eight times in parallel sections (See Section 3.3.1, point 1 and point 2)
and hitting the barrier directive, each thread waits for the rest to finish their
computation (See Section 3.3.1, point 4).
The eight values are then stored in a vector ‘m_validPoints’. It is important to
maintain the order to the iteration values starting from ‘voltReq.first’, ‘tempReq.first’,
iterating with respective to the temperature axis first.
Modified code:
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel // Parallel region begins here
{
#pragma omp sections // Code is distributed amongst the threads
{
#pragma omp section
{
startPoint = doublePair (m_tempReq.first, m_voltReq.first);
boundary = doublePair (m_tempReq.first, m_tempLimit.first);
temp1 = iterate_d (startPoint, boundary, 't', m_intervalLimit);
}
}
#pragma omp barrier // All threads wait here for each other
} // Parallel region ends here
m_validPoints.push_back (doublePair (temp1.first, temp1.second) );
43
Figure 26: The values of ‘temp1’ to ‘temp5’ are pushed back in the vector outside the parallel region. The first value to be pushed back in the vector is temp1 and the last being temp5. It is
very important to maintain this order.
5. To analyze the results and compare the difference in time taken for calculating
robustness with different number of threads, I used an OpenMP Timing routine,
double omp_get_wtime(void);
This returns the real time elapsed in seconds for any kind of computations or
functions. In this example, calculateRobustness( ).
double OmpStart;
double OmpEnd;
OmpStart = omp_get_wtime();
{ … calculateRobustness ( ) … }
OmpEnd = omp_get_wtime();
std::cout << "Robustness at "<< age << "years calculated in ";
std::cout << static_cast<double> (OmpEnd - OmpStart) << " OpenMP real time
seconds! ";
std::cout << std::endl;
44
6. To set the number of threads for parallel computing, I could either declare it by hard
coding: ‘#define NUM_THREADS 2’ or any other int value like four or eight. Instead
of hard coding it, I declared it as a Command Line argument. Depending on the user,
he or she can now set any number of threads and the program would run accordingly.
The number of threads can now be set by - - numthreads <int> in the command line.
The advantage is that, the user need not compile it every time. It is an automated
program.
Added code:
TCLAP::ValueArg<int> numThreadsArg("", "numthreads", "Override Number
of threads usage", false, 1, "int", cmd);
If nothing is declared in the command line, number of threads will be set to 1, as
default.
4.5. Difficulties faced
1. Running an OpenMP “Hello World” program in parallel was quite a challenge in the
start. OpenMP needs to be turned on by using certain compiler flags like –fopenmp or -
openmp and many others depending on the type of compiler used like GNU or Intel. In this
project, all these complier flags are declared in a Cmake file called CMakeLists.txt. I used the
internet [Refer to[8]] to search for the Cmake compiler flags and added the few lines of code
in this particular file. See Figure 27.
Figure 27: CMakeLists.txt which includes all the different compiler flags
45
2. The main idea was to speed up the eight iterations. Parallelizing the function
iterate_d( ) by using #pragma omp parallel and #pragma omp section was definite to be
done. Since the iteration function was earlier a void, it used to result in filling the vector
‘m_validPoint’ with the pair of double ‘temp’ values (valid points). If I simply parallelized
this function, the order of pushing back the ‘temp’ values into the vector becomes totally
random. See Figure 28.
Figure 28: The area of the robust region cannot be determined by simply parallelizing the iterate_d( ) function. As the threads carry out the part of code in section in a total random
manner, the order of these valid points is violated too.
To overcome this problem, I modified the function iterate_d( ). It now returns eight ‘temp’
values in parallel in a very random manner of course. And after the parallel section of code is
ended, these eights values were then just pushed back, starting from ‘temp1, temp2, temp3. .
.temp8’ in a correct order. See Figure 8 and Figure 24 for better understanding.
46
5. Evaluation and Results
The program was compiled with multiple threads depending on the user. This could be set by
the command line option - - numthreads x, where x is the number of threads (See Section
4.4, point 6). This was made possible by OpenMP interface. The OpenMP include file for
library routines is mandatory:
#include <omp.h>
These multiple threads execute the program by sharing the work amongst them. Hence, the
computation time for calculating robustness depends on the number of threads used. For
example if a Rechner machine takes 36 seconds to compute robustness by using a single
thread, it would take around 9 seconds with eight threads running in parallel for the same
circuit. The time spent for computing robustness is reduced by almost 3.2 to 4.3 times,
depending on the size of the circuit. Larger the circuit is, faster is the computation time.
Compare Table 3, Table 5 and Figure 34.
These results were observed after running three large circuits in Rein and Rechner1,
Rechner2 and Rechner3 machines each, several times.
The program now utilizes all the resources. The program ‘Robustness’ is now running on all
the core processors.
Rein (2 core processors) Rechner (8 core processors)
1 thread 90% – 100% 90% – 100%
2 threads 140% – 180% 160% – 199%
4 threads 140% – 180% 340% - 380%
Table 1: CPU Usage of the two machines while running the program ‘Robustness’ with different number of OpenMP threads
47
Figure 29: Both the processors are being utilized for the command ‘Robustness’
With the increase in number of threads, the computation time decreases. The time elapsed for
calculating robustness is shown with reference to the CPU seconds and real time wall clock
seconds. The CPU time increases once we use more threads, but compared to the real time, it
doesn’t make a hugh difference.
The results are analyzed by using the machines, Rein and Rechner and larger circuits like
c1355_i89, c1908_i89 and c3540_i89.
Num_Threads Robustness Value CPU Time OpenMP Real Time
1 (existing code) 0.325518 31.32 31.345 seconds
2 0.325518 32.09 19.6791 seconds
4 0.325518 31.86 18.4646 seconds
8 0.325518 31.99 18.4398 seconds
Table 2: Results for NangateDesigns: cell c1355_i89, dimension: 2D, maximum age: Robustness at 10 years,
machine: REIN (2 core processors)
48
Num_Threads Robustness Value CPU Time OpenMP Real Time
1 (existing code) 0.325518 28.63 28.697 seconds
2 0.325518 28.96 17.6511 seconds
4 0.325518 29.62 11.2495 seconds
8 0.325518 32.32 8.63535 seconds
12 0.325518 39.9 11.4595 seconds
Table 3: Results for NangateDesigns: cell c1355_i89, dimension: 2D, maximum age: Robustness at 10 years, machine: RECHNER3 (8 core processors)
My main aim was to:
1. Reduce the run time for computing robustness with increase in number of threads
2. The robustness result should remain the same, regardless of number of threads
By observing the readings, both the criteria were met. Although in the case of Rechner3, 12
threads (See Table 3 and Table 5), the time taken to compute the circuits robustness is more
than that of 8 threads. This is because of the several idle threads which don’t share the work
at all since the machine has just 8 core processors. These idle threads delay in executing the
program. 8 threads are just sufficient for an 8 core processor. Ideally, it is always preferred
to keep the number of threads equal to the number of processors, but it isn’t necessary.
Num_Threads Robustness Value CPU Time OpenMP Real Time
1 (existing code) 0.317204 47.15 47.1668 seconds
2 0.317204 48.6 28.8759 seconds
4 0.317204 48.29 28.1237 seconds
8 0.317204 48.47 27.5828 seconds
Table 4: Results for NangateDesigns: cell c1908_i89, dimension: 2D, maximum age: Robustness at 10 years,
machine: REIN (2 core processors)
Num_Threads Robustness Value CPU Time OpenMP Real Time
1 (existing code) 0.317204 43.55 43.6704 seconds
2 0.317204 43.53 25.3362 seconds
4 0.317204 44.19 16.4133 seconds
8 0.317204 47.97 11.563 seconds
12 0.317204 50.06 11.7313 seconds
Table 5: Results for NangateDesigns: cell c1908_i89, dimension: 2D, maximum age: Robustness at 10 years, machine: RECHNER3 (8 core processors)
49
Figure 30: Time elapsed for computation decreases with an increase in OpenMP threads. These results are based on the values from Table 5
Figure 31: Results for NangateDesigns: cell c3540_i89, dimension: 2D, maximum age: Robustness at 10 years,
machine: RECHNER3 (8 core processors), number of threads: 1
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20
OpenMP Real Time
OpenMP Real Time
Num_threads
50
Figure 32: Results for NangateDesigns: cell c3540_i89, dimension: 2D, maximum age: Robustness at 10 years,
machine: RECHNER3 (8 core processors), number of threads: 2
Figure 33: Results for NangateDesigns: cell c3540_i89, dimension: 2D, maximum age: Robustness at 10 years,
machine: RECHNER3 (8 core processors), number of threads: 4
51
Figure 34: Results for NangateDesigns: cell c3540_i89, dimension: 2D, maximum age: Robustness at 10 years,
machine: RECHNER3 (8 core processors), number of threads: 8
52
6. Conclusion
During the course of this thesis, efficient parallelization of the program using OpenMP was
implemented. The most important and time consuming task for me was the research on
OpenMP interface and to understand the working of the whole program. Without that, I
wouldn’t succeed in understanding how codes are actually parallelized and what parts of
codes were to be parallelized. Parallel computer programs are more difficult to write than
sequential programs as it requires more planning and skills to troubleshoot software bugs like
race conditions. Thus, at the beginning of this thesis, I was just busy getting myself familiar
with most of the variables and functions.
Parallelism has been employed for more than a few years now, mainly in high-performance
computers or super computers which were made of several core processors. With multi core
processor computers being so common these days, the interest had grown massively.
Multithreaded parallelization proves to be a simple and an effective method to reduce the
computational time associated with the solution of both constrained and unconstrained global
programming problems. OpenMP was introduced to explain more on parallel computing.
Although OpenMP is NOT
1. Meant for distributed memory parallel systems.
2. The most efficient use of shared memory.
3. Required to check for data conflicts, race conditions and deadlocks.
4. Designed to guarantee that the input or output to the same file is synchronous when
executed in parallel. The programmer is responsible for synchronizing input and
output.
It still has an advantage of being easy to implement on currently existing serial codes and
allowing incremental parallelization. OpenMP provides for a compact, but yet a powerful,
programming model for shared memory programming. It allows the programmer to
parallelize individual sections of a code at a time.
6.1. Further Improvements
The number of OpenMP threads can be set by several ways. The most basic was to hard-code
it. I hard-coded the program initially for testing purposes.
53
The number of threads could be declared at the start of the program with other include files:
#define NUM_THREADS 2
omp_set_num_threads ( NUM_THREADS ) ;
The number of OpenMP threads can also be set as an ‘environment variable’ in the linux
terminal by the command:
export OMP_NUM_THREADS = < int # >
where, # is user defined and can be any other integer/number.
int NUM_THREADS could then be defined as:
int NUM_THREADS = omp_get_num_threads( ) ;
Unfortunately the ‘environment variable’ could not be implemented. With this working, I
would not need the command line input for number of threads.
The code is automated, and the number of threads can be any integer, which can be declared
at the command line ( - -numthreads # ), which is used only for the program command
‘Robustness’.
The objective of this bachelor thesis, however, is met. Parts of robustness validation
programs are now efficiently parallelized, which speed up the whole process of robustness
calculation without altering the value of robustness, regardless of the number of threads.
54
Acknowledgments
This thesis results from my work as an undergraduate student at the Institute for Electronic
Design Automation at the Technischen Universität München.
Firstly, I express my gratitude to Professor Ulf Schlichtmann for giving me such a wonderful
opportunity to complete my bachelor thesis in his institute. I consider it as my privilege for
working on such a novel topic in his institute.
I would like to thank my supervisor, Mr. Martin Barke, for being so kind and encouraging
throughout the course of this bachelor thesis. Without his help and continued support, the
successful completion of this project would not have been possible. I have learnt a lot from
him and the skills I developed under his guidance are exceptional.
I am also very thankful to my colleagues, Mr. Sharad Shukla and Mr. Lachman Karunakaran
for their active help throughout my thesis. It was a great time with the colleagues at the
institute, which I will never forget.
Above all, I thank my parents for their constant support and love in all aspects. Without their
blessings, I wouldn’t have stood a chance to be a part of this great university.
55
References
[1] Alina Kiessling (April 2009): “An Introduction to Parallel Programming with
OpenMP”, A Pedagogical Seminar
[2] Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh
Menon (2001): “Parallel Programming in OpenMP”
[3] Tim Mattson, Barbara Chapman (2005): “OpenMP in Action”
[4] Katharina Boguslawski (28 April, 2010): “Parallel Programming with OpenMP,”
Group Talk
[5] Blaise Barney: “Introduction to Parallel Computing”, Lawrence Livermore National
Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#Abstract
[6] Blaise Barney: “OpenMP”, Lawrence Livermore National Laboratory,
https://computing.llnl.gov/tutorials/openMP/#ProgrammingModel
[7] “Introduction to OpenMP, Part 1”,
http://community.topcoder.com/tc?module=Static&d1=features&d2=091106
[8] March, 2012: “How to use OpenMP in Cmake”,
http://quotidianlinux.wordpress.com/2012/03/26/how-to-use-openmp-in-cmake/
[9] M. Barke, M. Kärgel, W. Lu, F. Salfelder, L. Hedrich, M. Olbrich, M. Radetzki and
U. Schlichtmann ( July 2012): “Robustness Validation of Integrated Circuits and
Systems”, Asia Symposium on Quality Electronics Design (ASQED)
[10] D. Lorenz, M. Barke and U. Schlichtmann (November, 2010): “Ageing analysis at
gate and macro cell level”, IEEE/ACM International Conference on Computer-
Aided Design (ICCAD)
[11] Juan Soulié (June, 2007) : “C++ Language Tutorial”,
http://www.cplusplus.com/doc/tutorial/
[12] “C++ Vector”, http://www.cplusplus.com/reference/vector/vector/
[13] “Summary of OpenMP 3.0 C/C++ Syntax”, http://www.openmp.org/mp-
documents/OpenMP3.0-SummarySpec.pdf, OpenMP Architecture Review Board
[14] Joel Yiluoma (Sept, 2007): “Guide into OpenMP: Easy multithreading programming
for C++”, bisqwit.iki.fi/story/howto/openmp/