Report FFT Implementation 08gr943

download Report FFT Implementation 08gr943

of 50

Transcript of Report FFT Implementation 08gr943

  • 8/2/2019 Report FFT Implementation 08gr943

    1/50

    FFT Parallelization for OFDM Systems

    9TH SEMESTER PROJECT, AAU

    APPLIED SIGNAL PROCESSINGAN D IMPLEMENTATION (ASPI)

    Group 943Jeremy LERESTEUX

    Jean-Michel LORY

    Olivier LE JACQUES

  • 8/2/2019 Report FFT Implementation 08gr943

    2/50

  • 8/2/2019 Report FFT Implementation 08gr943

    3/50

    AALBORG UNIVERSITY

    INSTITUTE FOR ELECTRONIC SYSTEMS

    Fredrik Bajers Vej 7 DK-9220 Aalborg East Phone 96 35 80 80 http://www.esn.aau.dk

    TITLE:

    FFT Parallelization

    for OFDM Systems

    THEME:Parallel Architecture Processing

    FFT implementation

    PROJECT PERIOD:

    9th Semester

    September 2008 to January 2009

    PROJECT GROUP:

    ASPI 08gr943

    PARTICIPANTS:

    Jeremy [email protected]

    Jean-Michel Lory

    [email protected]

    Olivier le Jacques

    [email protected]

    SUPERVISORS:

    Yannick Le Moullec (AAU)

    Ole Mikkelsen (Rohde&Schwarz)

    Jes Toft Kritensen (Rohde&Schwarz)

    PUBLICATIONS: 8

    NUMBER OF PAGES: 46

    APPENDICES: 1 CD-ROM

    FINISHED: 5th of January 2009

    Abstract

    This 9th semester project for the Applied Sig-

    nal Processing and Implementation special-

    ization at Aalborg University is a study of FFT

    algorithms parallelization for OFDM receivers

    on Cell BE. The project focuses on mobile ap-

    plications, which require efficient bandwidth

    utilization like in LTE. This can be achieved

    by means of the OFDM technology. A signif-

    icant contribution in OFDM is the IFFT/FFT

    operations. This can be exploited by the par-

    allelization of special FFT algorithms to yield a

    lower operations count and intuitively improve

    the time of computation. This project seeks

    to investigate the possibilities and differences,

    with regards to time usage, when computing

    FFT algorithms on multiple processors on the

    Cell BE. First of all, the definition of LTE and

    OFDM is explained. Then, two Fast Fourier

    Transform algorithms - a Radix-2 DIT FFT and

    a Srensen FFT algorithm (SFFT) are examined

    and mapped on Cell Be processor architecture.

    Afterwards, tests are done and results are dis-

    cussed for both algorithms. It appears SFFT al-

    gorithm is better than Radix 2 DIT algorithmin terms of execution time and performance. In

    the conclusion, an assessment is done and fu-

    ture perspectives are discussed.

  • 8/2/2019 Report FFT Implementation 08gr943

    4/50

    Preface

    This report is the documentation for a 9th semester project in Applied Signal Processing and Implementa-tion (ASPI) entitled FFT Parallelization for OFDM Systems at Aalborg University (AAU). This report

    is prepared by group 08gr943 and spans from September 2nd, 2008 to January 5th, 2009. The projectis supervised by Yannick Le Moullec, Assistant Professor at AAU, Jes Toft Kritensen and Ole Mikkelsen

    from the company Rohde & Schwarz Technology Center A/S in Aalborg. The report is divided into

    four parts. These chapter correspond to the introduction of the project, analysis, implementation and

    conclusion.

    Jeremy Leresteux Jean-Michel Lory

    Olivier Le Jacques

    Aalborg, January 5th 2008

    4

  • 8/2/2019 Report FFT Implementation 08gr943

    5/50

    Contents

    Preface 4

    1 Introduction 7

    1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.1.1 Long Term Evolution (LTE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.1.2 Orthogonal Frequency-Division Multiplexing (OFDM) . . . . . . . . . . . . . . . 10

    1.1.3 Conclusion on the context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.2 Project subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.2.1 Fast Fourier Transformation (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.2.2 Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    1.2.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Project Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2 Analysis 16

    2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.2 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.2.1 Design Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.3 Cell BE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.3.2 Programmation of the CBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.4 FFT algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.4.2 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.4.3 Cooley-Tukey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.4.4 Srensen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.5 Conclusion of the Analysis section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3 Implementation 32

    3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.2 Cooley-Tukey Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.2.1 Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.2.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.3 Srensen Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    5

  • 8/2/2019 Report FFT Implementation 08gr943

    6/50

    3.3.1 Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.3 Comparison with the CT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 40

    4 Conclusion & Perspectives 45

    4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    4.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    4.2.1 Short term perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    4.2.2 Long term perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    Bibliography 47

    List of Figures 49

  • 8/2/2019 Report FFT Implementation 08gr943

    7/50

    Chapter 1Introduction

    1.1 Context

    In 1981, Nordic Mobile Telephony (NMT) led to the commercialization of the first mobile phone (referred

    as 1st Generation1). On the 29th of November 2007, 3.3 billion mobile phones have been identifiedworlwide [1]. Most of these phones are GSM phones (2G). But the 3rd Generation phones, which canprovide features like web browsing or videoconferences, approached half a billion of devices at the end of

    September 2007. 3G phones have good capabilities but a new generation (4G), with even better capabilities

    including higher bandwidth and more flexibility, is approaching. See Figure 1.1 for a summary of the

    history of the mobile phone generations.

    1.1.1 Long Term Evolution (LTE)

    LTE [2] (Long Term Evolution) is the next major step in mobile radio communication. It is one of the

    best candidates for the 4th Generation of mobile wireless data transfer. Its development started in 2004 by3GPP [3] and several European mobile constructors and operators [4].

    802.16m WiMAX (Worldwide Interoperability for Microwave Access) is one of the other candidates

    [5] to the 4G appellation . It is developed by the IEEE and headed by Intel [6]. The last candidate is

    the Ultra Mobile Broadband (UMB) developed by 3GPP2 [7] and headed by Qualcomm (it was decided

    on November 13, 2008 to stop UMB development to the benefit of LTE [8]). This project only considers

    LTE, therefore WiMAX and UMB are disregarded.

    LTEs major aim is to improve the 3G UMTS (Universal Mobile Telecommunication System). It

    has ambitious requirements for the spectrum efficiency, lowering costs capacities, improving services

    like video conferences and VoIP (Voice over Internet Protocol) communication, latency and also betterintegration with other standards.

    The 3GPP Release 8 [9] gives what the LTE requirements shall be (only the most significant ones are

    listed here):

    Peak data rate

    Instantaneous downlink peak data rate of 100 Mb/s within a 20 MHz downlink spectrum allocation(5 bps/Hz)

    1Generation : Term used to define the technology used in mobile communication. 1G is NMT, 2G is GSM and 3G is UMT-S/HSPA.

    7

  • 8/2/2019 Report FFT Implementation 08gr943

    8/50

    2G

    3GGSM

    40kbps

    WCDMA

    384kbps

    HSPA

    14.4Mbps

    GSM: 3.3 billion subscribers

    WCDMA: 297 million subscribers

    HSPA: 55 million subscribers

    4G LTE

    First call made in 1991

    GPRSin 2000

    3G UMTSin 2001 EDGE

    in 2003

    HSDPA

    in 2005

    LTE

    100Mbps

    GSM

    1990

    1991

    1992

    1993

    1994

    1995

    1996

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    2004

    2005

    2006

    2007

    2008

    2009

    2010

    2012

    2013

    2014

    HSPA

    28/40MbpsEvolved

    HSUPA

    in 2008

    LTEplan in 2009

    2011

    Figure 1.1: Standardization evolution track. Where GSM is Global System for Mobile communications,

    GPRS is General Packet Radio Service, UMTS is Universal Mobile Telecommunications Sys-

    tem, WCDMA is Wideband Code Division Multiple Access, EDGE is Enhanced Data Rates

    for GSM Evolution, HSPA is High Speed Packet Access, HSDPA is High-Speed Downlink

    Packet Access, HSUPA is High-Speed Uplink Packet Access and LTE is Long-Term Evolu-

    tion. Modified from [10]

    Instantaneous uplink peak data rate of 50 Mb/s within a 20MHz uplink spectrum allocation (2.5bps/Hz)

    Latency

    Transition time of less than 100 ms from a camped state, Idle Mode, to an active state Less than 5 ms in unload condition (i.e. single user with single data stream) for small IP packetUser capacity and throughput

    At least 200 users per cell should be supported in the active state for spectrum allocations up to 5MHz

    Downlink: average user throughput per MHz, 3 to 4 times HSDPA (High-Speed Downlink PacketAccess: 3,5G Downlink protocol)

    Uplink: average user throughput per MHz, 2 to 3 times HSUPA (High-Speed Uplink Packet Access:3,5G Uplink protocol)

    Spectrum efficiency

    Downlink: In a loaded network, target for spectrum efficiency (bits/sec/Hz/site), 3 to 4 times HS-DPA

    8 Chapter: 1 Introduction

  • 8/2/2019 Report FFT Implementation 08gr943

    9/50

    Uplink: In a loaded network, target for spectrum efficiency (bits/sec/Hz/site), 2 to 3 times HSUPA

    Coverage

    Throughput, spectrum efficiency and mobility targets above should be met for 5 km cells, and witha slight degradation for 30 km cells. Cells range up to 100 km should not be precluded.

    Complexity

    Minimize the number of options

    No redundant mandatory features

    These characteristics are performed thanks to the E-UTRA Air Interface. E-UTRA is the acronym

    for Evolved Universal Terrestrial Radio Access. It is the successor of the UTRAN/GERAN (GSM EDGERadio Access Network/ UMTS Terrestrial Radio Access Network), 2G/3G air interface. Also designed by

    3GPP, its requirement are as follow:

    Mobility

    E-UTRAN should be optimized for low mobile speed from 0 to 15 km/h

    Higher mobile speed between 15 and 120 km/h should be supported with high performance

    Mobility across the cellular network shall be maintained at speeds from 120 km/h to 350 km/h (oreven up to 500 km/h depending on the frequency band)

    Spectrum flexibility

    E-UTRA shall operate in spectrum allocations of different sizes, including 1.25 MHz, 1.6 MHz, 2.5MHz, 5 MHz, 10 MHz, 15 MHz and 20 MHz in both the uplink and downlink. Operation in paired

    and unpaired spectrum shall be supported

    Co-existence and Inter-working with 3GPP Radio Access Technology (RAT)

    Co-existence in the same geographical area and co-location with GERAN/UTRAN on adjacentchannels.

    E-UTRAN terminals supporting also UTRAN and/or GERAN operation should be able to supportmeasurement of, and handover from and to, both 3GPP UTRAN and 3GPP GERAN.

    The interruption time during a handover of real-time services between E-UTRAN and UTRAN (orGERAN) should be less than 300 msec.

    E-UTRA is the air inteface which permits the communication between a BTS (Base Transmitter Sta-

    tion) and a UE (User Equipment). The signal modulation used for the BTS and demodulation for the UE

    are a bit different but made on the same technology namely the Frequency-Division Multiplexing. This

    provides a lot of similarities between them. SC-FDM (Single Carrier Frequency-Division Multiplexing)

    is used for the transmitter part and OFDM (Orthogonal Frequency-Division Multiplexing) for the receiver

    part. This has been decided by the 3GPP members and summarized in the Release8.

    This project is related to the OFDM aspect at the receiver side. Section 1.1.2 gives an overview of

    OFDM fundamentals.

    Section: 1.1 Context 9

  • 8/2/2019 Report FFT Implementation 08gr943

    10/50

    1.1.2 Orthogonal Frequency-Division Multiplexing (OFDM)

    This section has been inspired by the information provided by this two papers [ 11] [12].

    OFDM is a modulation technique which is used in most of the new wireless technologies such as

    IEEE802.11 a/b/g, 802.16, HiperLan-2, DVB (digital TV) and DAB [13]. The 3GPP members selected it

    to be the LTE/E-UTRA downlink protocol i.e. the system which receives data and communication packets

    from a transmitter. As indicated at the end of section 1.1.1, the selected uplink protocol, SC-FDM, presents

    similarities to OFDM, that is why this section only introduces OFDM on the transmitter and receiver sides.

    1.1.2.1 Overview

    With standard single carrier transmitters, the signal is spread into multiple transmission paths, multiple

    frequencies. Because of the environment (buildings, cars, distance), the signal becomes less powerful

    and distorted. This phenomenon, called fading , appears when signals are reflected on the buildings for

    example. The reflected signals arrive to the receiver later than the main signal, which results in distortions,as illustrated in Figure 1.2.

    Transmitter

    Buildings

    Receiver

    Mobile Obstacle

    Obstacle

    Figure 1.2: Multipath propagation. A transmitted signal is spread between different frequencies and ac-

    cording to theses frequencies, the obstacle met and the distance covered, the distortion is

    more or less present. Modified from [12].

    These distortions are a major problem when establishing secured high speed data transfer like usedon the 3G UMTS cell phones. OFDM settles this distortion problem. It is not avoiding reflections but its

    characteristics make a transmission safer, in the meaning that data packets are always present by permiting

    to send multiple signals by a single radio channel. OFDM is a multi-carrier transmitter/receiver, i.e. it can

    send/receive signals to/from several users. The next subsections describe the main principles of OFDM

    on the transmitter and receiver sides.

    1.1.2.2 OFDM Principles

    OFDM distributes the data over a large number of carriers at different frequencies. This spacing provides

    the orthogonality which prevents the receivers to see wrong frequencies. In opposite to other multi-

    10 Chapter: 1 Introduction

  • 8/2/2019 Report FFT Implementation 08gr943

    11/50

    carriers techniques, like CDMA, OFDM prevents the Inter Symbol Interference (ISI) by adding a cyclic

    prefix, which is explained in section Inter-Symbol Interferences.One of the key features of OFDM is the IFFT/FFT pair. These two mathematical tools are used here

    to transform several signals on different carriers from the frequency-domain to the time-domain in the

    IFFT (or F F T1) and from the time-domain to the frequency-domain in the FFT. See in Figure 1.3, theprinciple with the main parts of an OFDM system.

    Serial toparallel

    conversionIFFT

    Addcyclicprefix

    FFTRemovecyclicprefix

    Parallel toserialconversion

    Transmitter Receiver

    Input signal

    Modulated signal

    Output signal

    Frequency Domain Frequency DomainTime Domain

    Antenna

    Antenna

    Figure 1.3: Main principle of an OFDM transmitter/receiver.

    The Transmitter Figure 1.4 shows a representation of the transmitter. OFDM divides the spectrum into

    N sub-carriers, each on different frequencies, and each carrying a part of the signal by means of the IFFT(also noted F F T1). In opposite with FDM, where there is no coordination or synchronisation betweeneach sub-carriers, OFDM links them with the principle of orthogonality . It results in a overlapping

    of the sub-carriers, see Figure 1.5, where all the sub-carriers can be simultaneous transmitted in a tight

    spaced frequencies but without Inter-Signal Interference.

    s[n] F F T1

    DAC

    DAC

    s(t)90o

    fc

    Serialto parallel

    X0

    X1

    XN2

    XN1

    e

    m

    Constellationmapping

    Figure 1.4: Representation of the OFDM transmitter [14]. The digital signal s[n] represents the data totransfer. It is then modulated with a QPSK, 16-QAM or 64-QAM to create symbols. Then thespectrum goes through an IFFT to transform it into time-domain. Real and Imaginary com-

    ponents are converted to analog domain to modulate cosine and sine at the carrier frequency,

    fc. They are then summed into s(t) to be transferred to the receiver via the antenna.

    Signals are orthogonal if they are mutually independant of each other. Orthogonality is based on the

    fact that any sub-carriers, sine or cosine wave, admit zero on one half-period. Lets assume two sine

    sub-carriers of frequency m and n, both integers, and multiply them together:

    f(x) = sin mwt sin nwt (1.1)

    Section: 1.1 Context 11

  • 8/2/2019 Report FFT Implementation 08gr943

    12/50

    Its integral yields to a sum of two sinusods of frequency (n m) and (n + m)

    f(x) =1

    2cos(m n) 1

    2cos(m + n) (1.2)

    As this two components are sinusods, the integral is equal to zero over one period

    f(x) =

    20

    1

    2cos(m n)

    20

    1

    2cos(m + n) (1.3)

    It conclues as when two sinusods of different frequencies, n and m/n, are multiplied, the area underthe product is zero. For all n and m, sin mx, cos mx, sin nx and cos nx are all orthogonal to eachothers. These frequencies are called harmonics.

    Overlapping gives a better spectrum usage than FDM modulator which just places each carrier next to

    each others and results on interferences between them.

    f 0 1 2 N-1

    f 0 1 2 N-1

    FDM

    OFDM

    foverlapping f

    Figure 1.5: Spectrum efficiency difference (f) between FDM and OFDM. With OFDM, signals, on each

    sub-carriers, are overlapped but still orthogonal to each others. With FDM, sub-carriers areplaced to next to each others.

    The Receiver OFDM symbols are transmitted over the channel to the receiver on an only frequency.

    Basically, the receiver performs the same operations as the transmitter, but in the inverse order. By means

    of a FFT, an approximation of the source signal is retrieved as illustrated in Figure 1.6.

    s[n]FFT

    ADC

    ADC

    r(t)90o

    fc

    Parallelto serial

    Y0

    Y1

    YN2

    YN1

    e

    m

    Symboldetection

    Figure 1.6: Representation of the OFDM receiver [14]. The antenna receives each part of the spectrum

    as one signal r(t).It is demodulating and after eliminating the cyclic prefix with filters, a FFT

    algorithm transforms them back to frequency-domain. Then, each symbol is detected to create

    an approximation of the original data signal.

    12 Chapter: 1 Introduction

  • 8/2/2019 Report FFT Implementation 08gr943

    13/50

    Inter-Symbol Interference (ISI) As seen in Figure 1.5, signals are overlapped. This overlapping intro-

    duces a problem known as Inter-Symbol Interference (ISI). ISI are the spread delay of the signal N1 onN due to the overlapping where, with the example in Figure 1.5, the last element of symbol 0 is overlappedby the first element of symbol 1 because of the channel.

    Spread Delay The spread delay corresponds to the propagation of a transmitted signal on the next

    one. It is the echo from the first signal on the second one as illustrated in Figure 1.7 (a). This physical

    effect depends on the channel and the distance between the two signals.

    To avoid this problem, a distance, called guard interval, superior of the spread delay is needed. As it

    is impossible not to send anything, samples from the tail of the symbol signal are added to the front,

    as illustrated in Figure 1.7 (b). This principle, explained in [15], is called cyclic prefix. In theory, this

    security prefix should be added after each sub-carrier, but in practice OFDM signal is a linear combination

    thus only one cyclic prefix is added, as illustrated in Figure 1.7 (c).

    t

    Guard interval

    Copy of the tail of the signal

    spread delay t

    t

    (a)

    (b)

    (c)

    Figure 1.7: The cyclic prefix which permits to avoid the ISI problems.(a) shows the spread delay problem.

    (b) shows the adding of the cyclic prefix in the guard interval according to the theory. (c)

    shows the cyclic prefixs adding in practice because of the linear combination of the OFDM.

    Section: 1.1 Context 13

  • 8/2/2019 Report FFT Implementation 08gr943

    14/50

    1.1.2.3 Advantages

    OFDM provides better spectrum flexibility by overlapping the signals on orthogonal frequencies, the

    harmonics. It is less noise sensitive than a single-carrier system. And the ISI problem is solved thanks to

    the guard interval and the cyclic prefix.

    1.1.2.4 Drawbacks

    OFDM is sensitive to frequency offset and synchronisation problem which can destroy the carriers orthog-

    onalities. Also, after the IFFT, OFDM can provide very high amplitude which can lead to a large amount

    of power consumption. This high amplitude, called Peak to Average Power Ratio (PAPR), can be reduced

    with transmitted signals correction vectors. But this adds complexity to the OFDM transmitter.

    1.1.3 Conclusion on the context

    The next step in mobile communication, LTE, focuses on the quality and speed of the data transfer. This

    is realized by the help of the OFDM modulator/demodulator which is the most used solution to problem

    such as ISI or fading. One of the key features in OFDM is the IFFT/FFT pair.

    In this project, the focus in on the receiver side, hence on the FFT block. In section 1.2, the FFT

    concept is presented by means of 3 FFT algorithms and the issue of parallelizing them is introduced.

    1.2 Project subject

    1.2.1 Fast Fourier Transformation (FFT)

    The group members have selected three FFT algorithms which will be compared. These three algorithmsare presented below :

    Radix 2 DIT Fast Fourier Transform (Decimation In Time) : This algorithm is chosen because it isthe simplest form of the Cooley Tukey Algorithm. It exists many other algorithms which compute

    DFT faster than radix 2 (radix 4 and split radix for example) but it is important for the project to

    be able to compare the basis algorithm with better algorithm (Srensen, Edelman) to show the

    difference of computation and complexity (explained in 2.4.3).

    "Srensen" Fast Fourier Transform (SFFT): The second algorithm under test is a mix of a Cooley-Tukey algorithm, like Split Radix, and Horners polynomial evaluation scheme. It takes into account

    the fact that all the outputs are not interesting for the final result. So only some chosen outputs are

    computed. This fact permits to avoid many operations which are time and memory expensive.

    Srensen FFT is well known and the project results can be compared with other studies. It is an

    interesting algorithm, in terms of complexity and challenge, to implement and compare with other

    algorithms like Radix-2 DIT or Edelman.

    "Edelman" Fast Fourier Transform : This algorithm computes approximately DFT, doing someerrors but which are minimal against the number of computation. This kind of algorithms allows

    increasing speed of computation in spite of some errors. Edelman Algorithm is useful for parallel

    computing.

    All the algorithms mentioned above are further developed in section 2.4. However because of a lack of

    documentation about the Edelman algorithm, it is disregarded from the project.

    14 Chapter: 1 Introduction

  • 8/2/2019 Report FFT Implementation 08gr943

    15/50

    1.2.2 Cell Broadband Engine

    The purpose of this project is to examine the implementation of FFT algorithms for the OFDM application

    presented in section 1.1.2 on a multiprocessors platform, namely the Cell Broadband Engine architecture.

    The Cell BE is, for this project, used for:

    The implementation of parallelized FFT algorithms Evaluation of the performance, in particular the execution time, of the implementation of the paral-

    lelized FFT algorithms

    The Cell BE is constructed as an heterogeneous processor architecture, with multiple executions and

    memory transfers active at the same time. This architecture is composed of a processor that contains a

    PowerPC unit (PPU) with two cores, and eight simpler processors, Synergistic Processing Units (SPUs),

    which are designed to perform calculations, whereas the PowerPC performs control, data management and

    scheduling of operations. The SPUs contains a RISC processor and are constructed with two pipelines thatcan execute an instruction each cycle. Moreover, the data paths of the arithmetic functional units inside

    the SPUs are wide (128 bits), allowing the use of Single Instruction Multiple Data (SIMD) instructions.

    The use of this method produces a processor optimized for computations.

    1.2.3 Parallelization

    The parallelization is an important part in this project. Indeed, the OFDM receiver requires a FFT as

    an integral part of the wireless communication. It is essential that the computation of this FFT be the

    fastest possible so that the achievable throughput is maximised. To obtain the best performance from the

    application running on the Cell BE processor, the use of multiple SPUs concurrently is evaluated. The

    application creates at least as many threads as concurrent SPU contexts are required. Each of these threads

    runs a single SPU context at a time. With this method, the FFT is parallelized and uses some of the featuresof the Cell BE to accelerate the computation.

    1.3 Problem Definition

    This work seeks to answer the following question

    "How efficient, in terms of computation speed, is the Cell-Be processor for the execution of parallelized

    FFT algorithms?"

    1.4 Project Delimitations

    Deep researches on LTE and OFDM are not the purposes of this project, nor is a complete mathematical

    examination of the FFT algorithms. This project focuses on the use of the Cell-Be for FFT, probably the

    single most important tool in digital signal processing (DSP) , according to Srensen and Burrus [ 16]

    Section: 1.3 Problem Definition 15

  • 8/2/2019 Report FFT Implementation 08gr943

    16/50

    Chapter 2Analysis

    2.1 Overview

    The purpose of this chapter is, first, to introduce the design methodology the project group chooses, in

    terms on project methodology (A3 design) and the way to parallelize an algorithm following establishedprocedures. Then, this chapter introduces the platform under tests in part 2.3, the Cell-Be, followed, in

    part 2.4, by an analysis of the different chosen FFT algorithms, with explanation on the reasons to choose

    these algorithms.

    2.2 Design Methodology

    2.2.1 Design Model

    The design of the model is divided in three parts as in the A3 model [17]: Application, Algorithm andArchitecture. First of all, Figure 2.1 shows the generic A3 model. Then, this methodology is appliedto this specific project presented in this report, as showed in figure 2.2.

    Application : The application is any system with specifications and constraints. It can be timeconstraints, power consumption, area problems,... It is the main purpose of a project.

    Algorithm : At this level, existing algorithms are developed. Special algorithms can be createdfor the application. The algorithms are optimized on a purely mathematical point of view, i.e. the

    optimization are only done on the algorithms parts directly related to the application.

    Architecture : The mapping of the previous algorithms is realised on the selected platform (DSP,FPGA, Cell-BE,...). In case of uncompatiblity between the specifications/constraints of the applica-

    tion and the results, modifications have to be done. On one hand, if the algorithms is implemented on

    an established architecture, a modification of the program, in terms of architecture related program

    (bus control, data transfer control, memory allocation,...) can be done for the specified architecture.

    On the other hand, if the algorithms are established then a modification of the architecture (VHDL

    program for a FPGA platform for example) can be done.

    Application : In the application domain, a presentation of LTE and OFDM in the context section1.1 is done.

    16 Chapter: 2 Analysis

  • 8/2/2019 Report FFT Implementation 08gr943

    17/50

    Application

    Algorithm

    Architecture

    iterateConstraints

    Specifications

    Results

    Comparison

    optimizationsArchitectural

    optimizationsAlgorithmic

    Mapping

    Architectureconstraints

    Algorithmicconstraints

    Figure 2.1: The generic A3 design methodology.

    Algorithm : In the algorithms domain, three Fast Fourier Transform algorithms are compared. Firstof all, an analysis of derivation is done. Then, the complexity, i.e. numbers of computation to

    execute the Fourier Transform, of each algorithms is analysed. Finally, the implementation of the

    algorithms is done in C language two times: one time for sequential execution and a second time

    for parallel execution.

    Architecture : In the architecture domain, the platform used to implement different algorithms isanalysed. Available hardware and system limitations are studied. Then, how the compiler used

    in order to parallelize programs is examined and also to measure the computation consumption in

    terms of resource utilisation, execution speed,. . .

    2.3 Cell BE

    2.3.1 Architecture

    In this section a presentation of the architecture used along the project, the Cell Broadband Engine. Ac-

    cording to the A3 model design, this section belongs to the analysis of the architecture, as illustrated onFigure 2.3

    2.3.1.1 Architecture Overview

    The Cell Broadband Engine (CBE) is a multicore processor. It has a Power Processing Element (PPE)

    which is a dual-thread PowerPC Architecture and eight Synergistic Processing Element (SPE) which is a

    SIMD (Single Instruction Multiple Data) processor element. The communication path for commands and

    data between all processor elements and all chip controllers for memory access or Input/output is made

    by the Element Interconnection Bus (EIB) [18, p. 41]. An overview of the architecture is presented on the

    figure 2.4.

    In the Playstation 3, 6 of the 8 SPEs can be used for computation because one is used by the OS

    virtualization layer and the other has been disabled for wafer yield reasons [19, p. 5]. That means that

    when running the operating system, 6 SPEs are available for computation, as shown in figure 2.4.

    Section: 2.3 Cell BE 17

  • 8/2/2019 Report FFT Implementation 08gr943

    18/50

    OFDM receiverLTE 4G

    SrensenEdelman Radix2

    iterate

    Cell Be

    Requirements

    Algorithms

    Application

    Architecture

    Figure 2.2: A3 model for project.

    2.3.1.2 Power Processing Element (PPE)

    The PPE contains a 64-bit, dual-thread PowerPC Architecture RISC core. It has 32 KB level-1 (L1)

    instruction and data caches and a 512 KB level-2 (L2) unified (instruction and data) cache. It can run

    existing PowerPC architecture software and is well-suited for executing system-control code. However

    for this project, it will be used as a managing controller for the SPE threads and it is assumed that the

    PPE is fast enough to manage the threads executing on the SPE. The PPE consists of two main units, the

    PowerPC processor unit (PPU) which performs instruction execution and the PowerPC processor storage

    subsystem (PPSS) which handles memory requests from the PPU and external requests to the PPE from

    SPEs [18, p. 41]. The architecture overview of the PPE is presented in figure 2.5.

    In the Playstation 3, the PPE is clocked at 3.2GHz, thus it can theoretically reach 2x3.2=6.4GFLOP/s

    of IEEE compliant double precision floating-point performance. It can also reach 4x2x3.2=25.6GFLOP/s

    of non-IEEE compliant single precision floating-point performance using 4-way single instruction multi-

    ple data (SIMD) fused multiply-add operation [19, p. 5].

    2.3.1.3 Synergistic Processor Element (SPE)

    The SPE is single instruction multiple data (SIMD) processor element that is optimized for data-rich

    (computation of FFT butterflies) operations allocated to them by the PPE. Each SPE has a Synergistic

    Processor Unit (SPU) which fetches instructions and datas from its 256KB Local Store (LS) and its single

    register file which has 128 entries, each 128bits wide. Each SPE has a Direct Memory Access (DMA)

    18 Chapter: 2 Analysis

  • 8/2/2019 Report FFT Implementation 08gr943

    19/50

    OFDM receiverLTE 4G

    SrensenEdelman Radix2

    iterate

    Cell Be

    Requirements

    Algorithms

    Application

    Architecture

    Figure 2.3: A3 model for project. Highlighted in red, the algorithms analyzed in this section

    interface and a channel interface for communicating with its Memory Flow Controller (MFC) and all the

    other Processors (PPE and SPE). The SPE is intended to run is own program which is in the LS and to not

    run an operating system [18, p. 63]. The architecture overview of the SPE is presented in figure 2.6.

    The SPU functional unit, as shown in figure 2.7, consists of a local store (LS) where is stored all

    instructions and data used by the SPU, a Synergistic Execution Unit (SXU) which executes all the in-struction and a SPU Register File Unit (SRF) which stores all data types,return addresses and results of

    comparisons. The SXU includes 6 executions units :

    SPU Odd Fixed-Point Unit (SFS) which executes byte granularity shift, rotate mask and shuffleoperations on quadwords.

    SPU Even Fixed-Point Unit (SFX) which executes arithmetic instructions, logical instructions, wordshifts and rotates, floating-point compares, and floating-point reciprocal and reciprocal square-root

    estimates.

    SPU Floating-Point Unit (SFP) which executes single-precision and double-precision floating-pointinstructions, integer multiplies and conversions, and byte operations. It can perform fully pipelined

    single precision (32 bit) floating point instructions and partially pipelined double (64 bits) precisioninstructions.

    SPU Load and Store Unit (SLS) which executes load and store instructions. It also handles DMArequests to the LS.

    SPU Control Unit (SCN) which fetches and issues instructions to the two pipelines, executes branchinstructions, arbitrates access to the LS and register file, and performs other control functions.

    SPU Channel and DMA Unit (SSC) which enables communication, data transfer, and control intoand out of the SPU. The functions of SSC, and those of the associated DMA controller in the

    Memory Flow Control (MFC).

    Section: 2.3 Cell BE 19

  • 8/2/2019 Report FFT Implementation 08gr943

    20/50

    EIB

    SPE1

    PPE

    XIOXIO

    MIC

    BEI

    IOIF_1 FlexIO

    FlexIOIOIF_0

    RAM RAM

    SPE3 SPE5 SPE7

    SPE6SPE4SPE2SPE0

    BEI : Cell Broadband Engine interface MIC : Memory Interface Controller

    EIB : Element Interconnect Bus PPE : PowerPC Processor Element

    FlexIO : Rambus FlexIO Bus RAM : Ressource Allocation ManagementIOIF : I/O Interface SPE : Synergistic Processor Element

    XIO : Rambus XDR I/O (XIO) cell

    Figure 2.4: Architecture overview of the Cell Broadband Processor. The Element Interconnect Bus is a

    connection between all processor elements and all chip controllers for memory access and

    Input/Output access. The cell broadband engine has 1 PowerPC processor element and 8

    synergistic processor elements. Adopted from [18, p. 37].

    The Synergistic Execution Unit (SXU) is divided into an even/odd pipeline (pipeline 0 and 1 respectively)

    and it can complete up to two instruction per cycle, one on each pipeline [ 18, p. 68]. Examining the SXU,

    the odd pipeline provides the data moving unit and the even pipeline provides the data processing unit.Furthermore, each units of the SXU has a datapath of 128 bits wide resulting in the capability to use Single

    Instruction Multiple Data (SIMD). If the SXU is working with data of 32 bits wide, thus it can perform 4

    operations in each instruction.

    On the Playstation 3, the SPU has a frequency of 3,2GHz. Thus each SPU can theoretically pro-

    vide with 32 bits wide data 2*4*3.2GFLOPS (one operation on each pipeline and 4 operations on each

    instruction). 6 SPUs are available, thus, this yields a total of 153.6 GFLOPS [ 18, p. 5].

    It must be noted that Single precision floating point operations are not conform to the IEEE 754

    because of the following differences :

    Truncation is used in rounding.

    Denormal numbers are treated as zero.

    NaN are interpreted as normilazed numbers.The double precision floating point does not have this problem [18, p. 68-69].

    20 Chapter: 2 Analysis

  • 8/2/2019 Report FFT Implementation 08gr943

    21/50

    PowerPC Processor Element (PPE)

    PowerPC Processor Unit (PPU)

    L1 Instruction

    L2 Cache

    PowerPC Processor

    L1 DataCache Cache

    Storage Subsystem (PPSS)

    Figure 2.5: Architecture overview of the PPE which consists of a PowerPC processor unit (PPU) and a

    PowerPC processor storage subsystem (PPSS). It has a 32 KB Level-1 (L1) instruction and

    data caches and a 512 KB level-2 (L2) unified (instruction and data) cache. Adopted from

    [18, p. 49].

    Synergistic Processor Element (SPE)

    Synergistic Processor Unit (SPU)

    Local Store (LS)

    DMA Controller

    Memory Flow Controller (MFC)

    Figure 2.6: Architecture overview of the SPE which consists of a Synergistic Processing Unit (SPU) and

    a Memory Flow Controller (MFC). The SPU has a LS of 256 KB. Adopted from [18, p. 63].

    Section: 2.3 Cell BE 21

  • 8/2/2019 Report FFT Implementation 08gr943

    22/50

    Odd PipelineEven Pipeline

    SPU OddFixed-PointUnit(SFS)

    SPU Loadand StoreUnit(SLS)

    SPUControlUnit(SCN) (SSC)

    Unitand DMASPU Channel

    SPU EvenFixed-PointUnit(SFX)

    SPU

    PointUnit(SFP)

    Floating

    LocalStore(LS)

    Synergistic Execution Unit (SXU)

    SPURegister File

    (SRF)Unit

    SPU Functional Unit

    Figure 2.7: Architecture overview of the SPU Functional Unit. The 256 KB Local Store (LS) is filled by

    the Element Interconnection Bus (EIB) via the MFC. The SXU contains 2 fixed points units

    and a floating point unit. The odd pipeline takes care of moving data (fetch instructions to the

    pipelines, load and store data between the LS and the register (128 entries of 128 bits wide

    each) while the even pipeline takes care of data processing (arithmetic and logic instructions).

    Adopted from [18, p. 64].

    22 Chapter: 2 Analysis

  • 8/2/2019 Report FFT Implementation 08gr943

    23/50

    2.3.1.4 Element Interconnection Bus (EIB)

    One of the main component in the Playstation 3 is the EIB which connect all the components together

    including PPEs, SPEs, main memory and all inputs/outputs. The bus has a bandwidth of 25.6GB/s (96

    bytes per clock cycle) and enabling multiple concurrent data transfers [18, p. 42].

    2.3.1.5 Memory Interface Controller (MIC)

    The Memory Interface Controller (MIC) provides the interface between the EIB and the physical mem-

    ory. It supports one or two Rambus extreme data rate (XDR) memory interfaces, which together support

    between 64 MB and 64 GB of XDR DRAM memory [18, p. 42].

    2.3.1.6 Memory System

    The Playstation 3 has a dual channel rambus extreme data rate (XDR) memory however the platform

    provides a modest amount of 256 MB which only 200 MB are available for Linux OS and the applications

    [19, p. 7]. The SPU access to the ram by the EIB and move the data to his LS via DMA transfers, with

    the MFC of the SPU acting as a DMA controller.

    2.3.1.7 Cell Broadband Engine Interface (BEI)

    The Cell Broadband Engine interface (BEI) unit supports I/O interfacing. It manages data transfers be-

    tween the EIB and I/O devices. The BEI supports two Rambus FlexIO interfaces. One of the two interfaces

    (IOIF1) supports only a noncoherent I/O interface (IOIF) protocol, which is suitable for I/O devices. The

    other interface (IOIF0) is software-selectable between the noncoherent IOIF protocol and the memory-coherent Cell Broadband Engine interface protocol [18, p .42].

    2.3.2 Programmation of the CBE

    The programming for the CBE is split into two main tasks, the programming of the PPE which manages

    the utilization of the SPU and the programming of what is executed on the SPE.

    2.3.2.1 Development platform

    The platform used for the project is a PlayStation 3 surrounded by a monitor, keyboard, mouse and LAN

    connection for remote access. The PlaySation 3 is set up with a linux operating system and a set ofdevelopment tools:

    Fedora 8 Linux kernel 2.6.23.1-42.fc8

    IBM SDK3.0 for the CBE architecture, includding:

    gcc compiler toolchain for the CBE (ppu-gcc and spu-gcc ver. 4.1.1)

    lipspe2 - SPE runtime management library ver. 2.2

    Makefile from SDK

    Section: 2.3 Cell BE 23

  • 8/2/2019 Report FFT Implementation 08gr943

    24/50

    2.3.2.2 Creating a simple application on a SPE

    Generally, applications do not have the physical control of the SPE. The operating systems manages this

    resources. Applications use software constructs called SPE context. These SPE context are a logical

    representation of an SPE. The SPE Runtime Management Library (libspe) provides all the function to

    manage the SPE. This library provides also the means for communication and data transfer between the

    SPEs and the PPE. The flow of running a single SPU program context, as shown in Figure 2.8, is to create

    a SPE context, to load an SPE executable object into the SPE context local store (LS), to run the SPE

    context, this is done by the operating system which requests the actual scheduling of the SPE context

    onto a physical SPE and lastly to destroy the SPE context in order to free the memory resources used by

    the context. It must be noticed that the fact to run the SPE context represents a synchronous call to the

    operating system and thus, the calling application blocks until the SPE stops executing [ 20, p. 1]. All

    functions for the SPE context management are described in [20].

    Create anSPE context

    Load an SPE

    into SPE context LS

    executable object Run theSPE context

    Destroy the

    SPE context

    Figure 2.8: The flow for running a simple application using a SPE.

    2.3.2.3 Create an application on several SPEs

    The project in order to get faster need to use multiple SPEs concurrently. For achieve this, the application

    must create at least as many threads as concurrent SPE contexts are required. The library used to achieve

    this is the libspe2 which uses the POSIX (Portable Operating System Interface) threads [ 20, p. 41]. Theflow of running an application on several SPEs is show in Figure 2.9.

    Each of these threads may run a single SPE context at a time. If N concurrent contexts are required, it

    is common to have a main application thread and beside, N threads dedicated to the SPE context execution.

    the main application thread issues a request for the context to be run, and become locked until the context

    finished execution. But there is no matter from the lock of the main program thread because it can still

    creates as many threads as needed. If all SPEs are busy, the threads are queued up and will be executed in

    the same order as they were created.

    Finally, when all the threads have been executed, the main program thread destroys the no longer

    needed SPE contexts.

    2.3.2.4 Project directory structure

    In order to program the cell broadband engine, the source code is arranged into two folders, one for the ppu

    code and one for the spu code. Furthermore, to use makefile definitions supplied by the SDK for producing

    programs,the line "include $(CELL_TOP)/buildutils/make.footer" has to be included in the makefile. The

    project directory structure in shown in figure 2.10.

    2.3.2.5 Program compilation

    The built of the application for the cell Be requires several steps as shown in Figure 2.11.

    First all .c files in the ppu folder are compiled using ppu-gcc for PPE programs and all .c files in the

    spu folder are compiled using spu-gcc for SPE programs. Next spu-gcc creates SPE executables from SPE

    compiled progams. These executables are embedded into the PPE programs by first creating embedded

    24 Chapter: 2 Analysis

  • 8/2/2019 Report FFT Implementation 08gr943

    25/50

    Create N SPE contexts

    Load SPE executableobject into context

    Create N threads

    Stop thread

    Run one SPE context

    in each thread

    Wait for all N threadsto stop

    Destroy all N SPE contexts

    Figure 2.9: The flow of running an application using several SPEs.

    PPE images of the SPE executables (using ppu-embedspu), next creating PPE libraries (using ppu-ar), and

    finally compiling the PPE programs again by merging it with the SPE libraries to obtain the final program

    FFT (using ppu-gcc).

    2.4 FFT algorithms

    This section is an introduction to select FFT algorithms with a complete analysis of each algorithm which

    will be parallelized in Chapter 3. According to the A3 design Model, this section belongs to the AlgorithmDomain, as illustrated Figure 2.12. First of all, it discusses about the selection of algorithm. Then, the

    different mathematical forms of algorithms are developed. At last, the computational time is compared.

    A FFT Algorithm allows computing the Discrete Fourier Transform (DFT) with a minimum complex-

    ity. In fact, for an application of DFT definition, computing complexity is O(n2). The purpose of FFTalgorithms is to split the transform to obtain a complexity O(n log(n)).

    Section: 2.4 FFT algorithms 25

  • 8/2/2019 Report FFT Implementation 08gr943

    26/50

    Makefile in the program directory

    Makefile in directory ppu Makefile in directory spu

    # SubdirectoriesDIRS = ppu spu

    # make.footerinclude $(CELL_TOP)/buildutils/make.footer

    # TargetPROGRAM_PPU = main

    # make.footerinclude $(CELL_TOP)/buildutils/make.footer

    # TargetPROGRAM_spu = fft_spu

    # make.footerinclude $(CELL_TOP)/buildutils/make.footer

    Figure 2.10: Project directory structure which yields two subfolders : one for the ppu program code and

    one for the spu program code.

    2.4.1 Overview

    It exists lot of algorithms to compute FFT. The most common algorithm is called Cooley-Tukey (CT).

    It uses on a kind of approach divide to control thanks to recursion. This recursion divides Discrete

    Fourier Transform in several DFT. This algorithm needs o(n) multiplications by twiddles factors. Theyare trigonometric constant coefficients that are multiplied in the course of algorithm developed in 2.4.2.

    In 1965, James Cooley and John Tukey published this method but this algorithm has been originally

    designed by Carl Friedrich Gauss in 1805. The most well known use of CT algorithm is a division of the

    transformation in two parts of similar size.

    2.4.2 Discrete Fourier Transform

    The Discrete Fourier Transform (DFT) presented in [21] is a mathematical tool for digital signal pro-

    cessing, (spectral analysis, data compression, partial differential equations,. . . ) similar to the Continuous

    Fourier Transform (CFT) which is used for analog signal treatment. The formula is shown below:

    X[k] =

    N1n=0

    x[n] exp 2j n kN

    (2.1)

    X[k] =

    N1n=0

    x[n] Wn kN (2.2)

    Wn kN = exp2j (n k)N (2.3)Where Wn kN is known as the Twiddle Factor . The time domain input data x[n] is a finite series

    of N samples of length n = [0, 1, . . . , N ] and is transformed to the frequency domain signal X[k] wherek = [0, 1, . . . , N 1].

    2.4.3 Cooley-Tukey

    This chapter presents a theoretical analysis of Radix 2 DIT FFT. First of all, a development of DFT

    formula is done to obtain Radix 2 DIT formula. Then, a data path derivation is shown to optimize the

    implementation in language programming code.

    26 Chapter: 2 Analysis

  • 8/2/2019 Report FFT Implementation 08gr943

    27/50

    ppu spucommon.h

    Compile SPEprograms

    Compile PPEprograms

    Compile PPE programs

    with libraries

    FFT

    *.o

    Create embeddedPPE images

    Create PPElibraries

    *.o

    *-embed.o

    *.a

    Create SPEexecutables

    *

    Figure 2.11: Flow for CBE program compilation : First .c files for PPE programs and SPE programs

    are compiled using ppu-gcc and spu-gcc, respectively. SPE compiled programs are used

    to create SPE executables, which are compiled through embedded PPE images into PPElibraries to finally get the final program FFT.

    2.4.3.1 Radix 2 DIT FFT

    This section presents the radix-2 FFT implementation [22] used for testing against Edelman and Srensen

    FFT algorithms. It is used because it is one of the simplest FFT algorithm. The simplest for 2 reasons:

    it is well-studied therefore it can be used for comparison and then to get acquainted with FFT. First, the

    analytical algorithm of the radix 2 calculation of a DFT is presented.

    The radix-2 decimation-in-time rearranges the DFT equation into two parts: a sum on the even-

    numbered indices n = [0, 2, 4, . . . , N 2] and a sum over the odd-numbered indices n = [1, 3, 5, . . . , N 1] as in the following equations:

    Xk =

    N21

    m=0

    x2me2N

    (2m)k +

    N21

    m=0

    x2m+1e2N

    (2m+1)k (2.4)

    Xk =

    N21

    m=0

    x2me2N

    (2m)k + e2N

    k N21

    m=0

    x2m+1e2N

    (2m)k (2.5)

    Xk = DF TN2

    [x(0), x(2), . . . , x(N 2)] + WkN DF TN2

    [x(1), x(3), . . . , x(N 1)] (2.6)Xk = Odd(k) + W

    kN Even(k) (2.7)

    Section: 2.4 FFT algorithms 27

  • 8/2/2019 Report FFT Implementation 08gr943

    28/50

    OFDM receiverLTE 4G

    SrensenEdelman Radix2

    iterate

    Cell Be

    Requirements

    Algorithms

    Application

    Architecture

    Figure 2.12: A3 model for project. Highlighted in red, the algorithms analyzed in this section

    where k = [0, 1, . . . , N 1]. The previous simplifications show that the DFT radix-2 DIT can becomputed as the sum of two N

    2length DFTs; one of them with the even indexes and the other with the odd

    indexes which are multiplied by the twiddle factor WkN = e2N

    k . Whereas DFT computation requires

    N2 complex multiplications and N2

    N complex additions, the radix-2 DIT rearrangement costs only

    N2

    2 + Ncomplex multiplications and N2

    2 complex additions.

    2.4.3.2 Data path Derivation

    One can notice that the radix 2 DIT simplification is recursive. This kind of expression is simple, but not

    optimal to implement in language programming code, because memory consumption and scheduling; that

    is why iterative algorithms are generally preferable.

    An other property is described below. Even and odd parts are periodic with period N2

    ; so Odd(k +N2

    ) = Odd(k) and Even(k+N2 ) = Even(k). In addition, the twiddle factor is periodic Wk+N

    2

    N = WkN.The equation may be expressed now as:

    Xk = Odd(k) + WkN Even(k) (2.8)

    Xk+N2

    = Odd(k) WkN Even(k) where k = 0, 1, . . . ,N

    2 1 (2.9)

    The decimation of data sequence can be repeated again and again until the resulting sequences are

    reduced to one point sequences. Thus, for N = 2n, this decimation can be performed n = log2N times.Therefore, the total number of complex multiplications is reduced to N

    2log2N and the number of additions

    to N log2N.

    28 Chapter: 2 Analysis

  • 8/2/2019 Report FFT Implementation 08gr943

    29/50

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    x(0)

    x(4)

    x(2)

    x(6)

    x(1)

    x(5)

    x(3)

    x(7)

    X(0)

    X(1)

    X(2)

    X(3)

    X(4)

    X(5)

    X(6)

    X(7)

    W08

    W08

    W08

    W08

    W08

    W08

    W28

    W28

    W08

    W18

    W28

    W38

    Figure 2.13: Eight point decimation in time algorithm.

    One can observe that the computation is divided in three stages: a four two-point DFT, then a two

    four-point DFT and finally a one eight-point DFT. Another important observation is the order of input

    data. Indeed, the order of these data have to be inverted to obtain the good sequence for the corresponding

    data output.

    2.4.4 Srensen

    Srensen FFT [16] (SFFT) algorithm is used in the project as a test algorithm like Radix-2 DIT and

    Edelman. It is also known as Transform Decomposition. Its principle is different from standard

    algorithms, like Radix-2 DIT or Split radix, in terms of length of input and output data points. Standard

    algorithms assumes that the both length of data points are equal, as seen in Figure 2.13, where all the data

    are computed. SFFT computes them in a different manner where only some output points are said of

    interest , thus only these points are computed, as illustrated in Figure 2.14.

    Considering the DFT definition (2.2) as:

    X[k] =

    N1n=0

    x[n] Wn kN (2.10)

    where k = [0, 1, . . . , N 1]SFFT supposes that only L points are interesting. It exist two sums of lenght P and Q such as N

    Section: 2.4 FFT algorithms 29

  • 8/2/2019 Report FFT Implementation 08gr943

    30/50

    x(0)

    x(1)

    x(N 1)

    x0(0)

    x0(k)

    x0(P 1)x1(0)

    x1(k)

    x1(P 1)

    xQ1(0)

    xQ1(k)

    xQ1(P 1)

    Input Mapping

    X0(0)

    X0(k)

    X0(P 1)X1(0)

    X1(k)

    X1(P 1)

    XQ1(0)

    XQ1(k)

    XQ1(P 1)

    Q length-P FFTs Recombinaison

    X(0)

    X(k)

    X(L 1)

    W0N

    WkN

    WQ1N

    Figure 2.14: There are N inputs, but only one output (X(k) in this example) is computed and used forfurther operation. The way it is done is explained in the following paragraphs. Modified

    from Srensen and Burrus, 1993, figure 4 [16]

    divided by P defines Q as:

    Q = N/P (2.11)n = Qn1 + n2 (2.12)

    with n1 = [0, . . . , P 1] and n2 = [0, . . . , Q 1]So the DFT equation (2.2) becomes:

    X[k] =

    Q1n2=0

    P1n1=0

    x[Qn1 + n2] W(Qn1+n2)kN (2.13)

    X[k] =

    Q1n2=0

    P1n1=0

    x[Qn1 + n2] Wn1pNWn2kN (2.14)

    where < k >p is k modulo p.

    X[k] =

    Q1n2=0

    Xn2 [< k >p] Wn2kN (2.15)

    Xn2 [j] =

    P1n1=0

    x[Qn1 + n2] Wn1jP (2.16)

    Xn2 [j] =

    P1n1=0

    xn2 [n1] Wn1jP (2.17)

    xn2 [n1] = x[Qn1 + n2] (2.18)

    30 Chapter: 2 Analysis

  • 8/2/2019 Report FFT Implementation 08gr943

    31/50

    The equation (2.17) is the equation of a length P DFT and can be compute with any FFT algorithm

    such as Radix-2 DIT or Split Radix. Srensens paper says that it is better with a Split Radix FFT in termsof number of operations. But as the Radix-2 FFT has been used previously in the project, it is better to use

    it to compare with the previous results.

    Equation (2.15) shows that Q FFTs of length P have to be computed, as illustrated in Figure 2.14

    2.4.4.1 Complexity

    The SFFT complexity depends on the number P, which permits to yield Q, the number of FFTs which have

    to be performed. Then the complexity depends on the complexity of the FFT algorithm used, Radix-2 DIT

    or Split Radix.

    2.5 Conclusion of the Analysis section

    The Analysis section shows the theorical point of the subject developed in this project. The A3 designmethodology is used to organize the project and permits to establish simple and defined parts. The ap-

    plication is defined as developping an OFDM receiver for LTE. Then, the algorithm part describes 2 FFT

    algorithms to be used : Radix-2 DIT and Srensen. The last part corresponds to the architecture on which

    the algorithms are implemented, namely the Cell Broadband Engine.

    The analysis of the Cell Broadband Engine shows a multiprocessor architecture containing one PPE

    managing the communication between the 6 SPEs, out of 8 in the Playstation 3 platform. The instructions

    and datas are flowing thanks to the Element Interconnect Bus (EIB) which connect the PPE, the SPEs and

    memories. The SPUs contains a RISC processor and are constructed with two pipelines that can execute

    an instruction each cycle. Moreover, the data paths of the arithmetic functional units inside the SPUs are

    wide (128 bits), allowing the use of Single Instruction Multiple Data (SIMD) instructions. The use of thismethod produces a processor optimized for computations.

    The last part of the Analysis section concerns the FFT algorithms. First of all a FFT (or its inverse

    the IFFT also known as F F T1) is a tool used in digital signal processing permitting the transformationfrom time domain to frequency domain. This domain is used to determine the usefull frequency from

    the added noise. In the case of OFDM, the transmitter contains an IFFT which transforms the digital

    symbols in analog signal for its transmission. The receiver operates the inverse computation to retrieve the

    data transmitted among noises. These operations are time expensive. An efficient multicore architecture

    designed for computation can reduce the computional time. The group project chooses the Radix-2 DIT

    for first algorithm to implement on the CBE. This algorithm is assumed to be one of the simplest FFT

    algorithm based on the paper by J.Cooley and J.Tukey. Then, the second algorithm to be implemented

    on the CBE is the Transform Decomposition algorithm, known as Srensen. This algorithm, a bit more

    complex than Radix-2 DIT, permits to speed up the computation time.

    Next Section deals with the experiments of implementing the FFT algorithms firstly on the PPU then

    on one SPU and finally on several SPUs. The Radix-2 DIT is the first algorithm being used followed by

    Srensen algorithm.

    Section: 2.5 Conclusion of the Analysis section 31

  • 8/2/2019 Report FFT Implementation 08gr943

    32/50

    Chapter 3Implementation

    3.1 Overview

    This chapter puts in practice the theoretical analysis developed in the chapter 2. It contains the results

    of the tests on one or several processors, with different FFT algorithms. All these results are evaluated,

    compared and discussed. According to the A3 design model, this section belongs to both Algorithm andArchitecture Domain, i.e. the mapping of the algorithm onto the architecture, as illustrated in Figure 3.1.

    3.2 Cooley-Tukey Implementation

    3.2.1 Overall Approach

    The tests are carried out with the CT Algorithm. First of all, Matlab is used to have reference results.

    Indeed, the fftfunction is used to verify that the results of the implementations are correct. This verifica-

    tion is done only for the computation results and the Matlab computation time is of no interest. Indeed,

    as mentioned in the second chapter section 2.4.3, CT Algorithm is one of the simplest existing FFT al-

    gorithms; therefore it is selected for the initial tests, as its sequential implementation is straightforward.

    These also provide elements of comparison for the subsequent implementations.

    Then, various types of tests are performed. All the following tests are carried out 10000 times to

    insure that the results are meaningful (since the execution is not fully deterministic due to architectural

    and OS hazards). The first one is a sequential execution on the main processor (PPU). The second one

    is also a sequential computation but on the SPU (without data transfer). These two tests allow seeing

    the computation difference between both. Then, the parallel implementation on 6 SPUs is performed toevaluate the potential improvement.

    Two parameters are evaluated during these tests: the computation time and the number of computation

    per second. Measurement of the time is realized by the function gettimeofday [23] and is carried out for

    the execution ofbit reverse function, twiddle factor computation and butterfly computation. The following

    calculation allows computing the number of operations per second:

    Numberofoperationspersecond =10 N

    2 log2N

    totaltime(3.1)

    where N is the length of the FFT.

    32 Chapter: 3 Implementation

  • 8/2/2019 Report FFT Implementation 08gr943

    33/50

    OFDM receiverLTE 4G

    SrensenEdelman Radix2

    iterate

    Cell Be

    Requirements

    Algorithms

    Application

    Architecture

    Figure 3.1: A3 model for project. Highlighted in red, the mapping developed in this section

    This formula corresponds to the complexity of CT butterflies seen in section 2.4.3.2]. The computation

    of the twiddle factor and bit reverse function does not affect equation 3.1 because there are no floating

    operations in these functions. Bit reverse is only data transfer and twiddle factorhas no floating operation

    (only cosine and sine which, in this case, are not floating point operations).

    3.2.2 Results

    A graphical representation of the results is seen in Figure 3.2. This graph shows the computation time of

    the sequential executions on the PPU and on the SPU. Both are almost linear, which is normal because

    when the FFT length is multiplied by 2, the computation time is almost doubled. One more comment about

    these results is that the SPUs computation time are larger than the PPUs ones. Indeed, the difference

    between both increases with the FFT length.

    The following graph in Figure 3.3 depicts the computation time of a parallel implementation according

    to the FFT length. Indeed, the CT algorithm is parallelized on the 6 SPUs of the Cell Be. The more the FFT

    length increases, the larger the computation time is. This is an unexpected result; firstly, the computation

    time for the parallelized version is larger than for the sequential one. Secondly, the larger the FFT length

    is, the larger the execution time is. There is an explanation for that. The data transfers between themain storage (RAM) and the local storage (LS) are very long (as compared to the computation, i.e. data

    transfers are a bottleneck and the PPUs remain idle for significantly long periods of time). Moreover, no

    optimizations have been implemented so far.

    Finally, the number of operations per second is drawn according to the number of processors as de-

    picted in Figure 3.4. With the FFT length of 1024, it can be observed that the number of operations per

    second (in MFLOPS) decreases when the number of processors increases.

    Considering the results of the previous tests (computation on 6 SPUs, Figure 3.3), this result was

    expected. Indeed, much time (cf. previous results comments) is spent for data transfers when the number

    of processors increases. Therefore, the number of operations per second (i.e. actual computations) is very

    low compared to the transfer times.

    Section: 3.2 Cooley-Tukey Implementation 33

  • 8/2/2019 Report FFT Implementation 08gr943

    34/50

    0 200 400 600 800 1000 12000

    50

    100

    150

    200

    250

    300

    350

    400

    Computationtime(s)

    N : Number of points

    PPUSPU

    Figure 3.2: Computation time of a sequential radix 2 FFT implemented on the PPU (dashed blue) and

    one SPU (continuous red) for different lengths FFT (ranging from 4 to 1024).

    34 Chapter: 3 Implementation

  • 8/2/2019 Report FFT Implementation 08gr943

    35/50

    0 200 400 600 800 1000 12000

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Computationtime(s)

    N : Number of points

    104

    Figure 3.3: Computation time of a parallel radix 2 FFT implemented on 6 SPUs for different FFT lengths,

    ranging from 4 to 1024.

    Section: 3.2 Cooley-Tukey Implementation 35

  • 8/2/2019 Report FFT Implementation 08gr943

    36/50

    1 2 3 4 5 60

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    Numberofoperationspersecond

    N : Number of SPUs

    Figure 3.4: Number of operation per second for a parallel radix 2 FFT implemented on different number

    of processors (from 1 to 6).

    3.2.3 Optimizations

    Intuitively, one would expect that increasing the parallelism would increase the number of operations per

    second. However, the opposite effect has been observed in the results described above. Therefore, the

    group members have decided to evaluate whether it is possible to reduce the computation time by means

    of several optimizations techniques, as described in what follows.

    Problem of data transfers: The time for performing data transfers between the PPU and the SPU is

    higher than the computation time. Several methods have been used to reduce this time, as described in the

    following paragraphs.

    3.2.3.1 Deterministic twiddle factors

    The twiddle factors have been made as a constant on the SPU. If the FFT length is fixed, the twiddle

    factors are always deterministic (they can be predicted). Instead of passing them as arguments to the SPU,

    they are stored in the Local Storage of the SPU. The twiddle factors are complex values with real andimaginary parts. Assuming 32 bits floats and a 1024 length FFT, the size of these data is:

    512 x 2 x 4= 4096 bytes

    Number of twiddle factor

    2 floats : real and imaginary

    4 bytes float

    That is not a problem for the LS because it is only 4,096 bytes out of the 256 Kb. This technique

    allows to not waste precious EIB bandwidth.

    36 Chapter: 3 Implementation

  • 8/2/2019 Report FFT Implementation 08gr943

    37/50

    3.2.3.2 Double Buffering

    One of the methods to transfer data to (from) the PPU from (to) the SPU uses Direct Memory Access

    (DMA). This section presents a technique called double buffering. To achieve computation on the SPU,

    the program has to transfer data from main storage to the LS using DMA data transfer. For example,

    consider a SPU program that repeats the following steps:

    1. Access data using DMA from the main storage to the LS buffer B,

    2. Wait for the transfer to complete,

    3. compute on data in buffer B,

    This sequence is not efficient because the SPU has to wait for the complete transfer of the data before it

    can compute the data in buffer. The process wastes much time. Figure 3.5 illustrates this scenario.

    First Iteration Second Iteration

    time

    DMA Input

    Compute

    Figure 3.5: Serial computation and data transfer. Modified from [24]

    This process can be significantly accelerated by using double buffering. Two buffers, B0 and B1,are allocated, allowing overlapping computation on one buffer with data transfer in the other one. The

    diagram scheme is showed in figure 3.6.

    Double buffering is achieved by using tag-group identifiers [25]. All transfers involving buffer B0(respectively B1) are applied to Tag-group ID 0 (respectively ID 1). Then, software sets the tag-groupmask to include only tag ID 0 (tag ID 1) and requests conditional tag status update. It enables not to begin

    the computation before the transfer to the buffer is complete. Figure 3.7 shows the resulting execution in

    time.

    Double buffering is used in the project to transfer the data structure from the PPU to the SPU. This

    structure is described below:

    Initiate DMA transfer

    from EA to LS buffer B0

    Initiate DMA transfer

    from EA to LS buffer B1

    Wait for DMA transfer

    to buffer B0 to complete

    Compute on data

    in buffer B0

    Initiate DMA transfer

    from EA to LS buffer B0

    Wait for DMA transfer

    to buffer B1 to complete

    Compute on data

    in buffer B1

    Figure 3.6: Double Buffering scheme. Modified from [24]

    Section: 3.2 Cooley-Tukey Implementation 37

  • 8/2/2019 Report FFT Implementation 08gr943

    38/50

    B0

    time

    DMA Input

    Compute

    B0 B0 B0

    B1 B1 B1

    Figure 3.7: Paralell Computing and Transfer. Double Buffering is more efficient than the approach pre-

    sented in Figure 3.5 as the SPU does not have to wait for the data. A part can be computed

    in bufferB0 while the next data is in the DMA transfer to B1. Modified from [24]

    Typedef struct complex{

    float real;float imag;

    }complex_t

    Typedef struct{

    complex_t *input;complex_t *output;

    complex_t *twiddle;int count;

    }spe_arg_t

    The structure spe_arg_tis passed in arguments from the PPU to the SPU. While the computation of

    one butterfly is being performed by means of the first buffer transfer, the second buffer is transferring the

    data for the computation of the next butterfly. Although twiddle factors and double buffering methods

    have been implemented, no significant improvement for the data transfer time has been observed (since

    the results are the same with or without these two methods, the corresponding numbers are not repeated

    here).

    3.2.3.3 Large amount of data

    After further considerations, the group members wanted to evaluate that to gain anything from using

    double buffering, a larger amount of data must be transferred. The EIB only becomes efficient if it can

    work for longer durations of time. So, in a new experiment, instead of sending 1024 times the input data,

    half of the data have been sent to the SPU. Then, the calculations started while the other half was sent.

    Although this method has been implemented, no improvement regarding the computation time has been

    measured.

    3.2.3.4 Computation of several stages on the same SPU

    The goal of all the previous optimizations is to reduce the data transfer time. Regarding figure 3.8, the

    first four data (x(0), x(4), x(2), x(6)) are used together in stages 1 and 2. It means that only one transferis necessary from the PPU to the SPU to compute these four values in stages 1 and 2. If this method is

    applied to a 1024 point FFT on 4 SPUs, 256 data (1024/4) are transferred on each SPU. It means that each

    SPU computes 128 Butterflies (27) thus each SPU computes the first seven stages with only one transferof 256 data. This optimization is only possible on a power of 2 numbers of processors. Then, the last three

    stages are computed on the 4 SPUs like the method decribed in part [].

    38 Chapter: 3 Implementation

  • 8/2/2019 Report FFT Implementation 08gr943

    39/50

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    x(0)

    x(4)

    x(2)

    x(6)

    x(1)

    x(5)

    x(3)

    x(7)

    X(0)

    X(1)

    X(2)

    X(3)

    X(4)

    X(5)

    X(6)

    X(7)

    W08

    W08

    W08

    W08

    W08

    W08

    W28

    W28

    W08

    W18

    W28

    W38

    stage 1 stage 2 stage 3

    SPU1

    SPU2

    Figure 3.8: Eight-point decimation in time algorithm. This algorithm is implemented on two SPUs. Thefirst two stages are computed with only one transfer from the main storage to the LS. Modified

    from [24]

    The results are interesting. By means of this method, the computation time is improved. For a 1024

    FFT length, the time without optimization is 30 ms on 2 SPUs. With this one, the computation time is 7

    ms. This result shows two things: firstly, the data transfer time is the problem (that was a supposition until

    this part). Indeed, the time is divided by 4,3 thanks to sending less data. Secondly, the improvement is not

    enough because the computation time for a parallel implementation is always larger than the sequential

    one. Another algorithm (Srensen) has been analysed in part 2.4.4. The implementation is developed in

    the following part. This implementation is better than Radix 2 DIT in terms of computation time as shown

    in section 3.3.

    3.3 Srensen Implementation

    3.3.1 Overall Approach

    The following tests are carried out on the Srensen Algorithm. The reference results come from Matlab

    and are the same as presented in section 3.2. This implementation allows comparing the difference with

    the CT algorithm. According to the theoretical analysis section 2.4.4, the results should be better (in

    terms of execution time) with Srensen than with CT radix 2. Indeed, Srensen algorithm divides a

    large FFT in small FFTs, which facilitates the parallelization. Various tests are performed on Srensen.

    However, in order to compare the results with those of CT, the same type of tests as those used for CT

    Section: 3.3 Srensen Implementation 39

  • 8/2/2019 Report FFT Implementation 08gr943

    40/50

    are carried out. Two sequential implementations on the PPU are performed: one with Q set to 2 and the

    other one with Q set to 4 (Q is the number of small FFTs as seen in the part 2.4.4). Then, the parallelimplementation is tested to see the potential improvement. There are also two different values for Q (2and 4). Therefore, the parallel implementation is performed on 2 and 4 SPUs. The same parameters as

    for the CT algorithm are measured. The measurement of the computation time concerns the reordering,

    compute_fft and recombination functions. The function to measure the time is still gettimeofday [23].

    Then, the number of operations per second is evaluated as well but with a different formula because the

    complexity of the computation is not the same as for CT. The formula is described below in equation 3.2:

    GFLOPS= 5 Q P log2P + 8 (Q 1) Ltotaltime

    (3.2)

    where Q is the number of small FFTs, P the number of input data for each small FFTs and L thedesired number of output data.

    The number of operations per second only concerns the computation of small FFTs and the recombina-tion function. The reordering function has no floating computations. It only consists of the data reordering

    by means of data moves.

    3.3.2 Results

    The graph in Figure 3.9 shows the computation time of the sequential execution on the PPU. There are

    two lines: one (continuous red) for a division of the large FFT in two smaller ( Q = 2) and another one(dashed blue) for a division in four smaller (Q = 4). The execution time for Q = 2 is always smallerthan for Q = 4. That is normal because the complexity depends on the chosen subdivision factor P, asthis defines Q, which is the number of small FFTs performed. The larger the factor Q is, the larger thenumber of computation is. Therefore, for a sequential execution, the time increases with the number of

    calculation. That explains the behaviour of these measures. Moreover, these two curves are almost linear.That is normal because increasing of the input data increases the computation time.

    Figure 3.10 shows the computation time for 2 parallel implementations (Q = 2 and Q = 4, i.e. 2 and4 SPUs, respectively) according to the FFT length (from 4 to 1024). It appears that the execution time is

    always larger for a parallelization on 4 SPUs. Thus, it can be deduced that the problem still comes from

    the time needed to transfer the data between the PPU and the SPUs as for the parallel implementation

    of the CT algorithm. However, the positive point choose aspect for this case is that the computation

    time becomes almost constant when the FFT length is increased (due to the effect of the pipeline). The

    computation time is always, for the parallel implementations, larger than the sequential one. Moreover,

    the 4 SPUs execution is slower than the 2 SPUs one. This is an expected result because there are only 2

    data transfers for Q = 2 whereas 4 are needed when Q = 4.

    3.3.3 Comparison with the CT algorithmThe goal of this section is to compare the results of Srensen implementation with the different measures

    obtained for the CT Radix 2 DIT implementation. Indeed, although several optimizations have been

    applied to the CT implementation, the results (especially for the computation time) due to data transfer

    between the PPU and the SPUs are larger than the sequential execution time. Figure 3.12 shows the

    parallel implementation of Srensen algorithm (Q = 2, i.e. 2 SPUs) and of CT (on 6 SPUs).Although the parallel implementation of the Srensen also suffers from problem of data transfer, it

    appears that it is better than Radix 2 DIT in terms of computation time. The explanation is simple; in

    Srensens algorithm, data are transferred one and only one time for the computation of one small FFT

    whereas for Radix 2 DIT, even with the optimizations, data are transferred four times (N = 1024), as seen

    in section 3.2.2. The conclusion is that Srensen is better fitted for parallel implementation due to its

    40 Chapter: 3 Implementation

  • 8/2/2019 Report FFT Implementation 08gr943

    41/50

    0 200 400 600 800 1000 12000

    500

    1000

    1500

    2000

    2500

    Computationtime(s)

    N : Number of points

    Q = 4Q = 2

    Figure 3.9: Computation time of a sequential Srensen FFT implemented on PPU forQ = 2 (continuous

    red) andQ = 4 (dashed blue) with different FFT lengths (ranging from 4 to 1024).

    Section: 3.3 Srensen Implementation 41

  • 8/2/2019 Report FFT Implementation 08gr943

    42/50

    0 200 400 600 800 1000 12000

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    Computationtime(s)

    N : Nombre de points

    Q = 4Q = 2

    Figure 3.10: Computation time of a parallel Srensen FFT implemented on PPU for Q = 2 (red) and

    Q = 4 (blue dash) with different FFT lengths (ranging from 4 to 1024).

    42 Chapter: 3 Implementation

  • 8/2/2019 Report FFT Implementation 08gr943

    43/50

    0 200 400 600 800 1000 12000

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Computationtime(s)

    N : Number of points

    104

    CT FFTSFFT Q = 4SFFT Q = 2

    Figure 3.11: Comparison of the computation time of a parallel Srensen FFT for Q = 2 (i.e. 2 SPUs)(dashed green), for Q = 4 (i.e. 4 SPUs) (dotted blue) and a parallel CT Radix 2 DIT FFTon 6 SPUs (continuous red) with FFT lengths from 4 to 1024.

    Section: 3.3 Srensen Implementation 43

  • 8/2/2019 Report FFT Implementation 08gr943

    44/50

    0 200 400 600 800 1000 12000

    2

    4

    6

    8

    10

    12

    Numberofoper

    ationspersecond

    N : Nombre de points

    CT FFT

    SFFT Q = 2

    Figure 3.12: Comparison of the number of operations per second for a parallel Srensen FFT (Q = 2)(dashed blue) and a parallel Radix 2 DIT FFT on 6 SPUs (continuous red) with different

    FFT length (ranging from 4 to 1024).

    design as compared to CT. Then, the time needed for data transfer appears to be the main limiting factor

    but with a better knowledge of these transfers, the Srensen algorithm is most likely easily applicable

    to this kind of parallel architecture. Another element of comparison is the number of computations per

    second. Figure 3.12 shows this variable (in MFLOPS) for the Srensen implementation on 2 processors

    and for the CT Radix 2 DIT algorithm on 6 SPUs according to the FFT length (from 4 to 1024).

    Please note that this is a different type of measure as compared to the execution time (here larger num-

    ber indicate a better performance). The number of floating point operations per seconds is larger for the

    Srensen implementation than the Radix 2 one. It can be explained by the fact that the Srensen algorithmhas much computations (compared to Radix 2). Indeed, the recombination step, cf 3.2.2, after the com-

    putation of the small FFTs, adds some computations. Furthermore, Figure 3.11 has shown that Srensen

    FFT was faster than CT in terms of computation time. These two explanations combined together, can

    explain the trend observed in Figure 3.12.

    44 Chapter: 3 Implementation

  • 8/2/2019 Report FFT Implementation 08gr943

    45/50

    Chapter 4Conclusion & Perspectives

    4.1 Conclusions

    The next step in mobile communication, LTE, focuses on the quality and speed of the data transfer. This

    is realized by the help of the OFDM modulator/demodulator which is the most used solution to problem

    such as ISI or fading. One of the key features in OFDM is the IFFT/FFT pair. To speed up the data

    transfer provides by the OFDM, an improvement of the computation speed of the IFFT/FFT tool can be

    sought. With the latest multiprocessor platform, the speed up can be improved even more as soon as the

    data transfer protocol between the different parts of the architecture are well managed.

    The goal of this 9th Semester ASPI project is to answer the problem defined in section 1.3 as follow :

    "How efficient, in terms of computation speed, is the Cell-Be processor for the execution of parallelized

    FFT algorithms?"

    First of all, an analysis of the Cell-BE has been done to determine if this multicore architecture can

    speed up the algorithm of a common tool of digital signal processing, the FFT. It appears that the Cell-BE,

    combined with the SIMD method produces a processor optimized for computations, and then it ables to

    improve the computation speed of the execution of FFT algorithms. To evaluate the efficiency of paral-

    lelized FFT algoritms on the Cell-Be processor, two FFT algortihms are used (Radix-2 DIT and Srensen).

    The first uses

    N step with N2

    butterflies at each step. It means that this step can be parallelised. This

    algorithm is assumed to be one of the simplest FFT algorithm. It is not the most efficient but the easiest

    to establish. Srensen algorithm splits the FFT in smaller FFT. It means that each smaller FFT can be

    computed on one processor in a parallel scheduling.Then, during the implementation, these algortithms are computed only with the PPE of the Cell-BE.

    The fact to use these algorithms only on the PPE provides a computation without any parallelisation to have

    a reference against the same algorithms with a parallelisation computation. Radix-2 DIT is implemented

    first. The comparison between only PPU implementation and multiple SPUs implementation shows that

    data transfers between PPU and SPUs causes waste of time and the results are unexpected, in a way

    that they are showing less efficiency than an unparallelized algorithm. Optimizations, like the Double-

    Buffering method, is applied to reduce the data transfer time but without any improvement. Srensen

    algorithm is implemented after and shows improvement in the computation time results in comparison

    with the Radix-2 DIT implementation. However, the results of this implementation are still under what

    the theoritical computation power the Cell-Be can provide.

    45

  • 8/2/2019 Report FFT Implementation 08gr943

    46/50

    4.2 Perspectives

    4.2.1 Short term perspectives

    The short time perspective for this project concern the usage of another op