CS 584

Post on 07-Jan-2016

37 views 0 download

description

CS 584. Fast Fourier Transform. Used in many scientific applications Transforms a periodic signal into the frequency spectrum of the signal. FFT. Given a sequence Transform into Where. O(n 2 ). FFT. - PowerPoint PPT Presentation

Transcript of CS 584

CS 584

Fast Fourier Transform

Used in many scientific applicationsTransforms a periodic signal into the

frequency spectrum of the signal

FFT

Given a sequence <X[0], X[1], … X[n-1]>Transform into <Y[0], Y[1], … Y[n-1]> Where

.0,][][1

0

nikXiY kin

k

O(n2)

FFT

In 1965 Cooley and Tukey showed that the FFT equation could be evaluated in O(n log n) operations, resulting in:

kin

k

ikin

k

kXkXiY~

1)2/(

0

~1)2/(

0

]12[]2[][

FFT

Procedure RecursiveFFT(X, Y, n, ) if (n == 1) Y[0] = X[0] else RecursiveFFT(<X[0],X[2],…X[n-2]>, <Q[0],Q[1],…Q[n/2]>, n/2, 2); RecursiveFFT(<X[1],X[3],…X[n-1]>, <T[0], T[1],… T[n/2]>, n/2, 2); for i = 0 to n-1

Y[i] = Q[i mod (n/2)] + i * T[i mod (n/2)];end

Optimization Opportunity

FFT

0 0 0

0

0

[2] [4][0] [6] [1] [3] [5] [7]

[2] [4][0] [6] [1] [3] [5] [7]

0

[0] [4] [2] [6] [1] [5] [3] [7]

[0] [4] [2] [6] [1] [5] [3] [7]

X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7]

2

Top level

1st level of recursion

2nd level of recursion

3rd level of recursion

Return to 2nd level

Return to 1st level

Return to top levelY[2] Y[7]Y[6]Y[5]Y[4]7

Y[3] 5

Y[0] Y[1]

[0] [4] [2] [6] [1] [5] [3] [7]

4 4 4 4

6 0 64 2 4 2

4 6 1 3

Somethinglooks

familiar?

Parallelization of FFT

Parallelize by looking at the data patterns

Two algorithms Binary Exchange Matrix Transpose

Binary Exchange FFTX[0]

X[1]

X[2]

X[3]

X[4]

X[5]

X[6]

X[7]

X[8]

X[9]

X[10]

X[11]

X[12]

X[13]

X[14]

X[15]

Y[0]

Y[1]

Y[2]

Y[3]

Y[4]

Y[5]

Y[6]

Y[7]

Y[8]

Y[9]

Y[10]

Y[11]

Y[12]

Y[13]

Y[14]

Y[15]

m = 0 m = 1 m = 2 m = 3

Binary Exchange FFT

Data exchange takes place between all pairs of processors that differ by one bit.

One element per processor Easy

Multiple elements per processor Assign contiguous blocks to processors Same algorithm, just exchange blocks

Binary Exchange FFT

X[0]

X[1]

X[2]

X[3]

X[4]

X[5]

X[6]

X[7]

X[8]

X[9]

X[10]

X[11]

X[12]

X[13]

X[14]

X[15]

P0

P2

P3

P1

d

r

0 0 0 0

0 0 0 1

0 0 1 0

0 0 1 1

0 1 0 0

0 1 0 1

0 1 1 0

0 1 1 1

1 0 0 0

1 0 0 1

1 0 1 0

1 0 1 1

1 1 0 0

1 1 0 1

1 1 1 0

1 1 1 1

Y[0]

Y[1]

Y[2]

Y[3]

Y[4]

Y[5]

Y[6]

Y[7]

Y[8]

Y[9]

Y[10]

Y[11]

Y[12]

Y[13]

Y[14]

Y[15]

m = 0 m = 1 m = 2 m = 3

Binary Exchange FFT

As n increases so does communication Big bandwidth requirement

Powers of cannot be precalculated i is used at different times on different

processors Duplicated computation

The Transpose FFT

Assume that sqrt(n) is a power of 2The data is arranged in a sqrt(n) x

sqrt(n) two-dimensional square array

The Transpose FFT0 1 2 3

1512 13 14

10 118 9

4 5 6 7

0 1 2 3

1512 13 14

10 118 9

4 5 6 7

0 1 2 3

4 5 6 7

10 118 9

1512 13 14

0 1 2 3

1512 13 14

10 118 9

4 5 6 7

(b) Iteration m = 1(a) Iteration m = 0

(c) Iteration m = 2 (d) Iteration m = 3

Parallelization of Transpose FFT

Notice First two iterations are columnwise Last two iterations are rowwise

Rather than do an exchange Transpose the matrix halfway through

algorithm

The Transpose FFTP0 P1 P2 P3

P0 P1 P2 P3

0 1 2 3

1512 13 14

10 118 9

4 5 6 7

0

15

10

51

2

3

4

6

8

9

11

14

12

13

7

P0 P1 P2 P3

P0 P1 P2 P3

0 1 2 3

1512 13 14

10 118 9

4 5 6 7

0

15

10

51

2

3

4

6

8

9

7 11

14

12

13

(a) Steps in phase 1 of the transpose algorithm (before transpose)

(b) Steps in phase 3 of the transpose algorithm (after transpose)

The Transpose FFT

Transposition of a striped partitioned array requires all-to-all communication

Would it be less expensive to just follow through with the algorithm or do the transpose?

Which is better?

It Depends Architecture and amount of data play

together to create tradeoffs.Transpose algorithm is easy to

generalize to higher dimensions

Which is better?

1800016000140001200010000800060004000200000

5

10

15

20

25

30

35

40

45

Binary exchange2-D transpose3-D transpose

n

S