CS 584
description
Transcript of CS 584
![Page 1: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/1.jpg)
CS 584
![Page 2: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/2.jpg)
Fast Fourier Transform
Used in many scientific applicationsTransforms a periodic signal into the
frequency spectrum of the signal
![Page 3: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/3.jpg)
FFT
Given a sequence <X[0], X[1], … X[n-1]>Transform into <Y[0], Y[1], … Y[n-1]> Where
.0,][][1
0
nikXiY kin
k
O(n2)
![Page 4: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/4.jpg)
FFT
In 1965 Cooley and Tukey showed that the FFT equation could be evaluated in O(n log n) operations, resulting in:
kin
k
ikin
k
kXkXiY~
1)2/(
0
~1)2/(
0
]12[]2[][
![Page 5: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/5.jpg)
FFT
Procedure RecursiveFFT(X, Y, n, ) if (n == 1) Y[0] = X[0] else RecursiveFFT(<X[0],X[2],…X[n-2]>, <Q[0],Q[1],…Q[n/2]>, n/2, 2); RecursiveFFT(<X[1],X[3],…X[n-1]>, <T[0], T[1],… T[n/2]>, n/2, 2); for i = 0 to n-1
Y[i] = Q[i mod (n/2)] + i * T[i mod (n/2)];end
Optimization Opportunity
![Page 6: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/6.jpg)
FFT
0 0 0
0
0
[2] [4][0] [6] [1] [3] [5] [7]
[2] [4][0] [6] [1] [3] [5] [7]
0
[0] [4] [2] [6] [1] [5] [3] [7]
[0] [4] [2] [6] [1] [5] [3] [7]
X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7]
2
Top level
1st level of recursion
2nd level of recursion
3rd level of recursion
Return to 2nd level
Return to 1st level
Return to top levelY[2] Y[7]Y[6]Y[5]Y[4]7
Y[3] 5
Y[0] Y[1]
[0] [4] [2] [6] [1] [5] [3] [7]
4 4 4 4
6 0 64 2 4 2
4 6 1 3
Somethinglooks
familiar?
![Page 7: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/7.jpg)
Parallelization of FFT
Parallelize by looking at the data patterns
Two algorithms Binary Exchange Matrix Transpose
![Page 8: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/8.jpg)
Binary Exchange FFTX[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
X[8]
X[9]
X[10]
X[11]
X[12]
X[13]
X[14]
X[15]
Y[0]
Y[1]
Y[2]
Y[3]
Y[4]
Y[5]
Y[6]
Y[7]
Y[8]
Y[9]
Y[10]
Y[11]
Y[12]
Y[13]
Y[14]
Y[15]
m = 0 m = 1 m = 2 m = 3
![Page 9: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/9.jpg)
Binary Exchange FFT
Data exchange takes place between all pairs of processors that differ by one bit.
One element per processor Easy
Multiple elements per processor Assign contiguous blocks to processors Same algorithm, just exchange blocks
![Page 10: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/10.jpg)
Binary Exchange FFT
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
X[8]
X[9]
X[10]
X[11]
X[12]
X[13]
X[14]
X[15]
P0
P2
P3
P1
d
r
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
Y[0]
Y[1]
Y[2]
Y[3]
Y[4]
Y[5]
Y[6]
Y[7]
Y[8]
Y[9]
Y[10]
Y[11]
Y[12]
Y[13]
Y[14]
Y[15]
m = 0 m = 1 m = 2 m = 3
![Page 11: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/11.jpg)
Binary Exchange FFT
As n increases so does communication Big bandwidth requirement
Powers of cannot be precalculated i is used at different times on different
processors Duplicated computation
![Page 12: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/12.jpg)
The Transpose FFT
Assume that sqrt(n) is a power of 2The data is arranged in a sqrt(n) x
sqrt(n) two-dimensional square array
![Page 13: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/13.jpg)
The Transpose FFT0 1 2 3
1512 13 14
10 118 9
4 5 6 7
0 1 2 3
1512 13 14
10 118 9
4 5 6 7
0 1 2 3
4 5 6 7
10 118 9
1512 13 14
0 1 2 3
1512 13 14
10 118 9
4 5 6 7
(b) Iteration m = 1(a) Iteration m = 0
(c) Iteration m = 2 (d) Iteration m = 3
![Page 14: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/14.jpg)
Parallelization of Transpose FFT
Notice First two iterations are columnwise Last two iterations are rowwise
Rather than do an exchange Transpose the matrix halfway through
algorithm
![Page 15: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/15.jpg)
The Transpose FFTP0 P1 P2 P3
P0 P1 P2 P3
0 1 2 3
1512 13 14
10 118 9
4 5 6 7
0
15
10
51
2
3
4
6
8
9
11
14
12
13
7
P0 P1 P2 P3
P0 P1 P2 P3
0 1 2 3
1512 13 14
10 118 9
4 5 6 7
0
15
10
51
2
3
4
6
8
9
7 11
14
12
13
(a) Steps in phase 1 of the transpose algorithm (before transpose)
(b) Steps in phase 3 of the transpose algorithm (after transpose)
![Page 16: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/16.jpg)
The Transpose FFT
Transposition of a striped partitioned array requires all-to-all communication
Would it be less expensive to just follow through with the algorithm or do the transpose?
![Page 17: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/17.jpg)
Which is better?
It Depends Architecture and amount of data play
together to create tradeoffs.Transpose algorithm is easy to
generalize to higher dimensions
![Page 18: CS 584](https://reader036.fdocument.pub/reader036/viewer/2022083006/56813dff550346895da7da67/html5/thumbnails/18.jpg)
Which is better?
1800016000140001200010000800060004000200000
5
10
15
20
25
30
35
40
45
Binary exchange2-D transpose3-D transpose
n
S