Textbook: VLSI ARRAY PROCESSORS S.Y. Kungsoc.eecs.yuntech.edu.tw/Course/Digital...
Transcript of Textbook: VLSI ARRAY PROCESSORS S.Y. Kungsoc.eecs.yuntech.edu.tw/Course/Digital...
1/34
Textbook: VLSI ARRAY PROCESSORSS.Y. Kung
Prentice-Hall, Inc.開發圖書
教 師 : 蘇 慶 龍INSTRUCTOR : CHING-LONG SU
課程名稱課程名稱: : 數位積體電路設計數位積體電路設計
E-mail: [email protected]
2/34
Chapter 4Systolic Array Processors
Chapter 4Chapter 4
3/34Outline of Chapter 4Outline of Chapter 4
4.1 Introduction4.2 Systolic Array Processors4.3 Mapping DGs and SFGs to Systolic Arrays4.4 Performance Analysis and Design Optimization4.5 Systolic Arrays for the Transitive Closure and
Dynamic Programming Problems4.6 Systolic Design for Artificial Neural Network4.7 Conclusion Remarks4.8 Problems
4/344.1 4.1 IntroductionIntroduction
4.1 Introduction4.2 Systolic Array Processors4.3 Mapping DGs and SFGs to Systolic Arrays4.4 Performance Analysis and Design Optimization4.5 Systolic Arrays for the Transitive Closure and
Dynamic Programming Problems4.6 Systolic Design for Artificial Neural Network4.7 Conclusion Remarks4.8 Problems
5/344.1 4.1 IntroductionIntroduction
1. Review the algorithm mapping onto SFG methodology2. Discuss the cut-set systolization (retiming) method for
systolic array design
nn In Chapter 4In Chapter 4
6/344.2 4.2 Systolic Array ProcessorsSystolic Array Processors
4.1 Introduction4.2 Systolic Array Processors4.3 Mapping DGs and SFGs to Systolic Arrays4.4 Performance Analysis and Design Optimization4.5 Systolic Arrays for the Transitive Closure and
Dynamic Programming Problems4.6 Systolic Design for Artificial Neural Network4.7 Conclusion Remarks4.8 Problems
7/34
1. “A systolic system is a network of processors which rhythmically compute and pass data through the system.”
2. Systolic Array Processor avoids the classic memory access bottleneck problem.
3. Systolic Array Processor can solve the compute-bound and I/O-bound computations.
nn Systolic Array ProcessorSystolic Array Processor
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
8/34
nn Basic Configuration of Systolic ArrayBasic Configuration of Systolic Array
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
PE PE PE PE PE PE PE PE
Memory
9/34
1. Synchrony: The data are rhythmically computed (timed by a global clock) and passed through the network.
2. Modularity and Regularity3. Spatial Locality and Temporal Locality4. Pipelinability
nn Definition of Systolic ArraysDefinition of Systolic Arrays
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
10/34
nn Example 1: Systolic Array for ConvolutionExample 1: Systolic Array for Convolution
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
W0 W0 W0 W0 0
- u1 - u0
- y0 - y1
Wi
ain
bout
aout
bin
aout=ain
bout=bin+ain*Wi
11/34
nn Example 2: Example 2: Hexagonal SystolicHexagonal Systolic
Array for BandArray for BandMatrix Matrix
MultiplicationMultiplication
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
c11
c21 c12
c31 c13c22
c32c23
c14c41
b11
b12
b21
T=1
T=2
T=3
T=4
b13
b22
b23
b32
T=1
T=2T=3
T=4
a11
a21
a12
a31
a22
a32
a23
C out
T=1
T=2T=3
T=4
12/34
nn Properties of Systolic ArrayProperties of Systolic Array
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
1. Simple and Regular Design2. Concurrency and Communication3. Balancing Computation with I/O
13/34
nn Clock Distribution Scheme for Synchronization Clock Distribution Scheme for Synchronization of the Systolic Array System:of the Systolic Array System:
HH--tree Layout for the Balance of the Clock tree Layout for the Balance of the Clock Circuit DelayCircuit Delay
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
Linear Array Square Array Hexagonal Array
14/34
nn Systolic vs. SIMD vs. SFG ArraysSystolic vs. SIMD vs. SFG Arrays
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
Control Unit(Central Control)
ProcessingUuit
ProcessingUuit
ProcessingUuit
Interconnection Network (Local)
Control Bus
Data Bus
GlobalCommunication
SIMD Array
15/34
nn Systolic vs. SIMD vs. SFG ArraysSystolic vs. SIMD vs. SFG Arrays
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
Systolic Array
ProcessingUuit
ProcessingUuit
ProcessingUuit
Interconnection Network (Local)
ControlUnit
ControlUnit
ControlUnit
16/34
nn Systolic vs. SIMD vs. SFG ArraysSystolic vs. SIMD vs. SFG Arrays
4.2 4.2 Systolic Array ProcessorsSystolic Array Processors
SFG Array
ProcessingUuit
ProcessingUuit
ProcessingUuit
Interconnection Network (Local)
Data Bus
ControlUnit
ControlUnit
ControlUnit
Global Communication
17/344.2 4.2 Systolic Array ProcessorsSystolic Array Processors
4.1 Introduction4.2 Systolic Array Processors4.3 Mapping DGs and SFGs to Systolic Arrays4.4 Performance Analysis and Design Optimization4.5 Systolic Arrays for the Transitive Closure and
Dynamic Programming Problems4.6 Systolic Design for Artificial Neural Network4.7 Conclusion Remarks4.8 Problems
18/34
nn Three Stage of Canonical Mapping Algorithm Three Stage of Canonical Mapping Algorithm for Systolic Array Designfor Systolic Array Design
4.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
1. Derive a (local) DG from the Algorithm2. Map the DG to an SFG Array3. Transform the SFG to a Systolic Array (i.e. retiming)
19/34
nn The major systolic array design gap is that most The major systolic array design gap is that most SFGs SFGs are not given in temporally localized form.are not given in temporally localized form.
4.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
Systolic Array = SFG Array + Pipeline Retiming
20/34
nn CutCut--Set Retiming ProcedureSet Retiming Procedure
4.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
1. Timing Scale: All delays D may be scaled, i.e. D αD(I/O Down Sample)
2. Delay Transfer: Given a cut-set of the SFG, which partitions the graph into two components, we can group the edges of the cut-set into two inbound edges and outbound edges.
21/34
nn Data Transfer RuleData Transfer Rule
4.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
Inbound
Outbound
Cut
+kD
+kD
- kD
22/34
nn Systolization Systolization ProcedureProcedure
4.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
1. Selection of Basic Operation Modules2. Applying Retiming Rules3. Combination of Delay and Operation Modules
23/34
nn Systolization Systolization Procedure: Example of Lattice FiltersProcedure: Example of Lattice Filters
4.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
*+
*+
*+
*+
*+
*+
DD D
- - - X2 X1 X0
Y0 Y1 Y2- - -
Critical Path= 6 MAC
ki
*+
*+
*+
*+
*+
*+
2D2D 2D
- X2 - X1 - X0
Y0 - Y1 - Y2 -
SFG for AR Lattice Filter
Step 1. Time-Rescaled SFG for AR Lattice Filter
24/34
nn Systolization Systolization Procedure: Example of Lattice FiltersProcedure: Example of Lattice Filters
4.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
- X2 - X1 - X0
Y0 - Y1 - Y2 -
Step 2. Retiming SFG for AR Lattice Filter
Step 3. Systolic Array for AR Lattice Filter
*+
*+
*+
*+
*+
*+
2D-D2D-D 2D-D
- X2 - X1 - X0
Y0 - Y1 - Y2 -
+D +D +D
Critical Path= 2 MAC
25/34
nn Systolization Systolization Procedure:Procedure:Example ofExample of
Matrix MultiplicationMatrix Multiplication
4.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
D D D D
D D D D
D D D D
D D D D
b14b13b12b11
b24b23b22b21
b34b33b32b31
b44b43b42b41
a11
a21
a31
a41
a12
a22
a32
a42
a13
a23
a33
a43
a14
a24
a34
a44
26/344.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
nn All systolic arrays obtained from linear projections All systolic arrays obtained from linear projections of the DG can be derived by the following two steps.of the DG can be derived by the following two steps.
1. Mapping the DG onto SFGs by the SFG Projection Procedure
2. Mapping the SFG onto a Systolic Array by the Cut-Set Retiming
27/344.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
nn Retiming in the Sorting Systolic Arrays: Insertion SorterRetiming in the Sorting Systolic Arrays: Insertion Sorter
D m44m4
i
D m33m3
i
D m22m2
i
m11 m21 m31 m41 m51
x11 x21 x31 x41
m22 m32 m42 m52
x22 x32 x42x12
m33 m43 m53
x33 x43
m44 m54
x44x34
x23
x45
i
j
INPUT
-∞
-∞-∞
-∞-∞
-∞-∞
-∞
D m11m1
i
x11
x12
x13
x14
d = s =[1,0]
D
D
D
28/344.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
nn Retiming in the Sorting Systolic Arrays: Selection SorterRetiming in the Sorting Systolic Arrays: Selection Sorter
x1j
D
x11m1
1
x2j
D
x21m2
2
x3j
D
x31m3
3
m11 m21 m31 m41 m51
x11 x21 x31 x41
m22 m32 m42 m52
x22 x32 x42x12
m33 m43 m53
x33 x43
m44 m54
x44x34
x23
x45
i
j
INPUT
-∞
-∞-∞
-∞-∞
-∞-∞
-∞
x4j
D
x41m4
4
d = s = [0,1]
D D D
29/344.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
nn Retiming in the Sorting Systolic Arrays: BubbleRetiming in the Sorting Systolic Arrays: Bubble--SorterSorter
m11 m21 m31 m41 m51
x11 x21 x31 x41
m22 m32 m42 m52
x22 x32 x42x12
m33 m43 m53
x33 x43
m44 m54
x44x34
x23
x45
i
j
INPUT
-∞
-∞-∞
-∞-∞
-∞-∞
-∞
D
D
D
D
D
D
x44
x33
x22
x11m1
1
m13
m15
m17
d = s = [1,1]
30/344.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
nn Rotation of Schedule Vector for Insertion SorterRotation of Schedule Vector for Insertion Sorter
D m44m4
i
D m33m3
i
D m22m2
i
m11 m21 m31 m41 m51
x11 x21 x31 x41
m22 m32 m42 m52
x22 x32 x42x12
m33 m43 m53
x33 x43
m44 m54
x44x34
x23
x45
i
j
INPUT
-∞
-∞-∞
-∞-∞
-∞-∞
-∞
D m11m1
i
x11
x12
x13
x14
d = s =[1,0]
D
D
DDefault Schedule
Desired Schedule
31/344.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
nn Bit Level Systolic Arrays:Bit Level Systolic Arrays:Example of Inner Product of Two VectorsExample of Inner Product of Two Vectors
The inner product vector c of two vectors a and b is computed as:
Assume that elements of a and b are m-bit integer.
∑==
n
kkkbac
1
32/344.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
nn Bit Level Systolic Arrays:Bit Level Systolic Arrays:Example of Inner Product of Two VectorsExample of Inner Product of Two Vectors
jnmj
jjn
mj
jj
mj
jj
mj
jnn
nk kk
baba
bababa
bac
,10,
10,1
10,1
10
2211
1
22...22
...
∑×∑++∑×∑=
×++×+×=
∑ ×=
−=
−=
−=
−=
=
a1,0a1,1a1,m-1
b1,0b1,1b1,m-1
b1,0 a1,0a1,1b1,0b1,0 a1,m-1
b1,1 a1,0a1,1b1,1b1,1 a1,m-1
b1,m-1 a1,0a1,1b1,m-1b1,m-1 a1,m-1
Top Layer:
33/34
jnmj
jjn
mj
jj
mj
jj
mj
jnn
nk kk
baba
bababa
bac
,10,
10,1
10,1
10
2211
1
22...22
...
∑×∑++∑×∑=
×++×+×=
∑ ×=
−=
−=
−=
−=
=
4.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
nn Bit Level Systolic Arrays:Bit Level Systolic Arrays:Example of Inner Product of Two VectorsExample of Inner Product of Two Vectors
an,0an,1an,m-1
bn,0bn,1bn,m-1
bn,0 an,0an,1bn,0bn,0 an,m-1
bn,1 an,0an,1bn,1bn,1 an,m-1
bn,m-1 an,0an,1bn,m-1bn,m-1 an,m-1
Bottom Layer:
34/344.3 4.3 Mapping Mapping DGs DGs and and SFGs SFGs to Systolic Arraysto Systolic Arrays
nn Bit Level Systolic Arrays:Bit Level Systolic Arrays:Example of Inner Product Example of Inner Product
of Two Vectorsof Two Vectors
a0,0
a0,1
a0,2
b0,0
b0,1
b0,2
a1,0
a1,1
a1,2
b1,0
b1,1
b1,2
a2,0
a2,1
a2,2
b2,0
b2,1
b2,2
c0
c1c3
c4
c2
Sum
Carry
HA
FA
Top Layer
BottomLayer