"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems
-
Upload
embedded-vision-alliance -
Category
Technology
-
view
1.562 -
download
0
Transcript of "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems
Copyright © 2015 Auviz Systems 1
Nagesh Gupta
12 May 2015
Trade-offs in Implementing Deep Neural
Networks on FPGAs
Copyright © 2015 Auviz Systems 2
• Startup, specializes in implementing & optimizing algorithms on FPGAs
• Offers libraries of different classes of algorithms
• AuvizCV—optimized OpenCV algorithms
• AuvizLA —optimized BLAS
• AuvizDNN—optimized deep neural networks
• And develops custom algorithms in Computer Vision, Linear Algebra,
Deep Learning & Machine Learning
• Available as OpenCL function calls for software users to abstract the
complexity of using an FPGA
• Visit our booth & see AlexNet running on Xilinx FPGA!
Auviz Systems
Copyright © 2015 Auviz Systems 3
The Time for Artificial Intelligence &
Machine Learning
• Sources: Cisco/Statista, Facebook research, IT Business Edge
Copyright © 2015 Auviz Systems 4
Machine Learning Moving to the Data Center
Performance/watt
Programming model & use model
Microsoft Azure ML—
provides Machine Learning as a service on the cloud
IBM Watson at Jeopardy—one of the
best demonstration of Machine Learning
Amazon AWS ML & Google Predictive Analytics —other
Machine Learning services on the cloud
Copyright © 2015 Auviz Systems 5
• A form of Deep Neural Networks—used for various “recognition” tasks
• AlexNet [2] is a CNN configuration as shown below was used to classify
1.2 million images
Convolutional Neural Networks (CNNs)
Copyright © 2015 Auviz Systems 6
• A convolution layer has multiple stages
• 3D Convolutions:
• Activation: Using the ReLU function, Max(x, 0)
• Max pooling: Sub-sampling function that selects the max value
within a neighborhood
Components of AlexNet—Convolution layers
3D Convolutions Activation (ReLU) Sub-sampling (Max pooling)
Copyright © 2015 Auviz Systems 7
• Dense layers are fully connected—each
output node is a function of all the input
nodes
• The first 2 dense layers can be represented
as a matrix-vector multiplication operation
• Layer 6 has 9216 inputs which are
multiplied with a weight matrix to
create 4096 outputs
• Layer 7 has 4096 inputs which are
multiplied with a different weight
matrix to create 4096 outputs
• The output layer uses SoftMax to classify
the input image into one of 1000 classes
Dense Layers in AlexNet
Layer 6 Layer 7 Output
layer
Copyright © 2015 Auviz Systems 8
• Sequential implementation
• Implementation follows the
convolution equations
• Resource utilization will be very low,
but the latency at 200 MHz will be
22s for the 2nd layer
• High level synthesis (HLS) can be used to
implement as shown in [3]
• Get better performance by parallelizing
the implementation
Implementing 3D Convolutions
Weight
Matrices
Input feature
maps
Output feature
maps
Copyright © 2015 Auviz Systems 9
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09 Computations Data transfers
Computations vs. Data Transfers in AlexNet
• Computation latency, 2nd
convolution layer
• With 512 single precision
floating point operations
the 2nd convolution layer
takes 2.2 ms to
complete at 200 MHz
• Data transfer latency, 2nd
convolution layer
• With 64 bit DDR, 1.3
Gb/s, single precision
floating point data fetch
latency is around 0.5 ms
3D convolutions require more number of computations, while the data
transfers are higher for the dense layers
Copyright © 2015 Auviz Systems 10
3D Convolution—Parallel Implementation
X =
• A 11x11 weight matrix with 3 input feature maps requires 121*3
multiplications and 121*3 adders
• With 363 multiply units and 363 adders, this can be done in 1 cycle
• The FPGA resources required for a each single precision floating point
operation are 2-5 DSP blocks and 200-400 LUTs
• Implementing this in parallel will require ~1200 DSPs and ~75000 LUTs
1 Output value
11x11 Weight Matrix 11x11 Input Feature
Map
Copyright © 2015 Auviz Systems 11
Increasing Throughput With Pipelining
• Pipelining is a hardware concept to achieve higher throughput
• Helpful with complex multi-cycle operations—works by registering
intermediate results
• Pipeline 3D convolutions on one dimension & parallelize the other
• For example, convolve the weight matrix with an input feature
map in parallel, and pipeline for different feature maps
• Zhang, et al [3] convolve a set of input
feature maps with a set of weight matrices in
parallel and pipeline for the size of the input
feature map
C
R
C’
R’
M number of NxKxK weight filters
N M
Tn Tr
Tc
N
Tn
Tm
Input feature maps, NxRxC
K
K
N
Tn
Output feature maps, MxR’xC’
Copyright © 2015 Auviz Systems 12
• A simple way is to flatten feature maps and to create an array of
feature maps—below is an illustration for the first layer of AlexNet
• The weight matrices are flattened and the input feature maps are
rearranged for each column to have the neighborhood required for
convolutions
Mapping 3D Convolutions into Matrix
Multiplications
.
. 96
55 x 55 = 3025
.
. 96
3 x 11 x 11 = 363
.
.
3 x
11 x
11 =
363
55 x 55 = 3025
Y, matrix of output
feature maps W, matrix of weight
coefficients
X, matrix of input
feature maps
Copyright © 2015 Auviz Systems 13
• Larger number of compute units exhausts
the FPGA resources
• Each compute unit takes a few hundred
LUTs and 3-5 DSPs
• Data organization to ensure the compute
units are performing to the max
• Need to read a lot of data in parallel
• Data has to be stored on-chip to enable
parallel access
• Routing turns out to be a bigger challenge
• Proper data organization, architecture
& tools are the way to overcome
Implementation Challenges
0
10000
20000
30000
40000
50000
60000
70000
80000
256 512 768
Bit
s r
eq
uir
ed
per
cycle
Parallelism
Bits per operation
Copyright © 2015 Auviz Systems 14
• Single precision floating point
• Uses 32 bits to represent each data
• Requires more DSPs (3-5) to implement multiply/accumulate
• Fixed point
• 16-bit fixed point representation would suffice for many
applications [4]
• Stochastic rounding techniques perform similar to single precision
floating point representation [5]
• Half precision
• Uses 16 bits to represent data
• Significant reduction in routing & overall FPGA resources
• Mixed representation
• Use fixed point or half precision representation for some and single
precision representation for other layers
Using Alternate Data Representations
Copyright © 2015 Auviz Systems 15
• OpenCL tools enable software programmers to use the FPGA accelerator
without learning hardware methodologies
• Programmer calls OpenCL functions to accelerate on the FPGA
A complete CNN on the FPGA using OpenCL
Configure & setup
3D Convolutions
Dense layers Softmax
Copyright © 2015 Auviz Systems 16
Performance of AlexNet on FPGAs
FPGAs can achieve an impressive 14 images/sec/Watt compared to high
end GPUs such as Tesla K40, which can get to 4 images/sec/Watt
Copyright © 2015 Auviz Systems 17
• 3D convolutions are a key part of a CNN, and are compute intensive
• In FPGAs, 3D convolutions can be implemented efficiently with a
parallel & pipelined implementation
• FPGA resources—gates & routing will be the critical factors in
achieving a highly parallel implementation
• OpenCL implementation tools, such as Xilinx SDAccel simplify the
implementation task and provide a software flow
• Alternate data representations can be used to simplify the complexity
• Mixed data representations can simplify the computations without
compromising on the performance
• FPGAs are capable of delivering a high performance at a suitable power
profile for the data center
Summary
Copyright © 2015 Auviz Systems 18
• [1] Kevin Ovtcharov, et al, Accelerating Deep Convolutional Neural
Networks Using Specialized Hardware, Microsoft Research, 2015
• [2] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet Classification
with Deep Convolutional Neural Networks, Advances in Neural
Information Processing Systems, 2012
• [3] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, Optimizing
FPGA-based Accelerator Design for Deep Convolutional Neural
Networks, FPGA'2015, 2015
• [4] Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B.,
Akselrod, P., & Talay, S., “Large-scale FPGA-based convolutional
networks” in Machine Learning on Very Large Data Sets (2011).
• [5] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish
Narayanan. "Deep Learning with Limited Numerical Precision." arXiv
preprint arXiv:1502.02551 (2015).
References
Copyright © 2015 Auviz Systems 19
Nagesh Gupta
12 May 2015
Deep Neural Networks in FPGAs
Copyright © 2015 Auviz Systems 20
Con
volu
tion layers
Input size Input
feature
maps
Output
feature
maps
Filter
size
Computations Total data
transfer
224 x 224 3 96 11x11 110 * 10^6 255 * 10^3
27 x 27 96 256 5x5 448 * 10^6 728 * 10^3
13 x 13 256 384 3x3 150 * 10^6 993 * 10^3
13 x 13 384 384 3x3 224 * 10^6 1457 * 10^3
13 x 13 384 256 3x3 150 * 10^6 959 * 10^3
Computations vs. Data Transfers D
ense layers
Input data Weight matrix Computations Data transfers
9216 9216 x 4096 38 * 10^6 38 * 10^6
4096 4096 x 4096 16 * 10^6 16 * 10^6
4096 4096 x 1000 4 * 10^6 4 * 10^6