"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 1

Nagesh Gupta

12 May 2015

Trade-offs in Implementing Deep Neural

Networks on FPGAs


• Startup, specializes in implementing & optimizing algorithms on FPGAs

• Offers libraries of different classes of algorithms

• AuvizCV—optimized OpenCV algorithms

• AuvizLA —optimized BLAS

• AuvizDNN—optimized deep neural networks

• And develops custom algorithms in Computer Vision, Linear Algebra,

Deep Learning & Machine Learning

• Available as OpenCL function calls for software users to abstract the

complexity of using an FPGA

• Visit our booth & see AlexNet running on Xilinx FPGA!

Auviz Systems


The Time for Artificial Intelligence &

Machine Learning

• Sources: Cisco/Statista, Facebook research, IT Business Edge


Machine Learning Moving to the Data Center

Performance/watt

Programming model & use model

Microsoft Azure ML—

provides Machine Learning as a service on the cloud

IBM Watson at Jeopardy—one of the

best demonstration of Machine Learning

Amazon AWS ML & Google Predictive Analytics —other

Machine Learning services on the cloud


• A form of Deep Neural Networks—used for various “recognition” tasks

• AlexNet [2] is a CNN configuration as shown below was used to classify

1.2 million images

Convolutional Neural Networks (CNNs)


• A convolution layer has multiple stages

• 3D Convolutions:

• Activation: Using the ReLU function, Max(x, 0)

• Max pooling: Sub-sampling function that selects the max value

within a neighborhood

Components of AlexNet—Convolution layers

3D Convolutions Activation (ReLU) Sub-sampling (Max pooling)


• Dense layers are fully connected—each

output node is a function of all the input

nodes

• The first 2 dense layers can be represented

as a matrix-vector multiplication operation

• Layer 6 has 9216 inputs which are

multiplied with a weight matrix to

create 4096 outputs

• Layer 7 has 4096 inputs which are

multiplied with a different weight

matrix to create 4096 outputs

• The output layer uses SoftMax to classify

the input image into one of 1000 classes

Dense Layers in AlexNet

Layer 6 Layer 7 Output

layer


• Sequential implementation

• Implementation follows the

convolution equations

• Resource utilization will be very low,

but the latency at 200 MHz will be

22s for the 2nd layer

• High level synthesis (HLS) can be used to

implement as shown in [3]

• Get better performance by parallelizing

the implementation

Implementing 3D Convolutions

Weight

Matrices

Input feature

maps

Output feature

maps


1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09 Computations Data transfers

Computations vs. Data Transfers in AlexNet

• Computation latency, 2nd

convolution layer

• With 512 single precision

floating point operations

the 2nd convolution layer

takes 2.2 ms to

complete at 200 MHz

• Data transfer latency, 2nd

convolution layer

• With 64 bit DDR, 1.3

Gb/s, single precision

floating point data fetch

latency is around 0.5 ms

3D convolutions require more number of computations, while the data

transfers are higher for the dense layers


3D Convolution—Parallel Implementation

X =

• A 11x11 weight matrix with 3 input feature maps requires 121*3

multiplications and 121*3 adders

• With 363 multiply units and 363 adders, this can be done in 1 cycle

• The FPGA resources required for a each single precision floating point

operation are 2-5 DSP blocks and 200-400 LUTs

• Implementing this in parallel will require ~1200 DSPs and ~75000 LUTs

1 Output value

11x11 Weight Matrix 11x11 Input Feature

Map


Increasing Throughput With Pipelining

• Pipelining is a hardware concept to achieve higher throughput

• Helpful with complex multi-cycle operations—works by registering

intermediate results

• Pipeline 3D convolutions on one dimension & parallelize the other

• For example, convolve the weight matrix with an input feature

map in parallel, and pipeline for different feature maps

• Zhang, et al [3] convolve a set of input

feature maps with a set of weight matrices in

parallel and pipeline for the size of the input

feature map

C

R

C’

R’

M number of NxKxK weight filters

N M

Tn Tr

Tc

N

Tn

Tm

Input feature maps, NxRxC

K

K

N

Tn

Output feature maps, MxR’xC’


• A simple way is to flatten feature maps and to create an array of

feature maps—below is an illustration for the first layer of AlexNet

• The weight matrices are flattened and the input feature maps are

rearranged for each column to have the neighborhood required for

convolutions

Mapping 3D Convolutions into Matrix

Multiplications

.

. 96

55 x 55 = 3025

.

. 96

3 x 11 x 11 = 363

.

.

3 x

11 x

11 =

363

55 x 55 = 3025

Y, matrix of output

feature maps W, matrix of weight

coefficients

X, matrix of input

feature maps


• Larger number of compute units exhausts

the FPGA resources

• Each compute unit takes a few hundred

LUTs and 3-5 DSPs

• Data organization to ensure the compute

units are performing to the max

• Need to read a lot of data in parallel

• Data has to be stored on-chip to enable

parallel access

• Routing turns out to be a bigger challenge

• Proper data organization, architecture

& tools are the way to overcome

Implementation Challenges

0

10000

20000

30000

40000

50000

60000

70000

80000

256 512 768

Bit

s r

eq

uir

ed

per

cycle

Parallelism

Bits per operation


• Single precision floating point

• Uses 32 bits to represent each data

• Requires more DSPs (3-5) to implement multiply/accumulate

• Fixed point

• 16-bit fixed point representation would suffice for many

applications [4]

• Stochastic rounding techniques perform similar to single precision

floating point representation [5]

• Half precision

• Uses 16 bits to represent data

• Significant reduction in routing & overall FPGA resources

• Mixed representation

• Use fixed point or half precision representation for some and single

precision representation for other layers

Using Alternate Data Representations


• OpenCL tools enable software programmers to use the FPGA accelerator

without learning hardware methodologies

• Programmer calls OpenCL functions to accelerate on the FPGA

A complete CNN on the FPGA using OpenCL

Configure & setup

3D Convolutions

Dense layers Softmax


Performance of AlexNet on FPGAs

FPGAs can achieve an impressive 14 images/sec/Watt compared to high

end GPUs such as Tesla K40, which can get to 4 images/sec/Watt


• 3D convolutions are a key part of a CNN, and are compute intensive

• In FPGAs, 3D convolutions can be implemented efficiently with a

parallel & pipelined implementation

• FPGA resources—gates & routing will be the critical factors in

achieving a highly parallel implementation

• OpenCL implementation tools, such as Xilinx SDAccel simplify the

implementation task and provide a software flow

• Alternate data representations can be used to simplify the complexity

• Mixed data representations can simplify the computations without

compromising on the performance

• FPGAs are capable of delivering a high performance at a suitable power

profile for the data center

Summary


• [1] Kevin Ovtcharov, et al, Accelerating Deep Convolutional Neural

Networks Using Specialized Hardware, Microsoft Research, 2015

• [2] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet Classification

with Deep Convolutional Neural Networks, Advances in Neural

Information Processing Systems, 2012

• [3] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, Optimizing

FPGA-based Accelerator Design for Deep Convolutional Neural

Networks, FPGA'2015, 2015

• [4] Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B.,

Akselrod, P., & Talay, S., “Large-scale FPGA-based convolutional

networks” in Machine Learning on Very Large Data Sets (2011).

• [5] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish

Narayanan. "Deep Learning with Limited Numerical Precision." arXiv

preprint arXiv:1502.02551 (2015).

References


Nagesh Gupta

12 May 2015

Deep Neural Networks in FPGAs


Con

volu

tion layers

Input size Input

feature

maps

Output

feature

maps

Filter

size

Computations Total data

transfer

224 x 224 3 96 11x11 110 * 10^6 255 * 10^3

27 x 27 96 256 5x5 448 * 10^6 728 * 10^3

13 x 13 256 384 3x3 150 * 10^6 993 * 10^3

13 x 13 384 384 3x3 224 * 10^6 1457 * 10^3

13 x 13 384 256 3x3 150 * 10^6 959 * 10^3

Computations vs. Data Transfers D

ense layers

Input data Weight matrix Computations Data transfers

9216 9216 x 4096 38 * 10^6 38 * 10^6

4096 4096 x 4096 16 * 10^6 16 * 10^6

4096 4096 x 1000 4 * 10^6 4 * 10^6

"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Technology

Transcript of "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems