"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

20
Copyright © 2015 Auviz Systems 1 Nagesh Gupta 12 May 2015 Trade-offs in Implementing Deep Neural Networks on FPGAs

Transcript of "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Page 1: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 1

Nagesh Gupta

12 May 2015

Trade-offs in Implementing Deep Neural

Networks on FPGAs

Page 2: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 2

• Startup, specializes in implementing & optimizing algorithms on FPGAs

• Offers libraries of different classes of algorithms

• AuvizCV—optimized OpenCV algorithms

• AuvizLA —optimized BLAS

• AuvizDNN—optimized deep neural networks

• And develops custom algorithms in Computer Vision, Linear Algebra,

Deep Learning & Machine Learning

• Available as OpenCL function calls for software users to abstract the

complexity of using an FPGA

• Visit our booth & see AlexNet running on Xilinx FPGA!

Auviz Systems

Page 3: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 3

The Time for Artificial Intelligence &

Machine Learning

• Sources: Cisco/Statista, Facebook research, IT Business Edge

Page 4: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 4

Machine Learning Moving to the Data Center

Performance/watt

Programming model & use model

Microsoft Azure ML—

provides Machine Learning as a service on the cloud

IBM Watson at Jeopardy—one of the

best demonstration of Machine Learning

Amazon AWS ML & Google Predictive Analytics —other

Machine Learning services on the cloud

Page 5: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 5

• A form of Deep Neural Networks—used for various “recognition” tasks

• AlexNet [2] is a CNN configuration as shown below was used to classify

1.2 million images

Convolutional Neural Networks (CNNs)

Page 6: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 6

• A convolution layer has multiple stages

• 3D Convolutions:

• Activation: Using the ReLU function, Max(x, 0)

• Max pooling: Sub-sampling function that selects the max value

within a neighborhood

Components of AlexNet—Convolution layers

3D Convolutions Activation (ReLU) Sub-sampling (Max pooling)

Page 7: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 7

• Dense layers are fully connected—each

output node is a function of all the input

nodes

• The first 2 dense layers can be represented

as a matrix-vector multiplication operation

• Layer 6 has 9216 inputs which are

multiplied with a weight matrix to

create 4096 outputs

• Layer 7 has 4096 inputs which are

multiplied with a different weight

matrix to create 4096 outputs

• The output layer uses SoftMax to classify

the input image into one of 1000 classes

Dense Layers in AlexNet

Layer 6 Layer 7 Output

layer

Page 8: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 8

• Sequential implementation

• Implementation follows the

convolution equations

• Resource utilization will be very low,

but the latency at 200 MHz will be

22s for the 2nd layer

• High level synthesis (HLS) can be used to

implement as shown in [3]

• Get better performance by parallelizing

the implementation

Implementing 3D Convolutions

Weight

Matrices

Input feature

maps

Output feature

maps

Page 9: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 9

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09 Computations Data transfers

Computations vs. Data Transfers in AlexNet

• Computation latency, 2nd

convolution layer

• With 512 single precision

floating point operations

the 2nd convolution layer

takes 2.2 ms to

complete at 200 MHz

• Data transfer latency, 2nd

convolution layer

• With 64 bit DDR, 1.3

Gb/s, single precision

floating point data fetch

latency is around 0.5 ms

3D convolutions require more number of computations, while the data

transfers are higher for the dense layers

Page 10: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 10

3D Convolution—Parallel Implementation

X =

• A 11x11 weight matrix with 3 input feature maps requires 121*3

multiplications and 121*3 adders

• With 363 multiply units and 363 adders, this can be done in 1 cycle

• The FPGA resources required for a each single precision floating point

operation are 2-5 DSP blocks and 200-400 LUTs

• Implementing this in parallel will require ~1200 DSPs and ~75000 LUTs

1 Output value

11x11 Weight Matrix 11x11 Input Feature

Map

Page 11: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 11

Increasing Throughput With Pipelining

• Pipelining is a hardware concept to achieve higher throughput

• Helpful with complex multi-cycle operations—works by registering

intermediate results

• Pipeline 3D convolutions on one dimension & parallelize the other

• For example, convolve the weight matrix with an input feature

map in parallel, and pipeline for different feature maps

• Zhang, et al [3] convolve a set of input

feature maps with a set of weight matrices in

parallel and pipeline for the size of the input

feature map

C

R

C’

R’

M number of NxKxK weight filters

N M

Tn Tr

Tc

N

Tn

Tm

Input feature maps, NxRxC

K

K

N

Tn

Output feature maps, MxR’xC’

Page 12: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 12

• A simple way is to flatten feature maps and to create an array of

feature maps—below is an illustration for the first layer of AlexNet

• The weight matrices are flattened and the input feature maps are

rearranged for each column to have the neighborhood required for

convolutions

Mapping 3D Convolutions into Matrix

Multiplications

.

. 96

55 x 55 = 3025

.

. 96

3 x 11 x 11 = 363

.

.

3 x

11 x

11 =

363

55 x 55 = 3025

Y, matrix of output

feature maps W, matrix of weight

coefficients

X, matrix of input

feature maps

Page 13: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 13

• Larger number of compute units exhausts

the FPGA resources

• Each compute unit takes a few hundred

LUTs and 3-5 DSPs

• Data organization to ensure the compute

units are performing to the max

• Need to read a lot of data in parallel

• Data has to be stored on-chip to enable

parallel access

• Routing turns out to be a bigger challenge

• Proper data organization, architecture

& tools are the way to overcome

Implementation Challenges

0

10000

20000

30000

40000

50000

60000

70000

80000

256 512 768

Bit

s r

eq

uir

ed

per

cycle

Parallelism

Bits per operation

Page 14: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 14

• Single precision floating point

• Uses 32 bits to represent each data

• Requires more DSPs (3-5) to implement multiply/accumulate

• Fixed point

• 16-bit fixed point representation would suffice for many

applications [4]

• Stochastic rounding techniques perform similar to single precision

floating point representation [5]

• Half precision

• Uses 16 bits to represent data

• Significant reduction in routing & overall FPGA resources

• Mixed representation

• Use fixed point or half precision representation for some and single

precision representation for other layers

Using Alternate Data Representations

Page 15: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 15

• OpenCL tools enable software programmers to use the FPGA accelerator

without learning hardware methodologies

• Programmer calls OpenCL functions to accelerate on the FPGA

A complete CNN on the FPGA using OpenCL

Configure & setup

3D Convolutions

Dense layers Softmax

Page 16: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 16

Performance of AlexNet on FPGAs

FPGAs can achieve an impressive 14 images/sec/Watt compared to high

end GPUs such as Tesla K40, which can get to 4 images/sec/Watt

Page 17: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 17

• 3D convolutions are a key part of a CNN, and are compute intensive

• In FPGAs, 3D convolutions can be implemented efficiently with a

parallel & pipelined implementation

• FPGA resources—gates & routing will be the critical factors in

achieving a highly parallel implementation

• OpenCL implementation tools, such as Xilinx SDAccel simplify the

implementation task and provide a software flow

• Alternate data representations can be used to simplify the complexity

• Mixed data representations can simplify the computations without

compromising on the performance

• FPGAs are capable of delivering a high performance at a suitable power

profile for the data center

Summary

Page 18: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 18

• [1] Kevin Ovtcharov, et al, Accelerating Deep Convolutional Neural

Networks Using Specialized Hardware, Microsoft Research, 2015

• [2] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet Classification

with Deep Convolutional Neural Networks, Advances in Neural

Information Processing Systems, 2012

• [3] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, Optimizing

FPGA-based Accelerator Design for Deep Convolutional Neural

Networks, FPGA'2015, 2015

• [4] Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B.,

Akselrod, P., & Talay, S., “Large-scale FPGA-based convolutional

networks” in Machine Learning on Very Large Data Sets (2011).

• [5] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish

Narayanan. "Deep Learning with Limited Numerical Precision." arXiv

preprint arXiv:1502.02551 (2015).

References

Page 19: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 19

Nagesh Gupta

12 May 2015

Deep Neural Networks in FPGAs

Page 20: "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 20

Con

volu

tion layers

Input size Input

feature

maps

Output

feature

maps

Filter

size

Computations Total data

transfer

224 x 224 3 96 11x11 110 * 10^6 255 * 10^3

27 x 27 96 256 5x5 448 * 10^6 728 * 10^3

13 x 13 256 384 3x3 150 * 10^6 993 * 10^3

13 x 13 384 384 3x3 224 * 10^6 1457 * 10^3

13 x 13 384 256 3x3 150 * 10^6 959 * 10^3

Computations vs. Data Transfers D

ense layers

Input data Weight matrix Computations Data transfers

9216 9216 x 4096 38 * 10^6 38 * 10^6

4096 4096 x 4096 16 * 10^6 16 * 10^6

4096 4096 x 1000 4 * 10^6 4 * 10^6