Large Scale Distributed Deep Networks
-
Upload
hiroyuki-vincent-yamazaki -
Category
Engineering
-
view
828 -
download
8
Transcript of Large Scale Distributed Deep Networks
Large Scale Distributed Deep Networks
Survey of paper from NIPS 2012
Hiroyuki Vincent Yamazaki, Jan 8, [email protected]
What is Deep Learning?
How can distributed computing be applied?
– Jeff Dean, GoogleGitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015
“… We realize that distributed support is really important, and it's one of the top features we're
prioritizing at the moment.”
What is Deep Learning?
Multi layered neural networks
Functions that take some input and return some output
Input Outputf
Input Output
AND (1, 0) 0
y(x) = 2x + 5 7 19
Object Classifier Cat
Speech Recognizer “Hello world”
f
Neural Networks
Machine learning models, inspired by the human brain
Layered units with weighted connections
Signals are passed between layers Input layer → Hidden layers → Output layer
Steps
1. Prepare training, validation and test data
2. Define the model and its initial parameters
3. Train using the data to improve the modelf
Here to train?
Input Outputf
Input OutputHidden Layers
Input OutputHidden Layers
Yes, let’s do it
Feed Forward1. For each unit, compute its weighted sum
based on its input
2. Pass the sum to the activation function to get the output of the unit
z is the weighted sum
n is the number of inputs
xi is the i-th input
wi is the weight for xi
b is the bias term
y is the output
� is the activation function
z
z =nX
i=1
xiwi + b
y = �(z)
� y
w1x1
x2w2
b
Loss3. Given the output from the last layer, compute the loss using the
Mean Squared Error (MSE) or the cross entropy
This is the error that we want to minimize
E(W ) =1
2(y � y)2
E is the loss/error
W is the weights
y is the target values
y is the output values
Back Propagation
4. Compute the gradient of the loss function with respect to the parameters using Stochastic Gradient Descent (SGD)
5. Taken a step proportional (scaled by the learning rate) to the negative of the gradient to adjust the weights
�wi = ↵@E
@wi
wi,t+1 = wi,t +�wi
↵ is the learning rate, typically 10
�1to 10
�3
Improve the accuracy of the network by iteratively repeating these steps
But it takes time
22 layers 5M parameters
GoogLeNet, Google, ILSVRC 2014
AlexNet, NIPS 2012
7 layers 650K units 60M parameters
Yes, train hard
It’s too much
How can distributed computing be applied?
A framework, DistBelief proposed by the researchers at Google, 2012
Here, let me help you with thoseweights
Asynchronousness - Robustness to cope with slow machines and single point failures
Network Overhead - Manage the amount of data sent across machines
DistBelief
ParallelizationSplitting up the network/model
Model ReplicationProcessing multiple
instances of the network/model asynchronously
DistBelief Parallelization
Split up the network among multiple machines
Speed up gains for networks with many parameters up to the point when communication cost dominate
Bold connections require network traffic
DistBelief Model Replication
Two optimization algorithms to achieve asynchronousness, Downpour SGD and
Sandblaster L-BFGS
Downpour SGD Online Asynchronous
Stochastic Gradient Descent
1. Split the training data intoshards and assign a model replica to each data shard
2. For each model replica, fetch the parameters from the centralized sharded parameter server
3. Gradients are computed per model and pushed back to the parameter server
Each data shard stores a subset of the complete training data
Asynchrousness Model replicas and parameter server shards process data independently
Network OverheadEach machine only need to communicate with a subset of the parameter server shards
Batch Updates Performing batch updates and batch push/pull to and from the parameter server → Also reduces network overhead
AdaGrad Adaptive learning rates per weight using AdaGrad improves the training results
Stochasticity Out of date parameters in model replicas → Not clear how this affects the training
Sandblaster L-BFGSBatch Distributed Parameter Storage
and Manipulation
1. Create model replicas
2. Load balancing by dividing computational tasks into smaller subtasks and letting a coordinator assigns those subtasks to appropriate shards
Asynchrousness Model replicas and parameter shards process data independently
Network OverheadOnly a single fetch per batch
Distributed Parameter Server No need for a central parameter server that needs to handle all the parameters
Coordinator A process that balances the loads among the shards to prevent slow machines from slowing down or stopping the training
Results
Training speed-up is the number of times the parallelized model is faster compared with a regular model running on a single machine
The numbers in the brackets are the number of model replicas
Closer to the origin is better, in this case cost efficient in terms of money
Conclusion
Significant improvements over
single machine training
DistBelief is CPU oriented due to the CPU-GPU data transfer overhead
Unfortunately adds unit connectivity limitations
If neural networks continue to scale up distributed computing will become essential
Designed hardware such as the Big Sur could address these problems
We are strong together
ReferencesLarge Scaled Distributed Deep Networkshttp://research.google.com/archive/large_deep_networks_nips2012.html
Going Deeper with Convolutionshttp://arxiv.org/abs/1409.4842
ImageNet Classification with Deep Convolutional Neural Networkshttp://papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012
Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithmshttp://arxiv.org/abs/1505.04956
GitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015https://github.com/tensorflow/tensorflow/issues/23
Big Sur, Facebook, Dec 11, 2015https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/
Hiroyuki Vincent Yamazaki, Jan 8, [email protected]