keras-blog | Blog with Keras news , tutorials , and demos
kandi X-RAY | keras-blog Summary
kandi X-RAY | keras-blog Summary
Generated with Pelican, hosted on Github pages. To contribute an article: send a Pull Request to the content branch. We welcome anything that would be of interest to Keras users, from a simple tutorial about a specific use case, to an advanced application demo, and everything in between.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of keras-blog
keras-blog Key Features
keras-blog Examples and Code Snippets
Community Discussions
Trending Discussions on keras-blog
QUESTION
I am roughly following this script fashion-MNIST-sagemaker.
I see that in the notebook
...ANSWER
Answered 2019-Dec-23 at 11:44distributed training is model and framework specific. Not all models are easy to distribute, and from ML framework to ML framework things are not equally easy. It is rarely automatic, even less so with TensorFlow and Keras.
Neural nets are conceptually easy to distribute under the data-parallel paradigm, whereby the gradient computation of a given mini-batch is split among workers, which could be multiple devices in the same host (multi-device) or multiple hosts with each multiple devices (multi-device multi-host). The D2L.ai course provides an in-depth view of how neural nets are distributed here and here.
Keras used to be trivial to distribute in multi-device, single host fashion with the multi_gpu_model
, which will sadly get deprecated in 4 months. In your case, you seem to refer to multi-host model (more than one machine), and that requires writing ad-hoc synchronization code such as the one seen in this official tutorial.
Now let's look at how does this relate to SageMaker.
SageMaker comes with 3 options for algorithm development. Using distributed training may require a varying amount of custom work depending on the option you choose:
The built-in algorithms is a library of 18 pre-written algorithms. Many of them are written to be distributed in single-host multi-GPU or multi-GPU multi-host. With that first option, you don't have anything to do apart from setting
train_instance_count
> 1 to distribute over multiple instancesThe Framework containers (the option you are using) are containers developed for popular frameworks (TensorFlow, PyTorch, Sklearn, MXNet) and provide pre-written docker environment in which you can write arbitrary code. In this options, some container will support one-click creation of ephemeral training clusters to do distributed training, however using
train_instance_count
greater than one is not enough to distribute the training of your model. It will just run your script on multiple machines. In order to distribute your training, you must write appropriate distribution and synchronization code in yourmnist_keras_tf.py
script. For some frameworks such code modification will be very simple, for example for TensorFlow and Keras, SageMaker comes with Horovod pre-installed. Horovod is a peer-to-peer ring-style communication mechanism that requires very little code modification and is highly scalable (initial annoucement from Uber, SageMaker doc, SageMaker example, SageMaker blog post). My recommendation would be to try using Horovod to distribute your code. Similarly, in Apache MXNet you can easily create Parameter Stores to host model parameters in a distributed fashion and sync with them from multiple nodes. MXNet scalability and ease of distribution is one of the reason Amazon loves it.The Bring-Your-Own Container requires you to write both docker container and algorithm code. In this situation, you can of course distribute your training over multiple machines but you also have to write machine-to-machine communication code
For your specific situation my recommendation would be to scale horizontally first in a single node with multiple GPUs over bigger and bigger machine types, because latency and complexity increase drastically as you switch from single-host to multi-host context. If truly necessary, use multi-node context and things maybe easier if that's done with Horovod. In any case, things are still much easier to do with SageMaker since it manages creation of ephemeral, billed-per-second clusters with built-in, logging and metadata and artifacts persistence and also handles fast training data loading from s3, sharded over training nodes.
Note on the relevancy of distributed training: Keep in mind that when you distribute over N devices a model that was running fine on one device, you usually grow the batch size by N so that the per-device batch size stays constant and each device keeps being busy. This will disturb your model convergence, because bigger batches means a less noisy SGD. A common heuristic is to grow the learning rate by N (more info in this great paper from Priya Goyal et al), but this on the other hand induces instability at the first couple epochs, so it is sometimes associated with a learning rate warmup. Scaling SGD to work well with very large batches is still an active research problem, with new ideas coming up frequently. Reaching good model performance with very large batches sometimes require ad-hoc research and a fair amount of parameter tuning, occasionally to the extent where the extra money spent on finding how to distribute well overcome the benefits of the faster training you eventually manage to run. A situation where distributed training makes sense is when an individual record represent too much compute to form a big enough physical batch on a device, a situation seen on big input sizes (eg vision over HD pictures) or big parameter counts (eg BERT). That being said, for those models requiring very big logical batch you don't necessarily have to distribute things physically: you can run sequentially N batches through your single GPU and wait N per-device batches before doing the gradient averaging and parameter update to simulate having an N times bigger GPU. (a clever hack sometimes called gradient accumulation)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install keras-blog
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page