Large-scale Deep Learning by Distributed Training
Nyholm, Juha Oskari (2019)
Nyholm, Juha Oskari
2019
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2019060414582
https://urn.fi/URN:NBN:fi:amk-2019060414582
Tiivistelmä
This thesis is done as part of a service development task of distributed deep learning on the CSC provided infrastructure. The aim is to improve the readiness to provide a service for AI researchers who wish to scale out in deep learning and to benefit from the potential speedup gains of distributed deep learning. The algorithmic challenges in large-scale distributed deep learning involve hardware utilization and model quality related challenges, which must be addressed in order to truly benefit from scaling out. In this thesis, we experiment with the Horovod distributed training framework, which provides a means for more efficient resource utilization, when scaling out in deep learning. We also experiment with the linear scaling rule and learning rate warmup methods, to address the model quality related issues that are present when training is conducted with larger batch sizes. In our experiments, we measure the scaling performance of our distributed programs and the model quality implications of naïvely scaling out in deep learning vs. scaling out utilizing the linear scaling rule and learning rate warmup methods. Our experiments provide concrete examples on how to apply these tools and methods in the provided execution environment, and they also provide answers for the following research questions:
RQ1: What are the implications of using smaller per GPU worker batch sizes vs. using larger per GPU worker batch sizes in terms of scaling performance, training time (wall clock) and efficient resource utilization in the provided execution environment?
RQ2: Does the Horovod distributed training framework provide a speedup in the provided execution environment when scaling out training?
RQ3: What are the model quality implications of larger global batch sizes when utilizing methods such as linear learning rate scaling and gradual learning rate warmup?
The tools and methods utilized in our experiments, enabled us to efficiently scale out a 1000 class image classification problem to 32 GPU workers, without degrading the resulting model quality.
RQ1: What are the implications of using smaller per GPU worker batch sizes vs. using larger per GPU worker batch sizes in terms of scaling performance, training time (wall clock) and efficient resource utilization in the provided execution environment?
RQ2: Does the Horovod distributed training framework provide a speedup in the provided execution environment when scaling out training?
RQ3: What are the model quality implications of larger global batch sizes when utilizing methods such as linear learning rate scaling and gradual learning rate warmup?
The tools and methods utilized in our experiments, enabled us to efficiently scale out a 1000 class image classification problem to 32 GPU workers, without degrading the resulting model quality.