Scalable Distributed DL Training: Batching Communication and Computation

Authors

Shaoqi Wang

University of Colorado, Colorado Springs

Aidi Pi

University of Colorado, Colorado Springs

Xiaobo Zhou

University of Colorado, Colorado Springs

Proceedings:

No. 1: AAAI-19, IAAI-19, EAAI-20

Volume

Issue:

Proceedings of the AAAI Conference on Artificial Intelligence, 33

Track:

AAAI Technical Track: Machine Learning

Downloads:

Download PDF

Abstract:

Scalability of distributed deep learning (DL) training with parameter server architecture is often communication constrained in large clusters. There are recent efforts that use a layer by layer strategy to overlap gradient communication with backward computation so as to reduce the impact of communication constraint on the scalability. However, the approaches cannot be effectively applied to the overlap between parameter communication and forward computation. In this paper, we propose and design iBatch, a novel communication approach that batches parameter communication and forward computation to overlap them with each other. We formulate the batching decision as an optimization problem and solve it based on greedy algorithm to derive communication and computation batches. We implement iBatch in the open-source DL framework BigDL and perform evaluations with various DL workloads. Experimental results show that iBatch improves the scalability of a cluster of 72 nodes by up to 73% over the default PS and 41% over the layer by layer strategy.

DOI:

10.1609/aaai.v33i01.33015289

AAAI

Proceedings of the AAAI Conference on Artificial Intelligence, 33

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.