Scalable Distributed DL Training: Batching Communication and Computation

Shaoqi Wang; Aidi Pi; Xiaobo Zhou

doi:10.1609/aaai.v33i01.33015289

Authors

Shaoqi Wang University of Colorado, Colorado Springs
Aidi Pi University of Colorado, Colorado Springs
Xiaobo Zhou University of Colorado, Colorado Springs

DOI:

https://doi.org/10.1609/aaai.v33i01.33015289

Abstract

Scalability of distributed deep learning (DL) training with parameter server architecture is often communication constrained in large clusters. There are recent efforts that use a layer by layer strategy to overlap gradient communication with backward computation so as to reduce the impact of communication constraint on the scalability. However, the approaches cannot be effectively applied to the overlap between parameter communication and forward computation. In this paper, we propose and design iBatch, a novel communication approach that batches parameter communication and forward computation to overlap them with each other. We formulate the batching decision as an optimization problem and solve it based on greedy algorithm to derive communication and computation batches. We implement iBatch in the open-source DL framework BigDL and perform evaluations with various DL workloads. Experimental results show that iBatch improves the scalability of a cluster of 72 nodes by up to 73% over the default PS and 41% over the layer by layer strategy.

Scalable Distributed DL Training: Batching Communication and Computation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription