InGo: In-Network Aggregation Routing with Batch Size Adjustment for Distributed Training

Jianfeng Bao, Gongming Zhao, Hongli Xu, Haibo Wang, Peng Yang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Distributed training has emerged as a critical application in clusters due to the widespread adoption of AI technology across various domains. However, as distributed training continues to advance, it has become increasingly time-consuming. To address this challenge, researchers have explored leveraging In-Network Aggregation (INA) to expedite distributed model training. Specifically, by harnessing programmable hardware, such as Intel Tofino switches, INA can aggregate gradients within the network, thereby reducing the amount of gradient transmission and accelerating distributed training. However, previous works assume fixed routing selection and batch size, ignoring their impact on model convergence and resulting in extended completion time. To bridge this gap, we propose InGo, a pioneering approach that considers both in-network aggregation routing and batch size adjustment, and provide the rigorous convergence analysis. Then, we formally define the problem of in-network aggregation routing with batch size adjustment, and present an efficient algorithm with bounded approximation factors to solve this problem. Through extensive experiments on both physical platforms and simulated environments, we demonstrate that InGo significantly reduces the completion time by 25.2%-74.7% compared to state-of-the-art solutions.

Original languageEnglish
Title of host publication2024 IEEE/ACM 32nd International Symposium on Quality of Service, IWQoS 2024
ISBN (Electronic)9798350350128
DOIs
StatePublished - 2024
Event32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024 - Guangzhou, China
Duration: Jun 19 2024Jun 21 2024

Publication series

NameIEEE International Workshop on Quality of Service, IWQoS
ISSN (Print)1548-615X

Conference

Conference32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024
Country/TerritoryChina
CityGuangzhou
Period6/19/246/21/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • Batch Size Adjustment
  • Distributed Model Training
  • In-Network Aggregation
  • Programmable Switch

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'InGo: In-Network Aggregation Routing with Batch Size Adjustment for Distributed Training'. Together they form a unique fingerprint.

Cite this