Abstract
Distributed training has emerged as a critical application in clusters due to the widespread adoption of AI technology across various domains. However, as distributed training continues to advance, it has become increasingly time-consuming. To address this challenge, researchers have explored leveraging In-Network Aggregation (INA) to expedite distributed model training. Specifically, by harnessing programmable hardware, such as Intel Tofino switches, INA can aggregate gradients within the network, thereby reducing the amount of gradient transmission and accelerating distributed training. However, previous works assume fixed routing selection and batch size, ignoring their impact on model convergence and resulting in extended completion time. To bridge this gap, we propose InGo, a pioneering approach that considers both in-network aggregation routing and batch size adjustment, and provide the rigorous convergence analysis. Then, we formally define the problem of in-network aggregation routing with batch size adjustment, and present an efficient algorithm with bounded approximation factors to solve this problem. Through extensive experiments on both physical platforms and simulated environments, we demonstrate that InGo significantly reduces the completion time by 25.2%-74.7% compared to state-of-the-art solutions.
Original language | English |
---|---|
Title of host publication | 2024 IEEE/ACM 32nd International Symposium on Quality of Service, IWQoS 2024 |
ISBN (Electronic) | 9798350350128 |
DOIs | |
State | Published - 2024 |
Event | 32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024 - Guangzhou, China Duration: Jun 19 2024 → Jun 21 2024 |
Publication series
Name | IEEE International Workshop on Quality of Service, IWQoS |
---|---|
ISSN (Print) | 1548-615X |
Conference
Conference | 32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024 |
---|---|
Country/Territory | China |
City | Guangzhou |
Period | 6/19/24 → 6/21/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- Batch Size Adjustment
- Distributed Model Training
- In-Network Aggregation
- Programmable Switch
ASJC Scopus subject areas
- Electrical and Electronic Engineering