Abstract
Deep learning (DL) has seen a growing dataset, an expanding model scale, and increasing applications in recent years. There is a notable trend of shifting DL training jobs from local computing units to powerful DL clusters built by cloud providers. These clusters allocate physical training nodes to DL jobs through a process referred to as multi-job placement. Existing multi-job placement strategies fail to achieve high efficiency in resource utilization, DL training, and robustness simultaneously, resulting in poor performance when resources are limited or when abnormalities occur in some devices. To tackle these challenges, we present CROP, an approach that performs efficient and robust multi-job placement in DL clusters. We formulate the efficient and robust multi-job placement problem as a non-linear program and prove its NP-hardness. To solve this problem, we present an effective submodular-based algorithm with a tight approximation factor of (1-1/e). We evaluate CROP on a small-scale testbed consisting of 8 physical GPUs and a large-scale simulation employing real-world job traces. Experimental results demonstrate that CROP achieves nearoptimal communication overhead while improving the training throughput of the DL cluster by up to 57.5% compared to state-of-the-art solutions.
| Original language | English |
|---|---|
| Title of host publication | 2025 IEEE/ACM 33rd International Symposium on Quality of Service, IWQoS 2025 |
| ISBN (Electronic) | 9798331549404 |
| DOIs | |
| State | Published - 2025 |
| Event | 33rd IEEE/ACM International Symposium on Quality of Service, IWQoS 2025 - Gold Coast, Australia Duration: Jul 2 2025 → Jul 4 2025 |
Publication series
| Name | IEEE International Workshop on Quality of Service, IWQoS |
|---|---|
| ISSN (Print) | 1548-615X |
Conference
| Conference | 33rd IEEE/ACM International Symposium on Quality of Service, IWQoS 2025 |
|---|---|
| Country/Territory | Australia |
| City | Gold Coast |
| Period | 7/2/25 → 7/4/25 |
Bibliographical note
Publisher Copyright:© 2025 IEEE.
Funding
This work is supported in part by the National Natural Science Foundation of China (NSFC) under Grants 62372426 and 62132019; the Youth Innovation Promotion Association of the Chinese Academy of Sciences (Grant No. 2023481); the Open Project Program of the Guangxi Key Laboratory of Digital Infrastructure (Grant No. GXDIOP2024003); and the China Mobile "Joint Innovation+"Research Program.
| Funders | Funder number |
|---|---|
| National Natural Science Foundation of China (NSFC) | 62372426, 62132019 |
| Youth Innovation Promotion Association of the Chinese Academy of Sciences | 2023481 |
| Guangxi Key Laboratory of Digital Infrastructure | GXDIOP2024003 |
Keywords
- Communication Awareness
- Deep Learning Cluster
- Multi-job Placement
- Robustness
ASJC Scopus subject areas
- Electrical and Electronic Engineering