CROP: Efficient and Robust Multi-Job Placement in Deep Learning Clusters

Peng Yang, Gongming Zhao, Jing Wen, Hongli Xu, Haibo Wang, Wentao Fan, Xiaohu Xu, Jun Yao

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Deep learning (DL) has seen a growing dataset, an expanding model scale, and increasing applications in recent years. There is a notable trend of shifting DL training jobs from local computing units to powerful DL clusters built by cloud providers. These clusters allocate physical training nodes to DL jobs through a process referred to as multi-job placement. Existing multi-job placement strategies fail to achieve high efficiency in resource utilization, DL training, and robustness simultaneously, resulting in poor performance when resources are limited or when abnormalities occur in some devices. To tackle these challenges, we present CROP, an approach that performs efficient and robust multi-job placement in DL clusters. We formulate the efficient and robust multi-job placement problem as a non-linear program and prove its NP-hardness. To solve this problem, we present an effective submodular-based algorithm with a tight approximation factor of (1-1/e). We evaluate CROP on a small-scale testbed consisting of 8 physical GPUs and a large-scale simulation employing real-world job traces. Experimental results demonstrate that CROP achieves nearoptimal communication overhead while improving the training throughput of the DL cluster by up to 57.5% compared to state-of-the-art solutions.

Original languageEnglish
Title of host publication2025 IEEE/ACM 33rd International Symposium on Quality of Service, IWQoS 2025
ISBN (Electronic)9798331549404
DOIs
StatePublished - 2025
Event33rd IEEE/ACM International Symposium on Quality of Service, IWQoS 2025 - Gold Coast, Australia
Duration: Jul 2 2025Jul 4 2025

Publication series

NameIEEE International Workshop on Quality of Service, IWQoS
ISSN (Print)1548-615X

Conference

Conference33rd IEEE/ACM International Symposium on Quality of Service, IWQoS 2025
Country/TerritoryAustralia
CityGold Coast
Period7/2/257/4/25

Bibliographical note

Publisher Copyright:
© 2025 IEEE.

Funding

This work is supported in part by the National Natural Science Foundation of China (NSFC) under Grants 62372426 and 62132019; the Youth Innovation Promotion Association of the Chinese Academy of Sciences (Grant No. 2023481); the Open Project Program of the Guangxi Key Laboratory of Digital Infrastructure (Grant No. GXDIOP2024003); and the China Mobile "Joint Innovation+"Research Program.

FundersFunder number
National Natural Science Foundation of China (NSFC)62372426, 62132019
Youth Innovation Promotion Association of the Chinese Academy of Sciences2023481
Guangxi Key Laboratory of Digital InfrastructureGXDIOP2024003

    Keywords

    • Communication Awareness
    • Deep Learning Cluster
    • Multi-job Placement
    • Robustness

    ASJC Scopus subject areas

    • Electrical and Electronic Engineering

    Fingerprint

    Dive into the research topics of 'CROP: Efficient and Robust Multi-Job Placement in Deep Learning Clusters'. Together they form a unique fingerprint.

    Cite this