Elastic resource sharing for distributed deep learning

Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, Kyoung Soo Park

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Resource allocation and scheduling strategies for deep learning training (DLT) jobs have a critical impact on their average job completion time (JCT). Unfortunately, traditional algorithms such as Shortest-Remaining-Time-First (SRTF) often perform poorly for DLT jobs. This is because blindly prioritizing only the short jobs is suboptimal and job-level resource preemption is too coarse-grained for effective mitigation of head-of-line blocking. We investigate the algorithms that accelerate DLT jobs. Our analysis finds that (1) resource efficiency often matters more than short job prioritization and (2) applying greedy algorithms to existing jobs inflates average JCT due to overly optimistic views toward future resource availability. Inspired by these findings, we propose Apathetic Future Share (AFS) that balances resource efficiency and short job prioritization while curbing unrealistic optimism in resource allocation. To bring the algorithmic benefits into practice, we also build CoDDL, a DLT system framework that transparently handles automatic job parallelization and efficiently performs frequent share re-adjustments. Our evaluation shows that AFS outperforms Themis, SRTF, and Tiresias-L in terms of average JCT by up to 2.2x, 2.7x, and 3.1x, respectively.

Original languageEnglish
Title of host publicationProceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2021
PublisherUSENIX Association
Pages721-735
Number of pages15
ISBN (Electronic)9781939133212
StatePublished - 2021
Event18th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2021 - Virtual, Online
Duration: 2021 Apr 122021 Apr 14

Publication series

NameProceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2021

Conference

Conference18th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2021
CityVirtual, Online
Period21/04/1221/04/14

Fingerprint

Dive into the research topics of 'Elastic resource sharing for distributed deep learning'. Together they form a unique fingerprint.

Cite this