A Lightweight Execution Manager for Training TensorFlow Models under the Slurm Queuing System
Metadata
Show full item record
URI
Collections
Abstract
Artificial neural networks currently represent the flagship of Machine Learning
and have reached multiple fields alongside Computer Science. This kind of computational
model generally needs massive amounts of data and high-performance computing
resources. The availability of graphical processing units is especially relevant. Thus, only
institutional computing platforms and clusters satisfy such a high demand for
computational power and storage resources. These systems rely on resource managers
capable of handling multiple users and computing resources. However, the users interested
in working with artificial neural networks, especially those without a background in
Computer Engineering, might not master system administration. For them, planning their
executions within the framework of a resource manager focused on high-performance
computing is problematic. This work presents S-TFManager, an easy-to-use open-source
web manager for launching and controlling the execution of TensorFlow models consisting
of artificial neural networks in a heterogeneous cluster with a Slurm queuing system. Both
TensorFlow and Slurm are arguably the most extended tools in their respective fields, so
the proposed tool is of public interest. The tool, written in Python, includes built-in
batching and visualization capabilities, and its simplicity makes it easy to extend.
- Title
- A Lightweight Execution Manager for Training TensorFlow Models under the Slurm Queuing System
- Author
- Lupión, Marcos
- Cruz, C. Nicolás
- Romero, Felipe
- Sanjuan, F. Juan
- Ortigosa, M. Pilar
- xmlui.dri2xhtml.METS-1.0.item-date-issued
- 2025
- xmlui.dri2xhtml.METS-1.0.item-rights-access
- Open access
- xmlui.dri2xhtml.METS-1.0.item-identifier-issn
- 1785-8860
- xmlui.dri2xhtml.METS-1.0.item-language
- en
- xmlui.dri2xhtml.METS-1.0.item-format-page
- 16 p.
- xmlui.dri2xhtml.METS-1.0.item-subject-oszkar
- machine learning, TensorFlow, Slurm, Resource Management
- xmlui.dri2xhtml.METS-1.0.item-description-version
- Kiadói változat
- xmlui.dri2xhtml.METS-1.0.item-identifiers
- DOI: 10.12700/APH.22.3.2025.3.4
- xmlui.dri2xhtml.METS-1.0.item-other-containerTitle
- Acta Polytechnica Hungarica
- xmlui.dri2xhtml.METS-1.0.item-other-containerPeriodicalYear
- 2025
- xmlui.dri2xhtml.METS-1.0.item-other-containerPeriodicalVolume
- 22. évf.
- xmlui.dri2xhtml.METS-1.0.item-other-containerPeriodicalNumber
- 3. sz.
- xmlui.dri2xhtml.METS-1.0.item-type-type
- Tudományos cikk
- xmlui.dri2xhtml.METS-1.0.item-subject-area
- Műszaki tudományok - multidiszciplináris műszaki tudományok
- xmlui.dri2xhtml.METS-1.0.item-publisher-university
- Óbudai Egyetem