A Lightweight Execution Manager for Training TensorFlow Models under the Slurm Queuing System
Lupión, Marcos
Cruz, C. Nicolás
Romero, Felipe
Sanjuan, F. Juan
Ortigosa, M. Pilar
2025-08-11T06:03:52Z
2025-08-11T06:03:52Z
2025
1785-8860
hu_HU
http://hdl.handle.net/20.500.14044/32104
Artificial neural networks currently represent the flagship of Machine Learning
and have reached multiple fields alongside Computer Science. This kind of computational
model generally needs massive amounts of data and high-performance computing
resources. The availability of graphical processing units is especially relevant. Thus, only
institutional computing platforms and clusters satisfy such a high demand for
computational power and storage resources. These systems rely on resource managers
capable of handling multiple users and computing resources. However, the users interested
in working with artificial neural networks, especially those without a background in
Computer Engineering, might not master system administration. For them, planning their
executions within the framework of a resource manager focused on high-performance
computing is problematic. This work presents S-TFManager, an easy-to-use open-source
web manager for launching and controlling the execution of TensorFlow models consisting
of artificial neural networks in a heterogeneous cluster with a Slurm queuing system. Both
TensorFlow and Slurm are arguably the most extended tools in their respective fields, so
the proposed tool is of public interest. The tool, written in Python, includes built-in
batching and visualization capabilities, and its simplicity makes it easy to extend.
hu_HU
dc.format
PDF
hu_HU
en
hu_HU
A Lightweight Execution Manager for Training TensorFlow Models under the Slurm Queuing System
hu_HU
Open access
hu_HU
Óbudai Egyetem
hu_HU
Budapest
hu_HU
Óbudai Egyetem
hu_HU
Műszaki tudományok - multidiszciplináris műszaki tudományok