Conference PaperPDF Available

Evaluating SLURM Simulator with Real-Machine SLURM and Vice Versa

Authors:

Abstract and Figures

Having a precise and a fast job scheduler model that resembles the real-machine job scheduling software behavior is extremely important in the field of job scheduling. The idea behind SLURM simulator is preserving the original code of the core SLURM functions while allowing for all the advantages of a simulator. Since 2011, SLURM simulator has passed through several iterations of improvements in different research centers. In this work, we present our latest improvements of SLURM simulator and perform the first-ever validation of the simulator on the real machine. In particular, we improved the simulator's performance for about 2.6 times, made the simulator deterministic across several same set-up runs, and improved the simulator's accuracy; its deviation from the real-machine is lowered from previous 12% to at most 1.7%. Finally, we illustrate with several use cases the value of the simulator for job scheduling researchers, SLURM-system administrators, and SLURM developers.
Content may be subject to copyright.
Evaluating SLURM simulator with real-machine
SLURM and vice versa
Ana Jokanovic
Barcelona Supercomputing Center (BSC)
Barcelona, Spain
ana.jokanovic@bsc.es
Marco D’Amico
Barcelona Supercomputing Center (BSC)
Barcelona, Spain
marco.damico@bsc.es
Julita Corbalan
Universitat Politecnica de Catalunya
Barcelona, Spain
julita.corbalan@bsc.es
Abstract—Having a precise and a fast job scheduler model
that resembles the real-machine job scheduling software behavior
is extremely important in the field of job scheduling. The idea
behind SLURM simulator is preserving the original code of
the core SLURM functions while allowing for all the advan-
tages of a simulator. Since 2011, SLURM simulator has passed
through several iterations of improvements in different research
centers. In this work, we present our latest improvements of
SLURM simulator and perform the first-ever validation of the
simulator on the real machine. In particular, we improved the
simulator’s performance for about 2.6 times, made the simulator
deterministic across several same set-up runs, and improved
the simulator’s accuracy; its deviation from the real-machine
is lowered from previous 12% to at most 1.7%. Finally, we
illustrate with several use cases the value of the simulator for
job scheduling researchers, SLURM-system administrators, and
SLURM developers.
Index Terms—SLURM, simulator, job scheduling, workload,
simulation, HPC
I. INTRODUCTION
Optimizing a job scheduler for the best HPC system perfor-
mance and user experience is a complex, multi-dimensional
problem. In today’s HPC systems, SLURM [1] is a widely
used plugin-based job scheduler providing multiple optimiza-
tion options, either through tuning the configuration parame-
ters or by implementing new plugins and new scheduling and
select policies. Setting up all these options in SLURM for
the optimal performance would require a detailed parametric
analysis to understand the effect of multiple, sometimes inter-
dependent factors. Performing trial and error approach on the
real machine might be slow, impractical or most probably, it
would negatively impact the performance of the system. On
the other side, theoretical models may not include sufficient
details, can be imprecise and lead to wrong decisions.
Several years ago, the first version of SLURM simulator was
created by a SLURM system administrator from Barcelona
Supercomputing Center, A. Lucero [2], with the idea to allow
SLURM administrators to do their parametric analysis in the
SLURM code itself without affecting the system performance.
While the idea was promptly accepted by SLURM community,
none of the existing versions of the simulator until today was
brought to the level of precision and speed for correct decision
and practical use.
Our team, as well, enthusiastically accepted the SLURM
simulator, i.e., the latest version that was available at the
moment [3], with the intention of implementing new job
scheduling policies. However, very soon we realized it had
many flaws which made it inaccurate for any serious study
of scheduling policies. Our methodology, that allows us to
compare the simulator with the real-machine SLURM, enabled
us to quantify this imprecision, and it was significant. In this
work, we present the experimental results and the methodology
used. It demonstrates that when running the simulator with the
same input and set-up multiple times the average variation of
a job’s start time due to the simulator’s inconsistency was
up to 22 minutes. This variation can significantly influence
the scheduling policy evaluation. For example, in one of the
experiments, at a specific job we detected that the variation
in the job’s wait time among 10 simulation runs was from
92 minutes to 372 minutes for a job of 1 minute duration,
resulting in the variation of job’s slowdown from 93 to 373,
i.e., 301%. Also, our experiments demonstrate that the system
metrics deviation from the real-machine was up to 12%. More
importantly, we show that the error was clearly related to the
simulated system load.
Further, we worked on identifying the causes of the inac-
curacy and low speed in the simulator’s code and introduced
multiple fixes and improvements in the inherited version of
the simulator. The evaluation of these improvements and the
comparison with the previous version is presented in this work,
as well. Our main contributions are:
We removed the random variation from the simulator and
made our simulator deterministic across multiple runs for
the same input and set-up.
We improved the accuracy, i.e., lowered the deviation
of the simulator’s system metrics values from the real-
machine ones from previous 12% to at most 1.7%.
We improved the simulator’s performance by 2.6 times.
We presented our methodology for the evaluation of the
SLURM simulator on the real machine.
We ported the simulator to the latest SLURM version,
17.11.
We implemented converters between Standard Workload
Format (SWF) [4], [5] and the simulator’s input trace and
between SLURM’s completion log and SWF.
Once validated and evaluated regarding accuracy and per-
formance by comparing SLURM simulator results with real-
machine SLURM executions, we have used SLURM simulator
to evaluate SLURM. We have not evaluated SLURM as a
workload manager, but rather its specific configurations or
parts. For instance, we measure the impact of changing the
backfilling interval on system performance metrics, average
slowdown, average wait time and average response time, and
on the scheduler itself in terms of scheduling time. Four
different use cases for the SLURM simulator have been
selected, and in all of them, the impact on system performance
metrics and, on specific scheduler metrics was evaluated in the
simulator.
The paper is organized as follows. Section II gives a
brief description of the SLURM simulator’s environment and
its internals, as well as the description of the problems in
the previous version and the corresponding fixes we did.
Section III explains in detail our validation methodology and
presents the results from the validation experiments divided in
the subsections on consistency, i.e., variations across different
simulation’s runs, accuracy, i.e., simulated results vs. real-
machine results, and performance, i.e., simulator’s execution
time vs. total simulated time. Section IV presents the reverse
analysis. Namely, we present several example use cases of
SLURM simulator for evaluating the real SLURM that can
be of value for anyone dealing with the optimization of the
SLURM performance. Section V gives an overview of the
previous works on SLURM simulator and some discussion of
theoretical models. Section VI concludes the paper and gives
our plans for the future work.
II. SLURM SIMULATOR
SLURM simulator was developed initially at Barcelona
Supercomputing Center [2] and passed several iterations of
improvements at CSCS [6], and at Umea University/Berkley
Lab [3]. Here, we briefly explain the main inputs and outputs,
and main components and characteristics of the SLURM
simulator that have been present in either all of the versions
or in most of them.
A. General description
As shown in Figure 1, SLURM simulator receives as an
input the standard SLURM configuration file, slurm.conf, that
allows for the specification of the system architecture, as
well as job scheduler details such as scheduling and selection
policies, etc., and a trace file in a binary simulator’s format.
We have provided a converter from Standard Workload Format
(SWF) [4], [5] to simulator’s trace format to enable simulator
to use a vast amount of existing real-machine and modeled
logs in the online repositories such as Feitelson’s [7]. The
SLURM simulator generates standard SLURM outputs, such
as SLURM controller daemon’s and SLURM daemon’s logs,
job completion log, SLURM database files, etc.
Figure 1 shows three main components, i.e., three pro-
cesses of the SLURM simulator, SLURM simulator’s manager,
SLURM controller daemon and SLURM daemon.
sim mgr is in charge of controlling simulated time, i.e., it
increments it by one second in each simulator’s iteration.
sim_mgr
slurmctld slurmd
Shared
memory
trace
SLURM
logs/outs/DB
SLURM API!
(sbatch) sync
SLURM simulator
SWF
Workload & arch.
description
slurm.conf
Convert
Individual
job's & system’s
metrics
Fig. 1. SLURM simulator’s processes.
Also, it reads the input trace and submits the jobs to the
SLURM controller when its arrival time is reached. The
job submission is done using SLURM’s API sbatch.
slurmctld is a standard SLURM controller daemon, and
all the core functions of the controller, such as job
scheduling and selection policies are the original SLURM
code.
slurmd is a simplified SLURM daemon since the jobs’
execution is not simulated. It receives the job duration
from SLURM controller, sets the job’s end time as a
future event, and notifies the controller when the job’s
end time is reached. One slurmd process is in charge of
all the nodes.
All the time functions are redefined in order to return simulated
time instead of real time.
B. Improvements
Our starting point was the version of Rodrigo et al. [3]. The
latest available version at that time was on SLURM version
14.02. We encountered a set of errors when comparing to real
machine execution. In particular, we found unexpected delays
happening at job arrival, job end, and in the job duration.
The causes of these errors were due to problems in processes
synchronization, RPC related delays and schedulers calls. Here
we give details of the problems found and the solutions we
provided.
1) Synchronization of the simulator’s processes: Previous
versions of SLURM simulator implemented poor synchro-
nization between processes. In Rodrigo’s simulator, the syn-
chronization is implemented with a semaphore and a shared
variable. Three simulator’s processes, all slurmctld’s RPCs
and the backfill thread use the semaphore to get exclusive
access for editing the shared variable. To coordinate the
simulation, all the processes keep checking a specific value
for the variable to start processing. If the value is not the
right one, they sleep and re-check. Reading the variable is
not protected by the semaphore so, in the case of concurrent
reading and writing, the simulation would get into undefined
do_work()
post(simulator)
wait(slurm)
wait(simulator)
do_work()
send SIM_HELPER
wait response
post(slurm)
get SIM_HELPER
do_work()
respond OK
sim_mgr slurmd slurmctld
Fig. 2. New synchronization of the SLURM simulator’s processes.
behaviors, i.e., a race condition occurs. This incorrect syn-
chronization produces the loss of simulated seconds and job
events happening at the wrong time. We implemented a a
multi-semaphore synchronization approach (Figure 2). Slurmd
unlocks the first semaphore after it finishes all the message
interchanges with slurmctld and it is consumed by the sim mgr
to pass to the next simulated second. There is no need for a
semaphore between slurmd and slurmctld since it can be done
via RPCs, as sender thread is always blocked until it receives
an OK response from the receiver. The second semaphore is
incremented by sim mgr to unlock slurmd after one simulated
second has passed and it completed the processing of all
events for that second, i.e., all new job requests are sent. The
slurmd, once unlocked, sends a message called SIM HELPER
to the controller, that will respond with an OK message
when it terminates all its pending activities. By using this
synchronization, we were able to coordinate the simulation
processes correctly without losing simulated seconds, and at
the same time we sped up the simulator.
2) RPC exchange related delays: The second main point of
improvement is related to RPC exchange. As we said, RPCs
are blocking, but a special behavior occurs at the job end.
In this scenario slurmd sends a request for terminating a job,
slurmctld responds and it also sends a request for executing an
epilog for the finished job. The slurmd immediately responds
with OK, closing the RPC, thus unlocking slurmctld that was
waiting for a response and, after executing the epilog, it sends
a new message to the controller marking the real job end. If
slurmctld is not aware of the remaining epilogs pending to
arrive it ends its simulated section, unlocking the simulation
to proceed. Epilog messages can arrive with one or more
seconds of delay, increasing the duration of the jobs. SCSF
simulator only partially addressed this problem using sleeps,
which caused a slowdown in the simulation and altered job
duration, especially on high load. Shared counter variables
used by SLURM daemon’s threads allow eliminating delays
caused by job’s end. In synthesis, we wait for the number of
arrived epilogs messages to be equal to the number of ending
jobs in that second. Since the counters are shared by threads,
we implemented read and write locks to control access to the
counters.
3) Delay in scheduler calls: Rodrigo’s version was using
a time-triggered FIFO scheduler, more similar to how the
backfill scheduler works. We implemented an event triggered
FIFO scheduler, simulating the original SLURM scheduler,
that is triggered at job’s arrival and at job’s end. Starting the
scheduler at job’s end increases the efficiency of the scheduler,
that does not leave resources unused, delaying the start of new
jobs. Similar behavior happens at job’s arrival, in the case the
job can be run immediately.
4) Other improvements: An important effort was porting
our improved SLURM simulator version to the latest SLURM
version 17.11. Also, we enabled simulation to end when the
last simulated job ends, previously, it required passing specific
end-time. Backfill interval was hardcoded in the simulator’s
code, we made it editable from the slurm.conf as in regular
SLURM. We implemented a set of new slurm.conf parameters
for different parametric analysis. Also, we did a high number
of fixes in the code regarding compatibility with operating
systems, execution out of Virtual Machines environment, and
multiple parallel executions in supercomputers nodes. We
created various tools and scripts for launching and controlling
the simulation, for conversion of input log from SWF to
simulation input, and from output log to SWF, scripts for
generating different SLURM configurations, automatic data
extraction, and analysis.
III. SLURM SIM UL ATOR VALIDATI ON
In this section, we intend to show how our improvements
reflect in important points such as accuracy, consistency, and
performance of the simulator and to do the first real-machine
validation of the simulator that has not been done for any of
the previous versions.
Consistency experiments: we evaluate the variability of
the simulator across multiple runs for the same input and
configuration set up. We compare our improved version
with the inherited version of the simulator.
Accuracy experiments: we evaluate the precision of the
simulator comparing it to the real-machine SLURM
execution for the same input and configuration set up.
We compare both - our improved and inherited version
- to the real-machine SLURM using typical /system
performance metrics.
Performance experiments: we evaluate how fast is the
simulator’s code comparing it to the inherited version’s
code executions on the same machine and for the same
input and configuration set up. We report metrics execu-
tion time and speedup.
We will compare our improved version with the inherited
version from Rodrigo et al. [3]. We will use the following
notations to distinguish among the versions:
SIM V17 [8]: our improved version ported to the SLURM
17.11
SIM V14 [8]: our improved version ported to the SLURM
14.02, the same version of SLURM as inherited simu-
lator’s version, necessary for comparison of simulator’s
performance.
SIM SCSF( V14) [3]: inherited version, i.e., our starting
point which was on the SLURM 14.02 version when we
started improving it.
A. Workloads
We will use in total eight workloads for our validation
experiments - four big workloads of 5000 jobs for the experi-
ments on simulator’s consistency and simulator’s performance,
Log Arr. pattern #jobs System size Max job size Avg wait time (s) Avg response time (s) Avg slowdown Simulated time
big log 1 ANL 5k 3456 nodes 128 nodes 31457 40067 932 4,4 days
big log 2 CTC 5k 3456 nodes 128 nodes 596 9046 8,83 4,5 days
big log 3 KTH 5k 3456 nodes 128 nodes 268 8958 7,47 5,2 days
big log 4 SDSC 5k 3456 nodes 128 nodes 17757 26431 567 4,1 days
TABLE I
WORKLOAD LOGSCONFIGURATION PARAMETERS AND PERFORMANCE METRICS VALUES
their details are in Table I, and four small workloads of 200
jobs for the experiments on accuracy. In this section, we
describe how we generated these logs, and present a set of
configuration parameters and metrics for each of the logs.
1) Big logs for consistency and performance experiments:
We have generated workload trace files, i.e., logs using the
model proposed by Cirne and Berman in [9] based on the
analysis of many workloads coming from real workload traces.
This model includes user behavior concerning job submission,
i.e., job arrival time patterns, job sizes according to system
size, system load, etc. We have generated a workload with
5000 jobs in a system with 3456 nodes with 48 CPUs per
node for four different arrival patterns as given in Table I. In
the Table I, to give an idea of the system load created by each
of the workload logs, we also include a set of performance
metrics values that we obtained from running these logs in
the simulator.
Average wait time: average time passed between job’s
submission and job’s start
Average response time: average time passed between job’s
submission and job’s end
Average slowdown: slowdown is the ratio between re-
sponse time and the duration, i.e., the execution time of
the job.
2) Small logs for accuracy experiments on the real ma-
chine: The intention in this set of experiments is to compare
simulation results with the real-machine results. Therefore, we
had to limit the system size, i.e., the number of nodes we
reserve on the real machine and the log size, i.e., wall clock
time we request for this reservation, to make the experiments
on the real machine feasible. We choose the log size of 200
jobs as being the maximum number of jobs from our set of
applications that can be executed within 2h. The wall clock
time of 2h was the upper limit of the fast job queue on our
chosen production machine. We had to relay on this queue to
get big enough number of experiments done on the portion of
10 real-machine nodes in a reasonable time.
The complete step-by-step process of creating CIRNE
model [9] based real-applications workload for the execution
on the real-machine is given in the flow-chart in Figure 3. First,
we generate the logs using a CIRNE model for the system
size, i.e., the number of nodes that we plan to reserve on
the real machine, and maximum job size being the maximum
number of nodes required by the real applications in the
workload that we are going to use. As in the case of big logs,
we create four logs, for four different arrival patterns: ANL,
CTC, KTH, and SDSC. On the other side, we create a pool
of NAS benchmarks with different input and job sizes, such
that maximum job size matches the maximum job size in the
CIRNE model input parameters:
200 jobs
10-nodes system
8 nodes max. job size
ANL, CTC, KTH or SDSC
arrival pattern
CIRNE
model
Convert
1
Real-machine
execution of each
application
Order
CIRNE
log jobs
apps by size
and by duration
Order
real
apps by
size and by
duration
2
Create
"log job"-"real app”
matches and assign
arrival time to real
apps
Order
real
apps by
arrival time
and calculate wall
clock time
Submitter
script of real apps
workload, i.e.,
sbatch requests
Real-
machine
SLURM
3
4
Log
Sizes &
durations
of apps
5
6
7
8
Fig. 3. Creation of real-applications workload based on CIRNE model flow
chart.
CIRNE log and execute each of them on the real machine to
collect their execution times. Then we map each benchmark to
the similar job from CIRNE log based on its size and duration
and assign to benchmark its arrival time. Since the durations
of the benchmarks will not be exactly the same as the ones in
the log’s jobs, we make sure the wall clock times of the jobs
in the newly generated workload is calculated with a good
enough approximation. First, the ratio between the duration
and the wall clock time for each job in the CIRNE log is
calculated and then, it is used to multiply the duration of
its real-benchmark match and get the real benchmark’s wall
clock time. Finally, we sort the real benchmarks by their
newly assigned arrival time and convert this list to a submitter
script. This submitter script is a list of sbatch requests with
the corresponding number of seconds between them to match
the arrival times of the applications. These sbatch requests are
submitted to our SLURM over SLURM environment explained
later on in Section III-C.
B. System and job scheduler configuration
Here we explain the set up of SLURM configuration file,
slurm.conf, for each type of the experiments. For our con-
sistency and performance experiments, we use the system
configuration that corresponds to one of the machines to which
we have access. The system size of the machine is 3456
computing nodes; each node has 2 CPUs and each CPU 24
cores. This set-up will be the exact configuration of the system
in our experiments.
For the accuracy experiments, we use the system size of
10 nodes and the same node architecture configuration since
the real-machine experiments will be executed on a 10-nodes
*SLURM as the production-
machine’s SLURM job
Real-machine
SLURM* run
SLURM
Simulator
run
Real
jobcomp log
Convert
Simulator
trace
CIRNE-modeled NAS
benchmarks workload
slurm.conf
Simulator’s
jobcomp log
Metrics
comparison
1
2
3
4
5
7
6
8
Fig. 4. The accuracy experiments flow chart.
portion of the mentioned reference machine. The SLURM
and the SLURM simulator are configured to use linear select
policy and FIFO&backfill job scheduling policies All the
parameters regarding job scheduling are the same for all the
experiments. The summary of the relevant system and job
scheduler configuration parameters is given in Table II.
Configuration parameter Consistency&Performance Accuracy
System size 3456 10
Number of CPUs per node 48 48
Select policy linear linear
Scheduling policy backfill backfill
Backfill interval 30s 30s
TABLE II
REL EVANT SLURM C ONFI GU RATIO N FIL E PARA MET ER S
C. Real machine experiments and SLURM over SLURM en-
vironment
Since changing configuration of the production-machine’s
SLURM by a regular researcher is typically not permitted,
we created an environment where we can launch the real
SLURM as a job on a portion of Nnodes of the production-
machine. Thus, our SLURM is executed as a production-
machine’s job, in an exclusive mode, i.e., requested nodes are
fully allocated for SLURM, and its daemons do not interfere
with other jobs running in the system at the node level. It acts
as a regular job scheduler for a sequence of real-applications’
jobs submitted to it and being freely configurable by us.
This environment allows unhindered configuration of set up
parameters, and execution of predefined workloads on the real-
machine, i.e., using the same set-up as for the simulators’
versions that we want to compare to. The scheme of the
steps we follow in our real-machine experiments intended for
accuracy evaluation is shown in Figure 4. First, we create the
real-applications’ workload, as explained in III-A2. Then we
execute this workload on the real machine submitting the jobs
to our SLURM and collect job completion log. This log is
converted to the input trace of the simulator and replayed in
the simulator. From this simulation, we obtain simulator’s job
completion log. From the real machine log and simulator’s
log, we calculate system performance metrics and perform the
comparison.
D. Consistency results
As we explained in Section II-B, we observed a significant
variation in SIM SCSF results for the same set up across
different runs. Here we quantify this variation performing the
simulations of four big logs on the system size of 3456 nodes.
Figure 5 shows the variation range of start time for each
job across ten different runs for four big workloads in two
simulator’s versions. SIM V17 has been improved and gives
deterministic results, i.e., 0 variations across multiple runs for
the same input, whereas, the inherited version, SIM SCSF,
varies significantly, from up to 108 minutes to up to 277
minutes depending on the workload. In the Figure 6 we give
the characterization of the load for each of the workloads in
terms of the total number of requested and busy nodes in
time until the last, i.e., the 5000th job has been submitted.
Figure 7 shows the correlation between the average slope of
the load in time and the average error derived from the results
in Figure 5 for each workload when running SIM SCSF. We
conclude that the higher variation in SIM SCSF is due to
higher system load, i.e., there is a correlation between the
simulated system load and the SIM SCSF simulator’s variation
which is consistent with our understanding of the problems
found in the SIM SCSF code.
E. Accuracy results
The same small workload is run on the real machine and
in two versions of the simulator, SIM V17 and SIM SCSF
as explained in Section III-C and in Figure 4. In Figure 8
we compare the system performance metrics obtained from
simulation runs to the system performance metrics obtained
from real-machine runs. Simulator’s version SIM SCSF was
run ten times, and the average results are presented, since,
as we saw in the Section III-D, this version experiences
significant variation across runs. Figure 8 gives the deviation
of the simulator’s system performance metrics from the real-
machine case. It shows that SIM V17 version is quite close
to the real-machine SLURM version. It deviates at most 1.7%
in any of the system metrics, and in most of the cases, it
is below 1%. On the other side, version SIM SCSF deviates
from 6% to up to 12% from the real machine SLURM. Since
in Section III-D we already found the correlation between the
system load and the error of the SIM SCSF, we expect this
deviation to be even higher when simulating bigger systems
and workloads.
F. Performance results
We compare the speed of our simulator with the speed of
SIM SCSF simulator. Since, SIM SCSF is on SLURM 14.02
version, for the comparison we use our improved version on
the same SLURM version, 14.02, SIM V14, to compare the
same versions of the code. Also, it is important to mention
that we run all the simulators on the same machine and in
the same environment. Thus our comparison is valid. We
also include the results for the performance of the SIM V17,
since this is our latest contribution to SLURM simulator’s
code. Simulator’s speed is calculated as the ratio of total
(a) ANL log, 5k jobs (b) CTC log, 5k jobs
(c) KTH log, 5k jobs (d) SDSC log, 5k jobs
Fig. 5. Difference between maximum and minimum start time value for each job in the workload across 10 simulations of the same log and the same
configuration set-up. Arrival time of a job in the workload (x-axis), the job’s maximum variation in start time across 10 simulations (y-axis). Two simulator’s
versions are compared, SIM V17 (red line) and SIM SCSF (blue line). Big logs of 5000 jobs are used on the system of 3456 nodes, as explained in
Section III-A1. Each graph shows the workload for one of the arrival patterns, ANL, CTC, KTH and SDSC. There are 5000 jobs in each case, but they arrive
in different time spans for different logs, thus, x-axes are not at the same scale. Note, y-axes are at the same scale.
Fig. 6. System load. On the x-axis there is arrival time of the jobs and on the
y-axis is the total number of requested and busy nodes. Note that the system
load is presented until the last job in the workload arrives to the scheduler
queue and not until the end of the simulation.
simulated time and the simulation execution time. Figure 9a
and Figure 9b show the execution time and the speedup,
of each of the simulator’s versions for four different big
CIRNE logs, respectively. Our version SIM V14 is 2.3 to 2.6
KTH
CTC
SDSC
ANL
Fig. 7. Correlation between the average slope of system load vs. arrival
time, derived from Figure 6 and the average difference between maximum
and minimum start time across 10 runs of SIM SCSF, derived from Figure 5.
times faster than the SIM SCSF version depending on the
workload. The speed of current version of the simulator is
rather dependent on the total simulated time since sim mgr
increments the time one second each iteration and even with a
single job simulator will take significant time to execute, i.e.,
Deviation from real-machine
SLURM (%)
0!%
3!%
6!%
9!%
12!%
SIM-V17
SIM-SCSF-V14
Avg wait time Avg response time Avg slowdown
(a) ANL log, 200 jobs
Deviation from real-machine
SLURM (%)
0!%
3!%
6!%
9!%
12!%
SIM-V17
SIM-SCSF-V14
Avg wait time Avg response time Avg slowdown
(b) CTC log, 200 jobs
Deviation from real-machine
SLURM (%)
0!%
3!%
6!%
9!%
12!%
SIM-V17
SIM-SCSF-V14
Avg wait time Avg response time Avg slowdown
(c) KTH log, 200 jobs
Deviation from real-machine
SLURM (%)
0!%
3!%
6!%
9!%
12!%
SIM-V17
SIM-SCSF-V14
Avg wait time Avg response time Avg slowdown
(d) SDSC log, 200 jobs
Fig. 8. Deviation of different system performance metrics obtained on SLURM simulator from their respective values obtained on the real-machine SLURM.
Two versions of the simulator evaluated, SIM V17 and SIM SCSF (x-axis), percent of deviation w.r.t. real machine results for each of the system performance
metrics: average wait time, average response time and average slowdown (y-axis). Each graph shows the workload for one of the arrival patterns, ANL,
CTC, KTH and SDSC. Small logs, in the case of the simulations experiments, and equivalent small real-application workloads, in the case of real-machine
experiments, each of 200 jobs are used on the system of 10 nodes, as explained in Section III-A2 and III-C, and Figure 3 and 4. Note, y-axes are at the same
scale.
Execution time (min)
0
30
60
90
120
Workloads
CIRNE-ANL-5k
CIRNE-CTC-5k
CIRNE-KTH-5k
CIRNE-SDSC-5k
SIM-SCSF-V14 SIM-V14 SIM-V17
(a) Execution time of simulator
Speedup
0
100
200
300
400
Workloads
CIRNE-ANL-5k
CIRNE-CTC-5k
CIRNE-KTH-5k
CIRNE-SDSC-5k
SIM-SCSF-V14 SIM-V14 SIM-V17
(b) Speedup of simulator
Fig. 9. Comparison of the performance among simulators versions. (a) Execution time, (b) Speedup, i.e., simulated time (see Table I) divided with execution
time. Four different big logs of 5000 jobs on the system of 3456 nodes (x-axis). Three different versions of the simulators, SIM V17,SIM V14,SIM SCSF.
The version SIM V14 is necessary here for comparison purpose, since SIM SCSF is on the SLURM version 14.02. All the simulators are executed in the
same environment on the same machine.
more than necessary. We believe there is a room for significant
additional improvement in the simulator’s speed by allowing
sim mgr to change the number of seconds incremented in each
iteration. However, this additional improvement is planned
for future work. We include the results for two real logs
from the production machines regular executions of over eight
months period. Two logs, ANL Interpid and CEA Curie from
Feitelson’s repository [7], contain much higher number of jobs,
68936 and 198509, respectively. Also, they are simulated on
their original system sizes of 40960 and 5040 nodes. We show
that these big traces can be executed in order of a day and
speedup is comparable to the one we reported for 5000-jobs
logs.
IV. SLURM SIMU LATO R US E CA SE S
We present several use cases that show the value of the
simulator for various parametric studies.
A. Use case 1: Impact of backfill interval on scheduler per-
formance
The configuration of job scheduler parameters, such as
backfill interval, may have an important impact on the system
performance and scheduling time. Figure 11a shows the sys-
tem performance metrics for various backfill interval values.
We can see that lowering backfill interval below 30 s may
bring less than 1% of improvement. However, as shown in Fig-
ure 11b this can cause a significant and unnecessary increase
Execution time (min)
0
1000
2000
Real logs
ANL Interpid
CEA Curie
(a) Execution time of simulator
Speedup
0
125
250
Real logs
ANL Interpid
CEA Curie
(b) Speedup of simulator
Fig. 10. Performance of the SIM V17 simulator for big real logs as an input. (a) Execution time, (b) Speedup, i.e., simulated time divided with execution time.
Two different real big logs ANL Interpid and CEA Curie from Feitelson’s repository [7] (x-axis). ANL Interpid log contains 68936 jobs and it is executed on
the original system size of 40960 quad-core nodes. CEA Curie log contains 198509 jobs from the partition of 5040 nodes, each with 2 sockets and 8 CPUs
per socket.
(a) System performance metrics (b) Backfill scheduler execution time
Fig. 11. Impact of backfill interval value on (a) system performance metrics and (b) backfill scheduler time. Backfill intervals from 7 to 90 s (x-axis). (a)
average wait time and average response time (left y-axis), slowdown (right y-axis). (b) Total time spent in executing backfill scheduler during the entire
simulation (y-axis). Big log of 5000 jobs with ANL arrival pattern on the system of 3456 nodes simulated on SIM V17 simulator.
in total job scheduling time. Thus, this simple experiment on
the simulator may allow system administrators to choose good
enough backfill interval for their system.
B. Use case 2: Impact of job queue length on scheduler
performance
Similarly, a system administrator or a researcher can think
of implementing new job scheduler parameters and testing the
system and scheduler performance for different values of these
parameters. As an example, we implemented backfill queue
limit, a parameter that limits the number of jobs in the queue
tested by backfill scheduler. Analysis of the impact of different
values of this parameter on system performance and backfill
total time is given in Figure 12a and Figure 12b, respectively.
C. Use case 3: Impact of system size on system performance
metrics
This set of experiments aims to show how a potential
increase or decrease in system size may impact system per-
formance metrics. Running the typical system’s workload in
the simulator may help the system administrators evaluate
the overall cost of the system size change. In our example
experiment in Figure 13 decrease of the system size by 12,5%
or 25% degrades the system performance metrics by around
40-60% and 100-140%, respectively. On the other side, an
increase of the system size by 12,5% or 25% improves the
system performance metrics by around 30-40% and 50-70%,
respectively.
D. Use case 4: Evaluation of the job scheduler scalability
The simulator may help to evaluate the scalability of the
SLURM scheduler itself. Namely, we estimate the time spent
in backfill scheduling for different system loads. We achieve
different system loads by running the same workload on the
different system sizes. When system size decreases, the system
becomes more loaded and vice versa. Figure 14 shows that the
total time spent in backfill scheduler for a workload of more
than four days is around 2 minutes, even with the extreme
case of the system being twice smaller, the total backfill time
reaches 5.5 minutes. This simple example proves SLURM to
be rather scalable, but also, it can be used by job scheduling
researchers and SLURM developers to estimate the scalability
of the new scheduling algorithm implementations.
V. RE LATE D WO RK
Job scheduling evaluation is an important topic since small
changes in the scheduling and resource management can
significantly affect system performance. There are different
(a) System performance metrics (b) Backfill scheduler execution time
Fig. 12. Impact of backfill queue size on (a) system performance metrics and (b) backfill scheduler time. Number of jobs in the queue checked by backfill
scheduler from 20 to 100 (x-axis). (a) average wait time and average response time (left y-axis), slowdown (right y-axis). (b) Total time spent in executing
backfill scheduler during the entire simulation. Big log of 5000 jobs with ANL arrival pattern on the system of 3456 nodes simulated on SIM V17 simulator.
Fig. 13. Impact of system system size on system performance metrics. System
size in number of nodes (x-axis). The middle point is the reference system, two
points on the left are system sizes reduced by 12,5% and 25% w.r.t. reference
system, respectively, two points on the right are system sizes increased by
12,5% and 25% w.r.t. reference system, respectively. Average wait time and
average response time (left y-axis), slowdown (right y-axis). Big log of 5000
jobs with ANL arrival pattern on the system of 3456 nodes simulated on
SIM V17 simulator.
methodologies for evaluating the efficiency and effectiveness
of a job scheduler which we can separate in benchmarks and
simulations. Benchmarks assume a real run of workloads in
a cluster, with the purpose of evaluating well-known system
metrics [10] or specific aspects of the system that administrator
needs to optimize [11], such as the effect of dynamic job
scheduling in the context of malleable jobs. However, it is not
always possible to stop a production-machine to perform this
type of evaluation, so usually, simulations are more convenient
and practical to perform.
We can further divide job scheduling simulators into general
job scheduling simulators and implementation-specific job
scheduling simulators simulators.
There are plenty of traditional job scheduler simula-
tors [12] [13] [14] [15]. These simulators are suitable for
a theoretical evaluation of scheduling algorithms, and they
usually include configurations for modeling the different plat-
Fig. 14. Impact of system load on the backfill scheduler scalability and
number of backfilled jobs. System size in the number of nodes (x-axis). The
middle point is the reference system, three points on the left are system sizes
reduced by 12,5%, 25%, 50% w.r.t. reference system, respectively, three points
on the right are system sizes increased by 12,5%, 25%, 50% w.r.t. reference
system, respectively. Total time spent in executing backfill scheduler during
the entire simulation (left y-axis). The total number of jobs being scheduled
by backfill scheduler (right y-axis). Big log of 5000 jobs with ANL arrival
pattern on the system of 3456 nodes simulated on the SIM V17 simulator.
forms, partitions, hardware, energy, and networks. General job
scheduler simulators lack enough details, characterization of
parameters included in production software and the software
architecture that a system administrator might want to optimize
to get the most out of a particular machine. Batsim presents
some accuracy tests, done after developing an adaptor between
Batsim and OAR [16] to allow using OAR schedulers into the
simulator, and a submission system that reads and sends 800-
jobs requests to OAR operating on 161 nodes. This methodol-
ogy is hard to reproduce since few workload managers permit
decoupling the scheduler code from the rest, and simulation
is limited to the scheduler itself, not the whole software
infrastructure, including other code parts and implementation
related parameters. Also, Simbatch presents accuracy tests,
by implementing models for simulating batch schedulers, and
running 100 tasks on 5 nodes tests with OAR. In this case, it
is not clear up to which detail the models are representing
real job schedulers. In conclusion, standard job scheduling
simulators are a good start for the evaluation of a scheduling
algorithm, but they give only little hints about how they will
perform on a real machine, not representing an extensive set
of tools for system administrators.
On the other side, implementation-specific simulators keep
all the details of a specific job scheduler, maintain their archi-
tecture, reusing their source code, and giving the possibility
to system administrators to try different configurations and
algorithms with the objective of tuning system performance.
To the best of our knowledge, we found three simulators in
this category; the first is Qsim[17], an event-based simulator
for Cobalt[18], specific job scheduler for Blue Gene systems.
A second example is Moab[19] scheduler, that implements a
simulator mode, in which the user can interact and control
simulated time, but it is a proprietary software. Flux [20] is
a Resource Management framework that includes a simulator
in its code, but either publications and documentation lack
of information about it. There is no published evaluation of
the consistency and the accuracy for any of these simulators.
The last one is SLURM simulator that, before our work,
experienced several previous improvements. The first version
was based on SLURM version 2, developed at Barcelona
Supercomputing Center by A. Lucero [2]. In the second one,
Trofinoff and Benini [6] from CSCS updated the simulator to
SLURM version 14 and brought a series of improvements over
it. Our work is based on the top of G. Rodrigo [3] effort, which
improved the synchronization and the simulator speedup,
together with a set of tools for workload generation, scheduler
configuration and output analysis. None of these simulators
satisfied our needs regarding accuracy and consistency.
Finally, Simakov et al. [21] attempted to validate their
own SLURM simulator version. Simakov heavily simplified
simulator structure, serializing the code on a single process,
the SLURM controller, that can be compiled as a simulator.
While he significantly reduced the amount of executed code
and complexity, he lost some of the features that SLURM
can offer, e.g., plugins that are used inside SLURM node
daemons and SLURM original architecture. Moreover, the
paper is poor in the validation part, in which SLURM code
is not validated with real-machine runs, but it uses a SLURM
compiled in front-end mode. This mode is typically used by
SLURM developers for testing, and debugging purposes and
it is the base mechanism used by all SLURM simulators. It
allows simulating multiple nodes by routing all messages to
the same slurmd, that acts as a front-end. This methodology
alters the SLURM architecture and communications, and it
overloads a single daemon that is in charge of simulating a
high number of nodes, affecting the final results. On our side,
we validated the simulator in a real machine, by creating and
running real workloads into a supercomputer. In our validation,
we reported more accuracy, in both real runs and simulator
run. Our simulator runs are completely deterministic, while
it is not clear where the variability in Simakov’s simulator
comes from, as controlling simulated time and simplifying
the SLURM code remove all the sources of outliers, and the
real-machine variability model was not reported as the work
done in this version. Regarding the simulator code, we reported
better speedup while keeping standard SLURM architecture,
that opens more possibilities for system administrators.
VI. CONCLUSIONS AND FU TU RE WORK
In this paper, we presented our latest improvements of
SLURM simulator, together with the methodology and the
results of the first-ever validation of the simulator on the
real machine. The validation experiments show improvements
in the performance of 2.6 times comparing to the previous
version, deterministic results from the multiple same-input
executions of the simulator, and accuracy at the level of
the real-machine, with at most 1.7% of deviation. Our effort
also includes porting of the simulator to the latest version of
SLURM, 17.11. Besides, we present a selected number of use
cases to illustrate the usefulness of the simulator for SLURM
administrators, researchers, and developers. The parametric
studies of the impact of backfill interval and system size on
system performance might be a use case for the administrators.
Adding new parameters and testing their influence on the
system performance or studying the execution time of existing
and new job scheduling algorithm’s as a function of system
load might be a use case for the researchers and developers.
Our next efforts will go in two main directions: introducing
new models into the simulator and enabling support for
heterogeneous jobs.
Since the variability encountered in the previous versions
of the simulator was due to poor implementation of the syn-
chronization, it cannot be justified by the variability of a real-
machine scheduler. Modeling this variability and introducing a
parametrizable real-machine variability model in the simulator
will be one of our future efforts.
Currently, the simulator receives as an input a job duration
that is fixed during the entire simulation. Implementing a
performance model that will enable changing a job’s execution
time depending on the architecture or other factors, such as
sharing resources, is an important task for the future. Similarly,
we plan to integrate into the simulator energy models.
Since the simulator is ported to the SLURM 17.11 that
provides support for heterogeneous jobs, we plan to make
adaptations in the simulator internals and simulator’s inputs
to accept and process heterogeneous jobs’ requests.
ACKNOWLEDGMENT
This work is partially supported by the Spanish Govern-
ment through Programa Severo Ochoa (SEV-2015-0493), by
the Spanish Ministry of Science and Technology through
TIN2015-65316-P project, by the Generalitat de Catalunya
(contract 2017-SGR-1414) and from the European Commis-
sion’s Horizon 2020 Programme for research, technological
development and demonstration under Grant Agreement No
754304.
The authors would like to thank previous contributors,
Alejandro Lucero, Massimo Benini and Gonzalo Rodrigo, for
their work and their timely response to our questions.
REFERENCES
[1] M. A. Jette, A. B. You, and M. Grondona, “Slurm: Simple linux utility
for resource management,” in Proceedings of the 9th International
Workshop Job Scheduling Strategies for Parallel Processing (JSSPP).
Springer, Lecture Notes in Computer Science (LNCS), volume 2862,
2003, pp. 44–60.
[2] A. Lucero, “Simulation of batch scheduling using real production-ready
software tools,” in Proceedings of the 5th IBERGRID, 2011.
[3] G. P. Rodrigo, E. Elmroth, P.-O. Ostberg, and L. Ramakrishnan, “Scsf:
A scheduling simulation framework,” in Proceedings of the 21st Inter-
national Workshop Job Scheduling Strategies for Parallel Processing
(JSSPP), 2017.
[4] S. J. Chapin, W. Cirne, D. G. Feitelson, J. P. Jones, S. T. Leutenegger,
U. Schwiegelshohn, W. Smith, and D. Talby, “Benchmarks and standards
for the evaluation of parallel job schedulers,” in Proceedings of the
13th International Workshop Job Scheduling Strategies for Parallel
Processing (JSSPP). Springer-Verlag, 1999, pp. 66–89.
[5] The Standard Workload Format. [Online]. Available: http://www.cs.
huji.ac.il/labs/parallel/workload/
[6] S. Trofinoff and M. Benini, Using and Modifying the BSC Slurm
Workload Simulator, Slurm User Group Meeting 2015. [Online].
Available: https://slurm.schedmd.com/SLUG15/
[7] Logs of Real Parallel Workloads from Production Systems. [Online].
Available: http://www.cs.huji.ac.il/labs/parallel/workload/
[8] BSC Slurm Simulator. [Online]. Available: https://github.com/BSC-RM/
slurm simulator/
[9] W. Cirne and F. Berman, “A comprehensive model of the supercomputer
workload,” in Proceedings of the 4th Annual Workshop on Workload
Characterization, 2001.
[10] A. T. Wong, L. Oliker, W. T. Kramer, T. L. Kaltz, and D. H. Bailey,
“ESP: A system utilization benchmark,” in Supercomputing, ACM/IEEE
2000 Conference. IEEE, 2000, pp. 15–15.
[11] V. Lopez, A. Jokanovic, M. DAmico, M. Garcia, R. Sirvent, and J. Cor-
balan, “Djsb: Dynamic job scheduling benchmark,” in Job Scheduling
Strategies for Parallel Processing: 21st International Workshop, JSSPP
2017, Orlando, FL, USA, June 2, 2017, Revised Selected Papers.
Springer, 2017.
[12] P.-F. Dutot, M. Mercier, M. Poquet, and O. Richard, “Batsim:
a Realistic Language-Independent Resources and Jobs Management
Systems Simulator,” in 20th Workshop on Job Scheduling Strategies
for Parallel Processing, Chicago, United States, May 2016. [Online].
Available: https://hal.archives-ouvertes.fr/hal-01333471
[13] D. Klus´
aˇ
cek and H. Rudov´
a, “Alea 2: Job scheduling simulator,” in
Proceedings of the 3rd International ICST Conference on Simulation
Tools and Techniques, ser. SIMUTools ’10. ICST, Brussels, Belgium,
Belgium: ICST (Institute for Computer Sciences, Social-Informatics
and Telecommunications Engineering), 2010, pp. 61:1–61:10. [Online].
Available: https://doi.org/10.4108/ICST.SIMUTOOLS2010.8722
[14] Y. Caniou and J. S. Gay, “Simbatch: An api for simulating and predicting
the performance of parallel resources managed by batch systems,” in
Euro-Par 2008 Workshops - Parallel Processing, E. C´
esar, M. Alexander,
A. Streit, J. L. Tr¨
aff, C. C´
erin, A. Kn¨
upfer, D. Kranzlm¨
uller, and S. Jha,
Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 223–
234.
[15] H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter,
“Versatile, scalable, and accurate simulation of distributed applications
and platforms,” Journal of Parallel and Distributed Computing,
vol. 74, no. 10, pp. 2899–2917, Jun. 2014. [Online]. Available:
http://hal.inria.fr/hal-01017319
[16] “Oar resource manager.” [Online]. Available: http://oar.imag.fr/
documentation
[17] W. Tang, Z. Lan, N. Desai, and D. Buettner, “Fault-aware, utility-based
job scheduling on blue, gene/p systems,” in 2009 IEEE International
Conference on Cluster Computing and Workshops, Aug 2009, pp. 1–10.
[18] Cobalt website. [Online]. Available: https://www.alcf.anl.gov/
cobalt-scheduler
[19] Maui simulator. [Online]. Available: http://docs.adaptivecomputing.
com/maui/16.0simulations.php
[20] D. H. Ahn, J. Garlick, M. Grondona, D. Lipari, B. Springmeyer, and
M. Schulz, “Flux: A next-generation resource management framework
for large hpc centers,” in 2014 43rd International Conference on Parallel
Processing Workshops, Sept 2014, pp. 9–17.
[21] N. A. Simakov, M. D. Innus, M. D. Jones, R. L. DeLeon, J. P. White,
S. M. Gallo, A. K. Patra, and T. R. Furlani, “A slurm simulator:
Implementation and parametric analysis,” in High Performance Com-
puting Systems. Performance Modeling, Benchmarking, and Simulation,
S. Jarvis, S. Wright, and S. Hammond, Eds. Cham: Springer Interna-
tional Publishing, 2018, pp. 197–217.
... We integrated it into Slurm [14], a Distributed Resource Management System (DRMS), and the Slurm Simulator [15]. Integrating the prediction model in a job scheduling simulator, we made the Slurm Simulator workload and energyaware, capable of calculating energy consumption based on the type of application and not only the hardware like most of the job scheduling simulators. ...
... The evaluation is based on simulations using the BSC Slurm jobs scheduler Simulator [15]. We modeled a workload of 5000 jobs, with a makespan between 10 and 15 days, The number of modeled hardware architectures was limited by the available architectures and permissions needed to collect the necessary data. ...
Preprint
Full-text available
New HPC machines are getting close to the exascale. Power consumption for those machines has been increasing, and researchers are studying ways to reduce it. A second trend is HPC machines' growing complexity, with increasing heterogeneous hardware components and different clusters architectures cooperating in the same machine. We refer to these environments with the term heterogeneous multi-cluster environments. With the aim of optimizing performance and energy consumption in these environments, this paper proposes an Energy-Aware-Multi-Cluster (EAMC) job scheduling policy. EAMC-policy is able to optimize the scheduling and placement of jobs by predicting performance and energy consumption of arriving jobs for different hardware architectures and processor frequencies, reducing workload's energy consumption, makespan, and response time. The policy assigns a different priority to each job-resource combination so that the most efficient ones are favored, while less efficient ones are still considered on a variable degree, reducing response time and increasing cluster utilization. We implemented EAMC-policy in Slurm, and we evaluated a scenario in which two CPU clusters collaborate in the same machine. Simulations of workloads running applications modeled from real-world show a reduction of response time and makespan by up to 25% and 6% while saving up to 20% of total energy consumed when compared to policies minimizing runtime, and by 49%, 26%, and 6% compared to policies minimizing energy.
... We integrated it into Slurm [14], a Distributed Resource Management System (DRMS), and the Slurm Simulator [15]. Integrating the prediction model in a job scheduling simulator, we made the Slurm Simulator workload and energyaware, capable of calculating energy consumption based on the type of application and not only the hardware like most of the job scheduling simulators. ...
... The evaluation is based on simulations using the BSC Slurm jobs scheduler Simulator [15]. We modeled a workload of 5000 jobs, with a makespan between 10 and 15 days, The number of modeled hardware architectures was limited by the available architectures and permissions needed to collect the necessary data. ...
Preprint
Full-text available
New HPC machines are getting close to the exascale. Power consumption for those machines has been increasing, and researchers are studying ways to reduce it. A second trend is HPC machines' growing complexity, with increasing heterogeneous hardware components and different clusters architectures cooperating in the same machine. We refer to these environments with the term heterogeneous multi-cluster environments. With the aim of optimizing performance and energy consumption in these environments, this paper proposes an Energy-Aware-Multi-Cluster (EAMC) job scheduling policy. EAMC-policy is able to optimize the scheduling and placement of jobs by predicting performance and energy consumption of arriving jobs for different hardware architectures and processor frequencies, reducing workload's energy consumption, makespan, and response time. The policy assigns a different priority to each job-resource combination so that the most efficient ones are favored, while less efficient ones are still considered on a variable degree, reducing response time and increasing cluster utilization. We implemented EAMC-policy in Slurm, and we evaluated a scenario in which two CPU clusters collaborate in the same machine. Simulations of workloads running applications modeled from real-world show a reduction of response time and makespan by up to 25% and 6% while saving up to 20% of total energy consumed when compared to policies minimizing runtime, and by 49%, 26%, and 6% compared to policies minimizing energy.
... Other research efforts that are relevant to the present work include the recent advancements in the Slurm Simulator [27]. There are two distinctive versions of the Slurm Simulator Slurm V1 [28,29,30] and Slurm V2 [31,11]. Slurm V2 is extensively simplified compared to Slurm V1, i.e, Slurm V2 serializes the code on a single process, called sim controller. ...
Preprint
Full-text available
HPC users aim to improve their execution times without particular regard for increasing system utilization. On the contrary, HPC operators favor increasing the number of executed applications per time unit and increasing system utilization. This difference in the preferences promotes the following operational model. Applications execute on exclusively-allocated computing resources for a specific time and applications are assumed to utilize the allocated resources efficiently. In many cases, this operational model is inefficient, i.e., applications may not fully utilize their allocated resources. This inefficiency results in increasing application execution time and decreasing system utilization. In this work, we propose a resourceful coordination approach (RCA) that enables the cooperation between, currently independent, batch- and application-level schedulers. RCA enables application schedulers to share their allocated but idle computing resources with other applications through the batch system. The effective system performance (ESP) benchmark is used to assess the proposed approach. The results show that RCA increased system utilization up to 12.6% and decreased system makespan by the same percent without affecting applications' performance.
... 4) Slurm: Simple Linux Utility for Resource Management (SLURM) is a Linux-based compute resource manager that can handle two to thousands of servers and hundreds of clusters of multiple nodes at a time [12]. SLURM is a very powerful task scheduling system, it provides users with exclusive and/or shared access to cluster resource. ...
Article
High performance computing (HPC) workflows are undergoing tumultuous changes, including an explosion in size and complexity. Despite these changes, most batch job systems still use slow, centralized schedulers. Generalized hierarchical scheduling (GHS) solves many of the challenges that face modern workflows, but GHS has not been widely adopted in HPC. A major difficulty that hinders adoption is the lack of a performance model to aid in configuring GHS for optimal performance on a given application. We propose an analytical performance model of GHS, and we validate our proposed model with four different applications on a moderately-sized system. Our validation shows that our model is extremely accurate at predicting the performance of GHS, explaining 98.7% of the variance (i.e., an R ² statistic of 0.987). Our results also support the claim that GHS overcomes scheduling throughput problems; we measured throughput improvements of up to 270× on our moderately-sized system. We then apply our performance model to a pre-exascale system, where our model predicts throughput improvements of four orders of magnitude and provides insight into optimally configuring GHS on next generation systems.
Chapter
Full-text available
Slurm is an open-source resource manager for HPC that provides high configurability for inhomogeneous resources and job scheduling. Various Slurm parametric settings can significantly influence HPC resource utilization and job wait time, however in many cases it is hard to judge how these options will affect the overall HPC resource performance. The Slurm simulator can be a very helpful tool to aid parameter selection for a particular HPC resource. Here, we report our implementation of a Slurm simulator and the impact of parameter choice on HPC resource performance. The simulator is based on a real Slurm instance with modifications to allow simulation of historical jobs and to improve the simulation speed. The simulator speed heavily depends on job composition, HPC resource size and Slurm configuration. For an 8000 cores heterogeneous cluster, we achieve about 100 times acceleration, e.g. 20 days can be simulated in 5 h. Several parameters affecting job placement were studied. Disabling node sharing on our 8000 core cluster showed a 45% increase in the time needed to complete the same workload. For a large system (>6000 nodes) comprised of two distinct sub-clusters, two separate Slurm controllers and adding node sharing can cut waiting times nearly in half.
Conference Paper
Full-text available
As large scale computation systems are growing to exascale, Resources and Jobs Management Systems (RJMS) need to evolve to manage this scale modification. However, their study is problematic since they are critical production systems, where experimenting is extremely costly due to downtime and energy costs. Meanwhile, many scheduling algorithms emerging from theoretical studies have not been transferred to production tools for lack of realistic experimental validation. To tackle these problems we propose Batsim, an extendable, language-independent and scalable RJMS simulator. It allows researchers and engineers to test and compare any scheduling algorithm, using a simple event-based communication interface, which allows different levels of realism. In this paper we show that Batsim’s behaviour matches the one of the real RJMS OAR. Our evaluation process was made with reproducibility in mind and all the experiment material is freely available.
Conference Paper
Full-text available
Job scheduling on large-scale systems is an increasingly complicated affair, with numerous factors influencing scheduling policy. Addressing these concerns results in sophisticated scheduling policies that can be difficult to reason about. In this paper, we present a general utility-based scheduling framework to balance various scheduling requirements and priorities. It enables system owners to customize scheduling policies under different circumstances without changing the scheduling code. We also develop a fault-aware job allocation strategy for Blue Gene/P systems to address the increasing concern of system failures. We demonstrate the effectiveness of these facilities by means of event-driven simulations with real job traces collected from the production Blue Gene/P system at Argonne National Laboratory.
Conference Paper
Full-text available
The evaluation of parallel job schedulers hinges on the workloads used. It is suggested that this be standardized, in terms of both format and content, so as to ease the evaluation and comparison of different systems. The question remains whether this can encompass both traditional parallel systems and metacomputing systems. This paper is based on a panel on this subject that was held at the workshop, and the ensuing discussion; its authors are both the panel members and participants from the audience. Naturally, not all of us agree with all the opinions expressed here...
Article
The study of parallel and distributed applications and platforms, whether in the cluster, grid, peer-to-peer, volunteer, or cloud computing domain, often mandates empirical evaluation of proposed algorithmic and system solutions via simulation. Unlike direct experimentation via an application deployment on a real-world testbed, simulation enables fully repeatable and configurable experiments for arbitrary hypothetical scenarios. Two key concerns are accuracy (so that simulation results are scientifically sound) and scalability (so that simulation experiments can be fast and memory-efficient). While the scalability of a simulator is easily measured, the accuracy of many state-of-the-art simulators is largely unknown because they have not been sufficiently validated. In this work we describe recent accuracy and scalability advances made in the context of the SimGrid simulation framework. A design goal of SimGrid is that it should be versatile, i.e., applicable across all aforementioned domains. We present quantitative results that show that SimGrid compares favorably to state-of-the-art domain-specific simulators in terms of scalability, accuracy, or the trade-off between the two. An important implication is that, contrary to popular wisdom, striving for versatility in a simulator is not an impediment but instead is conducive to improving both accuracy and scalability.
Article
This work describes the Grid and cluster scheduling simulator Alea 2 designed for study, testing and evaluation of various job scheduling techniques. This event-based simulator is able to deal with common problems related to the job scheduling like the heterogeneity of jobs, resources, and the dynamic runtime changes such as the arrivals of new jobs or the resource failures and restarts. The Alea 2 is based on the popular GridSim toolkit [31] and represents a major extension of the Alea simulator, developed in 2007 [16]. The extension covers both improved design, extended functionality as well as the improved scalability and the higher simulation speed. Finally, new visualization interface was introduced into the simulator. The main part of the simulator is a complex scheduler which incorporates several common scheduling algorithms working either on the queue or the schedule (plan) based principle. Additional data structures are used to maintain information about the resource status, the objective functions and for collection and visualization of the simulation results. Many typical objectives such as the machine usage, the average slowdown or the average response time are included. The paper concludes with an example of the Alea 2 execution using a real-life workload, discussing also the scalability of the simulator.
Conference Paper
In this paper, we describe Simbatch, an API which offers core functionalities to realistically simulate parallel resources and batch reservation systems. The objective is twofold: proposing at the same time a tool to efficiently predict parallel resources usage based on their simulations, and to realistically study Grid scheduling heuristics that may be embedded in a Grid middleware or in a tool that deploys it. Indeed, such predictions can be used in a Grid middleware both for scheduling purposes, and to dynamically tune moldable applications in function of the load of the chosen parallel resource in place of the Grid user. Simbatch simulation experiments show an average error rate under 2% compared to real life experiments conducted with the OAR batch manager.