Availability
EAR is available on all partitions/nodes.
Running Jobs with EAR
MPI applications
The EAR Library is automatically loaded with MPI applications when EAR is enabled. EAR supports the utilization of both mpirun
/mpiexec
and srun
commands.
To enable EAR in your job script when launching an MPI application you will need to include the following SBATCH options in your job script.
srun
srun is the preferred job launcher when using EAR, as the EARL is a SLURM plugin ! You will collect the largest amount of energy metrics when using srun
Running MPI applications with EARL is automatic for SLURM systems when using srun
. All the jobs are monitored by EAR and the Library is loaded by default when EAR is enabled in the job script. To run a job with srun and EARL there is no need to load the EAR module. When using slurm commands for job submission, both Intel and OpenMPI implementations are supported. When using sbatch
/srun
or salloc
to submit a job, Intel MPI and OpenMPI are supported.
#!/bin/bash #SBATCH -p thin #SBATCH -t 00:30:00 #SBATCH --ntasks=128 #SBATCH --ear=on #SBATCH --ear-policy=monitoring #SBATCH --ear-verbose=1 module load 2022 module load foss/2022a srun myapplication
SLURM job options can also be passed direct to srun when launching a job from the command line.
srun --ear=on --ear-policy=monitoring --ear-verbose=1 -J my_ear_job -N 1 --ntasks=32 myapplication
EARL verbose messages are generated in the standard error. For jobs using more than 2 or 3 nodes it is possible that messages can be overwritten. If the user wants to have EARL messages in a file the SLURM_EARL_VERBOSE_PATH
environment variable must be set with a folder name. One file per node will be generated with EARL messages.
export SLURM_EARL_VERBOSE_PATH=/home/benjamic/logs srun --ear=on --ear-policy=monitoring --ear-verbose=1 -J my_ear_job -N 1 --ntasks=32 myapplication
The following asks for EAR library metrics to be stored in csv file after the application execution. Two files per node will be generated: one with the average/global signature and another with loop signatures. The format of output files is <filename>.<nodename>.time.csv for the global signature and <filename>.<nodename>.time.loops.csv for loop signatures.
srun --ear-user-db=filename -J my_ear_job -N 1 --ntasks=32 myapplication
mpirun
Intel MPI
Recent versions of Intel MPI offers two environment variables that can be used to guarantee the correct scheduler integrations:
I_MPI_HYDRA_BOOTSTRAP
sets the bootstrap server. It must be set to slurm.I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS
sets additional arguments for the bootstrap server. These arguments are passed to slurm.
You can read about the Intel environment variables on this guide.
export I_MPI_HYDRA_BOOTSTRAP=slurm export I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--ear-policy=monitoring --ear-verbose=1" mpiexec.hydra -n 10 application
OpenMPI
For OpenMPI and EAR it is highly recommened to use SLURM. When using mpirun
, as OpenMPI is not fully coordinated with the scheduler, the EAR Library is not automatilly loaded on all the nodes. If mpirun
is used, EARL will be disabled and only basic energy metrics will be reported. The use of Bootstrap is an Intel® MPI option but not an OpenMPI option. For OpenMPI srun
must be used for an automatic EAR support. In case OpenMPI with mpirun
is needed, EAR offers the erun
command explained below.
erun
erun is a program that simulates all the SLURM and EAR SLURM Plugin pipeline. You can launch erun with the --program
option to specify the application name and arguments.
mpirun -n 4 /path/to/erun --program="hostname --alias"
In this example, mpirun
would run 4 erun
processes. Then, erun
would launch the application hostname
with its alias parameter. You can use as many parameters as you want but the semicolons have to cover all the parameters in case there are more than just the program name. erun
would simulate on the remote node both the local and remote pipelines for all created processes. It has an internal system to avoid repeating functions that are executed just one time per job or node, like SLURM does with its plugins.
> erun --help This is the list of ERUN parameters: Usage: ./erun [OPTIONS] Options: --job-id=<arg> Set the JOB_ID. --nodes=<arg> Sets the number of nodes. --program=<arg> Sets the program to run. --clean Removes the internal files. SLURM options: ...
MPI4PY
To use MPI with Python applications, the EAR Loader cannot automatically detect symbols to classify the application as Intel or OpenMPI. In order to specify it, the user has to define the SLURM_LOAD_MPI_VERSION
environment variable with the values intel or open mpi.
Job accounting (eacct)
The eacct command shows accounting information stored in the EAR DB for jobs (and step) IDs. The command uses EAR’s configuration file to determine if the user running it is privileged or not, as non-privileged users can only access their information. It provides the following options. The ear module needs to be loaded to use the eacct command.
Usage: eacct [Optional parameters] Optional parameters: -h displays this message -v displays current EAR version -b verbose mode for debugging purposes -u specifies the user whose applications will be retrieved. Only available to privileged users. [default: all users] -j specifies the job id and step id to retrieve with the format [jobid.stepid] or the format [jobid1,jobid2,...,jobid_n]. A user can only retrieve its own jobs unless said user is privileged. [default: all jobs] -a specifies the application names that will be retrieved. [default: all app_ids] -c specifies the file where the output will be stored in CSV format. [default: no file] -t specifies the energy_tag of the jobs that will be retrieved. [default: all tags]. -l shows the information for each node for each job instead of the global statistics for said job. -x shows the last EAR events. Nodes, job ids, and step ids can be specified as if were showing job information. -m prints power signatures regardless of whether mpi signatures are available or not. -r shows the EAR loop signatures. Nodes, job ids, and step ids can be specified as if were showing job information. -n specifies the number of jobs to be shown, starting from the most recent one. [default: 20][to get all jobs use -n all] -f specifies the file where the user-database can be found. If this option is used, the information will be read from the file and not the database.
eacct example usage
The basic usage of eacct
retrieves the last 20 applications (by default) of the user executing it. The default behaviour shows data from each job-step, aggregating the values from each node in said job-step. If using SLURM as a job manager, a sbatch job-step is created with the data from the entire execution. A specific job may be specified with -j
option:
eacct
eacct -J 123456789
eacct -J 123456789.0
eacct -j 175966,175967,175968
column values reported by eacct
eacct shows a pre-selected set of columns. Some flags slightly modifies the set of columns reported.
Column Name | Value |
---|---|
JOB-STEP | JobID and Step ID. sb is shown for the sbatch. |
USER | Username who executed the job. |
APP=APPLICATION | Job’s name or executable name if job name is not provided. |
POLICY | Energy optimization policy name (MO = Monitoring). |
NODES | Number of nodes which ran the job. |
AVG/DEF/IMC(GHz) | Average CPU frequency, default frequency and average uncore frequency. Includes all the nodes for the step. In KHz. |
TIME(s) | Step execution time, in seconds. |
POWER | Average node power including all the nodes, in Watts. |
GBS | CPU Main memory bandwidth (GB/second). Hint for CPU/Memory bound classification. |
CPI | CPU Cycles per Instruction. Hint for CPU/Memory bound classification |
ENERGY(J) | Accumulated node energy. Includes all the nodes. In Joules. |
GFLOPS/WATT | CPU GFlops per Watt. Hint for energy efficiency. |
IO(MBs) | IO (read and write) Mega Bytes per second. |
MPI% | Percentage of MPI time over the total execution time. It’s the average including all the processes and nodes. |
G-POW (T/U)
| Average GPU power. Accumulated per node and average of all the nodes.
|
G-FREQ | Average GPU frequency. Per node and average of all the nodes. |
G-UTIL(G/MEM) | GPU utilization and GPU memory utilization. |