Energy Aware Runtime (EAR)

The Energy Aware Runtime (EAR) package provides an energy management framework for super computers. The documentation on this site covers the "end-user" experience with EAR.

EAR usage on Snellius can be decomposed into two "services."

The EAR library (EARL). EARL is loaded (at runtime) when launching an application through the EAR Loader (EARLO) and SLURM plugin (EARPLUG). The EARL provides functionality to monitor energy (and performance) metrics of an application and additionally the ability to select the optimal CPU frequency according to the application and the node characteristics.
Job accounting (via the command eacct) which queries energy information of a paprticulair job or list of jobs from the the EAR database (EAR DB) on Snellius.

https://gitlab.bsc.es/ear_team/ear/-/wikis/home

Availability

EAR is available on all partitions/nodes.

Running Jobs with EAR

MPI applications

The EAR Library is automatically loaded with MPI applications when EAR is enabled. EAR supports the utilization of both mpirun/mpiexec and srun commands.

To enable EAR in your job script when launching an MPI application you will need to include the following SBATCH options in your job script.

srun

srun is the preferred job launcher when using EAR, as the EARL is a SLURM plugin ! You will collect the largest amount of energy metrics when using srun

Running MPI applications with EARL is automatic for SLURM systems when using srun. All the jobs are monitored by EAR and the Library is loaded by default when EAR is enabled in the job script. To run a job with srun and EARL there is no need to load the EAR module. When using slurm commands for job submission, both Intel and OpenMPI implementations are supported. When using sbatch/srun or salloc to submit a job, Intel MPI and OpenMPI are supported.

Jobscript with EAR enabled

#!/bin/bash

#SBATCH -p thin
#SBATCH -t 00:30:00
#SBATCH --ntasks=128

#SBATCH --ear=on
#SBATCH --ear-policy=monitoring
#SBATCH --ear-verbose=1

module load 2022 
module load foss/2022a

srun myapplication

SLURM job options can also be passed direct to srun when launching a job from the command line.

Launching a job from the command line...

srun --ear=on --ear-policy=monitoring --ear-verbose=1 -J my_ear_job -N 1 --ntasks=32 myapplication

EARL verbose messages are generated in the standard error. For jobs using more than 2 or 3 nodes it is possible that messages can be overwritten. If the user wants to have EARL messages in a file the SLURM_EARL_VERBOSE_PATH environment variable must be set with a folder name. One file per node will be generated with EARL messages.

Setting the environment variable that set the log path

export SLURM_EARL_VERBOSE_PATH=/home/benjamic/logs
srun --ear=on --ear-policy=monitoring --ear-verbose=1 -J my_ear_job -N 1 --ntasks=32 myapplication

The following asks for EAR library metrics to be stored in csv file after the application execution. Two files per node will be generated: one with the average/global signature and another with loop signatures. The format of output files is <filename>.<nodename>.time.csv for the global signature and <filename>.<nodename>.time.loops.csv for loop signatures.

Storing the EAR metrics to a csv file

srun --ear-user-db=filename -J my_ear_job -N 1 --ntasks=32 myapplication

mpirun

Intel MPI

Recent versions of Intel MPI offers two environment variables that can be used to guarantee the correct scheduler integrations:

I_MPI_HYDRA_BOOTSTRAP sets the bootstrap server. It must be set to slurm.
I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS sets additional arguments for the bootstrap server. These arguments are passed to slurm.

You can read about the Intel environment variables on this guide.

Intel MPI

export I_MPI_HYDRA_BOOTSTRAP=slurm
export I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--ear-policy=monitoring --ear-verbose=1"
mpiexec.hydra -n 10 application

OpenMPI

For OpenMPI and EAR it is highly recommened to use SLURM. When using mpirun, as OpenMPI is not fully coordinated with the scheduler, the EAR Library is not automatilly loaded on all the nodes. If mpirun is used, EARL will be disabled and only basic energy metrics will be reported. The use of Bootstrap is an Intel® MPI option but not an OpenMPI option. For OpenMPI srun must be used for an automatic EAR support. In case OpenMPI with mpirun is needed, EAR offers the erun command explained below.

erun

erun is a program that simulates all the SLURM and EAR SLURM Plugin pipeline. You can launch erun with the --program option to specify the application name and arguments.

mpirun (OpenMPI) with erun

mpirun -n 4 /path/to/erun --program="hostname --alias"

In this example, mpirun would run 4 erun processes. Then, erun would launch the application hostname with its alias parameter. You can use as many parameters as you want but the semicolons have to cover all the parameters in case there are more than just the program name. erun would simulate on the remote node both the local and remote pipelines for all created processes. It has an internal system to avoid repeating functions that are executed just one time per job or node, like SLURM does with its plugins.

erun usage

> erun --help

This is the list of ERUN parameters:
Usage: ./erun [OPTIONS]

Options:
    --job-id=<arg>	Set the JOB_ID.
    --nodes=<arg>	Sets the number of nodes.
    --program=<arg>	Sets the program to run.
    --clean		Removes the internal files.
    
SLURM options:
...

MPI4PY

To use MPI with Python applications, the EAR Loader cannot automatically detect symbols to classify the application as Intel or OpenMPI. In order to specify it, the user has to define the SLURM_LOAD_MPI_VERSION environment variable with the values intel or open mpi.

Job accounting (eacct)

The eacct command shows accounting information stored in the EAR DB for jobs (and step) IDs. The command uses EAR’s configuration file to determine if the user running it is privileged or not, as non-privileged users can only access their information. It provides the following options. The ear module needs to be loaded to use the eacct command.

eacct usage

Usage: eacct [Optional parameters]
Optional parameters:
-h displays this message
-v displays current EAR version
-b verbose mode for debugging purposes
-u specifies the user whose applications will be retrieved. Only available to privileged users. [default: all users]
-j specifies the job id and step id to retrieve with the format [jobid.stepid] or the format [jobid1,jobid2,...,jobid_n].
A user can only retrieve its own jobs unless said user is privileged. [default: all jobs]
-a specifies the application names that will be retrieved. [default: all app_ids]
-c specifies the file where the output will be stored in CSV format. [default: no file]
-t specifies the energy_tag of the jobs that will be retrieved. [default: all tags].
-l shows the information for each node for each job instead of the global statistics for said job.
-x shows the last EAR events. Nodes, job ids, and step ids can be specified as if were showing job information.
-m prints power signatures regardless of whether mpi signatures are available or not.
-r shows the EAR loop signatures. Nodes, job ids, and step ids can be specified as if were showing job information.
-n specifies the number of jobs to be shown, starting from the most recent one. [default: 20][to get all jobs use -n all]
-f specifies the file where the user-database can be found. If this option is used, the information will be read from the file and not the database.

eacct example usage

The basic usage of eacct retrieves the last 20 applications (by default) of the user executing it. The default behaviour shows data from each job-step, aggregating the values from each node in said job-step. If using SLURM as a job manager, a sbatch job-step is created with the data from the entire execution. A specific job may be specified with -j option:

Default: Show the last 20 jobs (maximum) executed by the user.

eacct

Query a specific job

eacct -J 123456789

Query a specific job-step

eacct -J 123456789.0

Show metrics (averaged per job.stepid) from 3 jobs

eacct -j 175966,175967,175968

column values reported by eacct

eacct shows a pre-selected set of columns. Some flags slightly modifies the set of columns reported.

Column Name	Value
JOB-STEP	JobID and Step ID. sb is shown for the sbatch.
USER	Username who executed the job.
APP=APPLICATION	Job’s name or executable name if job name is not provided.
POLICY	Energy optimization policy name (MO = Monitoring).
NODES	Number of nodes which ran the job.
AVG/DEF/IMC(GHz)	Average CPU frequency, default frequency and average uncore frequency. Includes all the nodes for the step. In KHz.
TIME(s)	Step execution time, in seconds.
POWER	Average node power including all the nodes, in Watts.
GBS	CPU Main memory bandwidth (GB/second). Hint for CPU/Memory bound classification.
CPI	CPU Cycles per Instruction. Hint for CPU/Memory bound classification
ENERGY(J)	Accumulated node energy. Includes all the nodes. In Joules.
GFLOPS/WATT	CPU GFlops per Watt. Hint for energy efficiency.
IO(MBs)	IO (read and write) Mega Bytes per second.
MPI%	Percentage of MPI time over the total execution time. It’s the average including all the processes and nodes.
G-POW (T/U) T U	Average GPU power. Accumulated per node and average of all the nodes. Total (GPU power consumed even if the process is not using them). GPUs used by the job.
G-FREQ	Average GPU frequency. Per node and average of all the nodes.
G-UTIL(G/MEM)	GPU utilization and GPU memory utilization.

Space shortcuts

Page tree