AMD uProf (“MICRO-prof”) is a software profiling analysis tool for x86 applications running on Windows, Linux and FreeBSD operating systems and provides event information unique to the AMD “Zen”-based processors and AMD INSTINCT™ MI Series accelerators. AMD uProf enables the developer to better understand the limiters of application performance and evaluate improvements.

Read more about AMD uProf here https://www.amd.com/en/developer/uprof.html


This documentation is written based of AMDuProfCLI Version 4.1.424.0

!!!!ATTENTION!!!!

If you are not seeing the particular hardware event/counter you need special privileges in order to access some hardware events/counters. All you need to do on Snellius is submit a job with the constraints.....

#SBATCH --constraint=hwperf
#SBATCH --exclusive
  • It must be an exclusive job
  • You will get an allocation with Perf Event Paranoid set to 0

Introduction

AMD uProf is available as a module on Snellius

module load 2022 
module load AMD-uProf/4.1.424

or

module load 2023
module load AMD-uProf/4.1.424

Example application to play with

All of the examples in this documentation will be running the Matrix multiplication example given by AMD "AMDTClassicMatMul.cpp"

To compile this application you will need to (PLEASE NOTE THE file path and software versions can change!!!)  ....

module purge
module load 2022
module load AMD-uProf/4.1.424

# to compile with gcc

module load foss/2022a

g++ /sw/arch/RHEL8/EB_production/2022/software/AMD-uProf/4.1.424/Examples/AMDTClassicMatMul/AMDTClassicMatMul.cpp -o AMDTClassicMatMul 

Basic/Starting commands

Overview

AMDuProfCLI is a command-line tool for AMD uProf Profiler.

Usage: AMDuProfCLI [--version] [--help] COMMAND [<Options>] <PROGRAM> [<ARGS>]

Following are the supported COMMANDs:
  collect       Run the given program and collects the profile samples.
  timechart     Collects the system characteristics like power, thermal and frequency.
  report        Process the profile-data file and generates the profile report.
  translate     Process the raw profile-data files and save those into database files.
  profile       Collects the performance profile data, analysis it and generates the profile report
  info          Displays generic information about system, CPU etc.
  compare,diff  Process multiple profile-data and generates their comparison report.

PROGRAM
  The launch application to be profiled.

ARGS
  The list of arguments for the launch application.

Run 'AMDuProfCLI COMMAND -h' for more information on a specific command.



Get the CPU Topology of AMD Processors.

AMDCpuTopology


Display configuration about the uProf and the processor it is on

AMDuProfCLI info --system


List the available predefined events that can be used with 'collect --event' option

AMDuProfCLI info --list predefined-events


List the predefined profile configurations  that can be used with 'collect --config' option.

AMDuProfCLI info --list collect-configs


List the available "system events" available from timechart

AMDuProfCLI timechart --list

Profile an application

The profile command/option will collect the performance profile data, analyze, and generates the profile report on your application.

Simplest Example

Lets walk through a simple example of profiling an application to identify the functions where the program is spending most of its time.

We will go start to finish of profiling the AMDTClassicMatMul.cpp application.

  1. First Load the AMD-uProf module

    module purge
    module load 2022
    module load AMD-uProf/4.1.424
  2. Compile AMDTClassicMatMul.cpp (in this example we will use gcc)

    module load foss/2022a
    
    g++ /sw/arch/RHEL8/EB_production/2022/software/AMD-uProf/4.1.341/Examples/AMDTClassicMatMul/AMDTClassicMatMul.cpp -o AMDTClassicMatMul 
  3. In order to identify where AMDTClassicMatMul.cpp  is spending most of its time we will use the Time-based Sampling configuration of AMD-uProf. This is achieved by using the --config tbp  option. The full command will be. 

     AMDuProfCLI profile --config tbp -o AMD_profile_output --affinity 1 ./AMDTClassicMatMul  

    NOTE!! the –afinity 1  will set the affinity of the program to core # 1

  4. This will create the profile samples and report in the output directory that we specified (-o AMD_profile_output). AMD-uProf will tell you where it is generating the files (among a bunch of other information). In our example it generated the output here .... 

    Generated report file: /home/user/AMD_profile_output/AMDuProf-AMDTClassicMatMul-TBP_Aug-11-2023_10-45-11/report.csv

    And we can simply read it via cat  or more  or however you prefer to look at text files.

    cat /home/user/AMD_profile_output/AMDuProf-AMDTClassicMatMul-TBP_Aug-11-2023_10-45-11/report.csv

    We will then see the following "10" hottest functions, what event we profiled for (in this case CPU_TIME) and the modules that they are from...

    "10 HOTTEST FUNCTIONS (Sort Event - CPU_TIME)"
    FUNCTION,"CPU_TIME",Module
    "classic_multiply_matrices()",1607.0000,"/gpfs/home3/user/AMDTClassicMatMul"
    "random_r",3.0000,"/usr/lib64/libc-2.28.so"
    "initialize_matrices()",2.0000,"/gpfs/home3/user/AMDTClassicMatMul"
    "rand",1.0000,"/usr/lib64/libc-2.28.so"

    This is a simple example so we only see 4 functions, because there is only 4 functions used. In a "real world" example you should see much more. In our case, to no suprise, the classic_multiply_matrices() function is the most costly.


You can always use the GUI visualisation tool of AMD-uProf, to investigate the profile data more throughly. We will not document how to do this here, but you can read how to use the GUI on AMD's website https://www.amd.com/en/developer/uprof.html

This simple example should highlight the basic profiling of an application using AMD-uProf. Of course there is much more to the tool, and you will often find yourself in a situation when you want to profile your application for "deeper" information. We highlight some of the further functionality available to you in the sections below.


Further Details

Time-based Sampling

 Use this configuration to identify where programs are spending time

 AMDuProfCLI profile --config tbp -o AMD_profile_output --affinity 1 ./AMDTClassicMatMul 

Assess Performance

Use this configuration to get an overall assessment of performance and to find potential issues for further investigation.

 AMDuProfCLI profile --config assess -o AMD_profile_output --affinity 1  ./AMDTClassicMatMul 

Maybe you need more stuff

 AMDuProfCLI profile --config assess_ext -o AMD_profile_output --affinity 1  ./AMDTClassicMatMul 


Cache/Memory Analysis

Configuration for collecting memory accesses for false cache sharing 

 AMDuProfCLI profile --config memory -o AMD_profile_output --affinity 1  ./AMDTClassicMatMul 


Investigate Instruction Access

Use this configuration to find instruction fetches with poor L1 instruction cache locality and poor ITLB (Instruction Translation Lookaside Buffers) behavior.

 AMDuProfCLI profile --config inst_access -o AMD_profile_output --affinity 1  ./AMDTClassicMatMul 


Investigate Data Access

Use this configuration to find instruction fetches with poor L1 instruction cache locality and poor DTLB (Data Translation Lookaside Buffers) behavior.

 AMDuProfCLI profile --config data_access -o AMD_profile_output --affinity 1  ./AMDTClassicMatMul 


Profile an application (System characteristics)

Profile Frequency, Power, Temperature etc.

Profile specific core/s power usage. Note it is helpful to set the affinity of your application to the core you are profiling.

AMDuProfCLI timechart --event core=0-3,power -o AMD_profile_output --interval 10 --affinity 1 ./AMDTClassicMatMul 

In this example, we run a serial program that is bound to core=1 AND we profile the cores around it (0,1,2,3). This will show us the energy usage of the core, with the application running on it and the cores, next to it which are idle.



Compare two or more application profiles

Lets say for example you are trying to understand the difference between two executables that you have profiled using the techniques above. You can compare the profile data using the command

AMDuProfCLI compare --baseline /tmp/cpuprof-tbp/<BASE-SESSION-DIR> --with /tmp/cpuprof-tbp/<SUCCESSOR-SESSION-DIR> -o /tmp/cpuprof-tbp/

This will generate an easily readable .md (markdown) file in the directory that you supply via the -o flag. The comparison markdown file will highlight the profile data of the two profiles, and the differences between them.

Quick Roofline performance model


In order to collect data required for generating roofline model, you can use the AMDuProfPcm tool ....

AMDuProfPcm roofline -X -o ~/classic_roofline.csv -- ./AMDTClassicMatMul

You can plot the result (to a pdf) using the python plotting script

 AMDuProfModelling.py -i ~/classic_roofline.csv -o ~/

You will likely need Python for this.... a quick `module avail Python` will show you what python modules are available on Snellius

  • No labels