Introduction
AMD uProf is available as a module on Snellius
module load 2022 module load AMD-uProf/4.1.424 or module load 2023 module load AMD-uProf/4.1.424
Example application to play with
All of the examples in this documentation will be running the Matrix multiplication example given by AMD "AMDTClassicMatMul.cpp"
To compile this application you will need to (PLEASE NOTE THE file path and software versions can change!!!) ....
module purge module load 2022 module load AMD-uProf/4.1.424 # to compile with gcc module load foss/2022a g++ /sw/arch/RHEL8/EB_production/2022/software/AMD-uProf/4.1.424/Examples/AMDTClassicMatMul/AMDTClassicMatMul.cpp -o AMDTClassicMatMul
Basic/Starting commands
Overview
AMDuProfCLI is a command-line tool for AMD uProf Profiler. Usage: AMDuProfCLI [--version] [--help] COMMAND [<Options>] <PROGRAM> [<ARGS>] Following are the supported COMMANDs: collect Run the given program and collects the profile samples. timechart Collects the system characteristics like power, thermal and frequency. report Process the profile-data file and generates the profile report. translate Process the raw profile-data files and save those into database files. profile Collects the performance profile data, analysis it and generates the profile report info Displays generic information about system, CPU etc. compare,diff Process multiple profile-data and generates their comparison report. PROGRAM The launch application to be profiled. ARGS The list of arguments for the launch application. Run 'AMDuProfCLI COMMAND -h' for more information on a specific command.
Get the CPU Topology of AMD Processors.
AMDCpuTopology
Display configuration about the uProf and the processor it is on
AMDuProfCLI info --system
List the available predefined events that can be used with 'collect --event' option
AMDuProfCLI info --list predefined-events
List the predefined profile configurations that can be used with 'collect --config' option.
AMDuProfCLI info --list collect-configs
List the available "system events" available from timechart
AMDuProfCLI timechart --list
Profile an application
The profile command/option will collect the performance profile data, analyze, and generates the profile report on your application.
Simplest Example
Lets walk through a simple example of profiling an application to identify the functions where the program is spending most of its time.
We will go start to finish of profiling the AMDTClassicMatMul.cpp application.
First Load the AMD-uProf module
module purge module load 2022 module load AMD-uProf/4.1.424
Compile AMDTClassicMatMul.cpp (in this example we will use gcc)
module load foss/2022a g++ /sw/arch/RHEL8/EB_production/2022/software/AMD-uProf/4.1.341/Examples/AMDTClassicMatMul/AMDTClassicMatMul.cpp -o AMDTClassicMatMul
In order to identify where AMDTClassicMatMul.cpp is spending most of its time we will use the Time-based Sampling configuration of AMD-uProf. This is achieved by using the
--config tbp
option. The full command will be.AMDuProfCLI profile --config tbp -o AMD_profile_output --affinity 1 ./AMDTClassicMatMul
NOTE!! the
–afinity 1
will set the affinity of the program to core # 1This will create the profile samples and report in the output directory that we specified (-o AMD_profile_output). AMD-uProf will tell you where it is generating the files (among a bunch of other information). In our example it generated the output here ....
Generated report file: /home/user/AMD_profile_output/AMDuProf-AMDTClassicMatMul-TBP_Aug-11-2023_10-45-11/report.csv
And we can simply read it via
cat
ormore
or however you prefer to look at text files.cat /home/user/AMD_profile_output/AMDuProf-AMDTClassicMatMul-TBP_Aug-11-2023_10-45-11/report.csv
We will then see the following "10" hottest functions, what event we profiled for (in this case CPU_TIME) and the modules that they are from...
"10 HOTTEST FUNCTIONS (Sort Event - CPU_TIME)" FUNCTION,"CPU_TIME",Module "classic_multiply_matrices()",1607.0000,"/gpfs/home3/user/AMDTClassicMatMul" "random_r",3.0000,"/usr/lib64/libc-2.28.so" "initialize_matrices()",2.0000,"/gpfs/home3/user/AMDTClassicMatMul" "rand",1.0000,"/usr/lib64/libc-2.28.so"
This is a simple example so we only see 4 functions, because there is only 4 functions used. In a "real world" example you should see much more. In our case, to no suprise, the classic_multiply_matrices() function is the most costly.
You can always use the GUI visualisation tool of AMD-uProf, to investigate the profile data more throughly. We will not document how to do this here, but you can read how to use the GUI on AMD's website https://www.amd.com/en/developer/uprof.html
This simple example should highlight the basic profiling of an application using AMD-uProf. Of course there is much more to the tool, and you will often find yourself in a situation when you want to profile your application for "deeper" information. We highlight some of the further functionality available to you in the sections below.
Further Details
Time-based Sampling
Use this configuration to identify where programs are spending time
AMDuProfCLI profile --config tbp -o AMD_profile_output --affinity 1 ./AMDTClassicMatMul
Assess Performance
Use this configuration to get an overall assessment of performance and to find potential issues for further investigation.
AMDuProfCLI profile --config assess -o AMD_profile_output --affinity 1 ./AMDTClassicMatMul
Maybe you need more stuff
AMDuProfCLI profile --config assess_ext -o AMD_profile_output --affinity 1 ./AMDTClassicMatMul
Cache/Memory Analysis
Configuration for collecting memory accesses for false cache sharing
AMDuProfCLI profile --config memory -o AMD_profile_output --affinity 1 ./AMDTClassicMatMul
Investigate Instruction Access
Use this configuration to find instruction fetches with poor L1 instruction cache locality and poor ITLB (Instruction Translation Lookaside Buffers) behavior.
AMDuProfCLI profile --config inst_access -o AMD_profile_output --affinity 1 ./AMDTClassicMatMul
Investigate Data Access
Use this configuration to find instruction fetches with poor L1 instruction cache locality and poor DTLB (Data Translation Lookaside Buffers) behavior.
AMDuProfCLI profile --config data_access -o AMD_profile_output --affinity 1 ./AMDTClassicMatMul
Profile an application (System characteristics)
Profile Frequency, Power, Temperature etc.
Profile specific core/s power usage. Note it is helpful to set the affinity of your application to the core you are profiling.
AMDuProfCLI timechart --event core=0-3,power -o AMD_profile_output --interval 10 --affinity 1 ./AMDTClassicMatMul
In this example, we run a serial program that is bound to core=1 AND we profile the cores around it (0,1,2,3). This will show us the energy usage of the core, with the application running on it and the cores, next to it which are idle.
Compare two or more application profiles
Lets say for example you are trying to understand the difference between two executables that you have profiled using the techniques above. You can compare the profile data using the command
AMDuProfCLI compare --baseline /tmp/cpuprof-tbp/<BASE-SESSION-DIR> --with /tmp/cpuprof-tbp/<SUCCESSOR-SESSION-DIR> -o /tmp/cpuprof-tbp/
This will generate an easily readable .md (markdown) file in the directory that you supply via the -o flag. The comparison markdown file will highlight the profile data of the two profiles, and the differences between them.
Quick Roofline performance model
In order to collect data required for generating roofline model, you can use the AMDuProfPcm tool ....
AMDuProfPcm roofline -X -o ~/classic_roofline.csv -- ./AMDTClassicMatMul
You can plot the result (to a pdf) using the python plotting script
AMDuProfModelling.py -i ~/classic_roofline.csv -o ~/
You will likely need Python for this.... a quick `module avail Python` will show you what python modules are available on Snellius