Thicket Nsight Compute Reader: Thicket Tutorial
Nsight Compute (NCU) is a performance profiler for NVIDIA GPUs. NCU report files do not have a calltree, but with the NVTX Caliper service we can forward Caliper annotations to NCU. By profiling the same executable with a calltree profiler like Caliper, we can map the NCU data to the calltree profile and create a Thicket object.
In Section 6, we reproduce some of the analysis and visualizations from the paper:
Olga Pearce, Jason Burmark, Rich Hornung, Befikir Bogale, Ian Lumsden, Michael McKinsey, Dewi Yokelson, David Boehme, Stephanie Brink, Michela Taufer, and Tom Scogland. “RAJA Performance Suite: Performance Portability Analysis with Caliper and Thicket”. SC-W 2024: Workshops of ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. Performance, Portability & Productivity in HPC. 2024.
1. Import Necessary Packages
The Thicket NCU reader requires an existing install of Nsight Compute, and the extras/python
directory in the Nsight Compute installation directory must be added to the PYTHONPATH
. We use sys.path.append
to add the path to the PYTHONPATH
in this notebook. If you are not on a Livermore Computing system, you must change this path to match your install of Nsight Compute.
VERSION NOTICE: This functionality is tested with nsight-compute version 2023.2.2. Your mileage may vary if using a different version.
[1]:
import os
import sys
sys.path.append("/usr/tce/packages/nsight-compute/nsight-compute-2023.2.2/extras/python")
from IPython.display import display
from IPython.display import HTML
import thicket as tt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
display(HTML("<style>.container { width:80% !important; }</style>"))
Seaborn not found, so skipping imports of plotting in thicket.stats
To enable this plotting, install seaborn or thicket[plotting]
Warning: Roundtrip module could not be loaded. Requires jupyter notebook version <= 7.x.
2. The Dataset
The dataset we are using comes from a profile of the RAJA Performance Suite on Lassen. We profile the block_128
tuning of the Base_CUDA
, Lambda_CUDA
, and RAJA_CUDA
variants, while varying the problem size for 1 million and 2 million. The calltree profiles come from the CUDA Activity Profile Caliper configuration. By changing the variant
argument in the following cell, we can look at NCU data for different variants.
The following are reproducible steps to generate this dataset:
# Example of building
$ . RAJAPerf/scripts/lc-builds/blueos_nvhpc_nvcc_clang_caliper.sh
$ make -j
# Load CUDA version equal to the CUDA version used to build RAJAPerf
$ module load nvhpc/24.1-cuda-11.2.0
# Turn off NVIDIA Data Center GPU Manager (DCGM) on Lassen so we can run NCU (get an error if it's on)
$ dcgmi profile --pause
# Example run to Generate the CUDA Activity Profile
$ CALI_CONFIG=cuda-activity-profile,output.format=cali lrun -n 1 --smpiargs="-disable_gpu_hooks" bin/raja-perf.exe --variants [Base_CUDA OR Lambda_CUDA OR RAJA_CUDA] --tunings block_128 --size [1048576 OR 2097152] --repfact 0.01
# Example run to Generate the NCU Report
$ CALI_SERVICES_ENABLE=nvtx lrun -n 1 --smpiargs="-disable_gpu_hooks" ncu \
--nvtx --set default \
--export report \
--metrics sm__throughput.avg.pct_of_peak_sustained_elapsed \
--replay-mode application \
bin/raja-perf.exe --variants [Base_CUDA OR Lambda_CUDA OR RAJA_CUDA] --tunings block_128 --size [1048576 OR 2097152] --repfact 0.01
[2]:
# Map all files
ncu_dir = "../data/ncu/"
ncu_report_mapping = {}
variant = "base_cuda" # OR "lambda_cuda" OR "raja_cuda"
problem_sizes = ["1M", "2M"]
for problem_size in problem_sizes:
full_path = f"{ncu_dir}{variant}/{problem_size}/"
ncu_report_mapping[full_path + "report.ncu-rep"] = full_path + "cuda_profile.cali"
3. Read Calltree Profiles into Thicket
The only performance metrics contained in the CUDA Activity Profile will be the CPU time time
and the GPU time time (gpu)
.
[3]:
tk_cap = tt.Thicket.from_caliperreader(list(ncu_report_mapping.values()))
tk_cap.dataframe.head(20)
(1/2) Reading Files: 100%|██████████| 2/2 [00:00<00:00, 4.57it/s]
(2/2) Creating Thicket: 100%|██████████| 1/1 [00:00<00:00, 4.75it/s]
[3]:
nid | time | time (gpu) | name | ||
---|---|---|---|---|---|
node | profile | ||||
{'name': 'RAJAPerf', 'type': 'function'} | 457195964 | 23.0 | 0.000615 | NaN | RAJAPerf |
528105777 | 23.0 | 0.000596 | NaN | RAJAPerf | |
{'name': 'Algorithm', 'type': 'function'} | 457195964 | 164.0 | 0.000024 | NaN | Algorithm |
528105777 | 164.0 | 0.000024 | NaN | Algorithm | |
{'name': 'Algorithm_MEMCPY', 'type': 'function'} | 457195964 | 168.0 | 0.000017 | NaN | Algorithm_MEMCPY |
528105777 | 168.0 | 0.000017 | NaN | Algorithm_MEMCPY | |
{'name': 'cudaDeviceSynchronize', 'type': 'function'} | 457195964 | 170.0 | 0.000061 | NaN | cudaDeviceSynchronize |
528105777 | 170.0 | 0.000039 | NaN | cudaDeviceSynchronize | |
{'name': 'cudaLaunchKernel', 'type': 'function'} | 457195964 | 169.0 | 0.000031 | NaN | cudaLaunchKernel |
528105777 | 169.0 | 0.000032 | NaN | cudaLaunchKernel | |
{'name': 'void rajaperf::algorithm::memcpy<128ul>(double*, double*, long)', 'type': 'kernel'} | 457195964 | 225.0 | NaN | 0.000051 | void rajaperf::algorithm::memcpy<128ul>(double... |
528105777 | 225.0 | NaN | 0.000031 | void rajaperf::algorithm::memcpy<128ul>(double... | |
{'name': 'Algorithm_MEMSET', 'type': 'function'} | 457195964 | 165.0 | 0.000015 | NaN | Algorithm_MEMSET |
528105777 | 165.0 | 0.000014 | NaN | Algorithm_MEMSET | |
{'name': 'cudaDeviceSynchronize', 'type': 'function'} | 457195964 | 167.0 | 0.000043 | NaN | cudaDeviceSynchronize |
528105777 | 167.0 | 0.000030 | NaN | cudaDeviceSynchronize | |
{'name': 'cudaLaunchKernel', 'type': 'function'} | 457195964 | 166.0 | 0.000030 | NaN | cudaLaunchKernel |
528105777 | 166.0 | 0.000029 | NaN | cudaLaunchKernel | |
{'name': 'void rajaperf::algorithm::memset<128ul>(double*, double, long)', 'type': 'kernel'} | 457195964 | 224.0 | NaN | 0.000033 | void rajaperf::algorithm::memset<128ul>(double... |
528105777 | 224.0 | NaN | 0.000020 | void rajaperf::algorithm::memset<128ul>(double... |
4. Add NCU Data
The Thicket add_ncu
function takes one required argument and one optional arguement. The required argument ncu_report_mapping
is the mapping from the NCU report file to the corresponding calltree profile run. The optional argument chosen_metrics
allows for a subselection of the NCU performance metrics to add, since there can be hundreds of NCU performance metrics. By default we add all metrics.
[4]:
# Add NCU to thicket
ncu_metrics = [
"gpu__time_duration.sum",
"sm__throughput.avg.pct_of_peak_sustained_elapsed",
"smsp__maximum_warps_avg_per_active_cycle",
]
# Add in metrics
tk_cap.add_ncu(
ncu_report_mapping=ncu_report_mapping,
chosen_metrics=ncu_metrics,
)
tk_cap.dataframe.head(20)
Processing action 600/601: 100%|██████████| 601/601 [00:15<00:00, 39.09it/s]
Processing action 600/601: 100%|██████████| 601/601 [00:01<00:00, 375.43it/s]
[4]:
nid | time | time (gpu) | name | gpu__time_duration.sum | sm__throughput.avg.pct_of_peak_sustained_elapsed | smsp__maximum_warps_avg_per_active_cycle | ||
---|---|---|---|---|---|---|---|---|
node | profile | |||||||
{'name': 'RAJAPerf', 'type': 'function'} | 457195964 | 23.0 | 0.000615 | NaN | RAJAPerf | NaN | NaN | NaN |
528105777 | 23.0 | 0.000596 | NaN | RAJAPerf | NaN | NaN | NaN | |
{'name': 'Algorithm', 'type': 'function'} | 457195964 | 164.0 | 0.000024 | NaN | Algorithm | NaN | NaN | NaN |
528105777 | 164.0 | 0.000024 | NaN | Algorithm | NaN | NaN | NaN | |
{'name': 'Algorithm_MEMCPY', 'type': 'function'} | 457195964 | 168.0 | 0.000017 | NaN | Algorithm_MEMCPY | NaN | NaN | NaN |
528105777 | 168.0 | 0.000017 | NaN | Algorithm_MEMCPY | NaN | NaN | NaN | |
{'name': 'cudaDeviceSynchronize', 'type': 'function'} | 457195964 | 170.0 | 0.000061 | NaN | cudaDeviceSynchronize | NaN | NaN | NaN |
528105777 | 170.0 | 0.000039 | NaN | cudaDeviceSynchronize | NaN | NaN | NaN | |
{'name': 'cudaLaunchKernel', 'type': 'function'} | 457195964 | 169.0 | 0.000031 | NaN | cudaLaunchKernel | NaN | NaN | NaN |
528105777 | 169.0 | 0.000032 | NaN | cudaLaunchKernel | NaN | NaN | NaN | |
{'name': 'void rajaperf::algorithm::memcpy<128ul>(double*, double*, long)', 'type': 'kernel'} | 457195964 | 225.0 | NaN | 0.000051 | void rajaperf::algorithm::memcpy<128ul>(double... | 43232.0 | 6.521123 | 16.0 |
528105777 | 225.0 | NaN | 0.000031 | void rajaperf::algorithm::memcpy<128ul>(double... | 22880.0 | 6.294607 | 16.0 | |
{'name': 'Algorithm_MEMSET', 'type': 'function'} | 457195964 | 165.0 | 0.000015 | NaN | Algorithm_MEMSET | NaN | NaN | NaN |
528105777 | 165.0 | 0.000014 | NaN | Algorithm_MEMSET | NaN | NaN | NaN | |
{'name': 'cudaDeviceSynchronize', 'type': 'function'} | 457195964 | 167.0 | 0.000043 | NaN | cudaDeviceSynchronize | NaN | NaN | NaN |
528105777 | 167.0 | 0.000030 | NaN | cudaDeviceSynchronize | NaN | NaN | NaN | |
{'name': 'cudaLaunchKernel', 'type': 'function'} | 457195964 | 166.0 | 0.000030 | NaN | cudaLaunchKernel | NaN | NaN | NaN |
528105777 | 166.0 | 0.000029 | NaN | cudaLaunchKernel | NaN | NaN | NaN | |
{'name': 'void rajaperf::algorithm::memset<128ul>(double*, double, long)', 'type': 'kernel'} | 457195964 | 224.0 | NaN | 0.000033 | void rajaperf::algorithm::memset<128ul>(double... | 31648.0 | 7.531866 | 16.0 |
528105777 | 224.0 | NaN | 0.000020 | void rajaperf::algorithm::memset<128ul>(double... | 18016.0 | 6.692635 | 16.0 |
5. Visualize the NCU Performance Data on the Calltree
[5]:
print(tk_cap.tree(
metric_column="sm__throughput.avg.pct_of_peak_sustained_elapsed",
expand_name=True,
))
_____ _ _ _ _
|_ _| |__ (_) ___| | _____| |_
| | | '_ \| |/ __| |/ / _ \ __|
| | | | | | | (__| < __/ |_
|_| |_| |_|_|\___|_|\_\___|\__| v2024.2.1
nan RAJAPerf
├─ nan Algorithm
│ ├─ nan Algorithm_MEMCPY
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 6.521 void rajaperf::algorithm::memcpy<128ul>(double*, double*, long)
│ └─ nan Algorithm_MEMSET
│ ├─ nan cudaDeviceSynchronize
│ └─ nan cudaLaunchKernel
│ └─ 7.532 void rajaperf::algorithm::memset<128ul>(double*, double, long)
├─ nan Apps
│ ├─ nan Apps_DEL_DOT_VEC_2D
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 26.952 void rajaperf::apps::deldotvec2d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, long*, double, double, long)
│ ├─ nan Apps_EDGE3D
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Apps_ENERGY
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ ├─ 4.291 void rajaperf::apps::energycalc1<128ul>(double*, double*, double*, double*, double*, double*, long)
│ │ ├─ 7.512 void rajaperf::apps::energycalc2<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double, long)
│ │ ├─ 3.977 void rajaperf::apps::energycalc3<128ul>(double*, double*, double*, double*, double*, double*, long)
│ │ ├─ 6.292 void rajaperf::apps::energycalc4<128ul>(double*, double*, double, double, long)
│ │ ├─ 6.911 void rajaperf::apps::energycalc5<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double, double, double, long)
│ │ └─ 8.058 void rajaperf::apps::energycalc6<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double, double, long)
│ ├─ nan Apps_FIR
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 46.164 void rajaperf::apps::fir<128ul>(double*, double*, long, long)
│ ├─ nan Apps_LTIMES
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Apps_LTIMES_NOVIEW
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Apps_MATVEC_3D_STENCIL
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 6.912 void rajaperf::apps::matvec_3d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, long*, long, long)
│ ├─ nan Apps_NODAL_ACCUMULATION_3D
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 7.249 void rajaperf::apps::nodal_accumulation_3d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, long*, long, long)
│ ├─ nan Apps_PRESSURE
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ ├─ 7.532 void rajaperf::apps::pressurecalc1<128ul>(double*, double*, double, long)
│ │ └─ 7.175 void rajaperf::apps::pressurecalc2<128ul>(double*, double*, double*, double*, double, double, double, long)
│ ├─ nan Apps_VOL3D
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 34.543 void rajaperf::apps::vol3d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double, long, long)
│ └─ nan Apps_ZONAL_ACCUMULATION_3D
│ ├─ nan cudaDeviceSynchronize
│ └─ nan cudaLaunchKernel
│ └─ 11.787 void rajaperf::apps::zonal_accumulation_3d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, long*, long, long)
├─ nan Basic
│ ├─ nan Basic_ARRAY_OF_PTRS
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Basic_COPY8
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Basic_DAXPY
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 5.275 void rajaperf::basic::daxpy<128ul>(double*, double*, double, long)
│ ├─ nan Basic_DAXPY_ATOMIC
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 4.733 void rajaperf::basic::daxpy_atomic<128ul>(double*, double*, double, long)
│ ├─ nan Basic_IF_QUAD
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 12.770 void rajaperf::basic::ifquad<128ul>(double*, double*, double*, double*, double*, long)
│ ├─ nan Basic_INIT3
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 4.232 void rajaperf::basic::init3<128ul>(double*, double*, double*, double*, double*, long)
│ ├─ nan Basic_INIT_VIEW1D
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 8.887 void rajaperf::basic::initview1d<128ul>(double*, double, long)
│ ├─ nan Basic_INIT_VIEW1D_OFFSET
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 8.892 void rajaperf::basic::initview1d_offset<128ul>(double*, double, long, long)
│ ├─ nan Basic_MULADDSUB
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 4.515 void rajaperf::basic::muladdsub<128ul>(double*, double*, double*, double*, double*, long)
│ ├─ nan Basic_NESTED_INIT
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 17.655 void rajaperf::basic::nested_init<32ul, 4ul, 1ul>(double*, long, long, long)
│ └─ nan Basic_PI_ATOMIC
│ └─ nan cudaDeviceSynchronize
├─ nan Comm
│ └─ nan Comm_HALO_PACKING
│ ├─ nan cudaDeviceSynchronize
│ ├─ nan cudaLaunchKernel
│ │ ├─ 0.126 void rajaperf::comm::halo_packing_pack<128ul>(double*, int*, double*, long)
│ │ └─ 0.160 void rajaperf::comm::halo_packing_unpack<128ul>(double*, int*, double*, long)
│ └─ nan cudaStreamSynchronize
├─ nan Lcals
│ ├─ nan Lcals_DIFF_PREDICT
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 2.277 void rajaperf::lcals::diff_predict<128ul>(double*, double*, long, long)
│ ├─ nan Lcals_EOS
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 9.121 void rajaperf::lcals::eos<128ul>(double*, double*, double*, double*, double, double, double, long)
│ ├─ nan Lcals_FIRST_DIFF
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 7.521 void rajaperf::lcals::first_diff<128ul>(double*, double*, long)
│ ├─ nan Lcals_FIRST_SUM
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 8.251 void rajaperf::lcals::first_sum<128ul>(double*, double*, long)
│ ├─ nan Lcals_GEN_LIN_RECUR
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ ├─ 4.580 void rajaperf::lcals::genlinrecur1<128ul>(double*, double*, double*, double*, long, long)
│ │ └─ 5.087 void rajaperf::lcals::genlinrecur2<128ul>(double*, double*, double*, double*, long, long)
│ ├─ nan Lcals_HYDRO_1D
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 6.563 void rajaperf::lcals::hydro_1d<128ul>(double*, double*, double*, double, double, double, long)
│ ├─ nan Lcals_HYDRO_2D
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ ├─ 14.008 void rajaperf::lcals::hydro_2d1<32ul, 4ul>(double*, double*, double*, double*, double*, double*, long, long)
│ │ ├─ 9.808 void rajaperf::lcals::hydro_2d2<32ul, 4ul>(double*, double*, double*, double*, double*, double*, double, long, long)
│ │ └─ 6.551 void rajaperf::lcals::hydro_2d3<32ul, 4ul>(double*, double*, double*, double*, double*, double*, double, long, long)
│ ├─ nan Lcals_INT_PREDICT
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 5.497 void rajaperf::lcals::int_predict<128ul>(double*, double, double, double, double, double, double, double, double, long, long)
│ ├─ nan Lcals_PLANCKIAN
│ │ └─ nan cudaDeviceSynchronize
│ └─ nan Lcals_TRIDIAG_ELIM
│ ├─ nan cudaDeviceSynchronize
│ └─ nan cudaLaunchKernel
│ └─ 5.423 void rajaperf::lcals::tridiag_elim<128ul>(double*, double*, double*, double*, long)
├─ nan Polybench
│ ├─ nan Polybench_2MM
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Polybench_3MM
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Polybench_ADI
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Polybench_ATAX
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ ├─ 0.642 void rajaperf::polybench::poly_atax_1<128ul>(double*, double*, double*, double*, long)
│ │ └─ 0.455 void rajaperf::polybench::poly_atax_2<128ul>(double*, double*, double*, long)
│ ├─ nan Polybench_FDTD_2D
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Polybench_FLOYD_WARSHALL
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Polybench_GEMM
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Polybench_GEMVER
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Polybench_GESUMMV
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ └─ 0.595 void rajaperf::polybench::poly_gesummv<128ul>(double*, double*, double*, double*, double, double, long)
│ ├─ nan Polybench_HEAT_3D
│ │ └─ nan cudaDeviceSynchronize
│ ├─ nan Polybench_JACOBI_1D
│ │ ├─ nan cudaDeviceSynchronize
│ │ └─ nan cudaLaunchKernel
│ │ ├─ 10.901 void rajaperf::polybench::poly_jacobi_1D_1<128ul>(double*, double*, long)
│ │ └─ 10.883 void rajaperf::polybench::poly_jacobi_1D_2<128ul>(double*, double*, long)
│ ├─ nan Polybench_JACOBI_2D
│ │ └─ nan cudaDeviceSynchronize
│ └─ nan Polybench_MVT
│ ├─ nan cudaDeviceSynchronize
│ └─ nan cudaLaunchKernel
│ ├─ 0.667 void rajaperf::polybench::poly_mvt_1<128ul>(double*, double*, double*, long)
│ └─ 0.378 void rajaperf::polybench::poly_mvt_2<128ul>(double*, double*, double*, long)
└─ nan Stream
├─ nan Stream_ADD
│ ├─ nan cudaDeviceSynchronize
│ └─ nan cudaLaunchKernel
│ └─ 5.780 void rajaperf::stream::add<128ul>(double*, double*, double*, long)
├─ nan Stream_COPY
│ ├─ nan cudaDeviceSynchronize
│ └─ nan cudaLaunchKernel
│ └─ 6.787 void rajaperf::stream::copy<128ul>(double*, double*, long)
├─ nan Stream_MUL
│ ├─ nan cudaDeviceSynchronize
│ └─ nan cudaLaunchKernel
│ └─ 7.199 void rajaperf::stream::mul<128ul>(double*, double*, double, long)
└─ nan Stream_TRIAD
├─ nan cudaDeviceSynchronize
└─ nan cudaLaunchKernel
└─ 5.789 void rajaperf::stream::triad<128ul>(double*, double*, double*, double, long)
nan cudaDeviceSynchronize
nan cudaFree
nan cudaFreeHost
nan cudaGetDevice
nan cudaGetSymbolAddress
nan cudaHostAlloc
nan cudaLaunchKernel
└─ nan void rajaperf::basic::daxpy<128ul>(double*, double*, double, long)
nan cudaMalloc
nan cudaMallocManaged
nan cudaMemAdvise
nan cudaMemcpy
└─ nan memcpy
nan cudaMemcpyAsync
└─ nan memcpy
nan cudaStreamCreate
Legend (Metric: sm__throughput.avg.pct_of_peak_sustained_elapsed Min: 0.13 Max: 46.16 indices: {'profile': 457195964})
█ 41.56 - 46.16
█ 32.35 - 41.56
█ 23.14 - 32.35
█ 13.94 - 23.14
█ 4.73 - 13.94
█ 0.13 - 4.73
name User code ◀ Only in left graph ▶ Only in right graph
6. Create Instruction Roofline Plots
We can make roofline plots using the metrics we have collected using Nsight Compute. The Roofline sets an upper bound on performance of a kernel depending on its operational intensity. We will use roofline plots to understand the performance of RAJAPerf kernels.
In this section, we reproduce some of the analysis and visualizations from the paper:
Olga Pearce, Jason Burmark, Rich Hornung, Befikir Bogale, Ian Lumsden, Michael McKinsey, Dewi Yokelson, David Boehme, Stephanie Brink, Michela Taufer, and Tom Scogland. “RAJA Performance Suite: Performance Portability Analysis with Caliper and Thicket”. SC-W 2024: Workshops of ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. Performance, Portability & Productivity in HPC. 2024.
Instruction Roofline Models were introduced by Ding et. al to better characterize GPU workloads by looking at instruction intensity. We walk through the creation of instruction roofline models for the Appications group of the RAJAPerf kernels below.
More references on methodology for roofline models:
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (April 2009), 65–76. https://doi.org/10.1145/1498765.1498785
Nan Ding, Muaaz Awan, Samuel Williams. 2022. Instruction Roofline: An insightful visual performance model for GPUs. Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.6591
Step 1 is to create a Thicket using a CUDA Activity Profile, and add the NCU data. We use RAJA_CUDA
variant data of the block_256
tuning.
Command to generate the Caliper CUDA Activity Profile:
CALI_CONFIG=cuda-activity-profile,output.format=cali lrun -n 1 --smpiargs="-disable_gpu_hooks" bin/raja-perf.exe --variants RAJA_CUDA --tunings block_256 --size 8388608
Command to generate the NCU report with Roofline metrics:
CALI_SERVICES_ENABLE=nvtx lrun -n 1 --smpiargs="-disable_gpu_hooks" ncu \
--nvtx --set default \
--export report \
--metrics sm__sass_thread_inst_executed.sum,l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum,l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum,l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum,lts__t_sectors_op_read.sum,lts__t_sectors_op_write.sum,lts__t_sectors_op_atom.sum,lts__t_sectors_op_red.sum,dram__sectors_read.sum,dram__sectors_write.sum,l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum,l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum \
--replay-mode application \
bin/raja-perf.exe --variants RAJA_CUDA --tunings block_256 --size 8388608
[6]:
roofline_cali = ncu_dir + "roofline/block_256_profile.cali"
roofline_ncu = ncu_dir + "roofline/block_256_profile.ncu-rep"
# Unzip files, unless they have already been unzipped
if not (os.path.isfile(roofline_cali) and os.path.isfile(roofline_ncu)):
import zipfile
with zipfile.ZipFile(ncu_dir + "roofline/block_256_profile.zip", "r") as zip_ref:
zip_ref.extractall(ncu_dir + "roofline/")
# Create Thicket and add NCU data
tk_roofline = tt.Thicket.from_caliperreader(roofline_cali)
tk_roofline.add_ncu({roofline_ncu:roofline_cali})
Processing action 3444/3445: 100%|██████████| 3445/3445 [00:42<00:00, 81.94it/s]
Select a subset of kernels to plot on the roofline. Here we are just selecting Apps
kernels.
[7]:
# Options (can be more than one): ["Algorithm", "Apps", "Basic", "Comm", "Lcals", "Polybench", "Stream"]
kernel_types = ["Apps"]
raja_kernel_query = (
tt.query.Query()
.match (
".",
lambda row: row["name"].apply(
lambda x: any([x.startswith(c + "_") for c in kernel_types])
).all()
)
.rel("*")
)
pruned_th = tk_roofline.query(raja_kernel_query)
Aggregate metrics for kernels with multiple instances
[8]:
# Match parent kernel name, e.g. "Apps_VOL3D"
kernel_query = """
MATCH (".",p)->("*")
WHERE p."name" = "{ker}"
"""
# Match demangled kernel that contains NCU measurement.
child_kernel_query = """
MATCH (".", p)
WHERE p."depth" = 2
"""
# List of columns to exclude from aggregation
cols = [col for col in pruned_th.dataframe.columns.tolist() if col not in ["nid", "time", "name", "time (inc)", "time (gpu) (inc)"]]
leaves = []
for n in pruned_th.graph.roots:
kernels = {}
kernels["name"] = n.frame.get("name")
tmp = kernel_query.format(ker=n.frame.get("name"))
ker_th = pruned_th.query(tmp, multi_index_mode="all")
leaf = ker_th.query(child_kernel_query, multi_index_mode="all")
for col in cols:
kernels[col] = leaf.dataframe[col].sum()
leaves.append(kernels)
agg_df = pd.DataFrame(data=leaves)
agg_df
[8]:
name | time (gpu) | c2clink__enabled_mask | c2clink__present | device__attribute_architecture | device__attribute_async_engine_count | device__attribute_can_flush_remote_writes | device__attribute_can_map_host_memory | device__attribute_can_tex2d_gather | device__attribute_can_use_64_bit_stream_mem_ops_v1 | ... | sm__sass_thread_inst_executed.sum | smsp__inst_executed.sum | smsp__inst_executed_op_global_ld.sum | smsp__inst_executed_op_global_st.sum | smsp__inst_executed_op_local_ld.sum | smsp__inst_executed_op_local_st.sum | smsp__inst_executed_op_shared_ld.sum | smsp__inst_executed_op_shared_st.sum | smsp__inst_executed_pipe_tensor.sum | smsp__maximum_warps_avg_per_active_cycle | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Apps_DEL_DOT_VEC_2D | 0.000527 | 0.0 | 0.0 | 320.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.165767e+09 | 3.643023e+07 | 4455496.0 | 262088.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 |
1 | Apps_EDGE3D | 0.468591 | 0.0 | 0.0 | 320.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.898933e+11 | 5.934178e+09 | 6587736.0 | 274489.0 | 1.089172e+09 | 715043845.0 | 0.0 | 0.0 | 0.0 | 2.0 |
2 | Apps_ENERGY | 0.002247 | 0.0 | 0.0 | 1920.0 | 24.0 | 6.0 | 6.0 | 6.0 | 6.0 | ... | 1.811939e+09 | 5.662310e+07 | 5767168.0 | 1572864.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 96.0 |
3 | Apps_FIR | 0.000178 | 0.0 | 0.0 | 320.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.115685e+09 | 3.486515e+07 | 4194304.0 | 262144.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 16.0 |
4 | Apps_LTIMES | 0.002227 | 0.0 | 0.0 | 320.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 3.340763e+09 | 1.216348e+08 | 18874368.0 | 8388608.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 16.0 |
5 | Apps_LTIMES_NOVIEW | 0.002234 | 0.0 | 0.0 | 320.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 2.900361e+09 | 1.042022e+08 | 18874368.0 | 8388608.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 16.0 |
6 | Apps_MATVEC_3D_STENCIL | 0.002154 | 0.0 | 0.0 | 320.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.731645e+09 | 5.411398e+07 | 14378100.0 | 261420.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
7 | Apps_NODAL_ACCUMULATION_3D | 0.000556 | 0.0 | 0.0 | 320.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 3.597146e+08 | 1.124110e+07 | 522840.0 | 0.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 16.0 |
8 | Apps_PRESSURE | 0.000507 | 0.0 | 0.0 | 640.0 | 8.0 | 2.0 | 2.0 | 2.0 | 2.0 | ... | 5.117051e+08 | 1.599078e+07 | 1048576.0 | 786432.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 32.0 |
9 | Apps_VOL3D | 0.000380 | 0.0 | 0.0 | 320.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.317547e+09 | 4.117341e+07 | 6587736.0 | 274489.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 8.0 |
10 | Apps_ZONAL_ACCUMULATION_3D | 0.000268 | 0.0 | 0.0 | 320.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 4.182726e+08 | 1.307104e+07 | 2352780.0 | 261420.0 | 0.000000e+00 | 0.0 | 0.0 | 0.0 | 0.0 | 16.0 |
11 rows × 239 columns
Calculate instruciton intensities from existing NCU metrics.
[9]:
# Warp Instructions
agg_df["Warp Instructions"] = agg_df["sm__sass_thread_inst_executed.sum"] / 32
# L1 Global Count Memory Transactions
agg_df["L1 (GLOBAL)"] = agg_df["l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum"] + agg_df["l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum"]
agg_df["L1 (SHARED)"] = agg_df["l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum"] + agg_df["l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum"]
agg_df["Total L1 Transactions"] = agg_df["L1 (GLOBAL)"] + (4 * agg_df["L1 (SHARED)"])
# Shared Count Memory Transactions
agg_df["L2 Write Transactions"] = agg_df["lts__t_sectors_op_write.sum"] + agg_df["lts__t_sectors_op_atom.sum"] + agg_df["lts__t_sectors_op_red.sum"]
agg_df["L2 Read Transactions"] = agg_df["lts__t_sectors_op_read.sum"] + agg_df["lts__t_sectors_op_atom.sum"] + agg_df["lts__t_sectors_op_red.sum"]
agg_df["Total L2 Transactions"] = agg_df["L2 Read Transactions"] + agg_df["L2 Write Transactions"]
# HBM Count Memory Transactions
agg_df["HBM Transactions"] = agg_df["dram__sectors_read.sum"] + agg_df["dram__sectors_write.sum"]
# L1, L2, HBM Intensities
agg_df["L1 Instruction Intensity"] = agg_df["Warp Instructions"] / agg_df["Total L1 Transactions"]
agg_df["L2 Instruction Intensity"] = agg_df["Warp Instructions"] / agg_df["Total L2 Transactions"]
agg_df["HBM Instruction Intensity"] = agg_df["Warp Instructions"] / agg_df["HBM Transactions"]
# Performance in GIPS
agg_df["Performance GIPS"] = agg_df["Warp Instructions"] / (agg_df["time (gpu)"] * (10 ** 9))
Plotting Roofline with Matplotlib
The plotting code below was adapted from the following resource: https://gitlab.com/NERSC/roofline-on-nvidia-gpus/-/blob/master/custom-scripts/roofline.py
[14]:
pruned_th_kers = tk_roofline.query(child_kernel_query, multi_index_mode="all")
metrics = ["L1 Instruction Intensity", "L2 Instruction Intensity", "HBM Instruction Intensity"]
c_s = ["red", "blue", "green", "orange", "purple", "cyan", "magenta"]
final_colors = []
for i in agg_df["name"].tolist():
for j in range(0, len(kernel_types)):
if i.startswith(kernel_types[j]):
final_colors.append(c_s[j])
continue
font = {"size": 15}
plt.rc("font", **font)
colors = ["tab:blue", "tab:orange", "tab:green", "tab:red", "tab:purple", "tab:brown", "tab:pink", "tab:gray", "tab:olive", "tab:cyan"]
styles = ["o", "s", "v", "^", "D", ">", "<", "*", "h", "H", "+", "1", "2", "3", "4", "8", "p", "d", "|", "_", ".", ","]
markersize = 10
markerwidth = 2
maxchar = 25
def roofline(LABELS, flag="HBM", data_df=None):
LABELS = [x[:maxchar] for x in LABELS]
bandiwdth_hbm = 25.9 # in GTXN/s
bandwidth_l2 = 93.6 # in GTXN/s
bandwidth_l1 = 437.5 # in GTXN/s
if flag == "L1":
memRoofs = [("L1", bandwidth_l1)]
elif flag == "L2":
memRoofs = [("L2", bandwidth_l2)]
elif flag == "HBM":
memRoofs = [("HBM", bandiwdth_hbm)]
elif flag == "all":
memRoofs = [("L1", bandwidth_l1), ("L2", bandwidth_l2), ("HBM", bandiwdth_hbm)]
cmpRoofs = [("GIPS", 489.6)]
fig = plt.figure(1, figsize=(10,6))
plt.clf()
ax = fig.gca()
ax.set_xscale("log")
ax.set_yscale("log")
ax.set_xlabel("Instruction Intensity [Warp Instructions/transaction]")
ax.set_ylabel("Performance [Warp GIPS]")
nx = 10000
xmin = -3
xmax = 3
ymin = 1
ymax = 1000
ax.set_xlim(10 ** xmin, 10 ** xmax)
ax.set_ylim(ymin, ymax)
ixx = int(nx * 0.02)
xlim = ax.get_xlim()
ylim = ax.get_ylim()
scomp_x_elbow = []
scomp_ix_elbow = []
smem_x_elbow = []
smem_ix_elbow = []
x = np.logspace(xmin, xmax, nx)
for roof in cmpRoofs:
for ix in range(1, nx):
if float(memRoofs[0][1] * x[ix]) >= roof[1] and (memRoofs[0][1] * x[ix - 1]) < roof[1]:
scomp_x_elbow.append(x[ix - 1])
scomp_ix_elbow.append(ix - 1)
break
for roof in memRoofs:
for ix in range(1, nx):
if (cmpRoofs[0][1] <= roof[1] * x[ix] and cmpRoofs[0][1] > roof[1] * x[ix - 1]):
smem_x_elbow.append(x[ix - 1])
smem_ix_elbow.append(ix - 1)
break
for i in range(len(cmpRoofs)):
roof = cmpRoofs[i][1]
y = np.ones(len(x)) * roof
ax.plot(x[scomp_ix_elbow[i]:], y[scomp_ix_elbow[i]:], c="k", ls="-", lw="2")
for i in range(len(memRoofs)):
roof = memRoofs[i][1]
y = x * roof
ax.plot(x[:smem_ix_elbow[i] + 1], y[:smem_ix_elbow[i] + 1], c="k", ls="-", lw="2")
for roof in cmpRoofs:
ax.text(
x[-ixx],
roof[1],
roof[0] + ": " + "{0:.1f}".format(roof[1]) + " Warp GIPS",
horizontalalignment="right",
verticalalignment="bottom"
)
for roof in memRoofs:
ang = np.arctan(np.log10(xlim[1] / xlim[0]) / np.log10(ylim[1] / ylim[0])
* fig.get_size_inches()[1] / fig.get_size_inches()[0] )
if x[ixx]*roof[1] > ymin:
ax.text(
x[ixx],
x[ixx] * roof[1] * (1 + 0.25 * np.sin(ang) ** 2),
roof[0] + ": " + "{0:.1f}".format(float(roof[1])) + " GTXN/s",
horizontalalignment="left",
verticalalignment="bottom",
rotation=180 / np.pi * ang
)
else:
ymin_ix_elbow=list()
ymin_x_elbow=list()
for ix in range(1, nx):
if (ymin <= roof[1] * x[ix] and ymin > roof[1] * x[ix - 1]):
ymin_x_elbow.append(x[ix - 1])
ymin_ix_elbow.append(ix - 1)
break
ax.text(
x[ixx+ymin_ix_elbow[0]],
x[ixx+ymin_ix_elbow[0]] * roof[1] * (1 + 0.25 * np.sin(ang) ** 2) * 1.15,
roof[0] + ": " + "{0:.1f}".format(float(roof[1])) + " GTXN/s",
horizontalalignment="left",
verticalalignment="bottom",
rotation=180 / np.pi * ang
)
if flag == "L1":
ax.scatter(data_df[metrics[0]], data_df["Performance GIPS"], c=final_colors, label="L1", marker=styles[0])
elif flag == "L2":
ax.scatter(data_df[metrics[1]], data_df["Performance GIPS"], c=final_colors, label="L2", marker=styles[1])
elif flag == "HBM":
ax.scatter(data_df[metrics[2]], data_df["Performance GIPS"], c=final_colors, label="HBM", marker="*")
elif flag == "all":
ax.scatter(data_df[metrics[0]], data_df["Performance GIPS"], c=final_colors, label="L1", marker=styles[0])
ax.scatter(data_df[metrics[1]], data_df["Performance GIPS"], c=final_colors, label="L2", marker=styles[1])
ax.scatter(data_df[metrics[2]], data_df["Performance GIPS"], c=final_colors, label="HBM", marker="*")
custom_labels = [k.split("_")[0] for k in kernel_types]
custom_handles = [plt.Line2D([0], [0], color=i, lw=2) for i in c_s]
if flag == "HBM":
leg2 = ax.legend(
custom_handles,
custom_labels,
bbox_to_anchor=(0.36, 1),
title="Kernel Types",
fontsize="12", # Adjust label font size (e.g., "small", "medium", "large", or specific size)
title_fontsize="small"
)
else:
leg2 = ax.legend(
custom_handles,
custom_labels,
bbox_to_anchor=(0.36, 1),
title="Kernel Types",
fontsize="12", # Adjust label font size (e.g., "small", "medium", "large", or specific size)
title_fontsize="small"
)
ax.add_artist(leg2)
ax.legend(loc="upper left")
plt.show()
[15]:
roofline(LABELS=pruned_th.dataframe["name"].tolist(), flag="all", data_df=agg_df)

[16]:
roofline(LABELS=pruned_th.dataframe["name"].tolist(), flag="L1", data_df=agg_df)

[17]:
roofline(LABELS=pruned_th.dataframe["name"].tolist(), flag="L2", data_df=agg_df)

[18]:
roofline(LABELS=pruned_th.dataframe["name"].tolist(), flag="HBM", data_df=agg_df)

[ ]: