Thicket Nsight Compute Reader: Thicket Tutorial

Nsight Compute (NCU) is a performance profiler for NVIDIA GPUs. NCU report files do not have a calltree, but with the NVTX Caliper service we can forward Caliper annotations to NCU. By profiling the same executable with a calltree profiler like Caliper, we can map the NCU data to the calltree profile and create a Thicket object.

In Section 6, we reproduce some of the analysis and visualizations from the paper:

Olga Pearce, Jason Burmark, Rich Hornung, Befikir Bogale, Ian Lumsden, Michael McKinsey, Dewi Yokelson, David Boehme, Stephanie Brink, Michela Taufer, and Tom Scogland. “RAJA Performance Suite: Performance Portability Analysis with Caliper and Thicket”. SC-W 2024: Workshops of ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. Performance, Portability & Productivity in HPC. 2024.


1. Import Necessary Packages

The Thicket NCU reader requires an existing install of Nsight Compute, and the extras/python directory in the Nsight Compute installation directory must be added to the PYTHONPATH. We use sys.path.append to add the path to the PYTHONPATH in this notebook. If you are not on a Livermore Computing system, you must change this path to match your install of Nsight Compute.

VERSION NOTICE: This functionality is tested with nsight-compute version 2023.2.2. Your mileage may vary if using a different version.

[1]:
import os
import sys

sys.path.append("/usr/tce/packages/nsight-compute/nsight-compute-2023.2.2/extras/python")

from IPython.display import display
from IPython.display import HTML

import thicket as tt

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

display(HTML("<style>.container { width:80% !important; }</style>"))
Seaborn not found, so skipping imports of plotting in thicket.stats
To enable this plotting, install seaborn or thicket[plotting]
Warning: Roundtrip module could not be loaded. Requires jupyter notebook version <= 7.x.

2. The Dataset

The dataset we are using comes from a profile of the RAJA Performance Suite on Lassen. We profile the block_128 tuning of the Base_CUDA, Lambda_CUDA, and RAJA_CUDA variants, while varying the problem size for 1 million and 2 million. The calltree profiles come from the CUDA Activity Profile Caliper configuration. By changing the variant argument in the following cell, we can look at NCU data for different variants.

The following are reproducible steps to generate this dataset:

# Example of building
$ . RAJAPerf/scripts/lc-builds/blueos_nvhpc_nvcc_clang_caliper.sh
$ make -j

# Load CUDA version equal to the CUDA version used to build RAJAPerf
$ module load nvhpc/24.1-cuda-11.2.0

# Turn off NVIDIA Data Center GPU Manager (DCGM) on Lassen so we can run NCU (get an error if it's on)
$ dcgmi profile --pause
# Example run to Generate the CUDA Activity Profile
$ CALI_CONFIG=cuda-activity-profile,output.format=cali lrun -n 1 --smpiargs="-disable_gpu_hooks" bin/raja-perf.exe --variants [Base_CUDA OR Lambda_CUDA OR RAJA_CUDA] --tunings block_128 --size [1048576 OR 2097152] --repfact 0.01

# Example run to Generate the NCU Report
$ CALI_SERVICES_ENABLE=nvtx lrun -n 1 --smpiargs="-disable_gpu_hooks" ncu \
--nvtx --set default \
--export report \
--metrics sm__throughput.avg.pct_of_peak_sustained_elapsed \
--replay-mode application \
bin/raja-perf.exe --variants [Base_CUDA OR Lambda_CUDA OR RAJA_CUDA] --tunings block_128 --size [1048576 OR 2097152] --repfact 0.01
[2]:
# Map all files
ncu_dir = "../data/ncu/"
ncu_report_mapping = {}
variant = "base_cuda" # OR "lambda_cuda" OR "raja_cuda"
problem_sizes = ["1M", "2M"]
for problem_size in problem_sizes:
    full_path = f"{ncu_dir}{variant}/{problem_size}/"
    ncu_report_mapping[full_path + "report.ncu-rep"] = full_path + "cuda_profile.cali"

3. Read Calltree Profiles into Thicket

The only performance metrics contained in the CUDA Activity Profile will be the CPU time time and the GPU time time (gpu).

[3]:
tk_cap = tt.Thicket.from_caliperreader(list(ncu_report_mapping.values()))
tk_cap.dataframe.head(20)
(1/2) Reading Files: 100%|██████████| 2/2 [00:00<00:00,  4.57it/s]
(2/2) Creating Thicket: 100%|██████████| 1/1 [00:00<00:00,  4.75it/s]
[3]:
nid time time (gpu) name
node profile
{'name': 'RAJAPerf', 'type': 'function'} 457195964 23.0 0.000615 NaN RAJAPerf
528105777 23.0 0.000596 NaN RAJAPerf
{'name': 'Algorithm', 'type': 'function'} 457195964 164.0 0.000024 NaN Algorithm
528105777 164.0 0.000024 NaN Algorithm
{'name': 'Algorithm_MEMCPY', 'type': 'function'} 457195964 168.0 0.000017 NaN Algorithm_MEMCPY
528105777 168.0 0.000017 NaN Algorithm_MEMCPY
{'name': 'cudaDeviceSynchronize', 'type': 'function'} 457195964 170.0 0.000061 NaN cudaDeviceSynchronize
528105777 170.0 0.000039 NaN cudaDeviceSynchronize
{'name': 'cudaLaunchKernel', 'type': 'function'} 457195964 169.0 0.000031 NaN cudaLaunchKernel
528105777 169.0 0.000032 NaN cudaLaunchKernel
{'name': 'void rajaperf::algorithm::memcpy<128ul>(double*, double*, long)', 'type': 'kernel'} 457195964 225.0 NaN 0.000051 void rajaperf::algorithm::memcpy<128ul>(double...
528105777 225.0 NaN 0.000031 void rajaperf::algorithm::memcpy<128ul>(double...
{'name': 'Algorithm_MEMSET', 'type': 'function'} 457195964 165.0 0.000015 NaN Algorithm_MEMSET
528105777 165.0 0.000014 NaN Algorithm_MEMSET
{'name': 'cudaDeviceSynchronize', 'type': 'function'} 457195964 167.0 0.000043 NaN cudaDeviceSynchronize
528105777 167.0 0.000030 NaN cudaDeviceSynchronize
{'name': 'cudaLaunchKernel', 'type': 'function'} 457195964 166.0 0.000030 NaN cudaLaunchKernel
528105777 166.0 0.000029 NaN cudaLaunchKernel
{'name': 'void rajaperf::algorithm::memset<128ul>(double*, double, long)', 'type': 'kernel'} 457195964 224.0 NaN 0.000033 void rajaperf::algorithm::memset<128ul>(double...
528105777 224.0 NaN 0.000020 void rajaperf::algorithm::memset<128ul>(double...

4. Add NCU Data

The Thicket add_ncu function takes one required argument and one optional arguement. The required argument ncu_report_mapping is the mapping from the NCU report file to the corresponding calltree profile run. The optional argument chosen_metrics allows for a subselection of the NCU performance metrics to add, since there can be hundreds of NCU performance metrics. By default we add all metrics.

[4]:
# Add NCU to thicket
ncu_metrics = [
    "gpu__time_duration.sum",
    "sm__throughput.avg.pct_of_peak_sustained_elapsed",
    "smsp__maximum_warps_avg_per_active_cycle",
]
# Add in metrics
tk_cap.add_ncu(
    ncu_report_mapping=ncu_report_mapping,
    chosen_metrics=ncu_metrics,
)
tk_cap.dataframe.head(20)
Processing action 600/601: 100%|██████████| 601/601 [00:15<00:00, 39.09it/s]
Processing action 600/601: 100%|██████████| 601/601 [00:01<00:00, 375.43it/s]
[4]:
nid time time (gpu) name gpu__time_duration.sum sm__throughput.avg.pct_of_peak_sustained_elapsed smsp__maximum_warps_avg_per_active_cycle
node profile
{'name': 'RAJAPerf', 'type': 'function'} 457195964 23.0 0.000615 NaN RAJAPerf NaN NaN NaN
528105777 23.0 0.000596 NaN RAJAPerf NaN NaN NaN
{'name': 'Algorithm', 'type': 'function'} 457195964 164.0 0.000024 NaN Algorithm NaN NaN NaN
528105777 164.0 0.000024 NaN Algorithm NaN NaN NaN
{'name': 'Algorithm_MEMCPY', 'type': 'function'} 457195964 168.0 0.000017 NaN Algorithm_MEMCPY NaN NaN NaN
528105777 168.0 0.000017 NaN Algorithm_MEMCPY NaN NaN NaN
{'name': 'cudaDeviceSynchronize', 'type': 'function'} 457195964 170.0 0.000061 NaN cudaDeviceSynchronize NaN NaN NaN
528105777 170.0 0.000039 NaN cudaDeviceSynchronize NaN NaN NaN
{'name': 'cudaLaunchKernel', 'type': 'function'} 457195964 169.0 0.000031 NaN cudaLaunchKernel NaN NaN NaN
528105777 169.0 0.000032 NaN cudaLaunchKernel NaN NaN NaN
{'name': 'void rajaperf::algorithm::memcpy<128ul>(double*, double*, long)', 'type': 'kernel'} 457195964 225.0 NaN 0.000051 void rajaperf::algorithm::memcpy<128ul>(double... 43232.0 6.521123 16.0
528105777 225.0 NaN 0.000031 void rajaperf::algorithm::memcpy<128ul>(double... 22880.0 6.294607 16.0
{'name': 'Algorithm_MEMSET', 'type': 'function'} 457195964 165.0 0.000015 NaN Algorithm_MEMSET NaN NaN NaN
528105777 165.0 0.000014 NaN Algorithm_MEMSET NaN NaN NaN
{'name': 'cudaDeviceSynchronize', 'type': 'function'} 457195964 167.0 0.000043 NaN cudaDeviceSynchronize NaN NaN NaN
528105777 167.0 0.000030 NaN cudaDeviceSynchronize NaN NaN NaN
{'name': 'cudaLaunchKernel', 'type': 'function'} 457195964 166.0 0.000030 NaN cudaLaunchKernel NaN NaN NaN
528105777 166.0 0.000029 NaN cudaLaunchKernel NaN NaN NaN
{'name': 'void rajaperf::algorithm::memset<128ul>(double*, double, long)', 'type': 'kernel'} 457195964 224.0 NaN 0.000033 void rajaperf::algorithm::memset<128ul>(double... 31648.0 7.531866 16.0
528105777 224.0 NaN 0.000020 void rajaperf::algorithm::memset<128ul>(double... 18016.0 6.692635 16.0

5. Visualize the NCU Performance Data on the Calltree

[5]:
print(tk_cap.tree(
    metric_column="sm__throughput.avg.pct_of_peak_sustained_elapsed",
    expand_name=True,
    ))
  _____ _     _      _        _
 |_   _| |__ (_) ___| | _____| |_
   | | | '_ \| |/ __| |/ / _ \ __|
   | | | | | | | (__|   <  __/ |_
   |_| |_| |_|_|\___|_|\_\___|\__|  v2024.2.1

nan RAJAPerf
├─ nan Algorithm
│  ├─ nan Algorithm_MEMCPY
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 6.521 void rajaperf::algorithm::memcpy<128ul>(double*, double*, long)
│  └─ nan Algorithm_MEMSET
│     ├─ nan cudaDeviceSynchronize
│     └─ nan cudaLaunchKernel
│        └─ 7.532 void rajaperf::algorithm::memset<128ul>(double*, double, long)
├─ nan Apps
│  ├─ nan Apps_DEL_DOT_VEC_2D
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 26.952 void rajaperf::apps::deldotvec2d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, long*, double, double, long)
│  ├─ nan Apps_EDGE3D
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Apps_ENERGY
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     ├─ 4.291 void rajaperf::apps::energycalc1<128ul>(double*, double*, double*, double*, double*, double*, long)
│  │     ├─ 7.512 void rajaperf::apps::energycalc2<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double, long)
│  │     ├─ 3.977 void rajaperf::apps::energycalc3<128ul>(double*, double*, double*, double*, double*, double*, long)
│  │     ├─ 6.292 void rajaperf::apps::energycalc4<128ul>(double*, double*, double, double, long)
│  │     ├─ 6.911 void rajaperf::apps::energycalc5<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double, double, double, long)
│  │     └─ 8.058 void rajaperf::apps::energycalc6<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double, double, long)
│  ├─ nan Apps_FIR
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 46.164 void rajaperf::apps::fir<128ul>(double*, double*, long, long)
│  ├─ nan Apps_LTIMES
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Apps_LTIMES_NOVIEW
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Apps_MATVEC_3D_STENCIL
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 6.912 void rajaperf::apps::matvec_3d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, long*, long, long)
│  ├─ nan Apps_NODAL_ACCUMULATION_3D
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 7.249 void rajaperf::apps::nodal_accumulation_3d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, long*, long, long)
│  ├─ nan Apps_PRESSURE
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     ├─ 7.532 void rajaperf::apps::pressurecalc1<128ul>(double*, double*, double, long)
│  │     └─ 7.175 void rajaperf::apps::pressurecalc2<128ul>(double*, double*, double*, double*, double, double, double, long)
│  ├─ nan Apps_VOL3D
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 34.543 void rajaperf::apps::vol3d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double, long, long)
│  └─ nan Apps_ZONAL_ACCUMULATION_3D
│     ├─ nan cudaDeviceSynchronize
│     └─ nan cudaLaunchKernel
│        └─ 11.787 void rajaperf::apps::zonal_accumulation_3d<128ul>(double*, double*, double*, double*, double*, double*, double*, double*, double*, long*, long, long)
├─ nan Basic
│  ├─ nan Basic_ARRAY_OF_PTRS
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Basic_COPY8
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Basic_DAXPY
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 5.275 void rajaperf::basic::daxpy<128ul>(double*, double*, double, long)
│  ├─ nan Basic_DAXPY_ATOMIC
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 4.733 void rajaperf::basic::daxpy_atomic<128ul>(double*, double*, double, long)
│  ├─ nan Basic_IF_QUAD
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 12.770 void rajaperf::basic::ifquad<128ul>(double*, double*, double*, double*, double*, long)
│  ├─ nan Basic_INIT3
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 4.232 void rajaperf::basic::init3<128ul>(double*, double*, double*, double*, double*, long)
│  ├─ nan Basic_INIT_VIEW1D
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 8.887 void rajaperf::basic::initview1d<128ul>(double*, double, long)
│  ├─ nan Basic_INIT_VIEW1D_OFFSET
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 8.892 void rajaperf::basic::initview1d_offset<128ul>(double*, double, long, long)
│  ├─ nan Basic_MULADDSUB
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 4.515 void rajaperf::basic::muladdsub<128ul>(double*, double*, double*, double*, double*, long)
│  ├─ nan Basic_NESTED_INIT
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 17.655 void rajaperf::basic::nested_init<32ul, 4ul, 1ul>(double*, long, long, long)
│  └─ nan Basic_PI_ATOMIC
│     └─ nan cudaDeviceSynchronize
├─ nan Comm
│  └─ nan Comm_HALO_PACKING
│     ├─ nan cudaDeviceSynchronize
│     ├─ nan cudaLaunchKernel
│     │  ├─ 0.126 void rajaperf::comm::halo_packing_pack<128ul>(double*, int*, double*, long)
│     │  └─ 0.160 void rajaperf::comm::halo_packing_unpack<128ul>(double*, int*, double*, long)
│     └─ nan cudaStreamSynchronize
├─ nan Lcals
│  ├─ nan Lcals_DIFF_PREDICT
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 2.277 void rajaperf::lcals::diff_predict<128ul>(double*, double*, long, long)
│  ├─ nan Lcals_EOS
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 9.121 void rajaperf::lcals::eos<128ul>(double*, double*, double*, double*, double, double, double, long)
│  ├─ nan Lcals_FIRST_DIFF
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 7.521 void rajaperf::lcals::first_diff<128ul>(double*, double*, long)
│  ├─ nan Lcals_FIRST_SUM
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 8.251 void rajaperf::lcals::first_sum<128ul>(double*, double*, long)
│  ├─ nan Lcals_GEN_LIN_RECUR
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     ├─ 4.580 void rajaperf::lcals::genlinrecur1<128ul>(double*, double*, double*, double*, long, long)
│  │     └─ 5.087 void rajaperf::lcals::genlinrecur2<128ul>(double*, double*, double*, double*, long, long)
│  ├─ nan Lcals_HYDRO_1D
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 6.563 void rajaperf::lcals::hydro_1d<128ul>(double*, double*, double*, double, double, double, long)
│  ├─ nan Lcals_HYDRO_2D
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     ├─ 14.008 void rajaperf::lcals::hydro_2d1<32ul, 4ul>(double*, double*, double*, double*, double*, double*, long, long)
│  │     ├─ 9.808 void rajaperf::lcals::hydro_2d2<32ul, 4ul>(double*, double*, double*, double*, double*, double*, double, long, long)
│  │     └─ 6.551 void rajaperf::lcals::hydro_2d3<32ul, 4ul>(double*, double*, double*, double*, double*, double*, double, long, long)
│  ├─ nan Lcals_INT_PREDICT
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 5.497 void rajaperf::lcals::int_predict<128ul>(double*, double, double, double, double, double, double, double, double, long, long)
│  ├─ nan Lcals_PLANCKIAN
│  │  └─ nan cudaDeviceSynchronize
│  └─ nan Lcals_TRIDIAG_ELIM
│     ├─ nan cudaDeviceSynchronize
│     └─ nan cudaLaunchKernel
│        └─ 5.423 void rajaperf::lcals::tridiag_elim<128ul>(double*, double*, double*, double*, long)
├─ nan Polybench
│  ├─ nan Polybench_2MM
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Polybench_3MM
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Polybench_ADI
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Polybench_ATAX
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     ├─ 0.642 void rajaperf::polybench::poly_atax_1<128ul>(double*, double*, double*, double*, long)
│  │     └─ 0.455 void rajaperf::polybench::poly_atax_2<128ul>(double*, double*, double*, long)
│  ├─ nan Polybench_FDTD_2D
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Polybench_FLOYD_WARSHALL
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Polybench_GEMM
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Polybench_GEMVER
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Polybench_GESUMMV
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     └─ 0.595 void rajaperf::polybench::poly_gesummv<128ul>(double*, double*, double*, double*, double, double, long)
│  ├─ nan Polybench_HEAT_3D
│  │  └─ nan cudaDeviceSynchronize
│  ├─ nan Polybench_JACOBI_1D
│  │  ├─ nan cudaDeviceSynchronize
│  │  └─ nan cudaLaunchKernel
│  │     ├─ 10.901 void rajaperf::polybench::poly_jacobi_1D_1<128ul>(double*, double*, long)
│  │     └─ 10.883 void rajaperf::polybench::poly_jacobi_1D_2<128ul>(double*, double*, long)
│  ├─ nan Polybench_JACOBI_2D
│  │  └─ nan cudaDeviceSynchronize
│  └─ nan Polybench_MVT
│     ├─ nan cudaDeviceSynchronize
│     └─ nan cudaLaunchKernel
│        ├─ 0.667 void rajaperf::polybench::poly_mvt_1<128ul>(double*, double*, double*, long)
│        └─ 0.378 void rajaperf::polybench::poly_mvt_2<128ul>(double*, double*, double*, long)
└─ nan Stream
   ├─ nan Stream_ADD
   │  ├─ nan cudaDeviceSynchronize
   │  └─ nan cudaLaunchKernel
   │     └─ 5.780 void rajaperf::stream::add<128ul>(double*, double*, double*, long)
   ├─ nan Stream_COPY
   │  ├─ nan cudaDeviceSynchronize
   │  └─ nan cudaLaunchKernel
   │     └─ 6.787 void rajaperf::stream::copy<128ul>(double*, double*, long)
   ├─ nan Stream_MUL
   │  ├─ nan cudaDeviceSynchronize
   │  └─ nan cudaLaunchKernel
   │     └─ 7.199 void rajaperf::stream::mul<128ul>(double*, double*, double, long)
   └─ nan Stream_TRIAD
      ├─ nan cudaDeviceSynchronize
      └─ nan cudaLaunchKernel
         └─ 5.789 void rajaperf::stream::triad<128ul>(double*, double*, double*, double, long)
nan cudaDeviceSynchronize
nan cudaFree
nan cudaFreeHost
nan cudaGetDevice
nan cudaGetSymbolAddress
nan cudaHostAlloc
nan cudaLaunchKernel
└─ nan void rajaperf::basic::daxpy<128ul>(double*, double*, double, long)
nan cudaMalloc
nan cudaMallocManaged
nan cudaMemAdvise
nan cudaMemcpy
└─ nan memcpy
nan cudaMemcpyAsync
└─ nan memcpy
nan cudaStreamCreate

Legend (Metric: sm__throughput.avg.pct_of_peak_sustained_elapsed Min: 0.13 Max: 46.16 indices: {'profile': 457195964})
41.56 - 46.16
32.35 - 41.56
23.14 - 32.35
13.94 - 23.14
4.73 - 13.94
0.13 - 4.73

name User code     Only in left graph     Only in right graph

6. Create Instruction Roofline Plots

We can make roofline plots using the metrics we have collected using Nsight Compute. The Roofline sets an upper bound on performance of a kernel depending on its operational intensity. We will use roofline plots to understand the performance of RAJAPerf kernels.

In this section, we reproduce some of the analysis and visualizations from the paper:

Olga Pearce, Jason Burmark, Rich Hornung, Befikir Bogale, Ian Lumsden, Michael McKinsey, Dewi Yokelson, David Boehme, Stephanie Brink, Michela Taufer, and Tom Scogland. “RAJA Performance Suite: Performance Portability Analysis with Caliper and Thicket”. SC-W 2024: Workshops of ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. Performance, Portability & Productivity in HPC. 2024.

Instruction Roofline Models were introduced by Ding et. al to better characterize GPU workloads by looking at instruction intensity. We walk through the creation of instruction roofline models for the Appications group of the RAJAPerf kernels below.

More references on methodology for roofline models:

  • Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (April 2009), 65–76. https://doi.org/10.1145/1498765.1498785

  • Nan Ding, Muaaz Awan, Samuel Williams. 2022. Instruction Roofline: An insightful visual performance model for GPUs. Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.6591

Step 1 is to create a Thicket using a CUDA Activity Profile, and add the NCU data. We use RAJA_CUDA variant data of the block_256 tuning.

Command to generate the Caliper CUDA Activity Profile:

CALI_CONFIG=cuda-activity-profile,output.format=cali lrun -n 1 --smpiargs="-disable_gpu_hooks" bin/raja-perf.exe --variants RAJA_CUDA --tunings block_256 --size 8388608

Command to generate the NCU report with Roofline metrics:

CALI_SERVICES_ENABLE=nvtx lrun -n 1 --smpiargs="-disable_gpu_hooks" ncu \
--nvtx --set default \
--export report \
--metrics sm__sass_thread_inst_executed.sum,l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum,l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum,l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum,lts__t_sectors_op_read.sum,lts__t_sectors_op_write.sum,lts__t_sectors_op_atom.sum,lts__t_sectors_op_red.sum,dram__sectors_read.sum,dram__sectors_write.sum,l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum,l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum \
--replay-mode application \
bin/raja-perf.exe --variants RAJA_CUDA --tunings block_256 --size 8388608
[6]:
roofline_cali = ncu_dir + "roofline/block_256_profile.cali"
roofline_ncu = ncu_dir + "roofline/block_256_profile.ncu-rep"

# Unzip files, unless they have already been unzipped
if not (os.path.isfile(roofline_cali) and os.path.isfile(roofline_ncu)):
    import zipfile
    with zipfile.ZipFile(ncu_dir + "roofline/block_256_profile.zip", "r") as zip_ref:
        zip_ref.extractall(ncu_dir + "roofline/")

# Create Thicket and add NCU data
tk_roofline = tt.Thicket.from_caliperreader(roofline_cali)
tk_roofline.add_ncu({roofline_ncu:roofline_cali})
Processing action 3444/3445: 100%|██████████| 3445/3445 [00:42<00:00, 81.94it/s]

Select a subset of kernels to plot on the roofline. Here we are just selecting Apps kernels.

[7]:
# Options (can be more than one): ["Algorithm", "Apps", "Basic", "Comm", "Lcals", "Polybench", "Stream"]
kernel_types = ["Apps"]

raja_kernel_query = (
    tt.query.Query()
    .match (
        ".",
        lambda row: row["name"].apply(
            lambda x: any([x.startswith(c + "_") for c in kernel_types])
        ).all()
    )
    .rel("*")
)

pruned_th = tk_roofline.query(raja_kernel_query)

Aggregate metrics for kernels with multiple instances

[8]:
# Match parent kernel name, e.g. "Apps_VOL3D"
kernel_query = """
MATCH (".",p)->("*")
WHERE p."name" = "{ker}"
"""

# Match demangled kernel that contains NCU measurement.
child_kernel_query = """
MATCH (".", p)
WHERE p."depth" = 2
"""

# List of columns to exclude from aggregation
cols = [col for col in pruned_th.dataframe.columns.tolist() if col not in ["nid", "time", "name", "time (inc)", "time (gpu) (inc)"]]
leaves = []

for n in pruned_th.graph.roots:
    kernels = {}

    kernels["name"] = n.frame.get("name")

    tmp = kernel_query.format(ker=n.frame.get("name"))
    ker_th = pruned_th.query(tmp, multi_index_mode="all")
    leaf = ker_th.query(child_kernel_query, multi_index_mode="all")
    for col in cols:
        kernels[col] = leaf.dataframe[col].sum()

    leaves.append(kernels)

agg_df = pd.DataFrame(data=leaves)
agg_df
[8]:
name time (gpu) c2clink__enabled_mask c2clink__present device__attribute_architecture device__attribute_async_engine_count device__attribute_can_flush_remote_writes device__attribute_can_map_host_memory device__attribute_can_tex2d_gather device__attribute_can_use_64_bit_stream_mem_ops_v1 ... sm__sass_thread_inst_executed.sum smsp__inst_executed.sum smsp__inst_executed_op_global_ld.sum smsp__inst_executed_op_global_st.sum smsp__inst_executed_op_local_ld.sum smsp__inst_executed_op_local_st.sum smsp__inst_executed_op_shared_ld.sum smsp__inst_executed_op_shared_st.sum smsp__inst_executed_pipe_tensor.sum smsp__maximum_warps_avg_per_active_cycle
0 Apps_DEL_DOT_VEC_2D 0.000527 0.0 0.0 320.0 4.0 1.0 1.0 1.0 1.0 ... 1.165767e+09 3.643023e+07 4455496.0 262088.0 0.000000e+00 0.0 0.0 0.0 0.0 10.0
1 Apps_EDGE3D 0.468591 0.0 0.0 320.0 4.0 1.0 1.0 1.0 1.0 ... 1.898933e+11 5.934178e+09 6587736.0 274489.0 1.089172e+09 715043845.0 0.0 0.0 0.0 2.0
2 Apps_ENERGY 0.002247 0.0 0.0 1920.0 24.0 6.0 6.0 6.0 6.0 ... 1.811939e+09 5.662310e+07 5767168.0 1572864.0 0.000000e+00 0.0 0.0 0.0 0.0 96.0
3 Apps_FIR 0.000178 0.0 0.0 320.0 4.0 1.0 1.0 1.0 1.0 ... 1.115685e+09 3.486515e+07 4194304.0 262144.0 0.000000e+00 0.0 0.0 0.0 0.0 16.0
4 Apps_LTIMES 0.002227 0.0 0.0 320.0 4.0 1.0 1.0 1.0 1.0 ... 3.340763e+09 1.216348e+08 18874368.0 8388608.0 0.000000e+00 0.0 0.0 0.0 0.0 16.0
5 Apps_LTIMES_NOVIEW 0.002234 0.0 0.0 320.0 4.0 1.0 1.0 1.0 1.0 ... 2.900361e+09 1.042022e+08 18874368.0 8388608.0 0.000000e+00 0.0 0.0 0.0 0.0 16.0
6 Apps_MATVEC_3D_STENCIL 0.002154 0.0 0.0 320.0 4.0 1.0 1.0 1.0 1.0 ... 1.731645e+09 5.411398e+07 14378100.0 261420.0 0.000000e+00 0.0 0.0 0.0 0.0 4.0
7 Apps_NODAL_ACCUMULATION_3D 0.000556 0.0 0.0 320.0 4.0 1.0 1.0 1.0 1.0 ... 3.597146e+08 1.124110e+07 522840.0 0.0 0.000000e+00 0.0 0.0 0.0 0.0 16.0
8 Apps_PRESSURE 0.000507 0.0 0.0 640.0 8.0 2.0 2.0 2.0 2.0 ... 5.117051e+08 1.599078e+07 1048576.0 786432.0 0.000000e+00 0.0 0.0 0.0 0.0 32.0
9 Apps_VOL3D 0.000380 0.0 0.0 320.0 4.0 1.0 1.0 1.0 1.0 ... 1.317547e+09 4.117341e+07 6587736.0 274489.0 0.000000e+00 0.0 0.0 0.0 0.0 8.0
10 Apps_ZONAL_ACCUMULATION_3D 0.000268 0.0 0.0 320.0 4.0 1.0 1.0 1.0 1.0 ... 4.182726e+08 1.307104e+07 2352780.0 261420.0 0.000000e+00 0.0 0.0 0.0 0.0 16.0

11 rows × 239 columns

Calculate instruciton intensities from existing NCU metrics.

[9]:
# Warp Instructions
agg_df["Warp Instructions"] = agg_df["sm__sass_thread_inst_executed.sum"] / 32
# L1 Global Count Memory Transactions
agg_df["L1 (GLOBAL)"] = agg_df["l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum"] + agg_df["l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum"]
agg_df["L1 (SHARED)"] = agg_df["l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum"] + agg_df["l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum"]
agg_df["Total L1 Transactions"] = agg_df["L1 (GLOBAL)"] + (4 * agg_df["L1 (SHARED)"])
# Shared Count Memory Transactions
agg_df["L2 Write Transactions"] =  agg_df["lts__t_sectors_op_write.sum"] + agg_df["lts__t_sectors_op_atom.sum"] + agg_df["lts__t_sectors_op_red.sum"]
agg_df["L2 Read Transactions"] =  agg_df["lts__t_sectors_op_read.sum"] + agg_df["lts__t_sectors_op_atom.sum"] + agg_df["lts__t_sectors_op_red.sum"]
agg_df["Total L2 Transactions"] = agg_df["L2 Read Transactions"] + agg_df["L2 Write Transactions"]
# HBM Count Memory Transactions
agg_df["HBM Transactions"] = agg_df["dram__sectors_read.sum"] +  agg_df["dram__sectors_write.sum"]
# L1, L2, HBM Intensities
agg_df["L1 Instruction Intensity"] = agg_df["Warp Instructions"] / agg_df["Total L1 Transactions"]
agg_df["L2 Instruction Intensity"] = agg_df["Warp Instructions"] / agg_df["Total L2 Transactions"]
agg_df["HBM Instruction Intensity"] = agg_df["Warp Instructions"] / agg_df["HBM Transactions"]
# Performance in GIPS
agg_df["Performance GIPS"] = agg_df["Warp Instructions"] / (agg_df["time (gpu)"] * (10 ** 9))

Plotting Roofline with Matplotlib

The plotting code below was adapted from the following resource: https://gitlab.com/NERSC/roofline-on-nvidia-gpus/-/blob/master/custom-scripts/roofline.py

[14]:
pruned_th_kers = tk_roofline.query(child_kernel_query, multi_index_mode="all")
metrics = ["L1 Instruction Intensity", "L2 Instruction Intensity", "HBM Instruction Intensity"]

c_s = ["red", "blue", "green", "orange", "purple", "cyan", "magenta"]
final_colors = []
for i in agg_df["name"].tolist():
    for j in range(0, len(kernel_types)):
        if i.startswith(kernel_types[j]):
            final_colors.append(c_s[j])
            continue

font = {"size": 15}
plt.rc("font", **font)

colors = ["tab:blue", "tab:orange", "tab:green", "tab:red", "tab:purple", "tab:brown", "tab:pink", "tab:gray", "tab:olive", "tab:cyan"]
styles = ["o", "s", "v", "^", "D", ">", "<", "*", "h", "H", "+", "1", "2", "3", "4", "8", "p", "d", "|", "_", ".", ","]

markersize = 10
markerwidth = 2
maxchar = 25

def roofline(LABELS, flag="HBM", data_df=None):
    LABELS = [x[:maxchar] for x in LABELS]

    bandiwdth_hbm = 25.9 # in GTXN/s
    bandwidth_l2 = 93.6 # in GTXN/s
    bandwidth_l1 = 437.5 # in GTXN/s

    if flag == "L1":
        memRoofs = [("L1", bandwidth_l1)]
    elif flag == "L2":
        memRoofs = [("L2", bandwidth_l2)]
    elif flag == "HBM":
        memRoofs = [("HBM", bandiwdth_hbm)]
    elif flag == "all":
        memRoofs = [("L1", bandwidth_l1), ("L2", bandwidth_l2),  ("HBM", bandiwdth_hbm)]

    cmpRoofs = [("GIPS", 489.6)]

    fig = plt.figure(1, figsize=(10,6))
    plt.clf()
    ax = fig.gca()
    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.set_xlabel("Instruction Intensity [Warp Instructions/transaction]")
    ax.set_ylabel("Performance [Warp GIPS]")

    nx   = 10000
    xmin = -3
    xmax = 3
    ymin = 1
    ymax = 1000

    ax.set_xlim(10 ** xmin, 10 ** xmax)
    ax.set_ylim(ymin, ymax)

    ixx = int(nx * 0.02)
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    scomp_x_elbow  = []
    scomp_ix_elbow = []
    smem_x_elbow   = []
    smem_ix_elbow  = []

    x = np.logspace(xmin, xmax, nx)
    for roof in cmpRoofs:
        for ix in range(1, nx):
            if float(memRoofs[0][1] * x[ix]) >= roof[1] and (memRoofs[0][1] * x[ix - 1]) < roof[1]:
                scomp_x_elbow.append(x[ix - 1])
                scomp_ix_elbow.append(ix - 1)
                break

    for roof in memRoofs:
        for ix in range(1, nx):
            if (cmpRoofs[0][1] <= roof[1] * x[ix] and cmpRoofs[0][1] > roof[1] * x[ix - 1]):
                smem_x_elbow.append(x[ix - 1])
                smem_ix_elbow.append(ix - 1)
                break

    for i in range(len(cmpRoofs)):
        roof = cmpRoofs[i][1]
        y = np.ones(len(x)) * roof
        ax.plot(x[scomp_ix_elbow[i]:], y[scomp_ix_elbow[i]:], c="k", ls="-", lw="2")

    for i in range(len(memRoofs)):
        roof = memRoofs[i][1]
        y = x * roof
        ax.plot(x[:smem_ix_elbow[i] + 1], y[:smem_ix_elbow[i] + 1], c="k", ls="-", lw="2")

    for roof in cmpRoofs:
        ax.text(
            x[-ixx],
            roof[1],
            roof[0] + ": " + "{0:.1f}".format(roof[1]) + " Warp GIPS",
            horizontalalignment="right",
            verticalalignment="bottom"
        )

    for roof in memRoofs:
        ang = np.arctan(np.log10(xlim[1] / xlim[0]) / np.log10(ylim[1] / ylim[0])
                                   * fig.get_size_inches()[1] / fig.get_size_inches()[0] )
        if x[ixx]*roof[1] > ymin:
            ax.text(
                x[ixx],
                x[ixx] * roof[1] * (1 + 0.25 * np.sin(ang) ** 2),
                roof[0] + ": " + "{0:.1f}".format(float(roof[1])) + " GTXN/s",
                horizontalalignment="left",
                verticalalignment="bottom",
                rotation=180 / np.pi * ang
            )
        else:
            ymin_ix_elbow=list()
            ymin_x_elbow=list()
            for ix in range(1, nx):
                if (ymin <= roof[1] * x[ix] and ymin > roof[1] * x[ix - 1]):
                    ymin_x_elbow.append(x[ix - 1])
                    ymin_ix_elbow.append(ix - 1)
                    break
            ax.text(
                x[ixx+ymin_ix_elbow[0]],
                x[ixx+ymin_ix_elbow[0]] * roof[1] * (1 + 0.25 * np.sin(ang) ** 2) * 1.15,
                roof[0] + ": " + "{0:.1f}".format(float(roof[1])) + " GTXN/s",
                horizontalalignment="left",
                verticalalignment="bottom",
                rotation=180 / np.pi * ang
            )

    if flag == "L1":
        ax.scatter(data_df[metrics[0]], data_df["Performance GIPS"], c=final_colors, label="L1", marker=styles[0])
    elif flag == "L2":
        ax.scatter(data_df[metrics[1]], data_df["Performance GIPS"], c=final_colors, label="L2", marker=styles[1])
    elif flag == "HBM":
        ax.scatter(data_df[metrics[2]], data_df["Performance GIPS"], c=final_colors, label="HBM", marker="*")

    elif flag == "all":
        ax.scatter(data_df[metrics[0]], data_df["Performance GIPS"], c=final_colors, label="L1", marker=styles[0])
        ax.scatter(data_df[metrics[1]], data_df["Performance GIPS"], c=final_colors, label="L2", marker=styles[1])
        ax.scatter(data_df[metrics[2]], data_df["Performance GIPS"], c=final_colors, label="HBM", marker="*")

    custom_labels = [k.split("_")[0] for k in kernel_types]
    custom_handles = [plt.Line2D([0], [0], color=i, lw=2) for i in c_s]

    if flag == "HBM":
        leg2 = ax.legend(
            custom_handles,
            custom_labels,
            bbox_to_anchor=(0.36, 1),
            title="Kernel Types",
            fontsize="12",          # Adjust label font size (e.g., "small", "medium", "large", or specific size)
            title_fontsize="small"
        )
    else:
        leg2 = ax.legend(
            custom_handles,
            custom_labels,
            bbox_to_anchor=(0.36, 1),
            title="Kernel Types",
            fontsize="12",          # Adjust label font size (e.g., "small", "medium", "large", or specific size)
            title_fontsize="small"
        )

    ax.add_artist(leg2)

    ax.legend(loc="upper left")
    plt.show()
[15]:
roofline(LABELS=pruned_th.dataframe["name"].tolist(), flag="all", data_df=agg_df)
_images/nsight_compute_21_0.png
[16]:
roofline(LABELS=pruned_th.dataframe["name"].tolist(), flag="L1", data_df=agg_df)
_images/nsight_compute_22_0.png
[17]:
roofline(LABELS=pruned_th.dataframe["name"].tolist(), flag="L2", data_df=agg_df)
_images/nsight_compute_23_0.png
[18]:
roofline(LABELS=pruned_th.dataframe["name"].tolist(), flag="HBM", data_df=agg_df)
_images/nsight_compute_24_0.png
[ ]: