Using Groupby-Aggregate to Compose Multi-Run Datasets: Thicket Tutorial

Thicket is a python-based toolkit for Exploratory Data Analysis (EDA) of parallel performance data that enables performance optimization and understanding of applications’ performance on supercomputers. It bridges the performance tool gap between being able to consider only a single instance of a simulation run (e.g., single platform, single measurement tool, or single scale) and finding actionable insights in multi-dimensional, multi-scale, multi-architecture, and multi-tool performance datasets.

1. Import Necessary Packages

[1]:
from glob import glob
import numpy as np
from IPython.display import display
from IPython.display import HTML

import thicket as th

display(HTML("<style>.container { width:80% !important; }</style>"))
[2]:
# Disable the Pandas 3 Future Warnings for now
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

2. Define Dataset Paths and Names

In this example, we load two repeat runs generated on lassen. We can use glob to find all of the caliper files in a given directory.

[3]:
data = glob("../data/lassen/clang10.0.1_nvcc10.2.89_1048576/**/*.cali", recursive=True)
tk = th.Thicket.from_caliperreader(data, disable_tqdm=True)

3. Groupby

Groupby the unique combinations of variant and tuning from the metadata table. In general, these will be the parameters you varied in your runs.

After performing the groupby, we can see that each thicket contains multiple profiles. In order to perform certain composition operations in Thicket, we need to aggregate the performance data (Thicket.dataframe).

[4]:
gb = tk.groupby(["variant", "tuning"])
4  thickets created...
{('Base_CUDA', 'block_1024'): <thicket.thicket.Thicket object at 0xffff35fa1400>, ('Base_CUDA', 'block_128'): <thicket.thicket.Thicket object at 0xffff35cf2790>, ('Base_CUDA', 'block_256'): <thicket.thicket.Thicket object at 0xffff35d7ef40>, ('Base_CUDA', 'block_512'): <thicket.thicket.Thicket object at 0xffff35d2c8e0>}
[5]:
for key, ttk in gb.items():
    print(f"key {key} contains {len(ttk.profile)} profiles")
key ('Base_CUDA', 'block_1024') contains 2 profiles
key ('Base_CUDA', 'block_128') contains 2 profiles
key ('Base_CUDA', 'block_256') contains 2 profiles
key ('Base_CUDA', 'block_512') contains 2 profiles

4. Aggregation

Using the aggregate_thicket function, we can aggregate each Thicket in the groupby object individually.

[6]:
gb_agg = {}
for key, ttk in gb.items():
    gb_agg[key] = gb.aggregate_thicket(ttk, np.mean)

display(gb_agg[('Base_CUDA', 'block_128')].dataframe)
nid_mean Min time/rank_mean Max time/rank_mean Avg time/rank_mean Total time_mean BlockSize_mean Bytes/Rep_mean Flops/Rep_mean Iterations/Rep_mean Kernels/Rep_mean ProblemSize_mean Reps_mean spot.channel name
node variant tuning
{'name': 'RAJAPerf', 'type': 'function'} Base_CUDA block_128 1.0 1.779628 1.779628 1.779628 1.779628 128.0 3.359049e+09 6.797544e+09 125952040.0 160.0 1135363.0 2500.0 regionprofile RAJAPerf
{'name': 'Algorithm', 'type': 'function'} Base_CUDA block_128 10.0 0.006809 0.006809 0.006809 0.006809 128.0 1.677722e+07 1.048576e+06 1048576.0 1.0 1048576.0 100.0 regionprofile Algorithm
{'name': 'Algorithm_MEMCPY', 'type': 'function'} Base_CUDA block_128 13.0 0.002439 0.002439 0.002439 0.002439 128.0 1.677722e+07 0.000000e+00 1048576.0 1.0 1048576.0 100.0 regionprofile Algorithm_MEMCPY
{'name': 'Algorithm_MEMSET', 'type': 'function'} Base_CUDA block_128 12.0 0.001705 0.001705 0.001705 0.001705 128.0 8.388616e+06 0.000000e+00 1048576.0 1.0 1048576.0 100.0 regionprofile Algorithm_MEMSET
{'name': 'Algorithm_REDUCE_SUM', 'type': 'function'} Base_CUDA block_128 11.0 0.002642 0.002642 0.002642 0.002642 128.0 8.388616e+06 1.048576e+06 1048576.0 1.0 1048576.0 50.0 regionprofile Algorithm_REDUCE_SUM
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
{'name': 'Stream_ADD', 'type': 'function'} Base_CUDA block_128 54.0 0.033593 0.033593 0.033593 0.033593 128.0 2.516582e+07 1.048576e+06 1048576.0 1.0 1048576.0 1000.0 regionprofile Stream_ADD
{'name': 'Stream_COPY', 'type': 'function'} Base_CUDA block_128 55.0 0.042584 0.042584 0.042584 0.042584 128.0 1.677722e+07 0.000000e+00 1048576.0 1.0 1048576.0 1800.0 regionprofile Stream_COPY
{'name': 'Stream_DOT', 'type': 'function'} Base_CUDA block_128 56.0 0.108153 0.108153 0.108153 0.108153 128.0 1.677723e+07 2.097152e+06 1048576.0 1.0 1048576.0 2000.0 regionprofile Stream_DOT
{'name': 'Stream_MUL', 'type': 'function'} Base_CUDA block_128 57.0 0.042611 0.042611 0.042611 0.042611 128.0 1.677722e+07 1.048576e+06 1048576.0 1.0 1048576.0 1800.0 regionprofile Stream_MUL
{'name': 'Stream_TRIAD', 'type': 'function'} Base_CUDA block_128 58.0 0.033648 0.033648 0.033648 0.033648 128.0 2.516582e+07 2.097152e+06 1048576.0 1.0 1048576.0 1000.0 regionprofile Stream_TRIAD

67 rows × 14 columns

We can call agg to aggregate and create a composed dataframe in one step

[7]:
tk_agg = gb.agg(np.mean, disable_tqdm=True)

display(tk_agg.dataframe)
nid_mean Min time/rank_mean Max time/rank_mean Avg time/rank_mean Total time_mean BlockSize_mean Bytes/Rep_mean Flops/Rep_mean Iterations/Rep_mean Kernels/Rep_mean ProblemSize_mean Reps_mean spot.channel name
node variant tuning
{'name': 'RAJAPerf', 'type': 'function'} Base_CUDA block_1024 1.0 2.122934 2.122934 2.122934 2.122934 1024.0 3.359049e+09 6.797544e+09 125952040.0 160.0 1135363.0 2500.0 regionprofile RAJAPerf
block_128 1.0 1.779628 1.779628 1.779628 1.779628 128.0 3.359049e+09 6.797544e+09 125952040.0 160.0 1135363.0 2500.0 regionprofile RAJAPerf
block_256 1.0 1.772165 1.772165 1.772165 1.772165 256.0 3.359049e+09 6.797544e+09 125952040.0 160.0 1135363.0 2500.0 regionprofile RAJAPerf
block_512 1.0 1.838314 1.838314 1.838314 1.838314 512.0 3.359049e+09 6.797544e+09 125952040.0 160.0 1135363.0 2500.0 regionprofile RAJAPerf
{'name': 'Algorithm', 'type': 'function'} Base_CUDA block_1024 11.0 0.006371 0.006371 0.006371 0.006371 1024.0 1.677722e+07 1.048576e+06 1048576.0 1.0 1048576.0 100.0 regionprofile Algorithm
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
{'name': 'Stream_MUL', 'type': 'function'} Base_CUDA block_512 57.0 0.042775 0.042775 0.042775 0.042775 512.0 1.677722e+07 1.048576e+06 1048576.0 1.0 1048576.0 1800.0 regionprofile Stream_MUL
{'name': 'Stream_TRIAD', 'type': 'function'} Base_CUDA block_1024 60.0 0.033749 0.033749 0.033749 0.033749 1024.0 2.516582e+07 2.097152e+06 1048576.0 1.0 1048576.0 1000.0 regionprofile Stream_TRIAD
block_128 58.0 0.033648 0.033648 0.033648 0.033648 128.0 2.516582e+07 2.097152e+06 1048576.0 1.0 1048576.0 1000.0 regionprofile Stream_TRIAD
block_256 64.0 0.033649 0.033649 0.033649 0.033649 256.0 2.516582e+07 2.097152e+06 1048576.0 1.0 1048576.0 1000.0 regionprofile Stream_TRIAD
block_512 58.0 0.033713 0.033713 0.033713 0.033713 512.0 2.516582e+07 2.097152e+06 1048576.0 1.0 1048576.0 1000.0 regionprofile Stream_TRIAD

268 rows × 14 columns