Using Groupby-Aggregate to Compose Multi-Run Datasets: Thicket Tutorial

Thicket is a python-based toolkit for Exploratory Data Analysis (EDA) of parallel performance data that enables performance optimization and understanding of applications’ performance on supercomputers. It bridges the performance tool gap between being able to consider only a single instance of a simulation run (e.g., single platform, single measurement tool, or single scale) and finding actionable insights in multi-dimensional, multi-scale, multi-architecture, and multi-tool performance datasets.

1. Import Necessary Packages

[1]:

from glob import glob
import numpy as np
from IPython.display import display
from IPython.display import HTML

import thicket as th

display(HTML("<style>.container { width:80% !important; }</style>"))

[2]:

# Disable the Pandas 3 Future Warnings for now
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

2. Define Dataset Paths and Names

In this example, we load two repeat runs generated on lassen. We can use glob to find all of the caliper files in a given directory.

[3]:

data = glob("../data/lassen/clang10.0.1_nvcc10.2.89_1048576/**/*.cali", recursive=True)
tk = th.Thicket.from_caliperreader(data, disable_tqdm=True)

3. Groupby

Groupby the unique combinations of variant and tuning from the metadata table. In general, these will be the parameters you varied in your runs.

After performing the groupby, we can see that each thicket contains multiple profiles. In order to perform certain composition operations in Thicket, we need to aggregate the performance data (Thicket.dataframe).

[4]:

gb = tk.groupby(["variant", "tuning"])

4  thickets created...
{('Base_CUDA', 'block_1024'): <thicket.thicket.Thicket object at 0xffff35fa1400>, ('Base_CUDA', 'block_128'): <thicket.thicket.Thicket object at 0xffff35cf2790>, ('Base_CUDA', 'block_256'): <thicket.thicket.Thicket object at 0xffff35d7ef40>, ('Base_CUDA', 'block_512'): <thicket.thicket.Thicket object at 0xffff35d2c8e0>}

[5]:

for key, ttk in gb.items():
    print(f"key {key} contains {len(ttk.profile)} profiles")

key ('Base_CUDA', 'block_1024') contains 2 profiles
key ('Base_CUDA', 'block_128') contains 2 profiles
key ('Base_CUDA', 'block_256') contains 2 profiles
key ('Base_CUDA', 'block_512') contains 2 profiles

4. Aggregation

Using the aggregate_thicket function, we can aggregate each Thicket in the groupby object individually.

[6]:

gb_agg = {}
for key, ttk in gb.items():
    gb_agg[key] = gb.aggregate_thicket(ttk, np.mean)

display(gb_agg[('Base_CUDA', 'block_128')].dataframe)

			nid_mean	Min time/rank_mean	Max time/rank_mean	Avg time/rank_mean	Total time_mean	BlockSize_mean	Bytes/Rep_mean	Flops/Rep_mean	Iterations/Rep_mean	Kernels/Rep_mean	ProblemSize_mean	Reps_mean	spot.channel	name
node	variant	tuning
{'name': 'RAJAPerf', 'type': 'function'}	Base_CUDA	block_128	1.0	1.779628	1.779628	1.779628	1.779628	128.0	3.359049e+09	6.797544e+09	125952040.0	160.0	1135363.0	2500.0	regionprofile	RAJAPerf
{'name': 'Algorithm', 'type': 'function'}	Base_CUDA	block_128	10.0	0.006809	0.006809	0.006809	0.006809	128.0	1.677722e+07	1.048576e+06	1048576.0	1.0	1048576.0	100.0	regionprofile	Algorithm
{'name': 'Algorithm_MEMCPY', 'type': 'function'}	Base_CUDA	block_128	13.0	0.002439	0.002439	0.002439	0.002439	128.0	1.677722e+07	0.000000e+00	1048576.0	1.0	1048576.0	100.0	regionprofile	Algorithm_MEMCPY
{'name': 'Algorithm_MEMSET', 'type': 'function'}	Base_CUDA	block_128	12.0	0.001705	0.001705	0.001705	0.001705	128.0	8.388616e+06	0.000000e+00	1048576.0	1.0	1048576.0	100.0	regionprofile	Algorithm_MEMSET
{'name': 'Algorithm_REDUCE_SUM', 'type': 'function'}	Base_CUDA	block_128	11.0	0.002642	0.002642	0.002642	0.002642	128.0	8.388616e+06	1.048576e+06	1048576.0	1.0	1048576.0	50.0	regionprofile	Algorithm_REDUCE_SUM
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
{'name': 'Stream_ADD', 'type': 'function'}	Base_CUDA	block_128	54.0	0.033593	0.033593	0.033593	0.033593	128.0	2.516582e+07	1.048576e+06	1048576.0	1.0	1048576.0	1000.0	regionprofile	Stream_ADD
{'name': 'Stream_COPY', 'type': 'function'}	Base_CUDA	block_128	55.0	0.042584	0.042584	0.042584	0.042584	128.0	1.677722e+07	0.000000e+00	1048576.0	1.0	1048576.0	1800.0	regionprofile	Stream_COPY
{'name': 'Stream_DOT', 'type': 'function'}	Base_CUDA	block_128	56.0	0.108153	0.108153	0.108153	0.108153	128.0	1.677723e+07	2.097152e+06	1048576.0	1.0	1048576.0	2000.0	regionprofile	Stream_DOT
{'name': 'Stream_MUL', 'type': 'function'}	Base_CUDA	block_128	57.0	0.042611	0.042611	0.042611	0.042611	128.0	1.677722e+07	1.048576e+06	1048576.0	1.0	1048576.0	1800.0	regionprofile	Stream_MUL
{'name': 'Stream_TRIAD', 'type': 'function'}	Base_CUDA	block_128	58.0	0.033648	0.033648	0.033648	0.033648	128.0	2.516582e+07	2.097152e+06	1048576.0	1.0	1048576.0	1000.0	regionprofile	Stream_TRIAD

67 rows × 14 columns

We can call agg to aggregate and create a composed dataframe in one step

[7]:

tk_agg = gb.agg(np.mean, disable_tqdm=True)

display(tk_agg.dataframe)

			nid_mean	Min time/rank_mean	Max time/rank_mean	Avg time/rank_mean	Total time_mean	BlockSize_mean	Bytes/Rep_mean	Flops/Rep_mean	Iterations/Rep_mean	Kernels/Rep_mean	ProblemSize_mean	Reps_mean	spot.channel	name
node	variant	tuning
{'name': 'RAJAPerf', 'type': 'function'}	Base_CUDA	block_1024	1.0	2.122934	2.122934	2.122934	2.122934	1024.0	3.359049e+09	6.797544e+09	125952040.0	160.0	1135363.0	2500.0	regionprofile	RAJAPerf
		block_128	1.0	1.779628	1.779628	1.779628	1.779628	128.0	3.359049e+09	6.797544e+09	125952040.0	160.0	1135363.0	2500.0	regionprofile	RAJAPerf
		block_256	1.0	1.772165	1.772165	1.772165	1.772165	256.0	3.359049e+09	6.797544e+09	125952040.0	160.0	1135363.0	2500.0	regionprofile	RAJAPerf
		block_512	1.0	1.838314	1.838314	1.838314	1.838314	512.0	3.359049e+09	6.797544e+09	125952040.0	160.0	1135363.0	2500.0	regionprofile	RAJAPerf
{'name': 'Algorithm', 'type': 'function'}	Base_CUDA	block_1024	11.0	0.006371	0.006371	0.006371	0.006371	1024.0	1.677722e+07	1.048576e+06	1048576.0	1.0	1048576.0	100.0	regionprofile	Algorithm
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
{'name': 'Stream_MUL', 'type': 'function'}	Base_CUDA	block_512	57.0	0.042775	0.042775	0.042775	0.042775	512.0	1.677722e+07	1.048576e+06	1048576.0	1.0	1048576.0	1800.0	regionprofile	Stream_MUL
{'name': 'Stream_TRIAD', 'type': 'function'}	Base_CUDA	block_1024	60.0	0.033749	0.033749	0.033749	0.033749	1024.0	2.516582e+07	2.097152e+06	1048576.0	1.0	1048576.0	1000.0	regionprofile	Stream_TRIAD
		block_128	58.0	0.033648	0.033648	0.033648	0.033648	128.0	2.516582e+07	2.097152e+06	1048576.0	1.0	1048576.0	1000.0	regionprofile	Stream_TRIAD
		block_256	64.0	0.033649	0.033649	0.033649	0.033649	256.0	2.516582e+07	2.097152e+06	1048576.0	1.0	1048576.0	1000.0	regionprofile	Stream_TRIAD
		block_512	58.0	0.033713	0.033713	0.033713	0.033713	512.0	2.516582e+07	2.097152e+06	1048576.0	1.0	1048576.0	1000.0	regionprofile	Stream_TRIAD

268 rows × 14 columns