Using Groupby-Aggregate to Compose Multi-Run Datasets: Thicket Tutorial
Thicket is a python-based toolkit for Exploratory Data Analysis (EDA) of parallel performance data that enables performance optimization and understanding of applications’ performance on supercomputers. It bridges the performance tool gap between being able to consider only a single instance of a simulation run (e.g., single platform, single measurement tool, or single scale) and finding actionable insights in multi-dimensional, multi-scale, multi-architecture, and multi-tool performance datasets.
1. Import Necessary Packages
[1]:
from glob import glob
import numpy as np
from IPython.display import display
from IPython.display import HTML
import thicket as th
display(HTML("<style>.container { width:80% !important; }</style>"))
[2]:
# Disable the Pandas 3 Future Warnings for now
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
2. Define Dataset Paths and Names
In this example, we load two repeat runs generated on lassen. We can use glob to find all of the caliper files in a given directory.
[3]:
data = glob("../data/lassen/clang10.0.1_nvcc10.2.89_1048576/**/*.cali", recursive=True)
tk = th.Thicket.from_caliperreader(data, disable_tqdm=True)
3. Groupby
Groupby the unique combinations of variant
and tuning
from the metadata table. In general, these will be the parameters you varied in your runs.
After performing the groupby, we can see that each thicket contains multiple profiles. In order to perform certain composition operations in Thicket, we need to aggregate the performance data (Thicket.dataframe
).
[4]:
gb = tk.groupby(["variant", "tuning"])
4 thickets created...
{('Base_CUDA', 'block_1024'): <thicket.thicket.Thicket object at 0xffff35fa1400>, ('Base_CUDA', 'block_128'): <thicket.thicket.Thicket object at 0xffff35cf2790>, ('Base_CUDA', 'block_256'): <thicket.thicket.Thicket object at 0xffff35d7ef40>, ('Base_CUDA', 'block_512'): <thicket.thicket.Thicket object at 0xffff35d2c8e0>}
[5]:
for key, ttk in gb.items():
print(f"key {key} contains {len(ttk.profile)} profiles")
key ('Base_CUDA', 'block_1024') contains 2 profiles
key ('Base_CUDA', 'block_128') contains 2 profiles
key ('Base_CUDA', 'block_256') contains 2 profiles
key ('Base_CUDA', 'block_512') contains 2 profiles
4. Aggregation
Using the aggregate_thicket
function, we can aggregate each Thicket in the groupby object individually.
[6]:
gb_agg = {}
for key, ttk in gb.items():
gb_agg[key] = gb.aggregate_thicket(ttk, np.mean)
display(gb_agg[('Base_CUDA', 'block_128')].dataframe)
nid_mean | Min time/rank_mean | Max time/rank_mean | Avg time/rank_mean | Total time_mean | BlockSize_mean | Bytes/Rep_mean | Flops/Rep_mean | Iterations/Rep_mean | Kernels/Rep_mean | ProblemSize_mean | Reps_mean | spot.channel | name | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
node | variant | tuning | ||||||||||||||
{'name': 'RAJAPerf', 'type': 'function'} | Base_CUDA | block_128 | 1.0 | 1.779628 | 1.779628 | 1.779628 | 1.779628 | 128.0 | 3.359049e+09 | 6.797544e+09 | 125952040.0 | 160.0 | 1135363.0 | 2500.0 | regionprofile | RAJAPerf |
{'name': 'Algorithm', 'type': 'function'} | Base_CUDA | block_128 | 10.0 | 0.006809 | 0.006809 | 0.006809 | 0.006809 | 128.0 | 1.677722e+07 | 1.048576e+06 | 1048576.0 | 1.0 | 1048576.0 | 100.0 | regionprofile | Algorithm |
{'name': 'Algorithm_MEMCPY', 'type': 'function'} | Base_CUDA | block_128 | 13.0 | 0.002439 | 0.002439 | 0.002439 | 0.002439 | 128.0 | 1.677722e+07 | 0.000000e+00 | 1048576.0 | 1.0 | 1048576.0 | 100.0 | regionprofile | Algorithm_MEMCPY |
{'name': 'Algorithm_MEMSET', 'type': 'function'} | Base_CUDA | block_128 | 12.0 | 0.001705 | 0.001705 | 0.001705 | 0.001705 | 128.0 | 8.388616e+06 | 0.000000e+00 | 1048576.0 | 1.0 | 1048576.0 | 100.0 | regionprofile | Algorithm_MEMSET |
{'name': 'Algorithm_REDUCE_SUM', 'type': 'function'} | Base_CUDA | block_128 | 11.0 | 0.002642 | 0.002642 | 0.002642 | 0.002642 | 128.0 | 8.388616e+06 | 1.048576e+06 | 1048576.0 | 1.0 | 1048576.0 | 50.0 | regionprofile | Algorithm_REDUCE_SUM |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
{'name': 'Stream_ADD', 'type': 'function'} | Base_CUDA | block_128 | 54.0 | 0.033593 | 0.033593 | 0.033593 | 0.033593 | 128.0 | 2.516582e+07 | 1.048576e+06 | 1048576.0 | 1.0 | 1048576.0 | 1000.0 | regionprofile | Stream_ADD |
{'name': 'Stream_COPY', 'type': 'function'} | Base_CUDA | block_128 | 55.0 | 0.042584 | 0.042584 | 0.042584 | 0.042584 | 128.0 | 1.677722e+07 | 0.000000e+00 | 1048576.0 | 1.0 | 1048576.0 | 1800.0 | regionprofile | Stream_COPY |
{'name': 'Stream_DOT', 'type': 'function'} | Base_CUDA | block_128 | 56.0 | 0.108153 | 0.108153 | 0.108153 | 0.108153 | 128.0 | 1.677723e+07 | 2.097152e+06 | 1048576.0 | 1.0 | 1048576.0 | 2000.0 | regionprofile | Stream_DOT |
{'name': 'Stream_MUL', 'type': 'function'} | Base_CUDA | block_128 | 57.0 | 0.042611 | 0.042611 | 0.042611 | 0.042611 | 128.0 | 1.677722e+07 | 1.048576e+06 | 1048576.0 | 1.0 | 1048576.0 | 1800.0 | regionprofile | Stream_MUL |
{'name': 'Stream_TRIAD', 'type': 'function'} | Base_CUDA | block_128 | 58.0 | 0.033648 | 0.033648 | 0.033648 | 0.033648 | 128.0 | 2.516582e+07 | 2.097152e+06 | 1048576.0 | 1.0 | 1048576.0 | 1000.0 | regionprofile | Stream_TRIAD |
67 rows × 14 columns
We can call agg
to aggregate and create a composed dataframe in one step
[7]:
tk_agg = gb.agg(np.mean, disable_tqdm=True)
display(tk_agg.dataframe)
nid_mean | Min time/rank_mean | Max time/rank_mean | Avg time/rank_mean | Total time_mean | BlockSize_mean | Bytes/Rep_mean | Flops/Rep_mean | Iterations/Rep_mean | Kernels/Rep_mean | ProblemSize_mean | Reps_mean | spot.channel | name | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
node | variant | tuning | ||||||||||||||
{'name': 'RAJAPerf', 'type': 'function'} | Base_CUDA | block_1024 | 1.0 | 2.122934 | 2.122934 | 2.122934 | 2.122934 | 1024.0 | 3.359049e+09 | 6.797544e+09 | 125952040.0 | 160.0 | 1135363.0 | 2500.0 | regionprofile | RAJAPerf |
block_128 | 1.0 | 1.779628 | 1.779628 | 1.779628 | 1.779628 | 128.0 | 3.359049e+09 | 6.797544e+09 | 125952040.0 | 160.0 | 1135363.0 | 2500.0 | regionprofile | RAJAPerf | ||
block_256 | 1.0 | 1.772165 | 1.772165 | 1.772165 | 1.772165 | 256.0 | 3.359049e+09 | 6.797544e+09 | 125952040.0 | 160.0 | 1135363.0 | 2500.0 | regionprofile | RAJAPerf | ||
block_512 | 1.0 | 1.838314 | 1.838314 | 1.838314 | 1.838314 | 512.0 | 3.359049e+09 | 6.797544e+09 | 125952040.0 | 160.0 | 1135363.0 | 2500.0 | regionprofile | RAJAPerf | ||
{'name': 'Algorithm', 'type': 'function'} | Base_CUDA | block_1024 | 11.0 | 0.006371 | 0.006371 | 0.006371 | 0.006371 | 1024.0 | 1.677722e+07 | 1.048576e+06 | 1048576.0 | 1.0 | 1048576.0 | 100.0 | regionprofile | Algorithm |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
{'name': 'Stream_MUL', 'type': 'function'} | Base_CUDA | block_512 | 57.0 | 0.042775 | 0.042775 | 0.042775 | 0.042775 | 512.0 | 1.677722e+07 | 1.048576e+06 | 1048576.0 | 1.0 | 1048576.0 | 1800.0 | regionprofile | Stream_MUL |
{'name': 'Stream_TRIAD', 'type': 'function'} | Base_CUDA | block_1024 | 60.0 | 0.033749 | 0.033749 | 0.033749 | 0.033749 | 1024.0 | 2.516582e+07 | 2.097152e+06 | 1048576.0 | 1.0 | 1048576.0 | 1000.0 | regionprofile | Stream_TRIAD |
block_128 | 58.0 | 0.033648 | 0.033648 | 0.033648 | 0.033648 | 128.0 | 2.516582e+07 | 2.097152e+06 | 1048576.0 | 1.0 | 1048576.0 | 1000.0 | regionprofile | Stream_TRIAD | ||
block_256 | 64.0 | 0.033649 | 0.033649 | 0.033649 | 0.033649 | 256.0 | 2.516582e+07 | 2.097152e+06 | 1048576.0 | 1.0 | 1048576.0 | 1000.0 | regionprofile | Stream_TRIAD | ||
block_512 | 58.0 | 0.033713 | 0.033713 | 0.033713 | 0.033713 | 512.0 | 2.516582e+07 | 2.097152e+06 | 1048576.0 | 1.0 | 1048576.0 | 1000.0 | regionprofile | Stream_TRIAD |
268 rows × 14 columns