Query Language: Thicket Tutorial

Thicket is a python-based toolkit for Exploratory Data Analysis (EDA) of parallel performance data that enables performance optimization and understanding of applications’ performance on supercomputers. It bridges the performance tool gap between being able to consider only a single instance of a simulation run (e.g., single platform, single measurement tool, or single scale) and finding actionable insights in multi-dimensional, multi-scale, multi-architecture, and multi-tool performance datasets.

NOTE: An interactive version of this notebook is available in the Binder environment.

Binder


1. Import Necessary Packages

To explore the structure and various capabilities of thicket components, we begin by importing necessary packages. These include python extensions and thicket’s statistical functions.

[1]:
import re

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from IPython.display import HTML
import hatchet as ht

import thicket as tt

display(HTML("<style>.container { width:80% !important; }</style>"))

2. Read in Performance Profiles

For this notebook, we select profiles generated on Lawrence Livermore National Lab (LLNL) machine, lassen. We create a thicket object generated with the same block size of 128.

[2]:
problem_sizes = [
    "1048576",
    "2097152",
    "4194304",
    "8388608"
]
lassen1 = [f"../data/lassen/clang10.0.1_nvcc10.2.89_{x}/Base_CUDA-block_128.cali" for x in problem_sizes]
lassen2 = [f"../data/lassen/clang10.0.1_nvcc10.2.89_1048576/Base_CUDA-block_256.cali"]

# generate thicket(s)
th_lassen = tt.Thicket.from_caliperreader(lassen1)
/opt/conda/lib/python3.9/site-packages/thicket/ensemble.py:319: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  perfdata[col].replace({fill_value: None}, inplace=True)

3. More Information on a Function


You can use the help() method within Python to see the information for a given object. You can do this by typing help(object). This will allow you to see the arguments for the function, and what will be returned. An example is below.

[3]:
help(tt.median)
Help on function median in module thicket.stats.median:

median(thicket, columns=None)
    Calculate the median for each node in the performance data table.

    Designed to take in a thicket, and append one or more columns to the
    aggregated statistics table for the median calculation for each node.

    Arguments:
        thicket (thicket): Thicket object
        columns (list): List of hardware/timing metrics to perform median calculation
            on. Note, if using a columnar joined thicket a list of tuples must be passed
            in with the format (column index, column name).

4. Append Statistical Calculation(s)


We can calculate statistical aggregations per-node in the performance data and append the values to the aggregated statistics table. In the example below, we calculate the per-node median time across 4 profiles and append the median to the statistics table. The new column is called Total time_median.

Why is this important for this notebook?

When the nodes in the performance data table change, the aggregated statistics table will change depending on the metric. Therefore, the aggregated statistics table is cleared after a query has been applied. In the examples further down, we use an appended column (specifically the median of total time) as the metric to print the call trees.

[4]:
metrics = ["Total time"]
tt.median(th_lassen, columns=metrics)
th_lassen.statsframe.dataframe
/opt/conda/lib/python3.9/site-packages/thicket/stats/median.py:32: FutureWarning: The provided callable <function median at 0xffff4fe9adc0> is currently using DataFrameGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  df = thicket.dataframe[columns].reset_index().groupby("node").agg(np.median)
[4]:
name Total time_median
node
{'name': 'RAJAPerf', 'type': 'function'} RAJAPerf 5.073698
{'name': 'Algorithm', 'type': 'function'} Algorithm 0.015522
{'name': 'Algorithm_MEMCPY', 'type': 'function'} Algorithm_MEMCPY 0.006589
{'name': 'Algorithm_MEMSET', 'type': 'function'} Algorithm_MEMSET 0.004223
{'name': 'Algorithm_REDUCE_SUM', 'type': 'function'} Algorithm_REDUCE_SUM 0.004685
... ... ...
{'name': 'Stream_ADD', 'type': 'function'} Stream_ADD 0.093114
{'name': 'Stream_COPY', 'type': 'function'} Stream_COPY 0.117469
{'name': 'Stream_DOT', 'type': 'function'} Stream_DOT 0.182254
{'name': 'Stream_MUL', 'type': 'function'} Stream_MUL 0.117501
{'name': 'Stream_TRIAD', 'type': 'function'} Stream_TRIAD 0.093277

64 rows × 2 columns

5. Thicket Query Language

Use the Query Language

Thicket’s query language provides users the capability to select or query specific nodes based on the call tree component in thicket. The nodes in the performance data and statistics table are updated as well to reflect which nodes are remaining in the call tree.

[5]:
print("Initial call tree:")
print(th_lassen.statsframe.tree("Total time_median"))
Initial call tree:
    __          __       __         __
   / /_  ____ _/ /______/ /_  ___  / /_
  / __ \/ __ `/ __/ ___/ __ \/ _ \/ __/
 / / / / /_/ / /_/ /__/ / / /  __/ /_
/_/ /_/\__,_/\__/\___/_/ /_/\___/\__/  v2024.1.0

5.074 RAJAPerf
├─ 0.016 Algorithm
│  ├─ 0.007 Algorithm_MEMCPY
│  ├─ 0.004 Algorithm_MEMSET
│  └─ 0.005 Algorithm_REDUCE_SUM
├─ 0.436 Apps
│  ├─ 0.020 Apps_DEL_DOT_VEC_2D
│  ├─ 0.112 Apps_ENERGY
│  ├─ 0.011 Apps_FIR
│  ├─ 0.035 Apps_HALOEXCHANGE
│  ├─ 0.007 Apps_HALOEXCHANGE_FUSED
│  ├─ 0.035 Apps_LTIMES
│  ├─ 0.035 Apps_LTIMES_NOVIEW
│  ├─ 0.021 Apps_NODAL_ACCUMULATION_3D
│  ├─ 0.134 Apps_PRESSURE
│  ├─ 0.016 Apps_VOL3D
│  └─ 0.010 Apps_ZONAL_ACCUMULATION_3D
├─ 0.936 Basic
│  ├─ 0.025 Basic_COPY8
│  ├─ 0.047 Basic_DAXPY
│  ├─ 0.047 Basic_DAXPY_ATOMIC
│  ├─ 0.036 Basic_IF_QUAD
│  ├─ 0.080 Basic_INIT3
│  ├─ 0.100 Basic_INIT_VIEW1D
│  ├─ 0.095 Basic_INIT_VIEW1D_OFFSET
│  ├─ 0.056 Basic_MULADDSUB
│  ├─ 0.045 Basic_NESTED_INIT
│  ├─ 0.342 Basic_PI_ATOMIC
│  ├─ 0.004 Basic_PI_REDUCE
│  ├─ 0.004 Basic_REDUCE3_INT
│  ├─ 0.051 Basic_REDUCE_STRUCT
│  └─ 0.004 Basic_TRAP_INT
├─ 1.095 Lcals
│  ├─ 0.178 Lcals_DIFF_PREDICT
│  ├─ 0.064 Lcals_EOS
│  ├─ 0.131 Lcals_FIRST_DIFF
│  ├─ 0.010 Lcals_FIRST_MIN
│  ├─ 0.132 Lcals_FIRST_SUM
│  ├─ 0.151 Lcals_GEN_LIN_RECUR
│  ├─ 0.094 Lcals_HYDRO_1D
│  ├─ 0.065 Lcals_HYDRO_2D
│  ├─ 0.137 Lcals_INT_PREDICT
│  ├─ 0.008 Lcals_PLANCKIAN
│  └─ 0.125 Lcals_TRIDIAG_ELIM
├─ 1.987 Polybench
│  ├─ 0.023 Polybench_2MM
│  ├─ 0.032 Polybench_3MM
│  ├─ 0.068 Polybench_ADI
│  ├─ 0.046 Polybench_ATAX
│  ├─ 0.101 Polybench_FDTD_2D
│  ├─ 1.038 Polybench_FLOYD_WARSHALL
│  ├─ 0.027 Polybench_GEMM
│  ├─ 0.013 Polybench_GEMVER
│  ├─ 0.047 Polybench_GESUMMV
│  ├─ 0.059 Polybench_HEAT_3D
│  ├─ 0.211 Polybench_JACOBI_1D
│  ├─ 0.282 Polybench_JACOBI_2D
│  └─ 0.039 Polybench_MVT
└─ 0.604 Stream
   ├─ 0.093 Stream_ADD
   ├─ 0.117 Stream_COPY
   ├─ 0.182 Stream_DOT
   ├─ 0.118 Stream_MUL
   └─ 0.093 Stream_TRIAD

Legend (Metric: Total time_median Min: 0.00 Max: 5.07)
4.57 - 5.07
3.55 - 4.57
2.54 - 3.55
1.52 - 2.54
0.51 - 1.52
0.00 - 0.51

name User code     Only in left graph     Only in right graph

Example Query 1: Find a Subgraph with a Specific Root

This example shows how to find a subtree starting with a specific root. More specifically, the query in this example finds a subtree rooted at the node with the name “Stream” followed by all nodes down to the leaf nodes.

NOTE: A DeprecationWarning is generated when using “old-style” queries (i.e., queries with QueryMatcher) if you have the newest version of Hatchet installed.

[6]:
query_ex1 = (
    ht.QueryMatcher()
    .match (
        ".",
        lambda row: row["name"].apply(
            lambda x: re.match(
                "Stream", x
            )
            is not None
        ).all()
    )
    .rel("*")
)

# applying the first query on the lassen thicket
th_ex1 = th_lassen.query(query_ex1)
tt.median(th_ex1, columns=["Total time"])
print(th_ex1.statsframe.tree("Total time_median"))
    __          __       __         __
   / /_  ____ _/ /______/ /_  ___  / /_
  / __ \/ __ `/ __/ ___/ __ \/ _ \/ __/
 / / / / /_/ / /_/ /__/ / / /  __/ /_
/_/ /_/\__,_/\__/\___/_/ /_/\___/\__/  v2024.1.0

0.604 Stream
├─ 0.093 Stream_ADD
├─ 0.117 Stream_COPY
├─ 0.182 Stream_DOT
├─ 0.118 Stream_MUL
└─ 0.093 Stream_TRIAD

Legend (Metric: Total time_median Min: 0.09 Max: 0.60)
0.55 - 0.60
0.45 - 0.55
0.35 - 0.45
0.25 - 0.35
0.14 - 0.25
0.09 - 0.14

name User code     Only in left graph     Only in right graph

<ipython-input-6-322ae67271ad>:2: DeprecationWarning: Old-style queries are deprecated as of Hatchet 2023.1.0 and will be removed in the             future. Please use new-style queries instead. For QueryMatcher, the equivalent             new-style queries are hatchet.query.Query for base-syntax queries and             hatchet.query.ObjectQuery for the object-dialect.
  ht.QueryMatcher()
/opt/conda/lib/python3.9/site-packages/thicket/stats/median.py:32: FutureWarning: The provided callable <function median at 0xffff4fe9adc0> is currently using DataFrameGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  df = thicket.dataframe[columns].reset_index().groupby("node").agg(np.median)

Example Query 2: Find All Paths Ending with a Specific Node

This example shows how to find all paths of a GraphFrame ending with a specific node. More specifically, the queries in this example can be used to find paths ending with a node named “Stream”.

[7]:
query_ex2 = (
    ht.QueryMatcher()
    .match("*")
    .rel(
        ".",
        lambda row: row["name"].apply(
            lambda x: re.match(
                "Stream", x
            )
            is not None
        ).all()
    )
)

# applying the second query on the lassen thicket
th_ex2 = th_lassen.query(query_ex2)
tt.median(th_ex2, columns=["Total time"])
print(th_ex2.statsframe.tree("Total time_median"))
<ipython-input-7-d669d2fda245>:2: DeprecationWarning: Old-style queries are deprecated as of Hatchet 2023.1.0 and will be removed in the             future. Please use new-style queries instead. For QueryMatcher, the equivalent             new-style queries are hatchet.query.Query for base-syntax queries and             hatchet.query.ObjectQuery for the object-dialect.
  ht.QueryMatcher()
    __          __       __         __
   / /_  ____ _/ /______/ /_  ___  / /_
  / __ \/ __ `/ __/ ___/ __ \/ _ \/ __/
 / / / / /_/ / /_/ /__/ / / /  __/ /_
/_/ /_/\__,_/\__/\___/_/ /_/\___/\__/  v2024.1.0

5.074 RAJAPerf
└─ 0.604 Stream
   ├─ 0.093 Stream_ADD
   ├─ 0.117 Stream_COPY
   ├─ 0.182 Stream_DOT
   ├─ 0.118 Stream_MUL
   └─ 0.093 Stream_TRIAD

Legend (Metric: Total time_median Min: 0.09 Max: 5.07)
4.58 - 5.07
3.58 - 4.58
2.58 - 3.58
1.59 - 2.58
0.59 - 1.59
0.09 - 0.59

name User code     Only in left graph     Only in right graph

/opt/conda/lib/python3.9/site-packages/thicket/stats/median.py:32: FutureWarning: The provided callable <function median at 0xffff4fe9adc0> is currently using DataFrameGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  df = thicket.dataframe[columns].reset_index().groupby("node").agg(np.median)

Example Query 3: Find All Paths with Specific Starting and Ending Nodes

This example shows how to find all call paths starting with and ending with specific nodes. More specifically, the query in this example finds paths starting with a node named “Stream” and ending with a node named “Stream_MUL”.

[8]:
query_ex3 = (
    ht.QueryMatcher()
    .match(
        ".",
        lambda row: row["name"].apply(
            lambda x: re.match(
                "Stream", x
            )
            is not None
        ).all()
    )
    .rel("*")
    .rel(
        ".",
        lambda row: row["name"].apply(
            lambda x: re.match(
                "Stream_MUL", x
            )
            is not None
        ).all()
    )
)

# applying the third query on the lassen thicket
th_ex3 = th_lassen.query(query_ex3)
tt.median(th_ex3, columns=["Total time"])
print(th_ex3.statsframe.tree("Total time_median"))
<ipython-input-8-325f62c07381>:2: DeprecationWarning: Old-style queries are deprecated as of Hatchet 2023.1.0 and will be removed in the             future. Please use new-style queries instead. For QueryMatcher, the equivalent             new-style queries are hatchet.query.Query for base-syntax queries and             hatchet.query.ObjectQuery for the object-dialect.
  ht.QueryMatcher()
    __          __       __         __
   / /_  ____ _/ /______/ /_  ___  / /_
  / __ \/ __ `/ __/ ___/ __ \/ _ \/ __/
 / / / / /_/ / /_/ /__/ / / /  __/ /_
/_/ /_/\__,_/\__/\___/_/ /_/\___/\__/  v2024.1.0

0.604 Stream
└─ 0.118 Stream_MUL

Legend (Metric: Total time_median Min: 0.12 Max: 0.60)
0.56 - 0.60
0.46 - 0.56
0.36 - 0.46
0.26 - 0.36
0.17 - 0.26
0.12 - 0.17

name User code     Only in left graph     Only in right graph

/opt/conda/lib/python3.9/site-packages/thicket/stats/median.py:32: FutureWarning: The provided callable <function median at 0xffff4fe9adc0> is currently using DataFrameGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  df = thicket.dataframe[columns].reset_index().groupby("node").agg(np.median)

Example Query 4: Find All Nodes for a Particular Software Library

This example shows how to find all call paths representing a specific software library. This example is simply a variant of finding a subtree with a given root (i.e., from :ref:this section <subgraph_root_ex>). The example query below can be adapted to find the nodes for a subset of the MPI library, for example. In our example, we look for subtrees rooted at PolyBench_2MM, Basic_DAXPY, and Apps_ENERGY.

[9]:
api_entrypoints = [
    "Polybench_2MM",
    "Basic_DAXPY",
    "Apps_ENERGY",
]

query_ex4 = (
    ht.QueryMatcher()
    .match(
        ".",
        lambda row: row["name"].apply(
            lambda x: x in api_entrypoints
        ).all()
    )
    .rel("*")
)

# applying the fourth query on the lassen thicket
th_ex4 = th_lassen.query(query_ex4)
tt.median(th_ex4, columns=["Total time"])
print(th_ex4.statsframe.tree("Total time_median"))
<ipython-input-9-db118bf78c48>:8: DeprecationWarning: Old-style queries are deprecated as of Hatchet 2023.1.0 and will be removed in the             future. Please use new-style queries instead. For QueryMatcher, the equivalent             new-style queries are hatchet.query.Query for base-syntax queries and             hatchet.query.ObjectQuery for the object-dialect.
  ht.QueryMatcher()
    __          __       __         __
   / /_  ____ _/ /______/ /_  ___  / /_
  / __ \/ __ `/ __/ ___/ __ \/ _ \/ __/
 / / / / /_/ / /_/ /__/ / / /  __/ /_
/_/ /_/\__,_/\__/\___/_/ /_/\___/\__/  v2024.1.0

0.112 Apps_ENERGY
0.047 Basic_DAXPY
0.023 Polybench_2MM

Legend (Metric: Total time_median Min: 0.02 Max: 0.11)
0.10 - 0.11
0.09 - 0.10
0.07 - 0.09
0.05 - 0.07
0.03 - 0.05
0.02 - 0.03

name User code     Only in left graph     Only in right graph

/opt/conda/lib/python3.9/site-packages/thicket/stats/median.py:32: FutureWarning: The provided callable <function median at 0xffff4fe9adc0> is currently using DataFrameGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  df = thicket.dataframe[columns].reset_index().groupby("node").agg(np.median)

Example Query 5: Find All Paths through a Specific Node

This example shows how to find all call paths that pass through a specific node. More specifically, the query below finds all paths that pass through a node named “Stream”.

[10]:
query_ex5 = (
    ht.QueryMatcher()
    .match("*")
    .rel(
        ".",
        lambda row: row["name"].apply(
            lambda x: re.match(
                "Stream", x
            )
            is not None
        ).all()
    )
    .rel("*")
)

# applying the fifth query on the lassen thicket
th_ex5 = th_lassen.query(query_ex5)
tt.median(th_ex5, columns=["Total time"])
print(th_ex5.statsframe.tree("Total time_median"))
<ipython-input-10-f64ab722a05a>:2: DeprecationWarning: Old-style queries are deprecated as of Hatchet 2023.1.0 and will be removed in the             future. Please use new-style queries instead. For QueryMatcher, the equivalent             new-style queries are hatchet.query.Query for base-syntax queries and             hatchet.query.ObjectQuery for the object-dialect.
  ht.QueryMatcher()
    __          __       __         __
   / /_  ____ _/ /______/ /_  ___  / /_
  / __ \/ __ `/ __/ ___/ __ \/ _ \/ __/
 / / / / /_/ / /_/ /__/ / / /  __/ /_
/_/ /_/\__,_/\__/\___/_/ /_/\___/\__/  v2024.1.0

5.074 RAJAPerf
└─ 0.604 Stream
   ├─ 0.093 Stream_ADD
   ├─ 0.117 Stream_COPY
   ├─ 0.182 Stream_DOT
   ├─ 0.118 Stream_MUL
   └─ 0.093 Stream_TRIAD

Legend (Metric: Total time_median Min: 0.09 Max: 5.07)
4.58 - 5.07
3.58 - 4.58
2.58 - 3.58
1.59 - 2.58
0.59 - 1.59
0.09 - 0.59

name User code     Only in left graph     Only in right graph

/opt/conda/lib/python3.9/site-packages/thicket/stats/median.py:32: FutureWarning: The provided callable <function median at 0xffff4fe9adc0> is currently using DataFrameGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  df = thicket.dataframe[columns].reset_index().groupby("node").agg(np.median)