PPAM ‘24: Composing & Modeling Parallel Sorting Performance Data (Part A): Thicket Tutorial

The parallel sorting dataset consists of 8,747 MPI sorting algorithm performance profiles (collected with Caliper) for 5 different algorithms and 51 implementations. We start with a dataset that includes over 10,000 performance profiles, and we show how to apply various filters and checks on the performance data to remove profiles that do not match our criteria. We use this data to show how we can train models to determine the algorithm from the performance data.

Algorithm	# Performance Profiles	# Implementations
Merge Sort	2,239	15
Sample Sort	2,231	9
Odd-Even Sort	2,034	12
Bitonic Sort	1,652	11
Radix Sort	591	4
Total	8,747	51

1. Import Necessary Packages

Import packages and point to the dataset.

[1]:

from glob import glob

import numpy as np
import zipfile

import thicket as th

DATA_DIR = "../data/parallel-sorting"

Warning: Roundtrip module could not be loaded. Requires jupyter notebook version <= 7.x.

2. Read files into Thicket

First, we download the dataset as a zip file from the Texas Data Respository dataverse and extract the files. Then, we read the files into Thicket.

glob() recursively grabs all Caliper files (.cali) in the data directory.
from_caliperreader() reads the Caliper files into Thicket and fill_perfdata=False will save memory, since we have so many files.

[2]:

# Download the parallel sorting dataset (zip file) from dataverse
! wget "https://dataverse.tdl.org/api/access/datafile/:persistentId?persistentId=doi:10.18738/T8/KY6SPB/IT9TZK" -O ../data/parallel-sorting.zip
# Extract files from zip
with zipfile.ZipFile(DATA_DIR+".zip", "r") as zip_ref:
    zip_ref.extractall(DATA_DIR)

--2024-11-04 16:04:39--  https://dataverse.tdl.org/api/access/datafile/:persistentId?persistentId=doi:10.18738/T8/KY6SPB/IT9TZK
Resolving dataverse.tdl.org (dataverse.tdl.org)... 34.239.31.80
Connecting to dataverse.tdl.org (dataverse.tdl.org)|34.239.31.80|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://dataverse-prod-s3.s3.amazonaws.com/10.18738/T8/KY6SPB/191bea8c1b5-ee1ba7d1a7d0?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27parallel-sorting.zip&response-content-type=application%2Fzip&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20241104T220440Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIAIAGTFFGLXBPHVGXQ%2F20241104%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=3f781e372078df9e7b080764fb5d3956b6e6669495bc7438078b5eaf2016bb4a [following]
--2024-11-04 16:04:40--  https://dataverse-prod-s3.s3.amazonaws.com/10.18738/T8/KY6SPB/191bea8c1b5-ee1ba7d1a7d0?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27parallel-sorting.zip&response-content-type=application%2Fzip&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20241104T220440Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIAIAGTFFGLXBPHVGXQ%2F20241104%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=3f781e372078df9e7b080764fb5d3956b6e6669495bc7438078b5eaf2016bb4a
Resolving dataverse-prod-s3.s3.amazonaws.com (dataverse-prod-s3.s3.amazonaws.com)... 54.231.136.193, 52.217.87.212, 54.231.234.249, ...
Connecting to dataverse-prod-s3.s3.amazonaws.com (dataverse-prod-s3.s3.amazonaws.com)|54.231.136.193|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35869614 (34M) [application/zip]
Saving to: ‘../data/parallel-sorting.zip’

../data/parallel-so 100%[===================>]  34.21M  41.9MB/s    in 0.8s

2024-11-04 16:04:41 (41.9 MB/s) - ‘../data/parallel-sorting.zip’ saved [35869614/35869614]

[3]:

data = glob(f"{DATA_DIR}/**/*.cali", recursive=True)
print(f"Total files: {len(data)}")

# Read caliper files without filling the profile index as it expensive and unnecessary in our case
tk = th.Thicket.from_caliperreader(
    data,
    fill_perfdata=False
)
print(f"DataFrame shape {tk.dataframe.shape}")
print(f"Metadata shape: {tk.metadata.shape}")

Total files: 12916

(1/2) Reading Files: 100%|██████████| 12916/12916 [01:38<00:00, 131.77it/s]
(2/2) Creating Thicket: 100%|██████████| 12915/12915 [02:05<00:00, 102.99it/s]

DataFrame shape (128716, 16)
Metadata shape: (12916, 62)

3. Modify and Filter Metadata Values

Since the dataset we are using is a compilation from many different implementations, there are various labeling inconsistencies in the metadata annotations which we can fix using Thicket. We have defined two dictionaries from manual analysis of the data to achieve this:

META_FIX_DICT is used to enforce consistency in the metadata by replacing inconsistent values.
META_WHITELIST_DICT is used to select the metadata parameters we are looking for from the experiments.

The metadata we reference are the experiment parameters and important identifying metadata. We use these values for processing and removing anomalies, and Algorithm specifically is also used as the class label when modeling:

Experiment Parameters
- InputType - The type of sortedness of the input array.
- Datatype - The datatype of the values in the input array.
- num_procs - Number of parallel processes.
- InputSize - Size of the input array.
Parallel Algorithm Class Label
- Algorithm - The name of the parallel sorting algorithm.
Identifying metadata
- group_num - Unique identifier for different implementations.

[4]:

META_FIX_DICT = {
    "Algorithm": {
        "bitonic_sort": "BitonicSort",
        "merge_sort": "MergeSort",
        "Merge Sort": "MergeSort",
        "odd_even_sort": "OddEvenSort",
        "Merge sort": "MergeSort",
        "Sample Sort": "SampleSort",
        "Bitonic_Sort": "BitonicSort",
        "Merge_Sort": "MergeSort",
        "OddEvenTranspositionSort": "OddEvenSort",
        "Bitonic Sort": "BitonicSort",
        "Mergesort": "MergeSort",
        "mergesort": "MergeSort",
        "oddEven": "OddEvenSort",
        "Odd Even Transposition Sort": "OddEvenSort",
        "RadixSort Sort": "RadixSort",
        "Odd Even Sort": "OddEvenSort",
        "Odd-Even Sort": "OddEvenSort",
        "OddevenSort": "OddEvenSort",
        "oddeven_sort": "OddEvenSort",
        "Radix Sort": "RadixSort",
        "Odd-Even Bubble Sort": "OddEvenSort",
        "Bubble_Sort": "OddEvenSort",
        "Bubblesort": "OddEvenSort",
        "Bubble Sort(Odd/Even)": "OddEvenSort",
        "Bubble/Odd-Even Sort": "OddEvenSort",
        "Parallel Bubble Sort": "OddEvenSort",
        "BubbleSort": "OddEvenSort",
        "Radix": "RadixSort",
        "Bitonic": "BitonicSort",
    },
    "InputType": {
        "perturbed_array": "1%perturbed",
        "sorted_array": "Sorted",
        "random_array": "Random",
        "ascending_array": "Sorted",
        "descending_array": "Reverse",
        "reversed_array": "Reverse",
        "reversedSort": "Reverse",
        "1% Perturbed": "1%perturbed",
        "reverse_sorted": "Reverse",
        "1perturbed": "1%perturbed",
        r"1%%perturbed": "1%perturbed",
        "1 Perturbed": "1%perturbed",
        "1 perturbed": "1%perturbed",
        "Reverse Sorted": "Reverse",
        "1%Perturbed": "1%perturbed",
        "1% perturbation": "1%perturbed",
        "1percentperturbed": "1%perturbed",
        "1 percent noise": "1%perturbed",
        "reverse sorted": "Reverse",
        "sorted_1%_perturbed": "1%perturbed",
        "Reversesorted": "Reverse",
        "ReverseSorted": "Reverse",
        "Reverse_Sorted": "Reverse",
        "ReversedSort": "Reverse",
        "Sorted_1%_perturbed": "1%perturbed",
        "Randomized": "Random",
        "Reversed": "Reverse",
        "reversed": "Reverse",
        "sorted": "Sorted",
        "random": "Random",
        "nearly": "Nearly",
        "reverse": "Reverse",
        " Reverse sorted": "Reverse",
        "Perturbed": "1%perturbed",
        "perturbed": "1%perturbed",
    },
    "Datatype": {
        "integer": "int",
        "Int": "int",
        "Integer": "int",
        "Double": "double",
    },
}

META_WHITELIST_DICT = {
    "InputType": ["Random", "Sorted", "Reverse", "1%perturbed", "Nearly"],
    "Algorithm": [
        "BitonicSort",
        "MergeSort",
        "OddEvenSort",
        "RadixSort",
        "SampleSort",
    ],
    "Datatype": ["int", "float", "double"],
    "num_procs": [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024],
    "InputSize": [65536, 262144, 1048576, 4194304, 16777216, 67108864, 268435456],
}

3A. Modify Metadata Values to Match Grammar

The pandas.DataFrame.replace() function replaces values in the metadata.

[5]:

for meta_col, values in META_FIX_DICT.items():
    tk.metadata[meta_col] = tk.metadata[meta_col].replace(values)

3B. Filter Metadata Values from Whitelist

We use the Thicket.filter_metadata() function to filter any values that are not contained in our metadata whitelist, which leaves performance profiles that contain the desired metadata for removing anomalies and modeling.

Note: This cell can take 10+ minutes to run

[6]:

print(f"Total profiles before: {len(tk.profile)}")
tk = tk.filter_metadata(lambda meta: all([meta[key] in META_WHITELIST_DICT[key] for key in META_WHITELIST_DICT.keys()]))
print(f"Total profiles after: {len(tk.profile)}")

Total profiles before: 12916
Total profiles after: 10624

3C. Filter Duplicate Metadata Values

Duplicate values across all of our experiment parameters indicates that one profile has incorrect metadata, since all of the profiles are single-trial. If we find duplicates of any profile we remove them all, as we cannot assume which profile contains the correct metadata. These occurrences are a result of incorrect manual annotation before generating the profiles.

We can remove duplicate values by using Thicket.groupby() on our experiment parameters except “num_procs”, and then checking if there are any duplicates of “num_procs” using pandas.DataFrame.duplicated(). We then remove the duplicate profiles using Thicket.filter_profile().

Note: This cell can take 10+ minutes to run

[7]:

gb = tk.groupby(["Algorithm", "InputType", "Datatype", "group_num", "InputSize"])
rm_profs = []
for key, ttk in gb.items():
    if ttk.metadata["num_procs"].duplicated().any():
        print(f"Skipping {key} ({len(ttk.profile)} profiles) because it has duplicate num_procs")
        rm_profs += ttk.profile
tk = tk.filter_profile([p for p in tk.profile if p not in set(rm_profs)])
print(f"Total profiles after removing duplicates: {len(tk.profile)}")

Skipping ('RadixSort', 'Random', 'double', 2, 65536) (27 profiles) because it has duplicate num_procs
Skipping ('RadixSort', 'Random', 'double', 2, 262144) (26 profiles) because it has duplicate num_procs
Total profiles after removing duplicates: 10571

4. Selecting Features

In this section, we structure the performance data where each column is a feature, and each row is a feature vector for one performance profile, which is necessary for modeling

4A. Query the Call Tree

For this study, we used “generalized” nodes for annotations. So a given node in the calltree would be annotated by its functionality, communication or computation, and the amount of data it operated on, small or large.

main                 // Top-level function of the program
|_ comm              // Parent for all communication nodes
|    |_ comm_small   // All nodes communicating "small" data
|    |_ comm_large   // All nodes communicating "large" data
|_ comp              // Parent for all computation nodes
     |_ comp_small   // All nodes computing on "small" data
     |_ comp_large   // All nodes computing on "large" data

Not all implementations match this tree 100% correctly. Some implementations include additional nodes, or have generalized nodes at different depths in the calltree, which results in duplicates of the same nodes after composing the Thicket. We will use Thicket.query() to subselect the performance metrics for the generalized nodes that we want to use for modeling. Querying by node name will also combine nodes with the same name at various depths into one node at root depth.

Note: Printing the ``Thicket.tree()`` at this point will show the full calltree, which includes many nodes which are not relevant to our analysis.

[8]:

# Perform query
nodes = [
    "comp",
    "comp_large",
    "comm",
    "comm_large",
    "comp_small",
    "comm_small"
]
ntk_dict = {n: tk.query(
    th.query.Query().match(
        "*",
        lambda row: row["name"].apply(
            lambda tn: tn == n
        ).all()
    )
) for n in nodes}

4B. Compose a New Thicket from the Queried Thickets

We use Thicket.concat_thickets() to compose the Thickets we created from each query. Since many of these Thickets will contain the same profiles in their metadata, we drop duplicate values using pandas.drop_duplicates()

Note: Unlike when we read the files, fill_perfdata is True here. This is so we can later compute the feature “Present” using the None values in the “name” column.

[9]:

# Re-compose quieried Thickets
tk = th.Thicket.concat_thickets(
    thickets=list(ntk_dict.values()),
    fill_perfdata=True,
)
# Drop duplicate profiles in the metadata from concat_thickets
unhashable_cols = ["libraries", "cmdline"] # Can't pass these cols in the check or error will be thrown. Won't change the outcome of the check
tk.metadata = tk.metadata.drop_duplicates(subset=[col for col in tk.metadata.columns if col not in unhashable_cols])

4C. Remove Profiles with Missing Nodes

Since we did not design our models to handle missing data points, we need to remove profiles with missing measurements for our selected nodes using Thicket.filter_profile().

[10]:

# Nodes not considered in the check. They are only used for their presence T/F
not_considered = ["comp_small", "comm_small"]
profiles_per_node = [set(ntk_dict[n].dataframe.index.get_level_values("profile")) for n in ntk_dict.keys() if n not in not_considered]
# Intersection of the profiles
profile_truth = list(profiles_per_node[0].intersection(*profiles_per_node[1:]))
# Filter the Thicket to only contain these profiles
tk = tk.filter_profile(profile_truth)
print(f"Total profiles that contain all data: {len(tk.profile)}")

Total profiles that contain all data: 9406

4D. Computute Additional Features from Performance Data

We compute the “Present” feature and the derived “comp/comm” features using a mixture of pandas functions. The add_root_node() function is used to add the “comp/comm” features to the performance data.

[11]:

metric_cols = [
    "Variance time/rank",
    "Min time/rank",
    "Max time/rank",
    "Avg time/rank",
    "Total time",
]

# Compute Present feature
tk.dataframe["Present"] = tk.dataframe["name"].apply(lambda name: False if name is None else True)

# Compute comp/comm feature
tk.add_root_node(attrs={"name": "comp/comm", "type": "derived"})
tdf = tk.dataframe.loc[tk.get_node("comp"), metric_cols].div(tk.dataframe.loc[tk.get_node("comm"), metric_cols])
# Replace inf with NaN where division by 0 occurred
tdf = tdf.replace({np.inf: np.nan})
for prof in tdf.index:
    tk.dataframe.loc[(tk.get_node("comp/comm"), prof), metric_cols] = tdf.loc[prof]

4E: Define Our Features as Pandas Slices

Here we are essentially defining macros to refer to the features. There needs to be two macros because each macro indexes the data differently.

To subselect the performance data we use a slice generated by either perf_idx() or presence_idx() (they are functions because the node objects can change id’s after certain Thicket operations). We use the Thicket.get_node() function to select node objects.

We can index the performance data with these slices using Thicket.dataframe.loc[perf_idx()] or Thicket.dataframe.loc[presence_idx()].

[12]:

def perf_idx():
    return (
        (
            [
                tk.get_node("comp/comm"),
                tk.get_node("comp_large"),
                tk.get_node("comm_large")
            ]
        ), metric_cols
    )

def presence_idx():
    return (
        (
            [
                tk.get_node("comp_small"),
                tk.get_node("comm_small"),
            ]
        ), [
            "Present"
        ]
    )

4F. Remove Profiles with Missing Metrics

Here we check for any missing data points in any of the profiles for each of the slices we just defined. This check is different from 4C, as we are checking that there are no missing metrics.

any_nan_rows_series will be a series of boolean values for each profile that will be True if there are any missing data points. We use the Thicket.filter_profile() function once again to filter out the profiles with missing data points.

[13]:

print(f"Total profiles before dropping NaNs: {len(tk.profile)}")
nan_profs = []
for idx in [perf_idx(), presence_idx()]:
    any_nan_rows_series = tk.dataframe.loc[idx].isna().apply(lambda x: x.any(), axis=1)
    nan_profs.extend(tk.dataframe.loc[idx][any_nan_rows_series].index.get_level_values("profile").unique())
tk = tk.filter_profile([p for p in tk.profile if p not in nan_profs])
print(f"Total profiles after dropping NaNs: {len(tk.profile)}")

Total profiles before dropping NaNs: 9406
Total profiles after dropping NaNs: 8747

5. Remove Anomalies

In this section, we show how a custom function can be used on a Thicket object. We use the find_outliers function to identify profiles that fall outside certain quantile ranges for a given feature. We use the filter_profile function to filter the outliers returned by find_outliers. This idea can be used to apply custom criteria to the Thicket object, by identifying the profiles we want to remove.

[14]:

def find_outliers(
    tk,
    cols_percs,
):
    """Compute outliers for the combination of Algorithm, InputType, and Datatype.
    Normalize values by num_procs and InputSize.

    Arguments:
        tk (Thicket): The Thicket object.
        cols_percs (dict): Dictionary of columns and their percentiles.

    Returns:
        set: A set of outlier profiles.
    """

    def find_single_outlier_profiles(df, node, col, percs):
        df = df.loc[node]
        upper = df[col].quantile(percs[1])
        lower = df[col].quantile(percs[0])
        return set(
            df[(df[col] > upper) | (df[col] < lower)].index.get_level_values("profile")
        )

    tkc = tk.deepcopy()

    tkc.metadata_columns_to_perfdata(["num_procs", "InputSize"])

    # Normalize the columns by num_procs and InputSize
    tkc.dataframe["np*IS"] = tkc.dataframe["num_procs"] * tkc.dataframe["InputSize"]
    for node, col in cols_percs.keys():
        tkc.dataframe[node, col] = tkc.dataframe.loc[node, col].div(tkc.dataframe.loc[node, "np*IS"])

    single_outlier_profiles = set()
    grouped = tkc.groupby(
        [
            "Algorithm",
            "InputType",
            "Datatype",
        ]
    )

    # Find the outlier profiles
    for alg_inp_dtype, ttk in grouped.items():
        temp_set = set()
        tdf = ttk.dataframe
        if len(tdf) >= 3:
            # Find outliers
            for (node, col), percs in cols_percs.items():
                prfs = find_single_outlier_profiles(tdf, node, col, percs)
                temp_set |= prfs
                single_outlier_profiles |= prfs
            # Uncomment for extra information
            # print(
            #     f"Checked {alg_inp_dtype}. Total outliers {len(temp_set)}/{len(tdf)} ({len(temp_set)/len(tdf)*100:.2f}%)"
            # )
        else:
            raise ValueError(f"Insufficient profiles for {alg_inp_dtype}")

    # find single outlier profiles
    print(
        f"Single outlier profiles: {len(single_outlier_profiles)}/{len(tkc.profile)} ({len(single_outlier_profiles)/len(tkc.profile)*100:.2f}%)"
    )

    return single_outlier_profiles

[15]:

perc=0.975
outlier_profiles = find_outliers(
    tk,
    {
        (tk.get_node("comp_large"), "Min time/rank"): [0, perc],
        (tk.get_node("comp_large"), "Max time/rank"): [0, perc],
        (tk.get_node("comp_large"), "Avg time/rank"): [0, perc],
        (tk.get_node("comp_large"), "Total time"): [0, perc],
        (tk.get_node("comp_large"), "Variance time/rank"): [0, perc],
        (tk.get_node("comm_large"), "Min time/rank"): [0, perc],
        (tk.get_node("comm_large"), "Max time/rank"): [0, perc],
        (tk.get_node("comm_large"), "Avg time/rank"): [0, perc],
        (tk.get_node("comm_large"), "Total time"): [0, perc],
        (tk.get_node("comm_large"), "Variance time/rank"): [0, perc],
    },
)
print(f"Total profiles before dropping outliers: {len(tk.profile)}")
tk = tk.filter_profile([p for p in tk.profile if p not in outlier_profiles])
print(f"Total profiles after dropping outliers: {len(tk.profile)}")

Single outlier profiles: 905/8747 (10.35%)
Total profiles before dropping outliers: 8747
Total profiles after dropping outliers: 7842

6. Write Model Data

Lastly we shuffle the data using pandas.DataFrame.sample() to reduce bias during model training, and pickle the Thicket object, which we will use to pick back up in the next notebook, part B, where we will create classification models using the performance data. Pickling is helpful in this scenario to avoid re-doing the steps in this notebook every time we want to re-run or make adjustments to our models.

[16]:

# Print how many profiles for each sorting algorithm
algs = tk.metadata.reset_index().groupby("Algorithm")
for name, data in algs:
    print(f"Algorithm: {name} has {len(data)} data points")

# Shuffle the data
tk.dataframe = tk.dataframe.sample(frac=1.0)
# Set useful attributes
tk.perf_idx = perf_idx()
tk.presence_idx = presence_idx()
# Write thicket to file
tk.to_pickle("thicket-modeldata.pkl")

Algorithm: BitonicSort has 1508 data points
Algorithm: MergeSort has 1990 data points
Algorithm: OddEvenSort has 1820 data points
Algorithm: RadixSort has 527 data points
Algorithm: SampleSort has 1997 data points