PPAM ‘24: Composing & Modeling Parallel Sorting Performance Data (Part B): Thicket Tutorial
In part B, we use machine learning to predict the parallel algorithm class from the performance data we processed and composed in part A. Running notebook 08A_composing_parallel_sorting_data.ipynb
is necessary to generate the data that we will use in this notebook.
1. Import Necessary Packages
Import packages: - numpy
and pandas
for help with data operations. - sklearn
for modeling. - matplotlib
to plot our model’s performance statistics. - thicket
to unpickle the Thicket from part A. - tqdm
for modeling progress bars.
[ ]:
import os
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics, svm
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from tqdm import tqdm
import thicket as th
2. Define Modeling Helper Functions
prep_data
first performs standardization scaling on the numerical columns to boost model accuracy (for models that require normally distributed data). Thenprep_data
converts categorical columns to integer labels using thesklearn.preprocessing.LabelEncoder()
function.split_X_y
splits a dataset into input features (X) and labels (y).compute_model_metrics
computes model statistics given the true labels, predicted labels, and probabilities. The model statistics we compute are accuracy, precision, recall, F1-score, the confusion matrix, and the ROC_AUC score.
[2]:
def configure_categorical(model_data, categories):
for col in categories:
# Encode any "string" categorical variables
if model_data[col].dtype == "object":
# Strings to ints
model_data.loc[:, [col]] = LabelEncoder().fit_transform(model_data[col])
else:
# anything else to int
model_data[col] = model_data[col].astype(int)
model_data.loc[:, [col]] = model_data[col]
# Set to categorical
model_data[col] = model_data[col].astype("category")
return model_data
def prep_data(
data,
numerical_columns,
categorical_columns,
scaling=False,
):
# preprocessing
if scaling:
scaler = StandardScaler().set_output(transform="pandas")
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])
if len(categorical_columns) > 0:
data = configure_categorical(data, categorical_columns)
return data
def split_X_y(data):
y = pd.get_dummies(data["Algorithm"], dtype=np.float64)
y_index = y.index
y_values = y.values.argmax(axis=1)
y = pd.Series(y_values, index=y_index)
X = data.drop(columns=["Algorithm"])
return X, y
def compute_model_metrics(y_true, y_pred, y_proba):
def ravel(tlist, max_num):
tarr = []
for y in tlist:
t1 = np.zeros(max_num + 1)
t1[y] = 1
tarr += t1.tolist()
return tarr
def unravel_true_pred(y_true, y_pred):
max_num = max(max(y_pred), max(y_true))
unravel_true = ravel(y_true, max_num)
unravel_pred = ravel(y_pred, max_num)
return unravel_true, unravel_pred
acc = metrics.accuracy_score(y_true=y_true, y_pred=y_pred) # Accuracy
pre = metrics.precision_score(
y_true=y_true, y_pred=y_pred, average="weighted", zero_division=0
) # Precision
rec = metrics.recall_score(
y_true=y_true, y_pred=y_pred, average="weighted", zero_division=0
) # Recall
f1 = metrics.f1_score(
y_true=y_true, y_pred=y_pred, average="weighted", zero_division=0
) # F1
conf_matrix = metrics.confusion_matrix(
y_true=y_true, y_pred=y_pred
) # Confusion matrix
unravel_true, unravel_pred = unravel_true_pred(y_true, y_pred)
y_proba = y_proba.ravel()
try: # case where class sample didnt get into data
roc_auc_score = metrics.roc_auc_score(
y_true=unravel_true,
y_score=y_proba,
multi_class="ovr",
average="weighted",
)
except ValueError:
roc_auc_score = 0
return acc, pre, rec, f1, conf_matrix, roc_auc_score
3. Modeling Preparation
3A. Unpickle the Thicket from part A
Running 08A_composing_parallel_sorting_data.ipynb
is necessary before this step.
[3]:
modeldata_file = "thicket-modeldata.pkl"
if not os.path.isfile(modeldata_file):
raise FileNotFoundError(f'You must run notebook "08A_composing_parallel_sorting_data.ipynb" before running this notebook to generate the model data "{modeldata_file}"')
[4]:
tk = th.Thicket.from_pickle("thicket-modeldata.pkl")
3B. Concatenate Features
To match the expected format of the Scikit-learn models, we concatenate the features together such that each row in the model_data
DataFrame is a data sample (profile). We can achieve this desired format by using pd.DataFrame.unstack()
to pivot the node index into the column labels, and using pd.concat()
to concatenate the results.
As a simple example: applying the unstack operation to a MultiIndex DataFrame (node and profile) with two unique values for node and two columns will result in a DataFrame with one index level (profile) and 4 columns (2 nodes x 2 columns). We are essentially extending the DataFrame on the column axis.
[5]:
model_data = pd.concat(
[
tk.dataframe.loc[tk.perf_idx].unstack(level="node"),
tk.dataframe.loc[tk.presence_idx].unstack(level="node"),
tk.metadata["Algorithm"]
],
axis="columns"
)
3C. Define Categorical and Numerical Columns
We manually define the two categorical “Present” columns we created in notebook 08A and then define the numerical columns, which are by definition the remaining columns in the data.
[6]:
categorical_columns = [
("Present", tk.get_node("comp_small")),
("Present", tk.get_node("comm_small"))
]
numerical_columns = list(set(model_data.columns) - set(categorical_columns + ["Algorithm"]))
3D. Discretize the Dataset (Optional)
By converting the numerical columns to integer labels (like the categorical columns) we improve accuracy for the SVM significantly. We do this for each numerical feature separately by computing at most n
quantiles, where n
is the number of samples. We then convert the quantile ranges to integer labels.
[7]:
q = len(model_data)
for i in tqdm(range(len(numerical_columns))):
col = numerical_columns[i]
model_data[col] = pd.qcut(
model_data[col],
q=q,
duplicates="drop"
)
model_data[col] = OrdinalEncoder(dtype=np.int64).fit_transform(model_data[[col]])
model_data = configure_categorical(model_data, numerical_columns)
100%|██████████| 15/15 [00:04<00:00, 3.43it/s]
3E. Define Dictionary of Models
Here we create a dictionary of each of the machine learning models we want to use to classify our algorithm dataset so that we can use them in the step 4 loop.
[8]:
classifiers = {
"DecisionTree": DecisionTreeClassifier(
class_weight="balanced",
min_samples_leaf=1,
),
"RandomForest": RandomForestClassifier(
class_weight="balanced",
n_estimators=100,
bootstrap=False,
),
"SVM": svm.SVC(
kernel="rbf",
probability=True,
class_weight="balanced",
C=1000,
)
}
4. Model Training Loop
prep_data
which can be configured differently for different models. - We use KFold
cross validation to ensure we are testing the entire dataset. - Initialize a new model and fit it to the training data set. - Compute the model statistics and store them in a dataframe with any other model metadata.We concatenate all of the model result information into model_results
, which we will use to plot the model performance.
[9]:
model_results = pd.DataFrame()
folds = 10
trials = 3
for model_name in classifiers.keys():
pbar = tqdm(range(trials))
for t in pbar:
metadata_list = []
mdc = model_data.copy()
mdc = prep_data(
data=mdc,
numerical_columns=numerical_columns,
categorical_columns=categorical_columns,
scaling=True,
)
kf = KFold(
n_splits=folds,
random_state=None,
)
for fold, (train_indices, test_indices) in enumerate(kf.split(mdc)):
pbar.set_description(f"{model_name}: Trial {t+1}/{trials}, Fold {fold+1}/{folds}")
train_data = mdc.iloc[train_indices]
test_data = mdc.iloc[test_indices]
X_train, y_train = split_X_y(train_data)
X_test, y_test = split_X_y(test_data)
# Init
model = classifiers[model_name]
# Train
model.fold = fold
model.fit(X_train, y_train)
# Compute scores
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test.astype(np.float32))
acc, pre, rec, f1, conf_matrix, roc_auc_score = compute_model_metrics(
y_true=y_test, y_pred=y_pred, y_proba=y_proba
)
y_proba = [tuple(i for i in j) for j in y_proba]
profile_labels = [mdc.index[i] for i in test_indices]
assert len(profile_labels) == len(test_indices)
values_dict = {
# Profile labels
"profile": profile_labels,
# Model preds
"y_pred": y_pred.tolist(),
"y_proba": y_proba,
"y_true": y_test.tolist(),
# Model Performance data
"classifier": model_name,
"trial": t+1,
"test_acc": acc,
"test_pre": pre,
"test_rec": rec,
"test_f1": f1,
"test_roc_auc": roc_auc_score,
"trials": trials,
"fold": fold,
"n_fold": folds,
"num_files": len(tk.profile),
}
tdf = pd.DataFrame.from_dict(values_dict)
metadata_list.append(tdf)
df = pd.concat(metadata_list)
model_results = pd.concat([model_results, df])
DecisionTree: Trial 3/3, Fold 10/10: 100%|██████████| 3/3 [00:02<00:00, 1.15it/s]
RandomForest: Trial 3/3, Fold 10/10: 100%|██████████| 3/3 [00:57<00:00, 19.32s/it]
SVM: Trial 3/3, Fold 10/10: 100%|██████████| 3/3 [01:10<00:00, 23.49s/it]
5. Visualize Model Performance
With the _plot_bars()
function, we can visualize the per-fold accuracy of each classifier for each model performance statistic. We notice that the SVM and random forest perform comparibly, with the decision tree slightly less accurate than both.
[10]:
def _plot_bars(
mean_df,
std_df,
grouper,
x_group,
xlabel=None,
ylabel=None,
title=None,
font=None,
legend_dict=None,
legend=None,
color=None,
random=False,
ylim=(0, 1),
colorbar=None,
kind="bar",
):
unstacker = list(set(grouper) - set([x_group]))
mu_df = mean_df.unstack(level=unstacker)
su_df = std_df.unstack(level=unstacker)
if font:
plt.rcParams.update(font)
for col in mu_df.columns.get_level_values(0).unique():
tdf1 = mu_df[col]
if col == "test_acc" and random:
tdf1["Random Classifier"] = [1/num_classes for num_classes in tdf1.index.get_level_values(0)]
ax = tdf1.plot(kind=kind, yerr=su_df[col], capsize=5, figsize=(10, 5), color=color, legend=legend)
plt.ylim(ylim)
if xlabel:
plt.xlabel(xlabel)
if ylabel:
plt.ylabel(ylabel)
if title:
plt.title(title)
plt.grid(False)
if legend_dict and legend:
plt.legend(
**legend_dict
)
if colorbar is not None:
plt.colorbar(colorbar, label='Parameter', ax=ax)
plt.xticks(rotation=90)
plt.show()
[11]:
# We can optionally join information from the Thicket metadata table to use in analysis of model performance.
model_results = model_results.join(tk.metadata[["Algorithm", "InputSize", "InputType", "num_procs", "group_num", "Datatype"]], on="profile")
for met in ["test_acc", "test_pre", "test_rec", "test_f1", "test_roc_auc"]:
config = [
"fold",
"classifier",
]
mean_df = model_results[[met]+config].groupby(config).mean()
std_df = model_results[[met]+config].groupby(config).std()
_plot_bars(
mean_df=mean_df,
std_df=std_df,
grouper=config,
x_group="fold",
ylabel=met,
xlabel="fold",
legend=True,
title=f"{met} vs fold",
ylim=(0.5, 1),
)





[ ]: