{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PPAM '24: Composing & Modeling Parallel Sorting Performance Data (Part A): Thicket Tutorial\n",
    "\n",
    "The parallel sorting dataset consists of 8,747 MPI sorting algorithm performance profiles (collected with [Caliper](https://software.llnl.gov/Caliper/)) for 5 different algorithms and 51 implementations.\n",
    "We start with a dataset that includes over 10,000 performance profiles, and we show how to apply various filters and checks on the performance data to remove profiles that do not match our criteria.\n",
    "We use this data to show how we can train models to determine the algorithm from the performance data.\n",
    "\n",
    "\n",
    "| Algorithm | # Performance Profiles | # Implementations |\n",
    "| -------- | ------- | ------- |\n",
    "| Merge Sort | 2,239 | 15 |\n",
    "| Sample Sort | 2,231 | 9 |\n",
    "| Odd-Even Sort | 2,034 | 12 |\n",
    "| Bitonic Sort | 1,652 | 11 |\n",
    "| Radix Sort | 591 | 4 |\n",
    "| **Total** | **8,747** | **51** |\n",
    "\n",
    "## 1. Import Necessary Packages\n",
    "\n",
    "Import packages and point to the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning: Roundtrip module could not be loaded. Requires jupyter notebook version <= 7.x.\n"
     ]
    }
   ],
   "source": [
    "from glob import glob\n",
    "\n",
    "import numpy as np\n",
    "import zipfile\n",
    "\n",
    "import thicket as th\n",
    "\n",
    "DATA_DIR = \"../data/parallel-sorting\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Read files into Thicket\n",
    "\n",
    "First, we download the dataset as a zip file from the [Texas Data Respository dataverse](https://dataverse.tdl.org/dataset.xhtml?persistentId=doi:10.18738/T8/KY6SPB) and extract the files. Then, we read the files into Thicket.\n",
    "- `glob()` recursively grabs all Caliper files (`.cali`) in the data directory.\n",
    "- `from_caliperreader()` reads the Caliper files into Thicket and `fill_perfdata=False` will save memory, since we have so many files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-11-04 16:04:39--  https://dataverse.tdl.org/api/access/datafile/:persistentId?persistentId=doi:10.18738/T8/KY6SPB/IT9TZK\n",
      "Resolving dataverse.tdl.org (dataverse.tdl.org)... 34.239.31.80\n",
      "Connecting to dataverse.tdl.org (dataverse.tdl.org)|34.239.31.80|:443... connected.\n",
      "HTTP request sent, awaiting response... 303 See Other\n",
      "Location: https://dataverse-prod-s3.s3.amazonaws.com/10.18738/T8/KY6SPB/191bea8c1b5-ee1ba7d1a7d0?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27parallel-sorting.zip&response-content-type=application%2Fzip&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20241104T220440Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIAIAGTFFGLXBPHVGXQ%2F20241104%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=3f781e372078df9e7b080764fb5d3956b6e6669495bc7438078b5eaf2016bb4a [following]\n",
      "--2024-11-04 16:04:40--  https://dataverse-prod-s3.s3.amazonaws.com/10.18738/T8/KY6SPB/191bea8c1b5-ee1ba7d1a7d0?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27parallel-sorting.zip&response-content-type=application%2Fzip&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20241104T220440Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIAIAGTFFGLXBPHVGXQ%2F20241104%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=3f781e372078df9e7b080764fb5d3956b6e6669495bc7438078b5eaf2016bb4a\n",
      "Resolving dataverse-prod-s3.s3.amazonaws.com (dataverse-prod-s3.s3.amazonaws.com)... 54.231.136.193, 52.217.87.212, 54.231.234.249, ...\n",
      "Connecting to dataverse-prod-s3.s3.amazonaws.com (dataverse-prod-s3.s3.amazonaws.com)|54.231.136.193|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 35869614 (34M) [application/zip]\n",
      "Saving to: ‘../data/parallel-sorting.zip’\n",
      "\n",
      "../data/parallel-so 100%[===================>]  34.21M  41.9MB/s    in 0.8s    \n",
      "\n",
      "2024-11-04 16:04:41 (41.9 MB/s) - ‘../data/parallel-sorting.zip’ saved [35869614/35869614]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Download the parallel sorting dataset (zip file) from dataverse\n",
    "! wget \"https://dataverse.tdl.org/api/access/datafile/:persistentId?persistentId=doi:10.18738/T8/KY6SPB/IT9TZK\" -O ../data/parallel-sorting.zip\n",
    "# Extract files from zip\n",
    "with zipfile.ZipFile(DATA_DIR+\".zip\", \"r\") as zip_ref:\n",
    "    zip_ref.extractall(DATA_DIR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total files: 12916\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "(1/2) Reading Files: 100%|██████████| 12916/12916 [01:38<00:00, 131.77it/s]\n",
      "(2/2) Creating Thicket: 100%|██████████| 12915/12915 [02:05<00:00, 102.99it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame shape (128716, 16)\n",
      "Metadata shape: (12916, 62)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "data = glob(f\"{DATA_DIR}/**/*.cali\", recursive=True)\n",
    "print(f\"Total files: {len(data)}\")\n",
    "\n",
    "# Read caliper files without filling the profile index as it expensive and unnecessary in our case\n",
    "tk = th.Thicket.from_caliperreader(\n",
    "    data,\n",
    "    fill_perfdata=False\n",
    ")\n",
    "print(f\"DataFrame shape {tk.dataframe.shape}\")\n",
    "print(f\"Metadata shape: {tk.metadata.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Modify and Filter Metadata Values\n",
    "\n",
    "Since the dataset we are using is a compilation from many different implementations, there are various labeling inconsistencies in the metadata annotations which we can fix using Thicket. We have defined two dictionaries from manual analysis of the data to achieve this:\n",
    "\n",
    "- `META_FIX_DICT` is used to enforce consistency in the metadata by replacing inconsistent values.\n",
    "- `META_WHITELIST_DICT` is used to select the metadata parameters we are looking for from the experiments.\n",
    "\n",
    "The metadata we reference are the experiment parameters and important identifying metadata. We use these values for processing and removing anomalies, and `Algorithm` specifically is also used as the class label when modeling:\n",
    "\n",
    "- Experiment Parameters\n",
    "    - `InputType` - The type of sortedness of the input array.\n",
    "    - `Datatype` - The datatype of the values in the input array.\n",
    "    - `num_procs` - Number of parallel processes.\n",
    "    - `InputSize` - Size of the input array.\n",
    "- Parallel Algorithm Class Label\n",
    "    - `Algorithm` - The name of the parallel sorting algorithm.\n",
    "- Identifying metadata\n",
    "    - `group_num` - Unique identifier for different implementations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "META_FIX_DICT = {\n",
    "    \"Algorithm\": {\n",
    "        \"bitonic_sort\": \"BitonicSort\",\n",
    "        \"merge_sort\": \"MergeSort\",\n",
    "        \"Merge Sort\": \"MergeSort\",\n",
    "        \"odd_even_sort\": \"OddEvenSort\",\n",
    "        \"Merge sort\": \"MergeSort\",\n",
    "        \"Sample Sort\": \"SampleSort\",\n",
    "        \"Bitonic_Sort\": \"BitonicSort\",\n",
    "        \"Merge_Sort\": \"MergeSort\",\n",
    "        \"OddEvenTranspositionSort\": \"OddEvenSort\",\n",
    "        \"Bitonic Sort\": \"BitonicSort\",\n",
    "        \"Mergesort\": \"MergeSort\",\n",
    "        \"mergesort\": \"MergeSort\",\n",
    "        \"oddEven\": \"OddEvenSort\",\n",
    "        \"Odd Even Transposition Sort\": \"OddEvenSort\",\n",
    "        \"RadixSort Sort\": \"RadixSort\",\n",
    "        \"Odd Even Sort\": \"OddEvenSort\",\n",
    "        \"Odd-Even Sort\": \"OddEvenSort\",\n",
    "        \"OddevenSort\": \"OddEvenSort\",\n",
    "        \"oddeven_sort\": \"OddEvenSort\",\n",
    "        \"Radix Sort\": \"RadixSort\",\n",
    "        \"Odd-Even Bubble Sort\": \"OddEvenSort\",\n",
    "        \"Bubble_Sort\": \"OddEvenSort\",\n",
    "        \"Bubblesort\": \"OddEvenSort\",\n",
    "        \"Bubble Sort(Odd/Even)\": \"OddEvenSort\",\n",
    "        \"Bubble/Odd-Even Sort\": \"OddEvenSort\",\n",
    "        \"Parallel Bubble Sort\": \"OddEvenSort\",\n",
    "        \"BubbleSort\": \"OddEvenSort\",\n",
    "        \"Radix\": \"RadixSort\",\n",
    "        \"Bitonic\": \"BitonicSort\",\n",
    "    },\n",
    "    \"InputType\": {\n",
    "        \"perturbed_array\": \"1%perturbed\",\n",
    "        \"sorted_array\": \"Sorted\",\n",
    "        \"random_array\": \"Random\",\n",
    "        \"ascending_array\": \"Sorted\",\n",
    "        \"descending_array\": \"Reverse\",\n",
    "        \"reversed_array\": \"Reverse\",\n",
    "        \"reversedSort\": \"Reverse\",\n",
    "        \"1% Perturbed\": \"1%perturbed\",\n",
    "        \"reverse_sorted\": \"Reverse\",\n",
    "        \"1perturbed\": \"1%perturbed\",\n",
    "        r\"1%%perturbed\": \"1%perturbed\",\n",
    "        \"1 Perturbed\": \"1%perturbed\",\n",
    "        \"1 perturbed\": \"1%perturbed\",\n",
    "        \"Reverse Sorted\": \"Reverse\",\n",
    "        \"1%Perturbed\": \"1%perturbed\",\n",
    "        \"1% perturbation\": \"1%perturbed\",\n",
    "        \"1percentperturbed\": \"1%perturbed\",\n",
    "        \"1 percent noise\": \"1%perturbed\",\n",
    "        \"reverse sorted\": \"Reverse\",\n",
    "        \"sorted_1%_perturbed\": \"1%perturbed\",\n",
    "        \"Reversesorted\": \"Reverse\",\n",
    "        \"ReverseSorted\": \"Reverse\",\n",
    "        \"Reverse_Sorted\": \"Reverse\",\n",
    "        \"ReversedSort\": \"Reverse\",\n",
    "        \"Sorted_1%_perturbed\": \"1%perturbed\",\n",
    "        \"Randomized\": \"Random\",\n",
    "        \"Reversed\": \"Reverse\",\n",
    "        \"reversed\": \"Reverse\",\n",
    "        \"sorted\": \"Sorted\",\n",
    "        \"random\": \"Random\",\n",
    "        \"nearly\": \"Nearly\",\n",
    "        \"reverse\": \"Reverse\",\n",
    "        \" Reverse sorted\": \"Reverse\",\n",
    "        \"Perturbed\": \"1%perturbed\",\n",
    "        \"perturbed\": \"1%perturbed\",\n",
    "    },\n",
    "    \"Datatype\": {\n",
    "        \"integer\": \"int\",\n",
    "        \"Int\": \"int\",\n",
    "        \"Integer\": \"int\",\n",
    "        \"Double\": \"double\",\n",
    "    },\n",
    "}\n",
    "\n",
    "META_WHITELIST_DICT = {\n",
    "    \"InputType\": [\"Random\", \"Sorted\", \"Reverse\", \"1%perturbed\", \"Nearly\"],\n",
    "    \"Algorithm\": [\n",
    "        \"BitonicSort\",\n",
    "        \"MergeSort\",\n",
    "        \"OddEvenSort\",\n",
    "        \"RadixSort\",\n",
    "        \"SampleSort\",\n",
    "    ],\n",
    "    \"Datatype\": [\"int\", \"float\", \"double\"],\n",
    "    \"num_procs\": [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024],\n",
    "    \"InputSize\": [65536, 262144, 1048576, 4194304, 16777216, 67108864, 268435456],\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3A. Modify Metadata Values to Match Grammar\n",
    "\n",
    "The `pandas.DataFrame.replace()` function replaces values in the metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "for meta_col, values in META_FIX_DICT.items():\n",
    "    tk.metadata[meta_col] = tk.metadata[meta_col].replace(values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3B. Filter Metadata Values from Whitelist\n",
    "\n",
    "We use the `Thicket.filter_metadata()` function to filter any values that are not contained in our metadata whitelist, which leaves performance profiles that contain the desired metadata for removing anomalies and modeling.\n",
    "\n",
    "*Note: This cell can take 10+ minutes to run*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total profiles before: 12916\n",
      "Total profiles after: 10624\n"
     ]
    }
   ],
   "source": [
    "print(f\"Total profiles before: {len(tk.profile)}\")\n",
    "tk = tk.filter_metadata(lambda meta: all([meta[key] in META_WHITELIST_DICT[key] for key in META_WHITELIST_DICT.keys()]))\n",
    "print(f\"Total profiles after: {len(tk.profile)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3C. Filter Duplicate Metadata Values\n",
    "\n",
    "Duplicate values across all of our experiment parameters indicates that one profile has incorrect metadata, since all of the profiles are single-trial. If we find duplicates of any profile we remove them all, as we cannot assume which profile contains the correct metadata. These occurrences are a result of incorrect manual annotation before generating the profiles.\n",
    "\n",
    "We can remove duplicate values by using `Thicket.groupby()` on our experiment parameters except \"num_procs\", and then checking if there are any duplicates of \"num_procs\" using `pandas.DataFrame.duplicated()`. We then remove the duplicate profiles using `Thicket.filter_profile()`.\n",
    "\n",
    "*Note: This cell can take 10+ minutes to run*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Skipping ('RadixSort', 'Random', 'double', 2, 65536) (27 profiles) because it has duplicate num_procs\n",
      "Skipping ('RadixSort', 'Random', 'double', 2, 262144) (26 profiles) because it has duplicate num_procs\n",
      "Total profiles after removing duplicates: 10571\n"
     ]
    }
   ],
   "source": [
    "gb = tk.groupby([\"Algorithm\", \"InputType\", \"Datatype\", \"group_num\", \"InputSize\"])\n",
    "rm_profs = []\n",
    "for key, ttk in gb.items():\n",
    "    if ttk.metadata[\"num_procs\"].duplicated().any():\n",
    "        print(f\"Skipping {key} ({len(ttk.profile)} profiles) because it has duplicate num_procs\")\n",
    "        rm_profs += ttk.profile   \n",
    "tk = tk.filter_profile([p for p in tk.profile if p not in set(rm_profs)])\n",
    "print(f\"Total profiles after removing duplicates: {len(tk.profile)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Selecting Features\n",
    "\n",
    "In this section, we structure the performance data where each column is a feature, and each row is a feature vector for one performance profile, which is necessary for modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4A. Query the Call Tree\n",
    "\n",
    "For this study, we used \"generalized\" nodes for annotations. So a given node in the calltree would be annotated by its functionality, communication or computation, and the amount of data it operated on, small or large.\n",
    "\n",
    "```\n",
    "main                 // Top-level function of the program\n",
    "|_ comm              // Parent for all communication nodes\n",
    "|    |_ comm_small   // All nodes communicating \"small\" data\n",
    "|    |_ comm_large   // All nodes communicating \"large\" data\n",
    "|_ comp              // Parent for all computation nodes\n",
    "     |_ comp_small   // All nodes computing on \"small\" data\n",
    "     |_ comp_large   // All nodes computing on \"large\" data\n",
    "```\n",
    "\n",
    "Not all implementations match this tree 100% correctly. Some implementations include additional nodes, or have generalized nodes at different depths in the calltree, which results in duplicates of the same nodes after composing the Thicket. We will use `Thicket.query()` to subselect the performance metrics for the generalized nodes that we want to use for modeling. Querying by node name will also combine nodes with the same name at various depths into one node at root depth.\n",
    "\n",
    "*Note: Printing the `Thicket.tree()` at this point will show the full calltree, which includes many nodes which are not relevant to our analysis.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform query\n",
    "nodes = [\n",
    "    \"comp\",\n",
    "    \"comp_large\",\n",
    "    \"comm\",\n",
    "    \"comm_large\",\n",
    "    \"comp_small\",\n",
    "    \"comm_small\"\n",
    "]\n",
    "ntk_dict = {n: tk.query(\n",
    "    th.query.Query().match(\n",
    "        \"*\",\n",
    "        lambda row: row[\"name\"].apply(\n",
    "            lambda tn: tn == n\n",
    "        ).all()\n",
    "    )\n",
    ") for n in nodes}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4B. Compose a New Thicket from the Queried Thickets\n",
    "\n",
    "We use `Thicket.concat_thickets()` to compose the Thickets we created from each query. Since many of these Thickets will contain the same profiles in their metadata, we drop duplicate values using `pandas.drop_duplicates()`\n",
    "\n",
    "*Note: Unlike when we read the files, fill_perfdata is True here. This is so we can later compute the feature \"Present\" using the None values in the \"name\" column.* "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Re-compose quieried Thickets\n",
    "tk = th.Thicket.concat_thickets(\n",
    "    thickets=list(ntk_dict.values()),\n",
    "    fill_perfdata=True,\n",
    ")\n",
    "# Drop duplicate profiles in the metadata from concat_thickets\n",
    "unhashable_cols = [\"libraries\", \"cmdline\"] # Can't pass these cols in the check or error will be thrown. Won't change the outcome of the check\n",
    "tk.metadata = tk.metadata.drop_duplicates(subset=[col for col in tk.metadata.columns if col not in unhashable_cols])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4C. Remove Profiles with Missing Nodes\n",
    "\n",
    "Since we did not design our models to handle missing data points, we need to remove profiles with missing measurements for our selected nodes using `Thicket.filter_profile()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total profiles that contain all data: 9406\n"
     ]
    }
   ],
   "source": [
    "# Nodes not considered in the check. They are only used for their presence T/F\n",
    "not_considered = [\"comp_small\", \"comm_small\"]\n",
    "profiles_per_node = [set(ntk_dict[n].dataframe.index.get_level_values(\"profile\")) for n in ntk_dict.keys() if n not in not_considered]\n",
    "# Intersection of the profiles\n",
    "profile_truth = list(profiles_per_node[0].intersection(*profiles_per_node[1:]))\n",
    "# Filter the Thicket to only contain these profiles\n",
    "tk = tk.filter_profile(profile_truth)\n",
    "print(f\"Total profiles that contain all data: {len(tk.profile)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4D. Computute Additional Features from Performance Data\n",
    "\n",
    "We compute the \"Present\" feature and the derived \"comp/comm\" features using a mixture of `pandas` functions. The `add_root_node()` function is used to add the \"comp/comm\" features to the performance data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "metric_cols = [\n",
    "    \"Variance time/rank\",\n",
    "    \"Min time/rank\",\n",
    "    \"Max time/rank\",\n",
    "    \"Avg time/rank\",\n",
    "    \"Total time\",\n",
    "]\n",
    "\n",
    "# Compute Present feature\n",
    "tk.dataframe[\"Present\"] = tk.dataframe[\"name\"].apply(lambda name: False if name is None else True)\n",
    "\n",
    "# Compute comp/comm feature\n",
    "tk.add_root_node(attrs={\"name\": \"comp/comm\", \"type\": \"derived\"})\n",
    "tdf = tk.dataframe.loc[tk.get_node(\"comp\"), metric_cols].div(tk.dataframe.loc[tk.get_node(\"comm\"), metric_cols])\n",
    "# Replace inf with NaN where division by 0 occurred\n",
    "tdf = tdf.replace({np.inf: np.nan})\n",
    "for prof in tdf.index:\n",
    "    tk.dataframe.loc[(tk.get_node(\"comp/comm\"), prof), metric_cols] = tdf.loc[prof]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4E: Define Our Features as Pandas Slices\n",
    "\n",
    "Here we are essentially defining macros to refer to the features. There needs to be two macros because each macro indexes the data differently.\n",
    "\n",
    "To subselect the performance data we use a slice generated by either `perf_idx()` or `presence_idx()` (they are functions because the node objects can change `id`'s after certain Thicket operations). We use the `Thicket.get_node()` function to select node objects.\n",
    "\n",
    "We can index the performance data with these slices using `Thicket.dataframe.loc[perf_idx()]` or `Thicket.dataframe.loc[presence_idx()]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def perf_idx():\n",
    "    return (\n",
    "        (\n",
    "            [\n",
    "                tk.get_node(\"comp/comm\"), \n",
    "                tk.get_node(\"comp_large\"),\n",
    "                tk.get_node(\"comm_large\")\n",
    "            ]\n",
    "        ), metric_cols\n",
    "    )\n",
    "\n",
    "def presence_idx():\n",
    "    return (\n",
    "        (\n",
    "            [\n",
    "                tk.get_node(\"comp_small\"),\n",
    "                tk.get_node(\"comm_small\"),\n",
    "            ]\n",
    "        ), [\n",
    "            \"Present\"\n",
    "        ]\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4F. Remove Profiles with Missing Metrics\n",
    "\n",
    "Here we check for any missing data points in any of the profiles for each of the slices we just defined. This check is different from 4C, as we are checking that there are no missing metrics.\n",
    "\n",
    "`any_nan_rows_series` will be a series of boolean values for each profile that will be `True` if there are any missing data points. We use the `Thicket.filter_profile()` function once again to filter out the profiles with missing data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total profiles before dropping NaNs: 9406\n",
      "Total profiles after dropping NaNs: 8747\n"
     ]
    }
   ],
   "source": [
    "print(f\"Total profiles before dropping NaNs: {len(tk.profile)}\")\n",
    "nan_profs = []\n",
    "for idx in [perf_idx(), presence_idx()]:\n",
    "    any_nan_rows_series = tk.dataframe.loc[idx].isna().apply(lambda x: x.any(), axis=1)\n",
    "    nan_profs.extend(tk.dataframe.loc[idx][any_nan_rows_series].index.get_level_values(\"profile\").unique())\n",
    "tk = tk.filter_profile([p for p in tk.profile if p not in nan_profs])\n",
    "print(f\"Total profiles after dropping NaNs: {len(tk.profile)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Remove Anomalies \n",
    "In this section, we show how a custom function can be used on a Thicket object. We use the `find_outliers` function to identify profiles that fall outside certain quantile ranges for a given feature. We use the `filter_profile` function to filter the outliers returned by `find_outliers`. This idea can be used to apply custom criteria to the Thicket object, by identifying the profiles we want to remove."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_outliers(\n",
    "    tk,\n",
    "    cols_percs,\n",
    "):\n",
    "    \"\"\"Compute outliers for the combination of Algorithm, InputType, and Datatype.\n",
    "    Normalize values by num_procs and InputSize.\n",
    "\n",
    "    Arguments:\n",
    "        tk (Thicket): The Thicket object.\n",
    "        cols_percs (dict): Dictionary of columns and their percentiles.\n",
    "\n",
    "    Returns:\n",
    "        set: A set of outlier profiles.\n",
    "    \"\"\"\n",
    "\n",
    "    def find_single_outlier_profiles(df, node, col, percs):\n",
    "        df = df.loc[node]\n",
    "        upper = df[col].quantile(percs[1])\n",
    "        lower = df[col].quantile(percs[0])\n",
    "        return set(\n",
    "            df[(df[col] > upper) | (df[col] < lower)].index.get_level_values(\"profile\")\n",
    "        )\n",
    "\n",
    "    tkc = tk.deepcopy()\n",
    "\n",
    "    tkc.metadata_columns_to_perfdata([\"num_procs\", \"InputSize\"])\n",
    "\n",
    "    # Normalize the columns by num_procs and InputSize\n",
    "    tkc.dataframe[\"np*IS\"] = tkc.dataframe[\"num_procs\"] * tkc.dataframe[\"InputSize\"]\n",
    "    for node, col in cols_percs.keys():\n",
    "        tkc.dataframe[node, col] = tkc.dataframe.loc[node, col].div(tkc.dataframe.loc[node, \"np*IS\"])        \n",
    "\n",
    "    single_outlier_profiles = set()\n",
    "    grouped = tkc.groupby(\n",
    "        [\n",
    "            \"Algorithm\",\n",
    "            \"InputType\",\n",
    "            \"Datatype\",\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    # Find the outlier profiles\n",
    "    for alg_inp_dtype, ttk in grouped.items():\n",
    "        temp_set = set()\n",
    "        tdf = ttk.dataframe\n",
    "        if len(tdf) >= 3:\n",
    "            # Find outliers\n",
    "            for (node, col), percs in cols_percs.items():\n",
    "                prfs = find_single_outlier_profiles(tdf, node, col, percs)\n",
    "                temp_set |= prfs\n",
    "                single_outlier_profiles |= prfs\n",
    "            # Uncomment for extra information\n",
    "            # print(\n",
    "            #     f\"Checked {alg_inp_dtype}. Total outliers {len(temp_set)}/{len(tdf)} ({len(temp_set)/len(tdf)*100:.2f}%)\"\n",
    "            # )\n",
    "        else:\n",
    "            raise ValueError(f\"Insufficient profiles for {alg_inp_dtype}\")\n",
    "\n",
    "    # find single outlier profiles\n",
    "    print(\n",
    "        f\"Single outlier profiles: {len(single_outlier_profiles)}/{len(tkc.profile)} ({len(single_outlier_profiles)/len(tkc.profile)*100:.2f}%)\"\n",
    "    )\n",
    "\n",
    "    return single_outlier_profiles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Single outlier profiles: 905/8747 (10.35%)\n",
      "Total profiles before dropping outliers: 8747\n",
      "Total profiles after dropping outliers: 7842\n"
     ]
    }
   ],
   "source": [
    "perc=0.975\n",
    "outlier_profiles = find_outliers(\n",
    "    tk,\n",
    "    {\n",
    "        (tk.get_node(\"comp_large\"), \"Min time/rank\"): [0, perc],\n",
    "        (tk.get_node(\"comp_large\"), \"Max time/rank\"): [0, perc],\n",
    "        (tk.get_node(\"comp_large\"), \"Avg time/rank\"): [0, perc],\n",
    "        (tk.get_node(\"comp_large\"), \"Total time\"): [0, perc],\n",
    "        (tk.get_node(\"comp_large\"), \"Variance time/rank\"): [0, perc],\n",
    "        (tk.get_node(\"comm_large\"), \"Min time/rank\"): [0, perc],\n",
    "        (tk.get_node(\"comm_large\"), \"Max time/rank\"): [0, perc],\n",
    "        (tk.get_node(\"comm_large\"), \"Avg time/rank\"): [0, perc],\n",
    "        (tk.get_node(\"comm_large\"), \"Total time\"): [0, perc],\n",
    "        (tk.get_node(\"comm_large\"), \"Variance time/rank\"): [0, perc],\n",
    "    },\n",
    ")\n",
    "print(f\"Total profiles before dropping outliers: {len(tk.profile)}\")\n",
    "tk = tk.filter_profile([p for p in tk.profile if p not in outlier_profiles])\n",
    "print(f\"Total profiles after dropping outliers: {len(tk.profile)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Write Model Data\n",
    "\n",
    "Lastly we shuffle the data using `pandas.DataFrame.sample()` to reduce bias during model training, and pickle the Thicket object, which we will use to pick back up in the next notebook, part B, where we will create classification models using the performance data. Pickling is helpful in this scenario to avoid re-doing the steps in this notebook every time we want to re-run or make adjustments to our models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Algorithm: BitonicSort has 1508 data points\n",
      "Algorithm: MergeSort has 1990 data points\n",
      "Algorithm: OddEvenSort has 1820 data points\n",
      "Algorithm: RadixSort has 527 data points\n",
      "Algorithm: SampleSort has 1997 data points\n"
     ]
    }
   ],
   "source": [
    "# Print how many profiles for each sorting algorithm\n",
    "algs = tk.metadata.reset_index().groupby(\"Algorithm\")\n",
    "for name, data in algs:\n",
    "    print(f\"Algorithm: {name} has {len(data)} data points\")\n",
    "\n",
    "# Shuffle the data\n",
    "tk.dataframe = tk.dataframe.sample(frac=1.0)\n",
    "# Set useful attributes\n",
    "tk.perf_idx = perf_idx()\n",
    "tk.presence_idx = presence_idx()\n",
    "# Write thicket to file\n",
    "tk.to_pickle(\"thicket-modeldata.pkl\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "tk-3.9.12",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}