diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..c4c4ffc6aa41a89cc618a31d17f6d5924ddf2b10 --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +*.zip diff --git a/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering.ipynb b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..636b74b0a37eba8272bbc6986d9ae5a6480938df --- /dev/null +++ b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering.ipynb @@ -0,0 +1,332 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9ccf7386", + "metadata": {}, + "source": [ + "# $K$-Means Clustering" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "random-contract", + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "from matplotlib import pyplot as plt\n", + "import numpy as np\n", + "\n", + "import importlib\n", + "import helper\n", + "importlib.reload(helper)\n", + "\n", + "from IPython.display import clear_output\n", + "from time import sleep, time" + ] + }, + { + "cell_type": "markdown", + "id": "3ae69504", + "metadata": { + "jp-MarkdownHeadingCollapsed": true, + "tags": [] + }, + "source": [ + "## Introduction\n", + "$K$-Means Clustering is a method from classical machine learning. It is used to find $K$ different groups of similar items in a dataset.\n", + "\n", + "In our case the dataset is a set of $N$ 2-dimensional coordinate vectors $\\vec{x}_1,\\vec{x}_2,\\dots,\\vec{x}_N$. These points form $K < N$ clusters which we would like to find. In order to characterise a cluster we use the cluster centre $\\vec{\\mu}_j$ ($1 \\leq j \\leq K$). *Each* point from the size-$N$ set can be assigned to *one* of these clusters (we will limit ourselves to cases where this indeed is possible)." + ] + }, + { + "cell_type": "markdown", + "id": "e8075aed", + "metadata": {}, + "source": [ + "## Algorithm\n", + "Assigning a point to a cluster works according to the following procedure:\n", + "\n", + "1. **Initialisation**: Randomly choose cluster centres $\\vec{\\mu}_j$ ($1 \\leq j \\leq K$). A simple way to achieve this is to choose them from the set of points $\\{\\vec{x}_i\\}_{i = 1, \\dots, N}$.\n", + "\n", + "2. **Iterations**: \n", + " - For all $i = 1, \\dots N$ find the cluster centre with position $\\vec{\\mu}_j$ to which $\\vec{x}_i$ has the *smallest* euclidian distance:\n", + " $$\n", + " c^{(i)} = \\operatorname{argmin}_{j \\in \\{1, \\dots, K\\}} \\left\\|\\vec{x}_i - \\vec{\\mu}_j\\right\\|_2^2,\n", + " $$\n", + " where $\\|\\vec{x}\\|_2 = \\sqrt{x_1^2 + x_2^2}$. $c^{(i)}$ is an integer number from the set $\\{1, \\dots, K\\}$. We use is to assign an index to each point $\\vec{x}_i$ (being $c^{(i)}$). This index designated the cluster centre to which the $i$th point is closest to. Hence, for each of the points we must compute the (squared) distance to *all* cluster centres $\\vec{\\mu}_j$ ($1 \\leq j \\leq K$) and determine the smallest of these distances. The index $j$ of the cluster with the smallest distance to a point with index $i$ is assigned to $c^{(i)}$.\n", + " - After having assigned each point of the set $\\{\\vec{x}_i\\}_{i = 1, \\dots, N}$ re-compute the position of all cluster centers:\n", + " $$\n", + " \\vec{\\mu}_j = \\frac{1}{n_j} \\sum_{\\vec{x}_i\\text{ with }c^{(i)} = j} \\vec{x}_i,\n", + " $$\n", + " By $n_j$ we mean the total number of points for which $c^{(i)} = j$. The *new* cluster centre is nothing but the arithmetic mean of all points $\\vec{x}_i$ that were assigned to the previous cluster centre.\n", + " - We compare the set cluster centres $C^{\\mathrm{old}} = \\{\\vec{\\mu}_1^{\\mathrm{old}}, \\dots, \\vec{\\mu}_K^{\\mathrm{old}} \\}$ from the previous iteration and the current set of cluster centres $C = \\{\\vec{\\mu}_1, \\dots, \\vec{\\mu}_K \\}$. If cluster centres are pair-wise equal (compare those with the same index) we stop the iterations. We have reached a steady state and the algorithm has *converged*." + ] + }, + { + "cell_type": "markdown", + "id": "ba899ad5", + "metadata": {}, + "source": [ + "## Task formulation\n", + "\n", + "Implement the outlined algorithm for the method of $K$-Means Clustering. Stick to the paradigm of *array-oriented programming* as often as possible.\n", + "\n", + "In case you have trouble mapping the algorithm to Numpy commands and functions it can help to first implement it with standard Python only.\n", + "\n", + "The folder `sample-data` contains some sample-dataset that you can use to explore the algorithm and your implementation.\n", + "\n", + "*Hint*: It can be helpful to plot the data and the cluster centres determined with your implementation. Have a look at the `make_scatter_plot` function from the `helper.py` module provided with this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25abbe32", + "metadata": {}, + "outputs": [], + "source": [ + "n_clusters = 2\n", + "dataset = np.loadtxt(f\"sample-data/coords-with-labels-{n_clusters}.dat\", delimiter=\",\")\n", + "coords, labels = dataset.T[:2].T, dataset.T[-1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5bf0a0e2", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots()\n", + "\n", + "helper.make_scatter_plot(\n", + " ax,\n", + " [coords[labels == tt] for tt in range(n_clusters)], \n", + " labels=[f\"cluster {tt}\" for tt in range(n_clusters)],\n", + " markers=[\"o\"] * n_clusters\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21546ed7", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "62ababfe", + "metadata": {}, + "source": [ + "## Implementation of solution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "european-bookmark", + "metadata": {}, + "outputs": [], + "source": [ + "# return True, if centers have not changed and the algorithm can therefore stop\n", + "def centers_have_not_changed(a, b):\n", + " # Provide your implementation here.\n", + " return np.allclose(a,b)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ahead-antenna", + "metadata": {}, + "outputs": [], + "source": [ + "# return the updated locations of the cluster centers\n", + "def compute_centers(coords, labels, n_centers):\n", + " # Provide your implementation here. \n", + " # **HINT**:\n", + " # \n", + " # Use advanced indexing with boolean masks to access\n", + " # all points that have a label corresponding to the \n", + " # index of a cluster center.\n", + " return np.array([coords[labels == idx].mean(axis=0) for idx in range(n_centers)])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "known-travel", + "metadata": {}, + "outputs": [], + "source": [ + "# return the list of *indices* of the cluster centers for the coordinates\n", + "def find_closest_center(coords, coords_center):\n", + " # Provide your implementation here.\n", + " # **HINT**:\n", + " # \n", + " # Use `np.tile()` to augment `coords` and then make use\n", + " # of NumPy's implicit broadcasting capabilities to\n", + " # compute the distance of each point to *all* cluster\n", + " # centers. You might also need to reshape the array.\n", + " # Think about along which *axis* to compute the norm. \n", + " #\n", + " # Then select the *index* of cluster center with the \n", + " # least distance for each point (Look up the \n", + " # `np.argmin()` function.).\n", + " n_centers = coords_center.shape[0]\n", + " coords_shifted = np.reshape(\n", + " np.tile(coords, (1, n_centers)) - coords_center.ravel(),\n", + " (coords.shape[0], n_centers, coords.shape[1]),\n", + " )\n", + " return np.argmin(np.linalg.norm(coords_shifted, axis=2), axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "physical-saturday", + "metadata": {}, + "outputs": [], + "source": [ + "# The driver function - You need to change it, as ther is an error in it\n", + "# the error is *not* in the visualization part\n", + "def kmeans(coords, n_centers, n_iter, initial_random_state=42,visualize_progres=True,sleep_time=0.5):\n", + " # Initialise the coordinates of the cluster centers\n", + " rng = np.random.RandomState(initial_random_state)\n", + " index = rng.choice(coords.shape[0], n_centers, replace=False)\n", + " \n", + " # Store coords of the center for iterations\n", + " coords_center = coords[index, ...].copy()\n", + " coords_center_old = coords_center.copy()\n", + " \n", + " for i in range(n_iter):\n", + " # Find closest center for each point\n", + " ### --> you provide this function ###\n", + " labels = find_closest_center(coords, coords_center)\n", + " if visualize_progres:\n", + " # Visualization of the process\n", + " sleep(sleep_time) \n", + " clear_output(wait=True)\n", + " helper.plot_clustering(n_centers,coords,coords_center,labels)\n", + " \n", + " # Update the centeroids\n", + " # INFO: \"...\" in x[...] is a slicing operation called \"ellipsis\". You can learn\n", + " # more about it here: https://stackoverflow.com/questions/118370/how-do-you-use-the-ellipsis-slicing-syntax-in-python\n", + " coords_center_old = coords_center # save old version for testing convergence\n", + " ### --> you provide this solution ###\n", + " # erronenous:\n", + " #coords_center[...] = compute_centers(coords, labels, n_centers)\n", + " # correct\n", + " coords_center = compute_centers(coords, labels, n_centers)\n", + " # Test for convergence\n", + " ### --> you provide this solution ###\n", + " if centers_have_not_changed(coords_center, coords_center_old):\n", + " if visualize_progres:\n", + " # visualize final state\n", + " sleep(sleep_time)\n", + " clear_output(wait=True)\n", + " helper.plot_clustering(n_centers,coords,coords_center,labels)\n", + " print(\"Finished after %d iterations\"%i)\n", + " break\n", + "\n", + " \n", + " return coords_center, labels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bronze-advocate", + "metadata": {}, + "outputs": [], + "source": [ + "def main(n_clusters, dataset, n_iter=1000):\n", + "# coords, labels = dataset.T[:2].T, dataset.T[-1].astype(int)\n", + " coords = dataset.T[:2].T\n", + " \n", + " coords_center, center_labels = kmeans(\n", + " coords=coords,# the input data (coordinates of the points to be clustered)\n", + " n_centers=n_clusters,# number of clusters\n", + " n_iter=n_iter,# maximum number of iterations to perform, if algorithm does not converge before\n", + " #initial_random_state=int(time()),# initial random seed - use a fixed value, if you want to have the same initial state for every execution\n", + " # this is a good random seed to see the bug\n", + " initial_random_state=4321,\n", + " visualize_progres=True,#Turn Off, if you do not want to wait for the visualization\n", + " sleep_time=1 # the sleep time controls the speed of the visualization (lower means faster)\n", + " \n", + " )\n", + " \n", + " print(coords_center)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "middle-planner", + "metadata": {}, + "outputs": [], + "source": [ + "if __name__ == \"__main__\":\n", + " n_clusters = 4 # change this value to test different datasets\n", + " dataset = np.loadtxt(f\"sample-data/coords-with-labels-{n_clusters}.dat\", delimiter=\",\")\n", + " main(n_clusters, dataset)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "surprising-austria", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91f9bc21", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering.ipynb.license b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering.ipynb.license new file mode 100644 index 0000000000000000000000000000000000000000..c207ab8c094a9d18d7c6cb5c9dfbf8913df4aa8a --- /dev/null +++ b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering.ipynb.license @@ -0,0 +1,4 @@ +SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> + +SPDX-License-Identifier: MIT diff --git a/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_stdPython.ipynb b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_stdPython.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..33e38357348697c443aab237abfc8f10036fc6fd --- /dev/null +++ b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_stdPython.ipynb @@ -0,0 +1,318 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "partial-munich", + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "from matplotlib import pyplot as plt\n", + "import numpy as np\n", + "\n", + "import importlib\n", + "import helper\n", + "importlib.reload(helper)\n", + "\n", + "import math\n", + "\n", + "from IPython.display import clear_output\n", + "from time import sleep, time" + ] + }, + { + "cell_type": "markdown", + "id": "honest-mexico", + "metadata": {}, + "source": [ + "## Beispiel fuer einen Datensatz mit 4 Clustern" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "invalid-baseball", + "metadata": {}, + "outputs": [], + "source": [ + "dataset = np.loadtxt(\"sample-data/coords-with-labels-4.dat\", delimiter=\",\")\n", + "coords, labels = dataset.T[:2].T, dataset.T[-1].astype(int)\n", + "\n", + "num_labels = np.unique(labels).size\n", + "coords_by_label = list(coords[labels == tt] for tt in range(num_labels))\n", + "\n", + "coords_center = np.loadtxt(\"sample-data/cluster-centers-4.dat\", delimiter=\",\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "piano-vehicle", + "metadata": {}, + "outputs": [], + "source": [ + "ax1, ax2 = helper.init_figure()\n", + "# Scatter plot of coords without clustering.\n", + "helper.make_scatter_plot(ax1, coords=[coords], labels=[\"\"])\n", + "# Scatter plot of coords assigned to clusters\n", + "helper.make_scatter_plot(\n", + " ax2,\n", + " coords_by_label, \n", + " labels=[f\"cluster {tt}\" for tt in range(num_labels)],\n", + " markers=[\"o\"] * num_labels\n", + ")\n", + "# Plot cluster centers.\n", + "helper.make_scatter_plot(\n", + " ax2,\n", + " coords_center, \n", + " labels=[f\"centeroid {tt}\" for tt in range(num_labels)],\n", + " colors=[\"black\"] * num_labels,\n", + " with_legend=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "collectible-detector", + "metadata": {}, + "source": [ + "## Implementation using standard Python only" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3084ddcb", + "metadata": {}, + "outputs": [], + "source": [ + "# return True, if centers have not changed and the algorithm can therefore stop\n", + "def centers_have_not_changed(a, b):\n", + " # if the center location only changes very little, we also consider it same\n", + " rtol=1e-05\n", + " atol=1e-08\n", + " #has_changed=False\n", + " # Provide your implementation here.\n", + " for point_a,point_b in zip(a,b):\n", + " for coordinate_a, coordinate_b in zip(point_a,point_b):\n", + " if abs(coordinate_a - coordinate_b) >= (atol + rtol * abs(coordinate_b)):\n", + " #has_changed=True\n", + " return False\n", + " return True\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0ae4cfd1", + "metadata": {}, + "outputs": [], + "source": [ + "# return the updated locations of the cluster centers\n", + "def compute_centers(coords, labels, n_centers):\n", + " # Provide your implementation here. \n", + " # **HINT**:\n", + " # \n", + " # Use advanced indexing with boolean masks to access\n", + " # all points that have a label corresponding to the \n", + " # index of a cluster center.\n", + " coords_center = []\n", + " # For every cluster we look up all points that are closest to it.\n", + " for ccidx in range(n_centers):\n", + " ccx, ccy = 0, 0\n", + " cluster_size = 0\n", + " # Find all points \"assigned\" to the current cluster center.\n", + " for lc, c in zip(labels, coords):\n", + " cx, cy = c\n", + " if ccidx == lc:\n", + " cluster_size += 1\n", + " ccx += cx\n", + " ccy += cy\n", + " assert cluster_size > 0, \"Error - found cluster size with value 0.\"\n", + " # Remember to divide by the cluster_size since we compute the \n", + " # new cluster centre as the arithmetic mean from the coordinates\n", + " # of all points assigned to it.\n", + " coords_center.append([ccx / cluster_size, ccy / cluster_size])\n", + " return coords_center" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c7f163f", + "metadata": {}, + "outputs": [], + "source": [ + "# return the list of *indices* of the cluster centers for the coordinates\n", + "def find_closest_center(coords, coords_center):\n", + " # Provide your implementation here.\n", + " # **HINT**:\n", + " # \n", + " # Use `np.tile()` to augment `coords` and then make use\n", + " # of NumPy's implicit broadcasting capabilities to\n", + " # compute the distance of each point to *all* cluster\n", + " # centers. You might also need to reshape the array.\n", + " # Think about along which *axis* to compute the norm. \n", + " #\n", + " # Then select the *index* of cluster center with the \n", + " # least distance for each point (Look up the \n", + " # `np.argmin()` function.).\n", + " labels = []\n", + " # For *all* points search the closest cluster centre.\n", + " for c in coords:\n", + " min_ccidx, min_dist = 100000, 1e+18\n", + " # Test each cluster center ...\n", + " for ccidx, cc in enumerate(coords_center):\n", + " # Squared distance of point to cluster centre.\n", + " dist = sum(r ** 2 for r in (x - y for x, y in zip(c, cc)))\n", + " # Found a new candidate.\n", + " if dist < min_dist:\n", + " min_ccidx, min_dist = ccidx, dist\n", + " # After finishing this loop we have a found the closest cluster centre.\n", + " # (Or at least a the closest in case some have the same distance.)\n", + " # The *index* of that cluster centre is stored.\n", + " labels.append(min_ccidx)\n", + " return labels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0707a439", + "metadata": {}, + "outputs": [], + "source": [ + "# The driver function is supplied, you do not need to change it\n", + "def kmeans(coords, n_centers, n_iter, initial_random_state=42,visualize_progres=True,sleep_time=0.5):\n", + " # Initialise the coordinates of the cluster centers\n", + " rng = np.random.RandomState(initial_random_state)\n", + " index = rng.choice(coords.shape[0], n_centers, replace=False)\n", + " \n", + " # Store coords of the center for iterations\n", + " coords_center = coords[index, ...].copy()\n", + " coords_center_old = coords_center.copy()\n", + " \n", + " for i in range(n_iter):\n", + " # Find closest center for each point\n", + " ### --> you provide this function ###\n", + " labels = find_closest_center(coords, coords_center)\n", + " if visualize_progres:\n", + " # Visualization of the process\n", + " sleep(sleep_time) \n", + " clear_output(wait=True)\n", + " # vor visualization, we have to convert the list of tuples back into an numpy array\n", + " helper.plot_clustering(n_centers,coords,np.asarray(coords_center),np.asarray(labels))\n", + " \n", + " # Update the centeroids\n", + " # INFO: \"...\" in x[...] is a slicing operation called \"ellipsis\". You can learn\n", + " # more about it here: https://stackoverflow.com/questions/118370/how-do-you-use-the-ellipsis-slicing-syntax-in-python\n", + " coords_center_old = coords_center # save old version for testing convergence\n", + " ### --> you provide this solution ###\n", + " coords_center= compute_centers(coords, labels, n_centers)\n", + " # Test for convergence\n", + " ### --> you provide this solution ###\n", + " if centers_have_not_changed(coords_center, coords_center_old):\n", + " if visualize_progres:\n", + " # visualize final state\n", + " sleep(sleep_time)\n", + " clear_output(wait=True)\n", + " helper.plot_clustering(n_centers,coords,np.asarray(coords_center),np.asarray(labels))\n", + " print(\"Finished after %d iterations\"%i)\n", + " break\n", + "\n", + " \n", + " return coords_center, labels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "authorized-slovenia", + "metadata": {}, + "outputs": [], + "source": [ + "def main(n_clusters, dataset, n_iter=1000):\n", + "# coords, labels = dataset.T[:2].T, dataset.T[-1].astype(int)\n", + " coords = dataset.T[:2].T\n", + " \n", + " coords_center, center_labels = kmeans(\n", + " coords=coords,# the input data (coordinates of the points to be clustered)\n", + " n_centers=n_clusters,# number of clusters\n", + " n_iter=n_iter,# maximum number of iterations to perform, if algorithm does not converge before\n", + " initial_random_state=int(time()),# initial random seed - use a fixed value, if you want to have the same initial state for every execution\n", + " visualize_progres=True,#Turn Off, if you do not want to wait for the visualization\n", + " sleep_time=0.5 # the sleep time controls the speed of the visualization (lower means faster)\n", + " \n", + " )\n", + " \n", + " print(coords_center)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "scientific-compensation", + "metadata": {}, + "outputs": [], + "source": [ + "if __name__ == \"__main__\":\n", + " n_clusters = 4 # change this value to test different datasets\n", + " dataset = np.loadtxt(f\"sample-data/coords-with-labels-{n_clusters}.dat\", delimiter=\",\")\n", + " main(n_clusters, dataset)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "098b1f1d-049d-4459-a518-1b2aef76c40e", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c4b7a8a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_stdPython.ipynb.license b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_stdPython.ipynb.license new file mode 100644 index 0000000000000000000000000000000000000000..c207ab8c094a9d18d7c6cb5c9dfbf8913df4aa8a --- /dev/null +++ b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_stdPython.ipynb.license @@ -0,0 +1,4 @@ +SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> + +SPDX-License-Identifier: MIT diff --git a/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_tasks.ipynb b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_tasks.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..40f98b5b7de4f2afa664b8a62ab49e9792cbe4db --- /dev/null +++ b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_tasks.ipynb @@ -0,0 +1,318 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9ccf7386", + "metadata": {}, + "source": [ + "# $K$-Means Clustering" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "random-contract", + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "from matplotlib import pyplot as plt\n", + "import numpy as np\n", + "\n", + "import importlib\n", + "import helper\n", + "importlib.reload(helper)\n", + "\n", + "from IPython.display import clear_output\n", + "from time import sleep, time" + ] + }, + { + "cell_type": "markdown", + "id": "3ae69504", + "metadata": { + "jp-MarkdownHeadingCollapsed": true, + "tags": [] + }, + "source": [ + "## Introduction\n", + "$K$-Means Clustering is a method from classical machine learning. It is used to find $K$ different groups of similar items in a dataset.\n", + "\n", + "In our case the dataset is a set of $N$ 2-dimensional coordinate vectors $\\vec{x}_1,\\vec{x}_2,\\dots,\\vec{x}_N$. These points form $K < N$ clusters which we would like to find. In order to characterise a cluster we use the cluster centre $\\vec{\\mu}_j$ ($1 \\leq j \\leq K$). *Each* point from the size-$N$ set can be assigned to *one* of these clusters (we will limit ourselves to cases where this indeed is possible)." + ] + }, + { + "cell_type": "markdown", + "id": "e8075aed", + "metadata": {}, + "source": [ + "## Algorithm\n", + "Assigning a point to a cluster works according to the following procedure:\n", + "\n", + "1. **Initialisation**: Randomly choose cluster centres $\\vec{\\mu}_j$ ($1 \\leq j \\leq K$). A simple way to achieve this is to choose them from the set of points $\\{\\vec{x}_i\\}_{i = 1, \\dots, N}$.\n", + "\n", + "2. **Iterations**: \n", + " - For all $i = 1, \\dots N$ find the cluster centre with position $\\vec{\\mu}_j$ to which $\\vec{x}_i$ has the *smallest* euclidian distance:\n", + " $$\n", + " c^{(i)} = \\operatorname{argmin}_{j \\in \\{1, \\dots, K\\}} \\left\\|\\vec{x}_i - \\vec{\\mu}_j\\right\\|_2^2,\n", + " $$\n", + " where $\\|\\vec{x}\\|_2 = \\sqrt{x_1^2 + x_2^2}$. $c^{(i)}$ is an integer number from the set $\\{1, \\dots, K\\}$. We use is to assign an index to each point $\\vec{x}_i$ (being $c^{(i)}$). This index designated the cluster centre to which the $i$th point is closest to. Hence, for each of the points we must compute the (squared) distance to *all* cluster centres $\\vec{\\mu}_j$ ($1 \\leq j \\leq K$) and determine the smallest of these distances. The index $j$ of the cluster with the smallest distance to a point with index $i$ is assigned to $c^{(i)}$.\n", + " - After having assigned each point of the set $\\{\\vec{x}_i\\}_{i = 1, \\dots, N}$ re-compute the position of all cluster centers:\n", + " $$\n", + " \\vec{\\mu}_j = \\frac{1}{n_j} \\sum_{\\vec{x}_i\\text{ with }c^{(i)} = j} \\vec{x}_i,\n", + " $$\n", + " By $n_j$ we mean the total number of points for which $c^{(i)} = j$. The *new* cluster centre is nothing but the arithmetic mean of all points $\\vec{x}_i$ that were assigned to the previous cluster centre.\n", + " - We compare the set cluster centres $C^{\\mathrm{old}} = \\{\\vec{\\mu}_1^{\\mathrm{old}}, \\dots, \\vec{\\mu}_K^{\\mathrm{old}} \\}$ from the previous iteration and the current set of cluster centres $C = \\{\\vec{\\mu}_1, \\dots, \\vec{\\mu}_K \\}$. If cluster centres are pair-wise equal (compare those with the same index) we stop the iterations. We have reached a steady state and the algorithm has *converged*." + ] + }, + { + "cell_type": "markdown", + "id": "ba899ad5", + "metadata": {}, + "source": [ + "## Task formulation\n", + "\n", + "Implement the outlined algorithm for the method of $K$-Means Clustering. Stick to the paradigm of *array-oriented programming* as often as possible.\n", + "\n", + "In case you have trouble mapping the algorithm to Numpy commands and functions it can help to first implement it with standard Python only.\n", + "\n", + "The folder `sample-data` contains some sample-dataset that you can use to explore the algorithm and your implementation.\n", + "\n", + "*Hint*: It can be helpful to plot the data and the cluster centres determined with your implementation. Have a look at the `make_scatter_plot` function from the `helper.py` module provided with this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25abbe32", + "metadata": {}, + "outputs": [], + "source": [ + "n_clusters = 2\n", + "dataset = np.loadtxt(f\"sample-data/coords-with-labels-{n_clusters}.dat\", delimiter=\",\")\n", + "coords, labels = dataset.T[:2].T, dataset.T[-1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5bf0a0e2", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots()\n", + "\n", + "helper.make_scatter_plot(\n", + " ax,\n", + " [coords[labels == tt] for tt in range(n_clusters)], \n", + " labels=[f\"cluster {tt}\" for tt in range(n_clusters)],\n", + " markers=[\"o\"] * n_clusters\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21546ed7", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "62ababfe", + "metadata": {}, + "source": [ + "## Implementation of solution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "european-bookmark", + "metadata": {}, + "outputs": [], + "source": [ + "# return True, if centers have not changed and the algorithm can therefore stop\n", + "def centers_have_not_changed(a, b):\n", + " # Provide your implementation here.\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ahead-antenna", + "metadata": {}, + "outputs": [], + "source": [ + "# return the updated locations of the cluster centers\n", + "def compute_centers(coords, labels, n_centers):\n", + " # Provide your implementation here. \n", + " # **HINT**:\n", + " # \n", + " # Use advanced indexing with boolean masks to access\n", + " # all points that have a label corresponding to the \n", + " # index of a cluster center.\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "known-travel", + "metadata": {}, + "outputs": [], + "source": [ + "# return the list of *indices* of the cluster centers for the coordinates\n", + "def find_closest_center(coords, coords_center):\n", + " # Provide your implementation here.\n", + " # **HINT**:\n", + " # \n", + " # Use `np.tile()` to augment `coords` and then make use\n", + " # of NumPy's implicit broadcasting capabilities to\n", + " # compute the distance of each point to *all* cluster\n", + " # centers. You might also need to reshape the array.\n", + " # Think about along which *axis* to compute the norm. \n", + " #\n", + " # Then select the *index* of cluster center with the \n", + " # least distance for each point (Look up the \n", + " # `np.argmin()` function.).\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "06d72489", + "metadata": {}, + "source": [ + "## The driver function \n", + "You need to change it, as there is an error in it the error is *not* in the visualization part." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "physical-saturday", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "def kmeans(coords, n_centers, n_iter, \n", + " initial_random_state=42, \n", + " visualize_progress=True,\n", + " sleep_time=0.5):\n", + " # Initialise the coordinates of the cluster centers\n", + " rng = np.random.RandomState(initial_random_state)\n", + " index = rng.choice(coords.shape[0], n_centers, replace=False)\n", + " \n", + " # Store coords of the center for iterations\n", + " coords_center = coords[index, ...].copy()\n", + " coords_center_old = coords_center.copy()\n", + " \n", + " for i in range(n_iter):\n", + " # Find closest center for each point\n", + " ### --> you provide this function ###\n", + " labels = find_closest_center(coords, coords_center)\n", + " if visualize_progress:\n", + " # Visualization of the process\n", + " sleep(sleep_time) \n", + " clear_output(wait=True)\n", + " helper.plot_clustering(n_centers,coords,coords_center,labels)\n", + " \n", + " # Update the centeroids\n", + " # INFO: \"...\" in x[...] is a slicing operation called \"ellipsis\". You can learn\n", + " # more about it here: https://stackoverflow.com/questions/118370/how-do-you-use-the-ellipsis-slicing-syntax-in-python\n", + " coords_center_old = coords_center # save old version for testing convergence\n", + " ### --> you provide this function ###\n", + " coords_center[...] = compute_centers(coords, labels, n_centers)\n", + " # Test for convergence\n", + " ### --> you provide this function ###\n", + " if centers_have_not_changed(coords_center, coords_center_old):\n", + " if visualize_progres:\n", + " # visualize final state\n", + " sleep(sleep_time)\n", + " clear_output(wait=True)\n", + " helper.plot_clustering(n_centers,coords,coords_center,labels)\n", + " print(\"Finished after %d iterations\"%i)\n", + " break\n", + " \n", + " return coords_center, labels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bronze-advocate", + "metadata": {}, + "outputs": [], + "source": [ + "def main(n_clusters, dataset, n_iter=1000):\n", + "# coords, labels = dataset.T[:2].T, dataset.T[-1].astype(int)\n", + " coords = dataset.T[:2].T\n", + " \n", + " coords_center, center_labels = kmeans(\n", + " coords=coords,# the input data (coordinates of the points to be clustered)\n", + " n_centers=n_clusters,# number of clusters\n", + " n_iter=n_iter,# maximum number of iterations to perform, if algorithm does not converge before\n", + " #initial_random_state=int(time()),# initial random seed - use a fixed value, if you want to have the same initial state for every execution\n", + " # this is a good random seed to see the bug\n", + " initial_random_state=4321,\n", + " visualize_progres=True,#Turn Off, if you do not want to wait for the visualization\n", + " sleep_time=0.5 # the sleep time controls the speed of the visualization (lower means faster)\n", + " \n", + " )\n", + " \n", + " print(coords_center)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "middle-planner", + "metadata": {}, + "outputs": [], + "source": [ + "if __name__ == \"__main__\":\n", + " n_clusters = 2 # change this value to test different datasets\n", + " dataset = np.loadtxt(f\"sample-data/coords-with-labels-{n_clusters}.dat\", delimiter=\",\")\n", + " main(n_clusters, dataset)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_tasks.ipynb.license b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_tasks.ipynb.license new file mode 100644 index 0000000000000000000000000000000000000000..c207ab8c094a9d18d7c6cb5c9dfbf8913df4aa8a --- /dev/null +++ b/exercises/Numpy_KMeansClustering/NumPy_KMeansClustering_tasks.ipynb.license @@ -0,0 +1,4 @@ +SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> + +SPDX-License-Identifier: MIT diff --git a/exercises/Numpy_KMeansClustering/helper.py b/exercises/Numpy_KMeansClustering/helper.py new file mode 100644 index 0000000000000000000000000000000000000000..a5e45b0a287cd37fe6765a61470f84d4316ea61a --- /dev/null +++ b/exercises/Numpy_KMeansClustering/helper.py @@ -0,0 +1,88 @@ +# SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +# SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> +# +# SPDX-License-Identifier: MIT + +import matplotlib.pyplot as plt +from matplotlib.lines import Line2D + + +def init_figure(figsize=(8, 8)): + _, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True, figsize=figsize) + return ax1, ax2 + + +# We have not dealt with `matplotlib` (or other packages for plotting data) yet +# but it is quite convenient for the purpose of visualising the results of the +# cluster search. +def make_scatter_plot( + ax, + coords, + labels, + markers=None, + colors=None, + with_legend=False, + figname=None, +): + ax.set_aspect("equal") + ax.minorticks_on() + + if colors is None: + cmap = plt.get_cmap("tab10") + color_list = [cmap(idx) for idx in range(len(labels))] + else: + color_list = colors + + marker_list = ( + list(Line2D.filled_markers)[: len(labels)] if markers is None else markers + ) + + for xy, col, ll, mm in zip(coords, color_list, labels, marker_list): + try: + x, y = xy.transpose() + except AttributeError: + x, y = [c[0] for c in xy], [c[1] for c in xy] + ax.scatter(x, y, s=20, color=col, label=ll, marker=mm) + + if with_legend: + ax.legend(bbox_to_anchor=(1, 1), loc="upper left") + + if figname is not None: + plt.savefig(figname, bbox_inches="tight") + +def plot_clustering(n_clusters,coords,coords_center,center_labels): + fig, ax = plt.subplots() + + # Assigen each point to a cluster. + coords_labelled = list( + coords[center_labels == tt] for tt in range(n_clusters) + ) + # Plot clusters with colors according to which cluster they belong. + make_scatter_plot( + ax, + coords_labelled, + labels=[f"cluster {tt}" for tt in range(n_clusters)], + markers=["o"] * n_clusters + ) + # Plot cluster centers. + make_scatter_plot( + ax, + coords_center, + labels=[f"centeroid {tt}" for tt in range(n_clusters)], + colors=["black"] * n_clusters, + with_legend=True, +# figname="kmeans.pdf" + ) + plt.show() + + + +def read_cluster_data(filename): + """Helper function to read sample datasets.""" + with open(filename, "r", encoding="utf-8") as datafile: + coords, labels = [], [] + for line in datafile: + x, y, l = map(float, line.split(",")) + coords.append([x, y]) + labels.append(int(l)) + return coords, labels diff --git a/exercises/Numpy_KMeansClustering/sample-data/cluster-centers-2.dat b/exercises/Numpy_KMeansClustering/sample-data/cluster-centers-2.dat new file mode 100644 index 0000000000000000000000000000000000000000..448b8cea44a93df9c556d1ac1a864bfa83845125 --- /dev/null +++ b/exercises/Numpy_KMeansClustering/sample-data/cluster-centers-2.dat @@ -0,0 +1,7 @@ +# SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +# SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> +# +# SPDX-License-Identifier: CC0-1.0 + +-2.621797518717023490e+00,9.050606662874209007e+00 +4.737827335215956559e+00,1.994987523048710854e+00 diff --git a/exercises/Numpy_KMeansClustering/sample-data/cluster-centers-3.dat b/exercises/Numpy_KMeansClustering/sample-data/cluster-centers-3.dat new file mode 100644 index 0000000000000000000000000000000000000000..8bc42fcc7d2387f0de5840457a4f1da1bb4abeca --- /dev/null +++ b/exercises/Numpy_KMeansClustering/sample-data/cluster-centers-3.dat @@ -0,0 +1,8 @@ +# SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +# SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> +# +# SPDX-License-Identifier: CC0-1.0 + +-2.633232678649361613e+00,9.043569782044549754e+00 +-6.883871789341749370e+00,-6.983984146713130059e+00 +4.747103374180733582e+00,2.010594272771337288e+00 diff --git a/exercises/Numpy_KMeansClustering/sample-data/cluster-centers-4.dat b/exercises/Numpy_KMeansClustering/sample-data/cluster-centers-4.dat new file mode 100644 index 0000000000000000000000000000000000000000..62087c101ff846b14626f9cb75fa90cf24779d47 --- /dev/null +++ b/exercises/Numpy_KMeansClustering/sample-data/cluster-centers-4.dat @@ -0,0 +1,9 @@ +# SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +# SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> +# +# SPDX-License-Identifier: CC0-1.0 + +-6.883871789341752923e+00,-6.983984146713127394e+00 +-2.633232678649360281e+00,9.043569782044546201e+00 +-8.929211039812535944e+00,7.381960674811766765e+00 +4.747103374180734470e+00,2.010594272771337288e+00 diff --git a/exercises/Numpy_KMeansClustering/sample-data/coords-with-labels-2.dat b/exercises/Numpy_KMeansClustering/sample-data/coords-with-labels-2.dat new file mode 100644 index 0000000000000000000000000000000000000000..f486ef8b5b3c218a000a33a8769e238b4b63d8b2 --- /dev/null +++ b/exercises/Numpy_KMeansClustering/sample-data/coords-with-labels-2.dat @@ -0,0 +1,205 @@ +# SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +# SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> +# +# SPDX-License-Identifier: CC0-1.0 + +3.045451177433734280e+00,1.373794660986959126e+00,1.000000000000000000e+00 +4.962597396566191144e+00,1.145938740388408927e+00,1.000000000000000000e+00 +4.664389010487044018e+00,2.471167975186181920e+00,1.000000000000000000e+00 +-3.571501336778855062e+00,9.487878558833502396e+00,0.000000000000000000e+00 +4.920870703963133863e+00,1.350470164120138206e+00,1.000000000000000000e+00 +6.783822925553426586e+00,2.607088706258743116e+00,1.000000000000000000e+00 +4.753396181479349281e+00,2.635300358461778458e+00,1.000000000000000000e+00 +4.164933525067144870e+00,1.319840451367020107e+00,1.000000000000000000e+00 +-2.955712575119771479e+00,9.870684922521792970e+00,0.000000000000000000e+00 +5.497538459430121094e+00,1.813231153977304944e+00,1.000000000000000000e+00 +-2.448967413111723612e+00,1.147752824068360766e+01,0.000000000000000000e+00 +5.539478711661351973e+00,2.280469204817341389e+00,1.000000000000000000e+00 +-1.106403312116650994e+00,7.612435065406041090e+00,0.000000000000000000e+00 +5.186976217398139077e+00,1.770977031506837829e+00,1.000000000000000000e+00 +1.398611496159028800e+00,9.487820426064421664e-01,1.000000000000000000e+00 +-6.434231119079936168e-01,9.488119049110109060e+00,0.000000000000000000e+00 +4.863971318038518454e+00,1.985762084722526799e+00,1.000000000000000000e+00 +3.633861454728399387e+00,7.589810711529998422e-01,1.000000000000000000e+00 +4.154515288398997974e+00,2.055043823327054486e+00,1.000000000000000000e+00 +3.909512204510964928e+00,2.189628273522707058e+00,1.000000000000000000e+00 +5.321831807523064839e+00,1.662902927347275961e+00,1.000000000000000000e+00 +5.154914103436761152e+00,2.486955634852940911e+00,1.000000000000000000e+00 +-1.043548854131196135e+00,8.788509827711786571e+00,0.000000000000000000e+00 +3.810883825306029316e+00,1.412988643743762429e+00,1.000000000000000000e+00 +-2.185113653657955179e+00,8.629203847782004999e+00,0.000000000000000000e+00 +-3.053580347577932841e+00,9.125208717908186884e+00,0.000000000000000000e+00 +5.144866115208558632e+00,2.838924878110853367e+00,1.000000000000000000e+00 +-1.686652710949561040e+00,7.793442478227299297e+00,0.000000000000000000e+00 +3.741464164879743315e+00,2.465088855447237659e+00,1.000000000000000000e+00 +-1.696671800658552165e+00,1.037052615676914513e+01,0.000000000000000000e+00 +-2.545023662162701594e+00,1.057892978401232753e+01,0.000000000000000000e+00 +5.803042588383060973e+00,1.983402744960319097e+00,1.000000000000000000e+00 +-3.499733948183438415e+00,8.447988398595549953e+00,0.000000000000000000e+00 +-2.147561598005116146e+00,8.369166373593197150e+00,0.000000000000000000e+00 +-1.695680405683080316e+00,7.783421811764366538e+00,0.000000000000000000e+00 +4.838938531801571408e+00,1.372952806781937429e+00,1.000000000000000000e+00 +-1.366374808537729635e+00,9.766219160885095008e+00,0.000000000000000000e+00 +6.225895652373453437e+00,7.353541851138829522e-01,1.000000000000000000e+00 +-2.422150554814578971e+00,8.715278777732454074e+00,0.000000000000000000e+00 +3.847358097795400944e+00,1.858433242473833014e+00,1.000000000000000000e+00 +-1.031303578311234093e+00,8.496015909924674148e+00,0.000000000000000000e+00 +5.052810290503725987e+00,1.409445131136757290e+00,1.000000000000000000e+00 +4.627632063381186711e+00,1.075915312454900352e+00,1.000000000000000000e+00 +4.996894322193148774e+00,1.280260088680077679e+00,1.000000000000000000e+00 +-2.496195731174843058e+00,1.046782020535563795e+01,0.000000000000000000e+00 +3.814381639435589832e+00,1.651783842287738668e+00,1.000000000000000000e+00 +-2.151410262704466891e+00,9.575070654566555817e+00,0.000000000000000000e+00 +-3.317691225945937905e+00,8.512529084613785102e+00,0.000000000000000000e+00 +-2.249314828804326538e+00,9.796108999975631448e+00,0.000000000000000000e+00 +5.614998569645852200e+00,1.826112302438593460e+00,1.000000000000000000e+00 +2.515983111918294490e+00,1.447414662259971063e+00,1.000000000000000000e+00 +-3.393055059253883066e+00,9.168011234143849109e+00,0.000000000000000000e+00 +-2.624845905440990723e+00,8.713182432609032801e+00,0.000000000000000000e+00 +-3.109836312971554939e+00,8.722592378405044755e+00,0.000000000000000000e+00 +-1.426146379877473169e+00,1.006808818023322516e+01,0.000000000000000000e+00 +3.712948364650018540e+00,1.913644327878931906e+00,1.000000000000000000e+00 +-2.412120073704709711e+00,9.982931118731210418e+00,0.000000000000000000e+00 +-2.216125149754069046e+00,8.299934710171953611e+00,0.000000000000000000e+00 +4.168840530609778661e+00,2.205219621298368349e+00,1.000000000000000000e+00 +3.658370185180150447e+00,2.435273158204002808e+00,1.000000000000000000e+00 +4.431756585870826548e+00,1.480168749281899121e+00,1.000000000000000000e+00 +3.880746174674403193e+00,2.123563470416939492e+00,1.000000000000000000e+00 +4.737554934776933457e+00,1.200159900085265630e+00,1.000000000000000000e+00 +-2.441669418364826427e+00,7.589537941984865199e+00,0.000000000000000000e+00 +4.525338990975483533e+00,3.210985995914193758e+00,1.000000000000000000e+00 +-4.059861054118883317e+00,9.082849103004349445e+00,0.000000000000000000e+00 +-2.522694847790684314e+00,7.956575199242420737e+00,0.000000000000000000e+00 +5.263998653280256512e+00,2.601515193205012011e+00,1.000000000000000000e+00 +-3.837383671951180908e+00,9.211147364067445054e+00,0.000000000000000000e+00 +-2.165579333484288771e+00,7.251245972835587139e+00,0.000000000000000000e+00 +5.159225350469273330e+00,3.505908596943309696e+00,1.000000000000000000e+00 +-3.522028743387173755e+00,9.328533460793595466e+00,0.000000000000000000e+00 +-1.883530275287744082e+00,8.157128571782038762e+00,0.000000000000000000e+00 +-1.718165676009703269e+00,8.104898673403582166e+00,0.000000000000000000e+00 +6.081152125294217115e+00,5.373075327612926166e-01,1.000000000000000000e+00 +-2.773854456290706150e+00,1.173445529478794036e+01,0.000000000000000000e+00 +-9.299848075453587271e-01,9.781720857351229981e+00,0.000000000000000000e+00 +3.262209468271010326e+00,1.035344644025609107e+00,1.000000000000000000e+00 +-2.177934191649186335e+00,9.989831255320680725e+00,0.000000000000000000e+00 +-3.110904235282147212e+00,1.086656431270725953e+01,0.000000000000000000e+00 +3.378994881893055968e+00,2.891031630995508195e+00,1.000000000000000000e+00 +5.387172441351363084e+00,2.583539949374197064e+00,1.000000000000000000e+00 +5.465295185216131557e+00,2.786679319941370636e+00,1.000000000000000000e+00 +5.945357643382430446e+00,1.994173525573491146e+00,1.000000000000000000e+00 +4.387310684834941021e+00,7.253865019758825028e-01,1.000000000000000000e+00 +6.954537402901610044e+00,1.059044913489839423e-01,1.000000000000000000e+00 +-5.128942727142494107e+00,9.836188632573545476e+00,0.000000000000000000e+00 +5.906789985414723887e+00,1.265500218321951253e+00,1.000000000000000000e+00 +3.817658440661670038e+00,2.216856895432644414e+00,1.000000000000000000e+00 +3.800156994047325210e+00,1.373777038496709846e+00,1.000000000000000000e+00 +-2.504084166410289303e+00,8.779698994823174729e+00,0.000000000000000000e+00 +-2.409546257965109017e+00,8.510810474082122212e+00,0.000000000000000000e+00 +-2.701558587833872593e+00,9.315833470531934779e+00,0.000000000000000000e+00 +-2.232506823722731237e+00,9.841469377234345117e+00,0.000000000000000000e+00 +4.884845407336824152e+00,1.466226508569602238e+00,1.000000000000000000e+00 +-1.478198100556799233e+00,9.945566247314520325e+00,0.000000000000000000e+00 +-1.987256057435852430e+00,9.311270801431508204e+00,0.000000000000000000e+00 +6.762035033240734627e+00,3.005634944491879068e+00,1.000000000000000000e+00 +-3.211250716930102556e+00,8.686623981600552824e+00,0.000000000000000000e+00 +3.867053621690529575e+00,1.736351077200723125e+00,1.000000000000000000e+00 +3.319645629207458981e+00,3.804628449795085743e+00,1.000000000000000000e+00 +-3.924568365103164425e+00,8.593640805432961827e+00,0.000000000000000000e+00 +6.772912210884367568e+00,2.108188441823011239e-02,1.000000000000000000e+00 +-2.901305776184907703e+00,7.550771180066202959e+00,0.000000000000000000e+00 +-3.580090121113862267e+00,9.496758543441506717e+00,0.000000000000000000e+00 +4.620862628325412835e+00,9.706403193029231602e-01,1.000000000000000000e+00 +5.593880599721304137e+00,2.624560935246529780e+00,1.000000000000000000e+00 +2.614736249570494220e+00,2.159623998710159754e+00,1.000000000000000000e+00 +5.590302674414151518e+00,1.396266028278328797e+00,1.000000000000000000e+00 +-4.116680857613977729e+00,9.198919986730626164e+00,0.000000000000000000e+00 +5.452740955067061357e+00,2.602798525864344015e+00,1.000000000000000000e+00 +-2.969836394012537628e+00,1.007140835441723681e+01,0.000000000000000000e+00 +3.439582429172324929e+00,1.638668448099783514e+00,1.000000000000000000e+00 +-1.593795505350676045e+00,9.343037237858005994e+00,0.000000000000000000e+00 +6.793061293739658169e+00,1.205822121052682494e+00,1.000000000000000000e+00 +3.821658152994628743e+00,4.065556959626192679e+00,1.000000000000000000e+00 +-2.267235351486716066e+00,7.101005883540523200e+00,0.000000000000000000e+00 +-3.987719613420177556e+00,8.294441919803613672e+00,0.000000000000000000e+00 +-1.770731043057339749e+00,9.185654409388291697e+00,0.000000000000000000e+00 +5.917543732016525837e+00,1.381598295104902174e+00,1.000000000000000000e+00 +-1.922340529252479779e+00,1.120474175400829964e+01,0.000000000000000000e+00 +5.330022827939213670e+00,1.571949212054895684e+00,1.000000000000000000e+00 +6.829681769445773654e+00,1.164871398585580531e+00,1.000000000000000000e+00 +-3.355991341121155269e+00,7.499438903512457344e+00,0.000000000000000000e+00 +-3.348415146275388832e+00,8.705073752347107785e+00,0.000000000000000000e+00 +5.083698264374329590e+00,2.747803737370068777e+00,1.000000000000000000e+00 +-2.336016697201568348e+00,9.399603507927158930e+00,0.000000000000000000e+00 +-3.292450915388987376e+00,8.692224611992646288e+00,0.000000000000000000e+00 +-3.186119623358708797e+00,9.625962417039190200e+00,0.000000000000000000e+00 +5.210769346921268586e+00,3.108735324121330912e+00,1.000000000000000000e+00 +-3.417221698573960964e+00,7.601982426863029829e+00,0.000000000000000000e+00 +4.531118687771243714e+00,2.374881406039673237e+00,1.000000000000000000e+00 +6.091022444023143301e+00,2.932440510025938973e+00,1.000000000000000000e+00 +-1.350602044045346117e+00,8.193603809846610631e+00,0.000000000000000000e+00 +4.167946970438667798e+00,3.062120280908097847e+00,1.000000000000000000e+00 +4.685450676131915237e+00,1.321569336334914802e+00,1.000000000000000000e+00 +-3.038957826819788988e+00,9.527553561311677299e+00,0.000000000000000000e+00 +3.120508870274087965e+00,1.488935611074480692e+00,1.000000000000000000e+00 +4.645122535946284437e+00,2.020150277705473840e+00,1.000000000000000000e+00 +-4.234115455565783392e+00,8.451998598957349174e+00,0.000000000000000000e+00 +5.512199472948779544e+00,2.156511689679083688e+00,1.000000000000000000e+00 +-2.281737688448620904e+00,1.032142888248074897e+01,0.000000000000000000e+00 +-3.398712052678273476e+00,8.198475843232882809e+00,0.000000000000000000e+00 +-2.300334028047994916e+00,7.054616004318545741e+00,0.000000000000000000e+00 +-2.258704772706873420e+00,9.360734337695296503e+00,0.000000000000000000e+00 +3.191794494730777032e+00,5.657059095641767676e-01,1.000000000000000000e+00 +4.709680921218120098e+00,1.587856087078971745e+00,1.000000000000000000e+00 +6.272290140159736183e+00,5.430283059800993239e-01,1.000000000000000000e+00 +-2.988371860898040300e+00,8.828627151534504947e+00,0.000000000000000000e+00 +4.950786401826105632e+00,3.448525900890284213e+00,1.000000000000000000e+00 +-1.545821493808428482e+00,9.427067055134820350e+00,0.000000000000000000e+00 +4.981634812005260926e+00,3.849340523156618232e+00,1.000000000000000000e+00 +4.324609591587755375e+00,2.732138904433999649e+00,1.000000000000000000e+00 +4.736874801220819720e+00,2.568326709377645400e+00,1.000000000000000000e+00 +4.964045188716543322e+00,1.843026629573047526e+00,1.000000000000000000e+00 +-6.230117218422199787e-01,9.188863941030160021e+00,0.000000000000000000e+00 +-2.732660408378601247e+00,9.728286622290413632e+00,0.000000000000000000e+00 +-3.483879293280071732e+00,9.801370731940773240e+00,0.000000000000000000e+00 +-3.615532597058778386e+00,7.818079504117650735e+00,0.000000000000000000e+00 +-1.687137463058260067e+00,1.091107911085226867e+01,0.000000000000000000e+00 +-2.450988904606750118e+00,7.871315830367698219e+00,0.000000000000000000e+00 +4.621365700235711138e+00,1.684511045020593567e+00,1.000000000000000000e+00 +-2.417436846517247773e+00,7.026717213597429179e+00,0.000000000000000000e+00 +5.154926522534148958e+00,5.825901174595452758e+00,1.000000000000000000e+00 +-3.189222344631240880e+00,9.246539825359324283e+00,0.000000000000000000e+00 +-3.428621857286553443e+00,1.056422053321586141e+01,0.000000000000000000e+00 +4.488093741192518138e+00,2.561486890425308527e+00,1.000000000000000000e+00 +5.819318956949388166e+00,1.503994031836027201e+00,1.000000000000000000e+00 +4.618977242263953009e+00,2.090497067249514007e+00,1.000000000000000000e+00 +-2.213077345988174294e+00,9.275341400378211532e+00,0.000000000000000000e+00 +4.704158855323564481e+00,8.954249060114258807e-01,1.000000000000000000e+00 +2.926744307137223888e+00,3.327042058106144840e+00,1.000000000000000000e+00 +-2.543909392757993437e+00,7.845608090578789273e+00,0.000000000000000000e+00 +4.199834349531117894e+00,2.103910261226823231e+00,1.000000000000000000e+00 +-4.427968838351791447e+00,8.987772252749104851e+00,0.000000000000000000e+00 +-3.660191200475052753e+00,9.389984146543993049e+00,0.000000000000000000e+00 +-2.851912139579519501e+00,8.212008858976702186e+00,0.000000000000000000e+00 +6.405333076509197809e+00,2.378151394901687699e+00,1.000000000000000000e+00 +-2.978672008987702124e+00,9.556846171784286526e+00,0.000000000000000000e+00 +3.978092371459713394e+00,2.825603018736956074e+00,1.000000000000000000e+00 +5.797989709728168961e+00,2.764832377903667648e+00,1.000000000000000000e+00 +4.422197633000880757e+00,3.071946535927922106e+00,1.000000000000000000e+00 +-2.728869510890262085e+00,9.371398699710068669e+00,0.000000000000000000e+00 +-3.746148333930832131e+00,7.693829515114044781e+00,0.000000000000000000e+00 +-2.295103878922546414e+00,7.768547349486333076e+00,0.000000000000000000e+00 +-2.035959998479205169e+00,8.941457215541449344e+00,0.000000000000000000e+00 +-2.147802017544336195e+00,1.055232269466429074e+01,0.000000000000000000e+00 +-2.581207744633084111e+00,1.001781902609034525e+01,0.000000000000000000e+00 +3.924575126968133265e+00,2.652767432875407838e+00,1.000000000000000000e+00 +-2.972615315865212438e+00,8.548556374628065058e+00,0.000000000000000000e+00 +3.921434614975665589e+00,1.759722532228884750e+00,1.000000000000000000e+00 +-2.670483334718759316e+00,9.418336985012860652e+00,0.000000000000000000e+00 +-2.743350997776086153e+00,8.780149171249140849e+00,0.000000000000000000e+00 +5.326139026602614734e+00,3.604538127510803491e-01,1.000000000000000000e+00 +-3.700501120255398568e+00,9.670839736832151701e+00,0.000000000000000000e+00 +-2.586299332466854395e+00,9.355438103014964923e+00,0.000000000000000000e+00 +4.050514079283889401e+00,2.822771780961756516e+00,1.000000000000000000e+00 +-2.754585739055620763e+00,8.260549963840832177e+00,0.000000000000000000e+00 +4.715683394421827934e+00,1.296007972428620203e+00,1.000000000000000000e+00 +-2.251647232329985648e+00,8.939840212432153876e+00,0.000000000000000000e+00 diff --git a/exercises/Numpy_KMeansClustering/sample-data/coords-with-labels-3.dat b/exercises/Numpy_KMeansClustering/sample-data/coords-with-labels-3.dat new file mode 100644 index 0000000000000000000000000000000000000000..420385f5a44f81040f2909d027ba961cc78d774e --- /dev/null +++ b/exercises/Numpy_KMeansClustering/sample-data/coords-with-labels-3.dat @@ -0,0 +1,305 @@ +# SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +# SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> +# +# SPDX-License-Identifier: CC0-1.0 + +-7.338988090691514365e+00,-7.729953962740738760e+00,1.000000000000000000e+00 +-7.740040556435222818e+00,-7.264665137505772030e+00,1.000000000000000000e+00 +-1.686652710949561040e+00,7.793442478227299297e+00,0.000000000000000000e+00 +4.422197633000880757e+00,3.071946535927922106e+00,2.000000000000000000e+00 +-8.917751726329123940e+00,-7.888195904193350927e+00,1.000000000000000000e+00 +5.497538459430121094e+00,1.813231153977304944e+00,2.000000000000000000e+00 +-2.336016697201568348e+00,9.399603507927158930e+00,0.000000000000000000e+00 +5.052810290503725987e+00,1.409445131136757290e+00,2.000000000000000000e+00 +-2.988371860898040300e+00,8.828627151534504947e+00,0.000000000000000000e+00 +-3.700501120255398568e+00,9.670839736832151701e+00,0.000000000000000000e+00 +-3.110904235282147212e+00,1.086656431270725953e+01,0.000000000000000000e+00 +4.996894322193148774e+00,1.280260088680077679e+00,2.000000000000000000e+00 +-2.300334028047994916e+00,7.054616004318545741e+00,0.000000000000000000e+00 +-3.924568365103164425e+00,8.593640805432961827e+00,0.000000000000000000e+00 +-7.530269760273096580e+00,-7.367234977040642896e+00,1.000000000000000000e+00 +-3.211250716930102556e+00,8.686623981600552824e+00,0.000000000000000000e+00 +-8.507169629034432745e+00,-6.832024646614564212e+00,1.000000000000000000e+00 +2.614736249570494220e+00,2.159623998710159754e+00,2.000000000000000000e+00 +-2.412120073704709711e+00,9.982931118731210418e+00,0.000000000000000000e+00 +-1.922340529252479779e+00,1.120474175400829964e+01,0.000000000000000000e+00 +-1.350602044045346117e+00,8.193603809846610631e+00,0.000000000000000000e+00 +-2.670483334718759316e+00,9.418336985012860652e+00,0.000000000000000000e+00 +5.614998569645852200e+00,1.826112302438593460e+00,2.000000000000000000e+00 +-6.991955240842099961e+00,-7.101079192809169882e+00,1.000000000000000000e+00 +-2.972615315865212438e+00,8.548556374628065058e+00,0.000000000000000000e+00 +-6.349823013235987190e+00,-5.438540972618046254e+00,1.000000000000000000e+00 +-7.456398521719602712e+00,-6.124718367450190826e+00,1.000000000000000000e+00 +3.821658152994628743e+00,4.065556959626192679e+00,2.000000000000000000e+00 +4.627632063381186711e+00,1.075915312454900352e+00,2.000000000000000000e+00 +-3.398712052678273476e+00,8.198475843232882809e+00,0.000000000000000000e+00 +-3.499733948183438415e+00,8.447988398595549953e+00,0.000000000000000000e+00 +-3.580090121113862267e+00,9.496758543441506717e+00,0.000000000000000000e+00 +-6.049291374607024707e+00,-7.736193419184814069e+00,1.000000000000000000e+00 +-2.295103878922546414e+00,7.768547349486333076e+00,0.000000000000000000e+00 +-8.394818253349821902e+00,-5.513235325831422173e+00,1.000000000000000000e+00 +-2.281737688448620904e+00,1.032142888248074897e+01,0.000000000000000000e+00 +-6.122638574505918641e+00,-7.802274917453572378e+00,1.000000000000000000e+00 +4.884845407336824152e+00,1.466226508569602238e+00,2.000000000000000000e+00 +-6.986657551105827757e+00,-7.915351915695320706e+00,1.000000000000000000e+00 +4.981634812005260926e+00,3.849340523156618232e+00,2.000000000000000000e+00 +5.906789985414723887e+00,1.265500218321951253e+00,2.000000000000000000e+00 +-2.251647232329985648e+00,8.939840212432153876e+00,0.000000000000000000e+00 +-7.367233415223763515e+00,-7.312667781095567143e+00,1.000000000000000000e+00 +4.525338990975483533e+00,3.210985995914193758e+00,2.000000000000000000e+00 +-2.543909392757993437e+00,7.845608090578789273e+00,0.000000000000000000e+00 +-2.147802017544336195e+00,1.055232269466429074e+01,0.000000000000000000e+00 +-6.808060953931877712e+00,-7.357767040041062856e+00,1.000000000000000000e+00 +4.154515288398997974e+00,2.055043823327054486e+00,2.000000000000000000e+00 +-6.542024529076067907e+00,-7.291986559398414336e+00,1.000000000000000000e+00 +6.225895652373453437e+00,7.353541851138829522e-01,2.000000000000000000e+00 +4.715683394421827934e+00,1.296007972428620203e+00,2.000000000000000000e+00 +-6.887599832467887317e+00,-5.400165454385920327e+00,1.000000000000000000e+00 +-6.513028945054421648e+00,-7.819989379603302204e+00,1.000000000000000000e+00 +-1.031303578311234093e+00,8.496015909924674148e+00,0.000000000000000000e+00 +-5.700330007087443640e+00,-6.812591111865837767e+00,1.000000000000000000e+00 +5.154926522534148958e+00,5.825901174595452758e+00,2.000000000000000000e+00 +-6.485175048772973128e+00,-7.301094074096209141e+00,1.000000000000000000e+00 +-1.545821493808428482e+00,9.427067055134820350e+00,0.000000000000000000e+00 +4.753396181479349281e+00,2.635300358461778458e+00,2.000000000000000000e+00 +-2.969836394012537628e+00,1.007140835441723681e+01,0.000000000000000000e+00 +-6.644012633042704508e+00,-6.109244399388980007e+00,1.000000000000000000e+00 +6.772912210884367568e+00,2.108188441823011239e-02,2.000000000000000000e+00 +5.539478711661351973e+00,2.280469204817341389e+00,2.000000000000000000e+00 +-3.800746382696032377e+00,-5.760534681841369853e+00,1.000000000000000000e+00 +-7.128591339630343526e+00,-5.908538642321591539e+00,1.000000000000000000e+00 +3.741464164879743315e+00,2.465088855447237659e+00,2.000000000000000000e+00 +3.921434614975665589e+00,1.759722532228884750e+00,2.000000000000000000e+00 +-6.168012313062380514e+00,-8.004751685113815185e+00,1.000000000000000000e+00 +-8.583009630506424514e+00,-6.935657292172565214e+00,1.000000000000000000e+00 +-3.571501336778855062e+00,9.487878558833502396e+00,0.000000000000000000e+00 +5.945357643382430446e+00,1.994173525573491146e+00,2.000000000000000000e+00 +-5.821202704301682296e+00,-8.638849079699060241e+00,1.000000000000000000e+00 +-7.579352699143855787e+00,-6.666129682541724222e+00,1.000000000000000000e+00 +-2.035959998479205169e+00,8.941457215541449344e+00,0.000000000000000000e+00 +-2.901305776184907703e+00,7.550771180066202959e+00,0.000000000000000000e+00 +-6.609170365371431544e+00,-6.930347702725083714e+00,1.000000000000000000e+00 +-8.947069291191146689e+00,-6.969229632788734641e+00,1.000000000000000000e+00 +3.880746174674403193e+00,2.123563470416939492e+00,2.000000000000000000e+00 +-3.109836312971554939e+00,8.722592378405044755e+00,0.000000000000000000e+00 +5.819318956949388166e+00,1.503994031836027201e+00,2.000000000000000000e+00 +-3.522028743387173755e+00,9.328533460793595466e+00,0.000000000000000000e+00 +-2.581207744633084111e+00,1.001781902609034525e+01,0.000000000000000000e+00 +-6.378710003526888883e+00,-7.857664838074497560e+00,1.000000000000000000e+00 +-2.177934191649186335e+00,9.989831255320680725e+00,0.000000000000000000e+00 +5.590302674414151518e+00,1.396266028278328797e+00,2.000000000000000000e+00 +-6.043935079086128148e+00,-8.009816447933564731e+00,1.000000000000000000e+00 +-5.711845129491463169e+00,-6.625688749974733227e+00,1.000000000000000000e+00 +-6.434231119079936168e-01,9.488119049110109060e+00,0.000000000000000000e+00 +6.405333076509197809e+00,2.378151394901687699e+00,2.000000000000000000e+00 +-3.886866991009841232e+00,8.076461088283199530e+00,0.000000000000000000e+00 +-8.549032472272642735e+00,-6.336749400896011686e+00,1.000000000000000000e+00 +-2.545023662162701594e+00,1.057892978401232753e+01,0.000000000000000000e+00 +-6.400647365404878109e+00,-6.546447487988998226e+00,1.000000000000000000e+00 +-1.593795505350676045e+00,9.343037237858005994e+00,0.000000000000000000e+00 +-3.038957826819788988e+00,9.527553561311677299e+00,0.000000000000000000e+00 +-7.433276496498452346e+00,-8.077987485864795758e+00,1.000000000000000000e+00 +-7.947247620533864243e+00,-7.022489078297240006e+00,1.000000000000000000e+00 +-2.249314828804326538e+00,9.796108999975631448e+00,0.000000000000000000e+00 +-7.642886347693787386e+00,-8.684991693940466106e+00,1.000000000000000000e+00 +-6.466192287927569282e+00,-5.003313780717880910e+00,1.000000000000000000e+00 +4.164933525067144870e+00,1.319840451367020107e+00,2.000000000000000000e+00 +-2.151410262704466891e+00,9.575070654566555817e+00,0.000000000000000000e+00 +4.431756585870826548e+00,1.480168749281899121e+00,2.000000000000000000e+00 +-1.718165676009703269e+00,8.104898673403582166e+00,0.000000000000000000e+00 +-3.348415146275388832e+00,8.705073752347107785e+00,0.000000000000000000e+00 +-2.267235351486716066e+00,7.101005883540523200e+00,0.000000000000000000e+00 +-2.165579333484288771e+00,7.251245972835587139e+00,0.000000000000000000e+00 +-2.258704772706873420e+00,9.360734337695296503e+00,0.000000000000000000e+00 +6.793061293739658169e+00,1.205822121052682494e+00,2.000000000000000000e+00 +-8.278194764970411512e+00,-6.317140356585375649e+00,1.000000000000000000e+00 +5.326139026602614734e+00,3.604538127510803491e-01,2.000000000000000000e+00 +-3.292450915388987376e+00,8.692224611992646288e+00,0.000000000000000000e+00 +-3.317691225945937905e+00,8.512529084613785102e+00,0.000000000000000000e+00 +-2.441669418364826427e+00,7.589537941984865199e+00,0.000000000000000000e+00 +-2.522694847790684314e+00,7.956575199242420737e+00,0.000000000000000000e+00 +4.838938531801571408e+00,1.372952806781937429e+00,2.000000000000000000e+00 +-7.916873345477726254e+00,-7.070448271359555115e+00,1.000000000000000000e+00 +2.926744307137223888e+00,3.327042058106144840e+00,2.000000000000000000e+00 +-8.750419112177125314e+00,-7.231623077317255621e+00,1.000000000000000000e+00 +3.633861454728399387e+00,7.589810711529998422e-01,2.000000000000000000e+00 +5.159225350469273330e+00,3.505908596943309696e+00,2.000000000000000000e+00 +4.863971318038518454e+00,1.985762084722526799e+00,2.000000000000000000e+00 +-1.106403312116650994e+00,7.612435065406041090e+00,0.000000000000000000e+00 +-6.303070228095503325e+00,-6.568859438732410183e+00,1.000000000000000000e+00 +-6.547313179171678321e+00,-7.628596129832500239e+00,1.000000000000000000e+00 +-7.007544782632036728e+00,-7.835650033876372156e+00,1.000000000000000000e+00 +-6.265460491107845087e+00,-6.122601883228641739e+00,1.000000000000000000e+00 +-3.186119623358708797e+00,9.625962417039190200e+00,0.000000000000000000e+00 +-5.842087246893370889e+00,-7.390125992130693433e+00,1.000000000000000000e+00 +-7.409884809523711091e+00,-7.672982425538291018e+00,1.000000000000000000e+00 +-1.426146379877473169e+00,1.006808818023322516e+01,0.000000000000000000e+00 +-4.427968838351791447e+00,8.987772252749104851e+00,0.000000000000000000e+00 +-2.417436846517247773e+00,7.026717213597429179e+00,0.000000000000000000e+00 +-4.234115455565783392e+00,8.451998598957349174e+00,0.000000000000000000e+00 +-3.987719613420177556e+00,8.294441919803613672e+00,0.000000000000000000e+00 +5.144866115208558632e+00,2.838924878110853367e+00,2.000000000000000000e+00 +4.387310684834941021e+00,7.253865019758825028e-01,2.000000000000000000e+00 +-7.149502126444641448e+00,-7.858873309058253653e+00,1.000000000000000000e+00 +-2.504084166410289303e+00,8.779698994823174729e+00,0.000000000000000000e+00 +-7.393494108487963956e+00,-7.939323115164897970e+00,1.000000000000000000e+00 +-2.978672008987702124e+00,9.556846171784286526e+00,0.000000000000000000e+00 +-2.754585739055620763e+00,8.260549963840832177e+00,0.000000000000000000e+00 +-4.818879266269282979e+00,-5.124768750832742192e+00,1.000000000000000000e+00 +-2.422150554814578971e+00,8.715278777732454074e+00,0.000000000000000000e+00 +3.120508870274087965e+00,1.488935611074480692e+00,2.000000000000000000e+00 +3.924575126968133265e+00,2.652767432875407838e+00,2.000000000000000000e+00 +5.452740955067061357e+00,2.602798525864344015e+00,2.000000000000000000e+00 +-2.232506823722731237e+00,9.841469377234345117e+00,0.000000000000000000e+00 +5.917543732016525837e+00,1.381598295104902174e+00,2.000000000000000000e+00 +-3.189222344631240880e+00,9.246539825359324283e+00,0.000000000000000000e+00 +-3.417221698573960964e+00,7.601982426863029829e+00,0.000000000000000000e+00 +-4.914902058234880577e+00,-6.844846041304218254e+00,1.000000000000000000e+00 +4.920870703963133863e+00,1.350470164120138206e+00,2.000000000000000000e+00 +-8.413741361886891923e+00,-5.602432771377437781e+00,1.000000000000000000e+00 +-2.213077345988174294e+00,9.275341400378211532e+00,0.000000000000000000e+00 +3.439582429172324929e+00,1.638668448099783514e+00,2.000000000000000000e+00 +4.621365700235711138e+00,1.684511045020593567e+00,2.000000000000000000e+00 +-7.410128338761797551e+00,-7.455927833920626746e+00,1.000000000000000000e+00 +-6.241034732373895721e+00,-8.541629655544905830e+00,1.000000000000000000e+00 +-3.393055059253883066e+00,9.168011234143849109e+00,0.000000000000000000e+00 +-5.128942727142494107e+00,9.836188632573545476e+00,0.000000000000000000e+00 +-7.755245444536027044e+00,-8.262909324240283127e+00,1.000000000000000000e+00 +-5.328475215628746930e+00,-6.764434958983088109e+00,1.000000000000000000e+00 +-5.678413268987325679e+00,-7.288184966297498235e+00,1.000000000000000000e+00 +-2.450988904606750118e+00,7.871315830367698219e+00,0.000000000000000000e+00 +-6.008502487719577623e+00,-7.206133125443788146e+00,1.000000000000000000e+00 +4.168840530609778661e+00,2.205219621298368349e+00,2.000000000000000000e+00 +-2.448967413111723612e+00,1.147752824068360766e+01,0.000000000000000000e+00 +-1.987256057435852430e+00,9.311270801431508204e+00,0.000000000000000000e+00 +-1.695680405683080316e+00,7.783421811764366538e+00,0.000000000000000000e+00 +-1.366374808537729635e+00,9.766219160885095008e+00,0.000000000000000000e+00 +-3.428621857286553443e+00,1.056422053321586141e+01,0.000000000000000000e+00 +3.800156994047325210e+00,1.373777038496709846e+00,2.000000000000000000e+00 +-2.955712575119771479e+00,9.870684922521792970e+00,0.000000000000000000e+00 +3.658370185180150447e+00,2.435273158204002808e+00,2.000000000000000000e+00 +4.664389010487044018e+00,2.471167975186181920e+00,2.000000000000000000e+00 +5.512199472948779544e+00,2.156511689679083688e+00,2.000000000000000000e+00 +-2.773854456290706150e+00,1.173445529478794036e+01,0.000000000000000000e+00 +6.762035033240734627e+00,3.005634944491879068e+00,2.000000000000000000e+00 +5.263998653280256512e+00,2.601515193205012011e+00,2.000000000000000000e+00 +-6.793037403678369834e+00,-7.035786828668026516e+00,1.000000000000000000e+00 +-3.053580347577932841e+00,9.125208717908186884e+00,0.000000000000000000e+00 +-7.542250950097116657e+00,-6.309510924682787625e+00,1.000000000000000000e+00 +6.272290140159736183e+00,5.430283059800993239e-01,2.000000000000000000e+00 +5.210769346921268586e+00,3.108735324121330912e+00,2.000000000000000000e+00 +-9.351271691278558507e+00,-7.677004848746423527e+00,1.000000000000000000e+00 +4.964045188716543322e+00,1.843026629573047526e+00,2.000000000000000000e+00 +-1.043548854131196135e+00,8.788509827711786571e+00,0.000000000000000000e+00 +1.398611496159028800e+00,9.487820426064421664e-01,2.000000000000000000e+00 +4.199834349531117894e+00,2.103910261226823231e+00,2.000000000000000000e+00 +-6.942306288424441973e+00,-5.924967272774708249e+00,1.000000000000000000e+00 +-5.251011645579978016e+00,-8.260211051490838230e+00,1.000000000000000000e+00 +3.814381639435589832e+00,1.651783842287738668e+00,2.000000000000000000e+00 +-8.486073511408843473e+00,-6.676645957408723575e+00,1.000000000000000000e+00 +3.909512204510964928e+00,2.189628273522707058e+00,2.000000000000000000e+00 +3.319645629207458981e+00,3.804628449795085743e+00,2.000000000000000000e+00 +4.620862628325412835e+00,9.706403193029231602e-01,2.000000000000000000e+00 +6.783822925553426586e+00,2.607088706258743116e+00,2.000000000000000000e+00 +-3.483879293280071732e+00,9.801370731940773240e+00,0.000000000000000000e+00 +-6.697760936092774564e+00,-6.631889006975610457e+00,1.000000000000000000e+00 +-1.770731043057339749e+00,9.185654409388291697e+00,0.000000000000000000e+00 +-2.624845905440990723e+00,8.713182432609032801e+00,0.000000000000000000e+00 +3.817658440661670038e+00,2.216856895432644414e+00,2.000000000000000000e+00 +4.050514079283889401e+00,2.822771780961756516e+00,2.000000000000000000e+00 +-1.696671800658552165e+00,1.037052615676914513e+01,0.000000000000000000e+00 +4.950786401826105632e+00,3.448525900890284213e+00,2.000000000000000000e+00 +-7.865353237486814031e+00,-6.376063077758102438e+00,1.000000000000000000e+00 +-7.526200075393796318e+00,-7.961657596890341360e+00,1.000000000000000000e+00 +4.736874801220819720e+00,2.568326709377645400e+00,2.000000000000000000e+00 +-2.147561598005116146e+00,8.369166373593197150e+00,0.000000000000000000e+00 +-2.409546257965109017e+00,8.510810474082122212e+00,0.000000000000000000e+00 +-7.844550651731374558e+00,-6.194058133277507316e+00,1.000000000000000000e+00 +6.091022444023143301e+00,2.932440510025938973e+00,2.000000000000000000e+00 +3.378994881893055968e+00,2.891031630995508195e+00,2.000000000000000000e+00 +-6.831105563206443243e+00,-7.711059709686984398e+00,1.000000000000000000e+00 +-5.377270139055242204e+00,-6.806014812856171048e+00,1.000000000000000000e+00 +-6.246845325044985131e+00,-4.609416735471550730e+00,1.000000000000000000e+00 +-6.302555063970729954e+00,-7.083154979318939226e+00,1.000000000000000000e+00 +-3.746148333930832131e+00,7.693829515114044781e+00,0.000000000000000000e+00 +-7.154678888302914430e+00,-9.182030758011531901e+00,1.000000000000000000e+00 +-6.050221609967780800e+00,-9.091244902283831308e+00,1.000000000000000000e+00 +2.515983111918294490e+00,1.447414662259971063e+00,2.000000000000000000e+00 +-7.635977936435573099e+00,-8.302363302873621009e+00,1.000000000000000000e+00 +-8.184096691656122857e+00,-6.210437044445908050e+00,1.000000000000000000e+00 +-2.496195731174843058e+00,1.046782020535563795e+01,0.000000000000000000e+00 +3.847358097795400944e+00,1.858433242473833014e+00,2.000000000000000000e+00 +-7.323920451227381889e+00,-6.502809100231094597e+00,1.000000000000000000e+00 +-5.192485556078705322e+00,-5.998469836326496107e+00,1.000000000000000000e+00 +4.324609591587755375e+00,2.732138904433999649e+00,2.000000000000000000e+00 +-2.586299332466854395e+00,9.355438103014964923e+00,0.000000000000000000e+00 +-1.687137463058260067e+00,1.091107911085226867e+01,0.000000000000000000e+00 +-5.953449643619628695e+00,-4.970692952805816134e+00,1.000000000000000000e+00 +-2.851912139579519501e+00,8.212008858976702186e+00,0.000000000000000000e+00 +-8.062885703817045169e+00,-8.919341771036046751e+00,1.000000000000000000e+00 +4.685450676131915237e+00,1.321569336334914802e+00,2.000000000000000000e+00 +5.321831807523064839e+00,1.662902927347275961e+00,2.000000000000000000e+00 +-7.531463298953429586e+00,-6.832710921959532335e+00,1.000000000000000000e+00 +4.618977242263953009e+00,2.090497067249514007e+00,2.000000000000000000e+00 +-5.234659477649985959e+00,-7.129145632832324608e+00,1.000000000000000000e+00 +-6.945706989798586584e+00,-8.091125793038402847e+00,1.000000000000000000e+00 +-6.589852334254857169e+00,-4.804708794630507818e+00,1.000000000000000000e+00 +4.962597396566191144e+00,1.145938740388408927e+00,2.000000000000000000e+00 +5.797989709728168961e+00,2.764832377903667648e+00,2.000000000000000000e+00 +-1.883530275287744082e+00,8.157128571782038762e+00,0.000000000000000000e+00 +-5.356503113881612599e+00,-6.341199549591287621e+00,1.000000000000000000e+00 +3.045451177433734280e+00,1.373794660986959126e+00,2.000000000000000000e+00 +5.330022827939213670e+00,1.571949212054895684e+00,2.000000000000000000e+00 +4.645122535946284437e+00,2.020150277705473840e+00,2.000000000000000000e+00 +-6.619904689429787936e+00,-7.784426218380355422e+00,1.000000000000000000e+00 +5.186976217398139077e+00,1.770977031506837829e+00,2.000000000000000000e+00 +-6.508481317779961195e+00,-7.484094779991766977e+00,1.000000000000000000e+00 +4.531118687771243714e+00,2.374881406039673237e+00,2.000000000000000000e+00 +-7.472021115390139023e+00,-7.744100362955762762e+00,1.000000000000000000e+00 +6.954537402901610044e+00,1.059044913489839423e-01,2.000000000000000000e+00 +6.829681769445773654e+00,1.164871398585580531e+00,2.000000000000000000e+00 +-6.552699817387107828e+00,-7.099210122084810948e+00,1.000000000000000000e+00 +3.712948364650018540e+00,1.913644327878931906e+00,2.000000000000000000e+00 +-3.837383671951180908e+00,9.211147364067445054e+00,0.000000000000000000e+00 +-8.358213436931110962e+00,-5.736355550069017539e+00,1.000000000000000000e+00 +-2.216125149754069046e+00,8.299934710171953611e+00,0.000000000000000000e+00 +-2.732660408378601247e+00,9.728286622290413632e+00,0.000000000000000000e+00 +-1.478198100556799233e+00,9.945566247314520325e+00,0.000000000000000000e+00 +-6.802258883503651710e+00,-7.741393794604210399e+00,1.000000000000000000e+00 +-4.059861054118883317e+00,9.082849103004349445e+00,0.000000000000000000e+00 +5.465295185216131557e+00,2.786679319941370636e+00,2.000000000000000000e+00 +3.978092371459713394e+00,2.825603018736956074e+00,2.000000000000000000e+00 +-6.759331559439370807e+00,-6.365670759217197272e+00,1.000000000000000000e+00 +4.709680921218120098e+00,1.587856087078971745e+00,2.000000000000000000e+00 +5.387172441351363084e+00,2.583539949374197064e+00,2.000000000000000000e+00 +-2.701558587833872593e+00,9.315833470531934779e+00,0.000000000000000000e+00 +-9.299848075453587271e-01,9.781720857351229981e+00,0.000000000000000000e+00 +4.737554934776933457e+00,1.200159900085265630e+00,2.000000000000000000e+00 +4.167946970438667798e+00,3.062120280908097847e+00,2.000000000000000000e+00 +5.154914103436761152e+00,2.486955634852940911e+00,2.000000000000000000e+00 +-6.780294885722044640e+00,-6.128722469904158032e+00,1.000000000000000000e+00 +-5.873334381936829551e+00,-7.457001462799095037e+00,1.000000000000000000e+00 +-7.149034025595828012e+00,-6.162567337479984531e+00,1.000000000000000000e+00 +-3.615532597058778386e+00,7.818079504117650735e+00,0.000000000000000000e+00 +-6.230117218422199787e-01,9.188863941030160021e+00,0.000000000000000000e+00 +-3.355991341121155269e+00,7.499438903512457344e+00,0.000000000000000000e+00 +3.867053621690529575e+00,1.736351077200723125e+00,2.000000000000000000e+00 +5.083698264374329590e+00,2.747803737370068777e+00,2.000000000000000000e+00 +6.081152125294217115e+00,5.373075327612926166e-01,2.000000000000000000e+00 +3.191794494730777032e+00,5.657059095641767676e-01,2.000000000000000000e+00 +-6.541130783656855741e+00,-7.295397507176748064e+00,1.000000000000000000e+00 +4.704158855323564481e+00,8.954249060114258807e-01,2.000000000000000000e+00 +-6.234251241566122204e+00,-5.511478035743597736e+00,1.000000000000000000e+00 +5.593880599721304137e+00,2.624560935246529780e+00,2.000000000000000000e+00 +4.488093741192518138e+00,2.561486890425308527e+00,2.000000000000000000e+00 +-6.495561742211962475e+00,-6.912804341370039296e+00,1.000000000000000000e+00 +-2.185113653657955179e+00,8.629203847782004999e+00,0.000000000000000000e+00 +4.189813364748857794e+00,2.596019616288230747e+00,2.000000000000000000e+00 +5.803042588383060973e+00,1.983402744960319097e+00,2.000000000000000000e+00 +-2.728869510890262085e+00,9.371398699710068669e+00,0.000000000000000000e+00 +-7.118575238017680107e+00,-7.787673255317544729e+00,1.000000000000000000e+00 +-3.660191200475052753e+00,9.389984146543993049e+00,0.000000000000000000e+00 +3.810883825306029316e+00,1.412988643743762429e+00,2.000000000000000000e+00 +-4.116680857613977729e+00,9.198919986730626164e+00,0.000000000000000000e+00 +-6.861208811961718723e+00,-5.203672281000663702e+00,1.000000000000000000e+00 +-6.010021271045610014e+00,-5.524471734470996154e+00,1.000000000000000000e+00 diff --git a/exercises/Numpy_KMeansClustering/sample-data/coords-with-labels-4.dat b/exercises/Numpy_KMeansClustering/sample-data/coords-with-labels-4.dat new file mode 100644 index 0000000000000000000000000000000000000000..dd089173f784c08f0c5a2437d703834aa8193780 --- /dev/null +++ b/exercises/Numpy_KMeansClustering/sample-data/coords-with-labels-4.dat @@ -0,0 +1,405 @@ +# SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +# SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> +# +# SPDX-License-Identifier: CC0-1.0 + +-1.011875710913229298e+01,9.078317097483067144e+00,2.000000000000000000e+00 +-5.128942727142494107e+00,9.836188632573545476e+00,1.000000000000000000e+00 +-9.084070820721954931e+00,7.050799345751032732e+00,2.000000000000000000e+00 +5.614998569645852200e+00,1.826112302438593460e+00,3.000000000000000000e+00 +5.210769346921268586e+00,3.108735324121330912e+00,3.000000000000000000e+00 +-3.292450915388987376e+00,8.692224611992646288e+00,1.000000000000000000e+00 +-2.035959998479205169e+00,8.941457215541449344e+00,1.000000000000000000e+00 +-8.320668736166886958e+00,6.597779102345237234e+00,2.000000000000000000e+00 +-2.422150554814578971e+00,8.715278777732454074e+00,1.000000000000000000e+00 +6.091022444023143301e+00,2.932440510025938973e+00,3.000000000000000000e+00 +-8.667462318508068364e+00,7.139539579146023662e+00,2.000000000000000000e+00 +3.712948364650018540e+00,1.913644327878931906e+00,3.000000000000000000e+00 +-6.956303260160976443e+00,8.668942961653680612e+00,2.000000000000000000e+00 +3.924575126968133265e+00,2.652767432875407838e+00,3.000000000000000000e+00 +-7.367233415223763515e+00,-7.312667781095567143e+00,0.000000000000000000e+00 +-6.831105563206443243e+00,-7.711059709686984398e+00,0.000000000000000000e+00 +-6.378710003526888883e+00,-7.857664838074497560e+00,0.000000000000000000e+00 +-2.624845905440990723e+00,8.713182432609032801e+00,1.000000000000000000e+00 +4.050514079283889401e+00,2.822771780961756516e+00,3.000000000000000000e+00 +-6.589852334254857169e+00,-4.804708794630507818e+00,0.000000000000000000e+00 +4.167946970438667798e+00,3.062120280908097847e+00,3.000000000000000000e+00 +-8.512194734393892404e+00,6.072409339113399973e+00,2.000000000000000000e+00 +-3.053580347577932841e+00,9.125208717908186884e+00,1.000000000000000000e+00 +1.398611496159028800e+00,9.487820426064421664e-01,3.000000000000000000e+00 +-2.701558587833872593e+00,9.315833470531934779e+00,1.000000000000000000e+00 +-9.628802212081321699e+00,7.794991272634699264e+00,2.000000000000000000e+00 +-6.466192287927569282e+00,-5.003313780717880910e+00,0.000000000000000000e+00 +-2.412120073704709711e+00,9.982931118731210418e+00,1.000000000000000000e+00 +-7.635977936435573099e+00,-8.302363302873621009e+00,0.000000000000000000e+00 +-8.912761185736057357e+00,7.944195013049380805e+00,2.000000000000000000e+00 +-8.582298022322135012e+00,8.306213899444216509e+00,2.000000000000000000e+00 +5.917543732016525837e+00,1.381598295104902174e+00,3.000000000000000000e+00 +-7.809172119310366256e+00,7.796120397911746380e+00,2.000000000000000000e+00 +-1.024583945135383090e+01,6.545706227907827746e+00,2.000000000000000000e+00 +-2.988371860898040300e+00,8.828627151534504947e+00,1.000000000000000000e+00 +-6.168012313062380514e+00,-8.004751685113815185e+00,0.000000000000000000e+00 +6.405333076509197809e+00,2.378151394901687699e+00,3.000000000000000000e+00 +4.525338990975483533e+00,3.210985995914193758e+00,3.000000000000000000e+00 +-4.116680857613977729e+00,9.198919986730626164e+00,1.000000000000000000e+00 +-5.192485556078705322e+00,-5.998469836326496107e+00,0.000000000000000000e+00 +-1.366374808537729635e+00,9.766219160885095008e+00,1.000000000000000000e+00 +-1.039495693015991407e+01,7.929532866844342998e+00,2.000000000000000000e+00 +-8.642482501538328421e+00,6.345150137883670993e+00,2.000000000000000000e+00 +-6.400647365404878109e+00,-6.546447487988998226e+00,0.000000000000000000e+00 +4.189813364748857794e+00,2.596019616288230747e+00,3.000000000000000000e+00 +4.737554934776933457e+00,1.200159900085265630e+00,3.000000000000000000e+00 +-1.545821493808428482e+00,9.427067055134820350e+00,1.000000000000000000e+00 +-7.844550651731374558e+00,-6.194058133277507316e+00,0.000000000000000000e+00 +-2.851912139579519501e+00,8.212008858976702186e+00,1.000000000000000000e+00 +2.515983111918294490e+00,1.447414662259971063e+00,3.000000000000000000e+00 +-3.038957826819788988e+00,9.527553561311677299e+00,1.000000000000000000e+00 +-8.873316247132972734e+00,9.094323551134213091e+00,2.000000000000000000e+00 +-8.184096691656122857e+00,-6.210437044445908050e+00,0.000000000000000000e+00 +5.465295185216131557e+00,2.786679319941370636e+00,3.000000000000000000e+00 +5.263998653280256512e+00,2.601515193205012011e+00,3.000000000000000000e+00 +-8.728932962031118237e+00,8.049289539397395998e+00,2.000000000000000000e+00 +5.154926522534148958e+00,5.825901174595452758e+00,3.000000000000000000e+00 +-2.773854456290706150e+00,1.173445529478794036e+01,1.000000000000000000e+00 +-9.199293922455090922e+00,8.482852718862950780e+00,2.000000000000000000e+00 +5.052810290503725987e+00,1.409445131136757290e+00,3.000000000000000000e+00 +-5.821202704301682296e+00,-8.638849079699060241e+00,0.000000000000000000e+00 +3.921434614975665589e+00,1.759722532228884750e+00,3.000000000000000000e+00 +-7.902649363488548850e+00,8.595078010492862575e+00,2.000000000000000000e+00 +-6.942306288424441973e+00,-5.924967272774708249e+00,0.000000000000000000e+00 +-1.695680405683080316e+00,7.783421811764366538e+00,1.000000000000000000e+00 +-1.696671800658552165e+00,1.037052615676914513e+01,1.000000000000000000e+00 +-8.062885703817045169e+00,-8.919341771036046751e+00,0.000000000000000000e+00 +-8.549032472272642735e+00,-6.336749400896011686e+00,0.000000000000000000e+00 +-4.818879266269282979e+00,-5.124768750832742192e+00,0.000000000000000000e+00 +-8.127367788432518836e+00,7.767786226984743081e+00,2.000000000000000000e+00 +-3.398712052678273476e+00,8.198475843232882809e+00,1.000000000000000000e+00 +-6.302555063970729954e+00,-7.083154979318939226e+00,0.000000000000000000e+00 +-8.015157172674095776e+00,7.396840882687106600e+00,2.000000000000000000e+00 +3.867053621690529575e+00,1.736351077200723125e+00,3.000000000000000000e+00 +-8.908493468094658141e+00,5.662561981982712211e+00,2.000000000000000000e+00 +-6.434231119079936168e-01,9.488119049110109060e+00,1.000000000000000000e+00 +-9.565464932460878700e+00,7.076004279947198050e+00,2.000000000000000000e+00 +-8.583009630506424514e+00,-6.935657292172565214e+00,0.000000000000000000e+00 +-6.230117218422199787e-01,9.188863941030160021e+00,1.000000000000000000e+00 +-2.232506823722731237e+00,9.841469377234345117e+00,1.000000000000000000e+00 +-7.154678888302914430e+00,-9.182030758011531901e+00,0.000000000000000000e+00 +-1.593795505350676045e+00,9.343037237858005994e+00,1.000000000000000000e+00 +-8.357318524899296719e+00,7.547406939777834722e+00,2.000000000000000000e+00 +-9.463146334331689502e+00,7.349613965709536956e+00,2.000000000000000000e+00 +-7.128591339630343526e+00,-5.908538642321591539e+00,0.000000000000000000e+00 +5.497538459430121094e+00,1.813231153977304944e+00,3.000000000000000000e+00 +-9.069262286844688603e+00,8.019729280312121844e+00,2.000000000000000000e+00 +-7.410128338761797551e+00,-7.455927833920626746e+00,0.000000000000000000e+00 +-2.336016697201568348e+00,9.399603507927158930e+00,1.000000000000000000e+00 +-7.338988090691514365e+00,-7.729953962740738760e+00,0.000000000000000000e+00 +5.186976217398139077e+00,1.770977031506837829e+00,3.000000000000000000e+00 +-2.450988904606750118e+00,7.871315830367698219e+00,1.000000000000000000e+00 +-9.612116955739583801e+00,6.078868212187286346e+00,2.000000000000000000e+00 +-4.914902058234880577e+00,-6.844846041304218254e+00,0.000000000000000000e+00 +4.164933525067144870e+00,1.319840451367020107e+00,3.000000000000000000e+00 +-9.362848022915784441e+00,7.812897476726621271e+00,2.000000000000000000e+00 +-6.989371661690665150e+00,8.450087945046460547e+00,2.000000000000000000e+00 +-7.149034025595828012e+00,-6.162567337479984531e+00,0.000000000000000000e+00 +-6.991955240842099961e+00,-7.101079192809169882e+00,0.000000000000000000e+00 +-2.496195731174843058e+00,1.046782020535563795e+01,1.000000000000000000e+00 +-5.700330007087443640e+00,-6.812591111865837767e+00,0.000000000000000000e+00 +3.978092371459713394e+00,2.825603018736956074e+00,3.000000000000000000e+00 +-3.924568365103164425e+00,8.593640805432961827e+00,1.000000000000000000e+00 +-8.917751726329123940e+00,-7.888195904193350927e+00,0.000000000000000000e+00 +-8.358213436931110962e+00,-5.736355550069017539e+00,0.000000000000000000e+00 +-6.265460491107845087e+00,-6.122601883228641739e+00,0.000000000000000000e+00 +5.387172441351363084e+00,2.583539949374197064e+00,3.000000000000000000e+00 +-7.947247620533864243e+00,-7.022489078297240006e+00,0.000000000000000000e+00 +4.431756585870826548e+00,1.480168749281899121e+00,3.000000000000000000e+00 +-3.483879293280071732e+00,9.801370731940773240e+00,1.000000000000000000e+00 +-2.295103878922546414e+00,7.768547349486333076e+00,1.000000000000000000e+00 +-7.914300737429109667e+00,7.138620779055713683e+00,2.000000000000000000e+00 +4.838938531801571408e+00,1.372952806781937429e+00,3.000000000000000000e+00 +5.083698264374329590e+00,2.747803737370068777e+00,3.000000000000000000e+00 +-8.640242995868154807e+00,7.179162503574760379e+00,2.000000000000000000e+00 +-8.245226498667626913e+00,7.013976476184712538e+00,2.000000000000000000e+00 +-6.278243218367215661e+00,7.227463015774053368e+00,2.000000000000000000e+00 +4.531118687771243714e+00,2.374881406039673237e+00,3.000000000000000000e+00 +-9.383246843444959850e+00,7.722659029850774459e+00,2.000000000000000000e+00 +-3.211250716930102556e+00,8.686623981600552824e+00,1.000000000000000000e+00 +-7.531463298953429586e+00,-6.832710921959532335e+00,0.000000000000000000e+00 +4.199834349531117894e+00,2.103910261226823231e+00,3.000000000000000000e+00 +-8.183962100281952701e+00,7.267938244588248331e+00,2.000000000000000000e+00 +-3.746148333930832131e+00,7.693829515114044781e+00,1.000000000000000000e+00 +-8.660626755702756085e+00,5.988178556788602336e+00,2.000000000000000000e+00 +-6.945706989798586584e+00,-8.091125793038402847e+00,0.000000000000000000e+00 +-7.393494108487963956e+00,-7.939323115164897970e+00,0.000000000000000000e+00 +3.658370185180150447e+00,2.435273158204002808e+00,3.000000000000000000e+00 +-9.351271691278558507e+00,-7.677004848746423527e+00,0.000000000000000000e+00 +-1.043548854131196135e+00,8.788509827711786571e+00,1.000000000000000000e+00 +-4.427968838351791447e+00,8.987772252749104851e+00,1.000000000000000000e+00 +-2.955712575119771479e+00,9.870684922521792970e+00,1.000000000000000000e+00 +-6.547313179171678321e+00,-7.628596129832500239e+00,0.000000000000000000e+00 +-9.551173539313174032e+00,7.429953143190600073e+00,2.000000000000000000e+00 +-7.579352699143855787e+00,-6.666129682541724222e+00,0.000000000000000000e+00 +-9.201939968849865537e+00,7.266577291777635672e+00,2.000000000000000000e+00 +-1.770731043057339749e+00,9.185654409388291697e+00,1.000000000000000000e+00 +-6.697760936092774564e+00,-6.631889006975610457e+00,0.000000000000000000e+00 +-3.186119623358708797e+00,9.625962417039190200e+00,1.000000000000000000e+00 +-6.050221609967780800e+00,-9.091244902283831308e+00,0.000000000000000000e+00 +-3.571501336778855062e+00,9.487878558833502396e+00,1.000000000000000000e+00 +-7.530269760273096580e+00,-7.367234977040642896e+00,0.000000000000000000e+00 +-8.750419112177125314e+00,-7.231623077317255621e+00,0.000000000000000000e+00 +4.950786401826105632e+00,3.448525900890284213e+00,3.000000000000000000e+00 +-9.299848075453587271e-01,9.781720857351229981e+00,1.000000000000000000e+00 +-8.742206979695026803e+00,6.861247626793661070e+00,2.000000000000000000e+00 +-6.241034732373895721e+00,-8.541629655544905830e+00,0.000000000000000000e+00 +-3.522028743387173755e+00,9.328533460793595466e+00,1.000000000000000000e+00 +-8.458129905630046963e+00,7.934108660782526634e+00,2.000000000000000000e+00 +5.154914103436761152e+00,2.486955634852940911e+00,3.000000000000000000e+00 +-9.850432131896177168e+00,5.668666243632934254e+00,2.000000000000000000e+00 +-2.969836394012537628e+00,1.007140835441723681e+01,1.000000000000000000e+00 +-6.552699817387107828e+00,-7.099210122084810948e+00,0.000000000000000000e+00 +5.906789985414723887e+00,1.265500218321951253e+00,3.000000000000000000e+00 +-6.542024529076067907e+00,-7.291986559398414336e+00,0.000000000000000000e+00 +-2.177934191649186335e+00,9.989831255320680725e+00,1.000000000000000000e+00 +3.880746174674403193e+00,2.123563470416939492e+00,3.000000000000000000e+00 +-2.586299332466854395e+00,9.355438103014964923e+00,1.000000000000000000e+00 +4.387310684834941021e+00,7.253865019758825028e-01,3.000000000000000000e+00 +-2.147561598005116146e+00,8.369166373593197150e+00,1.000000000000000000e+00 +2.614736249570494220e+00,2.159623998710159754e+00,3.000000000000000000e+00 +-2.216125149754069046e+00,8.299934710171953611e+00,1.000000000000000000e+00 +4.964045188716543322e+00,1.843026629573047526e+00,3.000000000000000000e+00 +-8.810009380505549714e+00,7.353279054994448671e+00,2.000000000000000000e+00 +5.144866115208558632e+00,2.838924878110853367e+00,3.000000000000000000e+00 +-8.408709537503424869e+00,7.531210602661814413e+00,2.000000000000000000e+00 +-7.409884809523711091e+00,-7.672982425538291018e+00,0.000000000000000000e+00 +-9.696685536817224005e+00,8.023832794907693966e+00,2.000000000000000000e+00 +-8.205920017580488945e+00,8.296077365125432479e+00,2.000000000000000000e+00 +-8.004405602087105720e+00,7.782702994727140222e+00,2.000000000000000000e+00 +-8.877882910492665758e+00,8.005023612871328353e+00,2.000000000000000000e+00 +-3.348415146275388832e+00,8.705073752347107785e+00,1.000000000000000000e+00 +-6.802258883503651710e+00,-7.741393794604210399e+00,0.000000000000000000e+00 +-7.007544782632036728e+00,-7.835650033876372156e+00,0.000000000000000000e+00 +-8.413741361886891923e+00,-5.602432771377437781e+00,0.000000000000000000e+00 +4.704158855323564481e+00,8.954249060114258807e-01,3.000000000000000000e+00 +5.803042588383060973e+00,1.983402744960319097e+00,3.000000000000000000e+00 +-2.147802017544336195e+00,1.055232269466429074e+01,1.000000000000000000e+00 +6.225895652373453437e+00,7.353541851138829522e-01,3.000000000000000000e+00 +-7.172853312173433693e+00,8.337892980516834029e+00,2.000000000000000000e+00 +-5.953449643619628695e+00,-4.970692952805816134e+00,0.000000000000000000e+00 +3.120508870274087965e+00,1.488935611074480692e+00,3.000000000000000000e+00 +5.330022827939213670e+00,1.571949212054895684e+00,3.000000000000000000e+00 +4.715683394421827934e+00,1.296007972428620203e+00,3.000000000000000000e+00 +-6.049291374607024707e+00,-7.736193419184814069e+00,0.000000000000000000e+00 +4.620862628325412835e+00,9.706403193029231602e-01,3.000000000000000000e+00 +-2.522694847790684314e+00,7.956575199242420737e+00,1.000000000000000000e+00 +-2.670483334718759316e+00,9.418336985012860652e+00,1.000000000000000000e+00 +-6.010021271045610014e+00,-5.524471734470996154e+00,0.000000000000000000e+00 +-6.759331559439370807e+00,-6.365670759217197272e+00,0.000000000000000000e+00 +3.847358097795400944e+00,1.858433242473833014e+00,3.000000000000000000e+00 +-9.919384297044272714e+00,8.376675768831606916e+00,2.000000000000000000e+00 +-8.278537308704970954e+00,8.404303641053324725e+00,2.000000000000000000e+00 +4.618977242263953009e+00,2.090497067249514007e+00,3.000000000000000000e+00 +-6.861208811961718723e+00,-5.203672281000663702e+00,0.000000000000000000e+00 +-8.947069291191146689e+00,-6.969229632788734641e+00,0.000000000000000000e+00 +-8.996335655214998894e+00,6.896641845551283012e+00,2.000000000000000000e+00 +-7.323920451227381889e+00,-6.502809100231094597e+00,0.000000000000000000e+00 +-9.787726645467953901e+00,9.955904980336093502e+00,2.000000000000000000e+00 +-8.871081026852008833e+00,6.780098144364938406e+00,2.000000000000000000e+00 +2.926744307137223888e+00,3.327042058106144840e+00,3.000000000000000000e+00 +-2.165579333484288771e+00,7.251245972835587139e+00,1.000000000000000000e+00 +-6.485175048772973128e+00,-7.301094074096209141e+00,0.000000000000000000e+00 +-1.350602044045346117e+00,8.193603809846610631e+00,1.000000000000000000e+00 +-1.922340529252479779e+00,1.120474175400829964e+01,1.000000000000000000e+00 +5.321831807523064839e+00,1.662902927347275961e+00,3.000000000000000000e+00 +-9.411989763516245944e+00,6.776663974258310574e+00,2.000000000000000000e+00 +-3.189222344631240880e+00,9.246539825359324283e+00,1.000000000000000000e+00 +-9.181015350716359436e+00,6.952082049502911865e+00,2.000000000000000000e+00 +-7.916873345477726254e+00,-7.070448271359555115e+00,0.000000000000000000e+00 +-2.249314828804326538e+00,9.796108999975631448e+00,1.000000000000000000e+00 +4.627632063381186711e+00,1.075915312454900352e+00,3.000000000000000000e+00 +5.326139026602614734e+00,3.604538127510803491e-01,3.000000000000000000e+00 +-1.883530275287744082e+00,8.157128571782038762e+00,1.000000000000000000e+00 +5.590302674414151518e+00,1.396266028278328797e+00,3.000000000000000000e+00 +-8.278194764970411512e+00,-6.317140356585375649e+00,0.000000000000000000e+00 +-7.149502126444641448e+00,-7.858873309058253653e+00,0.000000000000000000e+00 +5.593880599721304137e+00,2.624560935246529780e+00,3.000000000000000000e+00 +-3.837383671951180908e+00,9.211147364067445054e+00,1.000000000000000000e+00 +3.909512204510964928e+00,2.189628273522707058e+00,3.000000000000000000e+00 +-5.678413268987325679e+00,-7.288184966297498235e+00,0.000000000000000000e+00 +-2.213077345988174294e+00,9.275341400378211532e+00,1.000000000000000000e+00 +-3.110904235282147212e+00,1.086656431270725953e+01,1.000000000000000000e+00 +3.319645629207458981e+00,3.804628449795085743e+00,3.000000000000000000e+00 +-8.394818253349821902e+00,-5.513235325831422173e+00,0.000000000000000000e+00 +-3.109836312971554939e+00,8.722592378405044755e+00,1.000000000000000000e+00 +-6.513028945054421648e+00,-7.819989379603302204e+00,0.000000000000000000e+00 +-3.700501120255398568e+00,9.670839736832151701e+00,1.000000000000000000e+00 +-5.328475215628746930e+00,-6.764434958983088109e+00,0.000000000000000000e+00 +-2.978672008987702124e+00,9.556846171784286526e+00,1.000000000000000000e+00 +-6.609170365371431544e+00,-6.930347702725083714e+00,0.000000000000000000e+00 +-2.754585739055620763e+00,8.260549963840832177e+00,1.000000000000000000e+00 +3.821658152994628743e+00,4.065556959626192679e+00,3.000000000000000000e+00 +-9.761561002746914184e+00,5.971838309882369522e+00,2.000000000000000000e+00 +-8.724100107973971063e+00,7.473824676960580504e+00,2.000000000000000000e+00 +-7.456398521719602712e+00,-6.124718367450190826e+00,0.000000000000000000e+00 +-6.264967953386149979e+00,7.382741349513191054e+00,2.000000000000000000e+00 +-8.819893823570616576e+00,7.671104620860374368e+00,2.000000000000000000e+00 +4.753396181479349281e+00,2.635300358461778458e+00,3.000000000000000000e+00 +-7.118575238017680107e+00,-7.787673255317544729e+00,0.000000000000000000e+00 +-2.185113653657955179e+00,8.629203847782004999e+00,1.000000000000000000e+00 +-2.581207744633084111e+00,1.001781902609034525e+01,1.000000000000000000e+00 +-8.824398464723063995e+00,7.299397828388699772e+00,2.000000000000000000e+00 +-1.718165676009703269e+00,8.104898673403582166e+00,1.000000000000000000e+00 +-6.780294885722044640e+00,-6.128722469904158032e+00,0.000000000000000000e+00 +-5.234659477649985959e+00,-7.129145632832324608e+00,0.000000000000000000e+00 +6.081152125294217115e+00,5.373075327612926166e-01,3.000000000000000000e+00 +-3.417221698573960964e+00,7.601982426863029829e+00,1.000000000000000000e+00 +3.800156994047325210e+00,1.373777038496709846e+00,3.000000000000000000e+00 +3.814381639435589832e+00,1.651783842287738668e+00,3.000000000000000000e+00 +5.945357643382430446e+00,1.994173525573491146e+00,3.000000000000000000e+00 +4.981634812005260926e+00,3.849340523156618232e+00,3.000000000000000000e+00 +-3.886866991009841232e+00,8.076461088283199530e+00,1.000000000000000000e+00 +-6.541130783656855741e+00,-7.295397507176748064e+00,0.000000000000000000e+00 +4.863971318038518454e+00,1.985762084722526799e+00,3.000000000000000000e+00 +-6.122638574505918641e+00,-7.802274917453572378e+00,0.000000000000000000e+00 +-2.901305776184907703e+00,7.550771180066202959e+00,1.000000000000000000e+00 +4.884845407336824152e+00,1.466226508569602238e+00,3.000000000000000000e+00 +-1.148929756502902322e+01,8.415029767421165374e+00,2.000000000000000000e+00 +-9.093304974056865220e+00,8.827515904081391085e+00,2.000000000000000000e+00 +-3.355991341121155269e+00,7.499438903512457344e+00,1.000000000000000000e+00 +3.817658440661670038e+00,2.216856895432644414e+00,3.000000000000000000e+00 +5.452740955067061357e+00,2.602798525864344015e+00,3.000000000000000000e+00 +-8.558359130316189223e+00,6.198033868200326424e+00,2.000000000000000000e+00 +-9.078653154794144697e+00,6.948702107949105589e+00,2.000000000000000000e+00 +-3.800746382696032377e+00,-5.760534681841369853e+00,0.000000000000000000e+00 +-6.043935079086128148e+00,-8.009816447933564731e+00,0.000000000000000000e+00 +-6.793037403678369834e+00,-7.035786828668026516e+00,0.000000000000000000e+00 +-2.543909392757993437e+00,7.845608090578789273e+00,1.000000000000000000e+00 +5.512199472948779544e+00,2.156511689679083688e+00,3.000000000000000000e+00 +6.783822925553426586e+00,2.607088706258743116e+00,3.000000000000000000e+00 +-7.865353237486814031e+00,-6.376063077758102438e+00,0.000000000000000000e+00 +-6.644012633042704508e+00,-6.109244399388980007e+00,0.000000000000000000e+00 +4.685450676131915237e+00,1.321569336334914802e+00,3.000000000000000000e+00 +-1.687137463058260067e+00,1.091107911085226867e+01,1.000000000000000000e+00 +-6.008502487719577623e+00,-7.206133125443788146e+00,0.000000000000000000e+00 +-2.417436846517247773e+00,7.026717213597429179e+00,1.000000000000000000e+00 +-2.732660408378601247e+00,9.728286622290413632e+00,1.000000000000000000e+00 +4.621365700235711138e+00,1.684511045020593567e+00,3.000000000000000000e+00 +-7.689054430350334535e+00,6.620346490372815751e+00,2.000000000000000000e+00 +-2.728869510890262085e+00,9.371398699710068669e+00,1.000000000000000000e+00 +6.762035033240734627e+00,3.005634944491879068e+00,3.000000000000000000e+00 +-6.246845325044985131e+00,-4.609416735471550730e+00,0.000000000000000000e+00 +-1.012828865637706421e+01,6.028444143435087277e+00,2.000000000000000000e+00 +-1.478198100556799233e+00,9.945566247314520325e+00,1.000000000000000000e+00 +-3.317691225945937905e+00,8.512529084613785102e+00,1.000000000000000000e+00 +-8.782602844347316307e+00,8.417714433969651466e+00,2.000000000000000000e+00 +-8.216517794418813025e+00,5.753298195608246957e+00,2.000000000000000000e+00 +-8.875962459060858123e+00,8.426824797515225285e+00,2.000000000000000000e+00 +5.819318956949388166e+00,1.503994031836027201e+00,3.000000000000000000e+00 +4.709680921218120098e+00,1.587856087078971745e+00,3.000000000000000000e+00 +-8.651560992158932706e+00,6.568139983145380612e+00,2.000000000000000000e+00 +3.378994881893055968e+00,2.891031630995508195e+00,3.000000000000000000e+00 +4.324609591587755375e+00,2.732138904433999649e+00,3.000000000000000000e+00 +-5.873334381936829551e+00,-7.457001462799095037e+00,0.000000000000000000e+00 +-9.272823984068326197e+00,7.014350792030064063e+00,2.000000000000000000e+00 +-2.281737688448620904e+00,1.032142888248074897e+01,1.000000000000000000e+00 +-3.580090121113862267e+00,9.496758543441506717e+00,1.000000000000000000e+00 +3.439582429172324929e+00,1.638668448099783514e+00,3.000000000000000000e+00 +-9.413965582873784044e+00,7.445532730144064359e+00,2.000000000000000000e+00 +-9.814201009613343629e+00,8.377164712106543121e+00,2.000000000000000000e+00 +3.633861454728399387e+00,7.589810711529998422e-01,3.000000000000000000e+00 +-8.530525987743951433e+00,5.613354522842077365e+00,2.000000000000000000e+00 +-5.251011645579978016e+00,-8.260211051490838230e+00,0.000000000000000000e+00 +4.996894322193148774e+00,1.280260088680077679e+00,3.000000000000000000e+00 +-8.116655692592775750e+00,6.194471144281473940e+00,2.000000000000000000e+00 +5.159225350469273330e+00,3.505908596943309696e+00,3.000000000000000000e+00 +-9.361050777155050184e+00,8.372532141335591760e+00,2.000000000000000000e+00 +-1.153521439957758155e+01,7.269228048980891366e+00,2.000000000000000000e+00 +-3.987719613420177556e+00,8.294441919803613672e+00,1.000000000000000000e+00 +-1.426146379877473169e+00,1.006808818023322516e+01,1.000000000000000000e+00 +-2.441669418364826427e+00,7.589537941984865199e+00,1.000000000000000000e+00 +-9.378087436945371280e+00,6.545218190096390387e+00,2.000000000000000000e+00 +-5.377270139055242204e+00,-6.806014812856171048e+00,0.000000000000000000e+00 +-2.504084166410289303e+00,8.779698994823174729e+00,1.000000000000000000e+00 +-2.448967413111723612e+00,1.147752824068360766e+01,1.000000000000000000e+00 +-8.507169629034432745e+00,-6.832024646614564212e+00,0.000000000000000000e+00 +-4.234115455565783392e+00,8.451998598957349174e+00,1.000000000000000000e+00 +-6.619904689429787936e+00,-7.784426218380355422e+00,0.000000000000000000e+00 +-7.542250950097116657e+00,-6.309510924682787625e+00,0.000000000000000000e+00 +-7.526200075393796318e+00,-7.961657596890341360e+00,0.000000000000000000e+00 +-6.495561742211962475e+00,-6.912804341370039296e+00,0.000000000000000000e+00 +-1.053079238635083037e+01,8.853073234959316196e+00,2.000000000000000000e+00 +-9.948903602101839994e+00,9.075793358922325638e+00,2.000000000000000000e+00 +4.154515288398997974e+00,2.055043823327054486e+00,3.000000000000000000e+00 +-6.887599832467887317e+00,-5.400165454385920327e+00,0.000000000000000000e+00 +-7.245141129996612861e+00,6.812307239067518339e+00,2.000000000000000000e+00 +-4.059861054118883317e+00,9.082849103004349445e+00,1.000000000000000000e+00 +3.810883825306029316e+00,1.412988643743762429e+00,3.000000000000000000e+00 +-1.987256057435852430e+00,9.311270801431508204e+00,1.000000000000000000e+00 +-1.106403312116650994e+00,7.612435065406041090e+00,1.000000000000000000e+00 +-7.642886347693787386e+00,-8.684991693940466106e+00,0.000000000000000000e+00 +-2.251647232329985648e+00,8.939840212432153876e+00,1.000000000000000000e+00 +4.645122535946284437e+00,2.020150277705473840e+00,3.000000000000000000e+00 +-1.092025716451973238e+01,9.019979283788741142e+00,2.000000000000000000e+00 +-6.986657551105827757e+00,-7.915351915695320706e+00,0.000000000000000000e+00 +-7.592242564138381056e+00,5.250132683090553698e+00,2.000000000000000000e+00 +-2.409546257965109017e+00,8.510810474082122212e+00,1.000000000000000000e+00 +-3.660191200475052753e+00,9.389984146543993049e+00,1.000000000000000000e+00 +-9.449845559627958025e+00,5.916861818650480664e+00,2.000000000000000000e+00 +-9.097919107999615562e+00,5.820379962380597405e+00,2.000000000000000000e+00 +-5.842087246893370889e+00,-7.390125992130693433e+00,0.000000000000000000e+00 +3.045451177433734280e+00,1.373794660986959126e+00,3.000000000000000000e+00 +-3.499733948183438415e+00,8.447988398595549953e+00,1.000000000000000000e+00 +-8.627310289433392398e+00,7.226809803628310824e+00,2.000000000000000000e+00 +6.829681769445773654e+00,1.164871398585580531e+00,3.000000000000000000e+00 +6.772912210884367568e+00,2.108188441823011239e-02,3.000000000000000000e+00 +-9.919391084235908096e+00,7.939458522442967237e+00,2.000000000000000000e+00 +-6.392575777019184002e+00,7.452744097473930296e+00,2.000000000000000000e+00 +-2.258704772706873420e+00,9.360734337695296503e+00,1.000000000000000000e+00 +3.741464164879743315e+00,2.465088855447237659e+00,3.000000000000000000e+00 +4.422197633000880757e+00,3.071946535927922106e+00,3.000000000000000000e+00 +-7.900043950660013081e+00,6.807478187281329696e+00,2.000000000000000000e+00 +5.539478711661351973e+00,2.280469204817341389e+00,3.000000000000000000e+00 +-9.107216447190840114e+00,6.216997006757033262e+00,2.000000000000000000e+00 +4.736874801220819720e+00,2.568326709377645400e+00,3.000000000000000000e+00 +4.920870703963133863e+00,1.350470164120138206e+00,3.000000000000000000e+00 +-2.972615315865212438e+00,8.548556374628065058e+00,1.000000000000000000e+00 +4.962597396566191144e+00,1.145938740388408927e+00,3.000000000000000000e+00 +-3.428621857286553443e+00,1.056422053321586141e+01,1.000000000000000000e+00 +-1.006045556552795617e+01,8.036521345671090444e+00,2.000000000000000000e+00 +-1.018651317874172335e+01,8.066787009521418028e+00,2.000000000000000000e+00 +-7.740040556435222818e+00,-7.264665137505772030e+00,0.000000000000000000e+00 +-8.486073511408843473e+00,-6.676645957408723575e+00,0.000000000000000000e+00 +-1.067920198796765519e+01,6.043945948763001397e+00,2.000000000000000000e+00 +-6.508481317779961195e+00,-7.484094779991766977e+00,0.000000000000000000e+00 +6.793061293739658169e+00,1.205822121052682494e+00,3.000000000000000000e+00 +-6.303070228095503325e+00,-6.568859438732410183e+00,0.000000000000000000e+00 +-5.356503113881612599e+00,-6.341199549591287621e+00,0.000000000000000000e+00 +-9.174112455926138665e+00,8.992544440788096338e+00,2.000000000000000000e+00 +-3.615532597058778386e+00,7.818079504117650735e+00,1.000000000000000000e+00 +-1.061704800554028871e+01,8.819567226987885533e+00,2.000000000000000000e+00 +-8.566748919440636101e+00,6.046774339678393950e+00,2.000000000000000000e+00 +-8.130575821180535456e+00,6.761056139604435522e+00,2.000000000000000000e+00 +-3.393055059253883066e+00,9.168011234143849109e+00,1.000000000000000000e+00 +-5.711845129491463169e+00,-6.625688749974733227e+00,0.000000000000000000e+00 +4.488093741192518138e+00,2.561486890425308527e+00,3.000000000000000000e+00 +-6.349823013235987190e+00,-5.438540972618046254e+00,0.000000000000000000e+00 +-7.755245444536027044e+00,-8.262909324240283127e+00,0.000000000000000000e+00 +-7.472021115390139023e+00,-7.744100362955762762e+00,0.000000000000000000e+00 +-2.545023662162701594e+00,1.057892978401232753e+01,1.000000000000000000e+00 +4.168840530609778661e+00,2.205219621298368349e+00,3.000000000000000000e+00 +-7.433276496498452346e+00,-8.077987485864795758e+00,0.000000000000000000e+00 +-9.827932576894591321e+00,7.197735995399055398e+00,2.000000000000000000e+00 +-9.465294814423778291e+00,9.135971473495631656e+00,2.000000000000000000e+00 +6.954537402901610044e+00,1.059044913489839423e-01,3.000000000000000000e+00 +-2.300334028047994916e+00,7.054616004318545741e+00,1.000000000000000000e+00 +5.797989709728168961e+00,2.764832377903667648e+00,3.000000000000000000e+00 +-6.234251241566122204e+00,-5.511478035743597736e+00,0.000000000000000000e+00 +-9.542671447178769029e+00,5.915061619135143722e+00,2.000000000000000000e+00 +-2.267235351486716066e+00,7.101005883540523200e+00,1.000000000000000000e+00 +-8.345009855755121109e+00,7.508359039193576834e+00,2.000000000000000000e+00 +-1.686652710949561040e+00,7.793442478227299297e+00,1.000000000000000000e+00 +-1.031303578311234093e+00,8.496015909924674148e+00,1.000000000000000000e+00 +-8.430075000921538830e+00,5.620939311260862326e+00,2.000000000000000000e+00 +4.664389010487044018e+00,2.471167975186181920e+00,3.000000000000000000e+00 +3.191794494730777032e+00,5.657059095641767676e-01,3.000000000000000000e+00 +-6.808060953931877712e+00,-7.357767040041062856e+00,0.000000000000000000e+00 +6.272290140159736183e+00,5.430283059800993239e-01,3.000000000000000000e+00 +-2.151410262704466891e+00,9.575070654566555817e+00,1.000000000000000000e+00 diff --git a/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb b/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..7e6721adaf54289d6706b560439e6a363709cff3 --- /dev/null +++ b/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb @@ -0,0 +1,448 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7114f455", + "metadata": {}, + "source": [ + "# Weather Data " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f361dfb4", + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "from matplotlib import pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "import io\n", + "import urllib\n", + "import zipfile\n", + "from pathlib import Path" + ] + }, + { + "cell_type": "markdown", + "id": "70f8451d", + "metadata": {}, + "source": [ + "## Downloading the weather dataset\n", + "\n", + "Use the function `download_dwd` provided below to download the weather dataset. \n", + "\n", + "This function will download a ZIP file and will store it in the directory `./tmp` (if not specified differently). Inside `tmp` the ZIP file will be unpacked. One of the files contained in the tarball is `produkt_tu_stunde_19500101_20201231_01639.txt`. This file contains measurements of weather data (in particular temperature and relative humidity) from 1950 to the end of 2020." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6ec374f", + "metadata": {}, + "outputs": [], + "source": [ + "def download_and_extract_weatherdata(url: str, outdir: Path = Path('tmp')) -> None:\n", + " \"\"\"download DWD climate data from url and extract.\"\"\"\n", + " # Create temp directory for saving file to disk\n", + " outdir.mkdir(exist_ok=True)\n", + " \n", + " # Retrieve the file from URL and extract data from tarball\n", + " response = urllib.request.urlopen(url)\n", + " \n", + " # Extract the tarball\n", + " z = zipfile.ZipFile(io.BytesIO(response.read()))\n", + " z.extractall(path='tmp/')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62103fc3", + "metadata": {}, + "outputs": [], + "source": [ + "# Download the data and extract.\n", + "# Download the data and extract.\n", + "URL = (\n", + " 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/'\n", + " 'air_temperature/historical/stundenwerte_TU_01639_19500101_20211231_hist.zip'\n", + ")\n", + "TMP_DIRECTORY = Path(\"tmp\")\n", + "download_and_extract_weatherdata(url=URL, outdir=TMP_DIRECTORY)" + ] + }, + { + "cell_type": "markdown", + "id": "f0be4be5", + "metadata": {}, + "source": [ + "## Importing the measurement data\n", + "\n", + "The file `produkt_tu_stunde_19500101_20201231_01639.txt` is a CSV file (although the suffix `.txt` conveys something else).\n", + "\n", + "The single columns of the file have the following headers:\n", + "\n", + "```\n", + "STATIONS_ID;MESS_DATUM;QN_9;TT_TU;RF_TU;eor\n", + "```\n", + "\n", + "Load the content of this file into a `pd.DataFrame` by using a suitable function. Only import the columns \n", + "- `MESS_DATUM` (date of the measurement), \n", + "- `TT_TU` (measured temperature in ${}^{\\circ}\\text{C}$), and \n", + "- `RF_TU` (measured relative humidity). \n", + "\n", + "After having imported the data gather some `info`rmation on the data (e.g. datatypes of columns or memory usage)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f99173c3", + "metadata": {}, + "outputs": [], + "source": [ + "df_weather = pd.read_csv(\n", + " TMP_DIRECTORY / \"produkt_tu_stunde_19500101_20211231_01639.txt\",\n", + " delimiter=\";\",\n", + " usecols=[\"MESS_DATUM\",\"TT_TU\", \"RF_TU\"]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28ec724b", + "metadata": {}, + "outputs": [], + "source": [ + "df_weather.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a9448957", + "metadata": {}, + "outputs": [], + "source": [ + "df_weather.tail()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0377e45a", + "metadata": {}, + "outputs": [], + "source": [ + "df_weather.info(memory_usage=\"deep\")" + ] + }, + { + "cell_type": "markdown", + "id": "040a748e", + "metadata": {}, + "source": [ + "## Prepare the data for further analysis" + ] + }, + { + "cell_type": "markdown", + "id": "b111bfdc", + "metadata": {}, + "source": [ + "The column `MESS_DATUM` contains the data of each measurement in the format `%Y%m%d%H`. The datatype of this column is `int64`.\n", + "\n", + "Create a new `DataFrame` named `df_weather_cleaned` that is based on the original `df_weather` from above.\n", + "\n", + "Make the following modifications to the `df_weather` `DataFrame` to generate a new one that is then assigned to the `df_weather_tweaked` variable.\n", + "\n", + "**Note**: You will be using several methods of the `DataFrame` object. Consider *chaining* the calls to these methods to have a compact way to make the relevant transformations to the `df_weather` `DataFrame`.\n", + "\n", + "\n", + "### Changing the format of the measurement dates\n", + "\n", + "The `\"MESS_DATUM\"` column in the original `DataFrame` contains the dates of measurement as integer values. The format is `%Y%m%d%H` which is meant to represent \"YearMonthDayHour\". We would like to have these in date-like format. \n", + "\n", + "To transfer these integer values to a suitable date format look at the [documentation of the `DataFrame` object](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/09_timeseries.html?highlight=datetime) to find a suitable function for making such a conversion.\n", + "\n", + "Use this function together with the `assign` method of a `DataFrame` instance to modify the \"`MESS_DATUM`\" column in an appropriate manner.\n", + "\n", + "### Rename the column headers\n", + "\n", + "Rename the following columns:\n", + "\n", + "* `\"MESS_DATUM\"` $\\to$ `\"Date of Measurement\"`\n", + "* `\"TT_TU\"` $\\to$ `\"Temperature\"`\n", + "* `\"RF_TU\"` $\\to$ `\"Humidity\"`\n", + "\n", + "\n", + "### Setting a new index \n", + "\n", + "Make the column named `\"Date of Measurement\"` the new index of the new `DataFrame` instance.\n", + "\n", + "### Note\n", + "\n", + "In all the following tasks you are supposed to work with the new modified `df_weather_tweaked`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8030b621", + "metadata": {}, + "outputs": [], + "source": [ + "df_weather_tweaked = (\n", + " df_weather\n", + " # Transform the integer dates to a date-like format\n", + " .assign(\n", + " MESS_DATUM=pd.to_datetime(\n", + " df_weather[\"MESS_DATUM\"].astype('str'), \n", + " format=\"%Y%m%d%H\"\n", + " )\n", + " )\n", + " # Rename the columns\n", + " .rename(\n", + " columns={\n", + " \"MESS_DATUM\": \"Date of Measurement\",\n", + " \"TT_TU\": \"Temperature\",\n", + " \"RF_TU\": \"Humidity\"\n", + " }\n", + " )\n", + " # Set a new index\n", + " .set_index(\"Date of Measurement\")\n", + " .astype(\n", + " {\n", + " \"Temperature\": np.float64,\n", + " \"Humidity\": np.float64\n", + " }\n", + " )\n", + ")\n", + "df_weather_tweaked" + ] + }, + { + "cell_type": "markdown", + "id": "4b681490", + "metadata": {}, + "source": [ + "## Clean up the dataset\n", + "\n", + "When measurements are taken over a long period of time it is quite likely the erroneous data sneaks into the dataset. Indeed, we should remove this data from the `DataFrame`.\n", + "\n", + "Analyse the dataset in a suitable manner to investigate if the measured values for the temperature and the relative humidity are present that seem reasonable.\n", + "\n", + "- Plot the distribution of the temperature and the relative humidity. Look for suitable functions in the [`pandas.DataFrame.plot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) module.\n", + "- Determine the smallest (minimal) as well as the largest (maximal) value for each of the data columns.\n", + "- Remove all conspicuously small or large values from the dataset. Make sure not to generate a new `DataFrame` but rather to perform all adjustments with the already-existing one. Afterwards check re-check your results to assure all " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a28e418", + "metadata": {}, + "outputs": [], + "source": [ + "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4), sharey=\"row\")\n", + "\n", + "ax1.set_xlabel(\"temperature / degree Celsius\")\n", + "df_weather_tweaked.plot.hist(ax=ax1, y=\"Temperature\", bins=50)\n", + "ax2.set_xlabel(\"relative humidity / %\")\n", + "df_weather_tweaked.plot.hist(ax=ax2, y=\"Humidity\", bins=50)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d16a3cfb", + "metadata": {}, + "outputs": [], + "source": [ + "df_weather_tweaked.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3eb31737", + "metadata": {}, + "outputs": [], + "source": [ + "# clear all values that are \n", + "boolean_mask = df_weather_tweaked.index[(df_weather[\"TT_TU\"] < -998.9999) | (df_weather[\"RF_TU\"] < -998.9999)]\n", + "df_weather_tweaked.drop(boolean_mask, inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ee9be9c", + "metadata": {}, + "outputs": [], + "source": [ + "df_weather_tweaked.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "546c3127", + "metadata": {}, + "outputs": [], + "source": [ + "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))\n", + "\n", + "ax1.set_xlabel(\"temperature / degree Celsius\")\n", + "ax2.set_xlabel(\"relative humidity / %\")\n", + "\n", + "df_weather_tweaked[\"Temperature\"].value_counts().plot.line(ax=ax1, style=\"o\")\n", + "df_weather_tweaked[\"Humidity\"].value_counts().plot.line(ax=ax2, style=\"s\")" + ] + }, + { + "cell_type": "markdown", + "id": "de80c2cf", + "metadata": {}, + "source": [ + "## Analyse the data" + ] + }, + { + "cell_type": "markdown", + "id": "74dd1b57", + "metadata": {}, + "source": [ + "### Monthly distribution of temperature and humidity\n", + "\n", + "- Group the data by month in which each measurement has been conducted. *Hint*: The `index` of the `DataFrame` has a `month` attribute.\n", + "\n", + "- Display the distribution of the temperature and the relative humidity for each month in a [violin plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html). The abscissa must show each month as a integer value while the ordinate must show the values for the temperature or the relative humidity, respectively. *Hint*: In order to extract the data from the subframes of the `DataFrameGroupBy` object you need to iterate over it in a suitable manner." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ceaaacae", + "metadata": {}, + "outputs": [], + "source": [ + "by_month = df_weather_tweaked.groupby(df_weather_tweaked.index.month)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9251d7e", + "metadata": {}, + "outputs": [], + "source": [ + "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 5), sharex=\"col\")\n", + "\n", + "ax2.set_xticks(range(1, 13))\n", + "ax2.set_xlabel(\"month\")\n", + "\n", + "ax1.set_ylabel(\"temperature / deg. C\")\n", + "ax1.violinplot([subframe[\"Temperature\"] for _, subframe in by_month]); # added to avoid verbose output\n", + "ax2.set_ylabel(\"rel. humidity / %\")\n", + "ax2.violinplot([subframe[\"Humidity\"] for _, subframe in by_month]); # added tp avoid verbose output" + ] + }, + { + "cell_type": "markdown", + "id": "95b85f06", + "metadata": {}, + "source": [ + "### Yearly mean temperature\n", + "\n", + "- Group the data by the year in which the measurements have been conducted. Then use *two* different methods of your choice to compute the mean value of the temperatures in each subframe. The result is the average temperature for each year in the dataset.\n", + "\n", + "- Plot the results for the yearly averaged temperate in a suitable manner." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b750cd9d", + "metadata": {}, + "outputs": [], + "source": [ + "by_year = df_weather_tweaked.groupby(df_weather_tweaked.index.year)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a83a4e4b", + "metadata": {}, + "outputs": [], + "source": [ + "df_by_year_agg = by_year.agg([np.mean])\n", + "df_by_year_apply = by_year.apply(lambda x: x.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed4157a9", + "metadata": {}, + "outputs": [], + "source": [ + "df_by_year_apply[\"Temperature\"].plot.line(style=\"o\", xlabel=\"year\", ylabel=\"temperature / degree Celsius\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "489abbfd", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": { + "height": "calc(100% - 180px)", + "left": "10px", + "top": "150px", + "width": "384px" + }, + "toc_section_display": true, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb.license b/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb.license new file mode 100644 index 0000000000000000000000000000000000000000..c207ab8c094a9d18d7c6cb5c9dfbf8913df4aa8a --- /dev/null +++ b/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb.license @@ -0,0 +1,4 @@ +SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> + +SPDX-License-Identifier: MIT diff --git a/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb b/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..1ad3696e349ee55baf0a967e92e89f71b2ecbc89 --- /dev/null +++ b/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb @@ -0,0 +1,476 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7114f455", + "metadata": {}, + "source": [ + "# Weather Data " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f361dfb4", + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "from matplotlib import pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "import io\n", + "import urllib\n", + "import zipfile\n", + "from pathlib import Path" + ] + }, + { + "cell_type": "markdown", + "id": "70f8451d", + "metadata": {}, + "source": [ + "## Downloading the weather dataset\n", + "\n", + "Use the function `download_dwd` provided below to download the weather dataset. \n", + "\n", + "This function will download a ZIP file and will store it in the directory `./tmp` (if not specified differently). Inside `tmp` the ZIP file will be unpacked. One of the files contained in the tarball is `produkt_tu_stunde_19500101_20201231_01639.txt`. This file contains measurements of weather data (in particular temperature and relative humidity) from 1950 to the end of 2020." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6ec374f", + "metadata": {}, + "outputs": [], + "source": [ + "def download_and_extract_weatherdata(url: str, outdir: Path = Path('tmp')) -> None:\n", + " \"\"\"download DWD climate data from url and extract.\"\"\"\n", + " # Create temp directory for saving file to disk\n", + " outdir.mkdir(exist_ok=True)\n", + " \n", + " # Retrieve the file from URL and extract data from tarball\n", + " response = urllib.request.urlopen(url)\n", + " \n", + " # Extract the tarball\n", + " z = zipfile.ZipFile(io.BytesIO(response.read()))\n", + " z.extractall(path='tmp/')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62103fc3", + "metadata": {}, + "outputs": [], + "source": [ + "# Download the data and extract.\n", + "# Download the data and extract.\n", + "URL = (\n", + " 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/'\n", + " 'air_temperature/historical/stundenwerte_TU_01639_19500101_20211231_hist.zip'\n", + ")\n", + "TMP_DIRECTORY = Path(\"tmp\")\n", + "\n", + "download_and_extract_weatherdata(url=URL, outdir=TMP_DIRECTORY)" + ] + }, + { + "cell_type": "markdown", + "id": "f0be4be5", + "metadata": {}, + "source": [ + "## Importing the measurement data\n", + "\n", + "The file `produkt_tu_stunde_19500101_20201231_01639.txt` is a CSV file (although the suffix `.txt` conveys something else).\n", + "\n", + "The single columns of the file have the following headers:\n", + "\n", + "```\n", + "STATIONS_ID;MESS_DATUM;QN_9;TT_TU;RF_TU;eor\n", + "```\n", + "\n", + "Load the content of this file into a `pd.DataFrame` by using a suitable function. Only import the columns \n", + "- `MESS_DATUM` (date of the measurement), \n", + "- `TT_TU` (measured temperature in ${}^{\\circ}\\text{C}$), and \n", + "- `RF_TU` (measured relative humidity). \n", + "\n", + "After having imported the data gather some `info`rmation on the data (e.g. datatypes of columns or memory usage)." + ] + }, + { + "cell_type": "markdown", + "id": "06461f7c", + "metadata": {}, + "source": [ + "Import the data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f99173c3", + "metadata": {}, + "outputs": [], + "source": [ + "### YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "65a852ea", + "metadata": {}, + "source": [ + "Inspect the first few lines of the `DataFrame`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28ec724b", + "metadata": {}, + "outputs": [], + "source": [ + "### YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "f05b58c7", + "metadata": {}, + "source": [ + "Inspect the last lines of the `DataFrame`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a9448957", + "metadata": {}, + "outputs": [], + "source": [ + "### YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "379a7374", + "metadata": {}, + "source": [ + "What is the memory usage of the current `DataFrame` instance?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0377e45a", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "040a748e", + "metadata": {}, + "source": [ + "## Prepare the data for further analysis" + ] + }, + { + "cell_type": "markdown", + "id": "b111bfdc", + "metadata": {}, + "source": [ + "The column `MESS_DATUM` contains the data of each measurement in the format `%Y%m%d%H`. The datatype of this column is `int64`.\n", + "\n", + "Create a new `DataFrame` named `df_weather_cleaned` that is based on the original `df_weather` from above.\n", + "\n", + "Make the following modifications to the `df_weather` `DataFrame` to generate a new one that is then assigned to the `df_weather_tweaked` variable.\n", + "\n", + "**Note**: You will be using several methods of the `DataFrame` object. Consider *chaining* the calls to these methods to have a compact way to make the relevant transformations to the `df_weather` `DataFrame`.\n", + "\n", + "\n", + "### Changing the format of the measurement dates\n", + "\n", + "The `\"MESS_DATUM\"` column in the original `DataFrame` contains the dates of measurement as integer values. The format is `%Y%m%d%H` which is meant to represent \"YearMonthDayHour\". We would like to have these in date-like format. \n", + "\n", + "To transfer these integer values to a suitable date format look at the [documentation of the `DataFrame` object](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/09_timeseries.html?highlight=datetime) to find a suitable function for making such a conversion.\n", + "\n", + "Use this function together with the `assign` method of a `DataFrame` instance to modify the \"`MESS_DATUM`\" column in an appropriate manner.\n", + "\n", + "### Rename the column headers\n", + "\n", + "Rename the following columns:\n", + "\n", + "* `\"MESS_DATUM\"` $\\to$ `\"Date of Measurement\"`\n", + "* `\"TT_TU\"` $\\to$ `\"Temperature\"`\n", + "* `\"RF_TU\"` $\\to$ `\"Humidity\"`\n", + "\n", + "\n", + "### Setting a new index \n", + "\n", + "Make the column named `\"Date of Measurement\"` the new index of the new `DataFrame` instance.\n", + "\n", + "### Note\n", + "\n", + "In all the following tasks you are supposed to work with the new modified `df_weather_tweaked`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8030b621", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "4b681490", + "metadata": {}, + "source": [ + "## Clean up the dataset\n", + "\n", + "When measurements are taken over a long period of time it is quite likely the erroneous data sneaks into the dataset. Indeed, we should remove this data from the `DataFrame`.\n", + "\n", + "Analyse the dataset in a suitable manner to investigate if the measured values for the temperature and the relative humidity are present that seem reasonable.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "1c6d2689", + "metadata": {}, + "source": [ + "- Plot the distribution of the temperature and the relative humidity. Look for suitable functions in the [`pandas.DataFrame.plot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) module." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a28e418", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "d9c3c339", + "metadata": {}, + "source": [ + "- Determine the smallest (minimal) as well as the largest (maximal) value for each of the data columns.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d16a3cfb", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "e63a133a", + "metadata": {}, + "source": [ + "- Remove all conspicuously small or large values from the dataset. Make sure not to generate a new `DataFrame` but rather to perform all adjustments with the already-existing one. Afterwards check re-check your results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3eb31737", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ee9be9c", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "546c3127", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "de80c2cf", + "metadata": {}, + "source": [ + "## Analyse the data" + ] + }, + { + "cell_type": "markdown", + "id": "74dd1b57", + "metadata": {}, + "source": [ + "### Monthly distribution of temperature and humidity\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "7bee5b7d", + "metadata": {}, + "source": [ + "- Group the data by month in which each measurement has been conducted. *Hint*: The `index` of the `DataFrame` has a `month` attribute.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ceaaacae", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "28c23a20", + "metadata": {}, + "source": [ + "- Display the distribution of the temperature and the relative humidity for each month in a [violin plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html). The abscissa must show each month as a integer value while the ordinate must show the values for the temperature or the relative humidity, respectively. *Hint*: In order to extract the data from the subframes of the `DataFrameGroupBy` object you need to iterate over it in a suitable manner." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9251d7e", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "95b85f06", + "metadata": {}, + "source": [ + "### Yearly mean temperature" + ] + }, + { + "cell_type": "markdown", + "id": "44fbb458", + "metadata": {}, + "source": [ + "\n", + "- Group the data by the year in which the measurements have been conducted. Then use *two* different methods of your choice to compute the mean value of the temperatures in each subframe. The result is the average temperature for each year in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b750cd9d", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a83a4e4b", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "id": "c3000dae", + "metadata": {}, + "source": [ + "- Plot the results for the yearly averaged temperate in a suitable manner." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed4157a9", + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9bcce04b", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": { + "height": "calc(100% - 180px)", + "left": "10px", + "top": "150px", + "width": "384px" + }, + "toc_section_display": true, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb.license b/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb.license new file mode 100644 index 0000000000000000000000000000000000000000..c207ab8c094a9d18d7c6cb5c9dfbf8913df4aa8a --- /dev/null +++ b/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb.license @@ -0,0 +1,4 @@ +SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> + +SPDX-License-Identifier: MIT diff --git a/slides/Day1.ipynb b/slides/Day1.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..ed50bae8d42016e63fb438057cd83ff94fe5de87 --- /dev/null +++ b/slides/Day1.ipynb @@ -0,0 +1,3911 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d93c06a4", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# HiPerCH 14 Module 1: Introduction to Python Data Processing tools" + ] + }, + { + "cell_type": "markdown", + "id": "09758895", + "metadata": { + "slideshow": { + "slide_type": "notes" + } + }, + "source": [ + "Notiz: \",\" removes the ? icon" + ] + }, + { + "cell_type": "markdown", + "id": "094990cd", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Content\n", + "- Basic overview of NumPy\n", + " - datatypes\n", + " - array-oriented programming\n", + " - linear algebra\n", + "- Tabulated data: Pandas\n", + " - adding semantic information\n", + " - reading, transforming, and plotting data\n", + " - grouping and aggregation" + ] + }, + { + "cell_type": "markdown", + "id": "7a816a80", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "## Formats\n", + "### Presentation\n", + "- zoom conference with screenshare\n", + "- notebooks available for download\n", + "\n", + "### Small practical demonstrations\n", + "- integrated into presentation\n", + "- ~10 minutes assigned per demo\n", + "\n", + "### take-home exercises\n", + "- simple practical projects\n", + "- demonstration of common methods\n", + "- self-study during or after the workshop" + ] + }, + { + "cell_type": "markdown", + "id": "42c8b5d5", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Agenda for today\n", + "- 09:00 - 12:00 Morning session\n", + " - Introduction to Numpy\n", + " - Arrays, datatypes, array access\n", + "- 12:00 - 13:00 Lunch break\n", + "- 13:00 - 17:00 Afternoon session\n", + " - Broadcasting and universal functions\n", + " - **Hands on Exercises**" + ] + }, + { + "cell_type": "markdown", + "id": "9e6a61ae", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "## Agenda for tomorrow\n", + "- 09:00 - 12:00 Morning session\n", + " - Introduction to Pandas\n", + " - Usage of Pandas `Dataframe`s\n", + "- 12:00 - 13:00 Lunch break\n", + "- 13:00 - 17:00 Afternoon session\n", + " - Some more `DataFrame`s\n", + " - **Hands on Exercises**\n" + ] + }, + { + "cell_type": "markdown", + "id": "791069bd", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Why Use Python for scientific computing?\n", + "* Python is easy to learn\n", + "* Fast prototyping\n", + "* Excellent for interactive exploratory work (e.g. Jupyther Notebook)\n", + "* Many of different scientifc modules / libraries available\n", + "* Efficient calculations are possible\n", + "\n", + "$\\Rightarrow$ Python code is glue-code between \"high-performance\" languages (C/C++, Fortran, ...)" + ] + }, + { + "cell_type": "markdown", + "id": "425fd375", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Setting up: environment creation and validation" + ] + }, + { + "cell_type": "markdown", + "id": "b98d9e52", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Environment Creation" + ] + }, + { + "cell_type": "markdown", + "id": "71cd16fb", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Python Anaconda distribution\n", + " \n", + "#### From the command line\n", + "To create a conda environment, execute the following command from the command line:\n", + "```bash\n", + "$ cd /path/to/course/directory # make sure to navigate to the course directory first!\n", + "$ conda env create -f environment.yml\n", + "```\n", + "\n", + "Afterwards, activate it:\n", + "```bash\n", + "$ conda activate scipython\n", + "```\n", + "\n", + "#### From the Anaconda Navigator (e.g. Windows)\n", + "Follow the instructions at https://docs.anaconda.com/anaconda/navigator/tutorials/manage-environments/#id6 and use the provided `environment.yml` file." + ] + }, + { + "cell_type": "markdown", + "id": "6c2c10b0", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Python virtual environments\n", + "For creating a Python-native virtual environment, open a terminal emulator and execute the following commands: \n", + "```bash\n", + "$ cd /path/to/course/directory # make sure to navigate to the course directory first!\n", + "$ python3 -m venv .venv # creates a virtual environment\n", + "$ source .venv/bin/activate\n", + "$ pip3 install --upgrade pip\n", + "$ pip3 install -r requirements.txt \n", + "$ jupyter contrib nbextension install --sys-prefix\n", + "$ jupyter-nbextension install rise --py --sys-prefix\n", + "$ jupyter-nbextension enable rise --py --sys-prefix\n", + "\n", + "```\n", + "Please refer to [this link](https://docs.python.org/3/library/venv.html) for how to create a virtual environment on Windows." + ] + }, + { + "cell_type": "markdown", + "id": "0c8c43d2", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Starting the jupyter server\n", + "### From the command line\n", + "Make sure that the course environment is active. In the course directory, start a jupyter server:\n", + "```bash\n", + "$ cd /path/to/course/directory # make sure to navigate to the course directory first!\n", + "$ jupyter notebook # this will open a browser window\n", + "```\n", + "\n", + "### From the anaconda navigator\n", + "Make sure that the course environment is active. Then open a Jupyter notebook from the GUI." + ] + }, + { + "cell_type": "markdown", + "id": "507795f7", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Validation: Are you ready to start?\n", + "\n", + "If you can execute the following cells without error, you are ready to start this module." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "677dee02", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "import scipy\n", + "from matplotlib import pyplot as plt\n", + "\n", + "print(f\"Numpy version : {np.__version__}\")\n", + "print(f\"Scipy version : {scipy.__version__}\")\n", + "print(f\"Pandas version : {pd.__version__}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ca28902c", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Part 1: Numpy\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "e1a6bd4d", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Content\n", + "\n", + "- Introduction to Numpy\n", + "- Datatypes\n", + "- Concept of multi-dimensional arrays\n", + "- Array access\n", + "- Broadcasting\n", + "- Universal functions\n" + ] + }, + { + "cell_type": "markdown", + "id": "7501ca2b", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Content\n", + " - ***Introduction to Numpy***\n", + " - Datatypes\n", + " - Concept of multi-dimensional arrays\n", + " - Array access\n", + " - Broadcasting\n", + " - Universal functions\n" + ] + }, + { + "cell_type": "markdown", + "id": "b420d5ce", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## NumPy (Numerical Python)\n", + "* Open source Python library\n", + "* Multi-dimensional arrays\n", + "* Efficient storing of data\n", + "* Efficient computing (e.g. vectorization)\n" + ] + }, + { + "cell_type": "markdown", + "id": "b080c967", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "\n", + "Figure From: https://doi.org/10.1038/s41586-020-2649-2" + ] + }, + { + "cell_type": "markdown", + "id": "da03546b", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* Numpy is the base of many other scientific libraries.\n", + "\n", + "* For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves and in the first imaging of a black hole." + ] + }, + { + "cell_type": "markdown", + "id": "5ba975b4", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Numpy VS Standard Python\n", + "\n", + "Hands on: Time the code on the next slides and see where Numpy is faster!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf5059d3", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "myrange = range(10000) # normal python list\n", + "%timeit [i ** 2 for i in myrange]\n", + "a = np.arange(10000) # numpy array\n", + "%timeit a ** 2 # note that we operate on the array like a scalar value!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a1b1d4a", + "metadata": { + "cell_style": "center", + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Example: vector addition\n", + "def python_version(size):\n", + " X = range(size)\n", + " Y = range(size)\n", + " Z = [X[i] + Y[i] for i in range(len(X)) ]\n", + "\n", + "\n", + "def numpy_version(size):\n", + " X = np.arange(size)\n", + " Y = np.arange(size)\n", + " Z = X + Y\n", + "\n", + "%timeit python_version(1000)\n", + "%timeit numpy_version(1000)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "298661a9", + "metadata": { + "slideshow": { + "slide_type": "notes" + } + }, + "outputs": [], + "source": [ + "TODO an example with strings?" + ] + }, + { + "cell_type": "markdown", + "id": "4d94a263", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# So where does the performance benefit come from?" + ] + }, + { + "cell_type": "markdown", + "id": "62f09236", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "### Datatypes\n", + "* Fixed static datatypes for usage of more efficient CPU instructions" + ] + }, + { + "cell_type": "markdown", + "id": "a21e7d4f", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "### Array oriented programming\n", + "* Only work with array objects\n", + "* Offload loops to *compiled* programming language\n", + "* Use the CPUs vector instructions (SSE, AVX, ...) instead of loops" + ] + }, + { + "cell_type": "markdown", + "id": "bf7b6ed4", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Content\n", + " - Introduction to Numpy\n", + " - ***Datatypes***\n", + " - Concept of multi-dimensional arrays\n", + " - Array access\n", + " - Broadcasting\n", + " - Universal functions\n" + ] + }, + { + "cell_type": "markdown", + "id": "28074bdf", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Python has 3 built-in numeric datatypes:\n", + "- integers (`int`)\n", + "- floating point numbers (`float`)\n", + "- complex numbers (`complex`)" + ] + }, + { + "cell_type": "markdown", + "id": "a0ee54a1", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## `int`: integer numbers\n", + "* literals: `42`, `1`, `-15`\n", + "* no hard maximum value (e.g. `10**1000` is perfectly valid)\n", + "* dynamic size, memory overhead, *cannot directly use native CPU instructions for basic math* \n", + "* $\\Rightarrow$ very flexible, but often not very efficient\n", + "* typical integer datatypes in C: `int`, `unsigned int`, `long int`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4dd77e54", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# The size of of an Python built-in `int` object can depend on the value the integer\n", + "import sys\n", + "\n", + "# sizes are given in bytes\n", + "print(sys.getsizeof(1))\n", + "print(sys.getsizeof(10 ** 1000))\n", + "print(sys.getsizeof(10 ** 10000))" + ] + }, + { + "cell_type": "markdown", + "id": "651b5a9f", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## `float`: floating point numbers\n", + "* floating point number: $\\textrm{sign} * \\textrm{mantissa} * \\textrm{base}^\\textrm{exponent}$, e.g. $-1.234\\cdot10^{2}$\n", + "* literals in Python: `0.0`, `314.15`, `-1.5e7` (meaning $-1.5\\cdot10^{7}$)\n", + "* usually implemented as `double` in C (64 bit / 8 byte) \n", + "* thus, limited max, min, eps (see [`sys.float_info`](https://docs.python.org/3/library/sys.html#sys.float_info))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "471ca00e", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# The size of `float` objects is *independent` of the value\n", + "import sys\n", + "\n", + "print(sys.getsizeof(1.0))\n", + "print(sys.getsizeof(1.34e18))\n", + "\n", + "# low level information about precision and internal representation of `float`s\n", + "print(sys.float_info)" + ] + }, + { + "cell_type": "markdown", + "id": "9b2467e7", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## `complex`: complex floating point numbers:\n", + "* literals: `1.0+2.0j`, `1j`\n", + "* `1j**2 == -1`\n", + "* `x.real`, `x.imag` are `float`s" + ] + }, + { + "cell_type": "markdown", + "id": "0eaca8aa", + "metadata": { + "slideshow": { + "slide_type": "notes" + } + }, + "source": [ + "-- comment: j comes from electircal engenieering, wehre I is used for current\n", + "-- https://stackoverflow.com/questions/24812444/why-are-complex-numbers-in-python-denoted-with-j-instead-of-i" + ] + }, + { + "cell_type": "markdown", + "id": "019a151e", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Content\n", + " - Introduction to Numpy\n", + " - Datatypes\n", + " - ***Concept of multi-dimensional arrays***\n", + " - Array access\n", + " - Broadcasting\n", + " - Universal functions\n" + ] + }, + { + "cell_type": "markdown", + "id": "3b4a7b89", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "$N$-dimensional arrays of numerical data are essential for scientific computing where data is often handled by means to multidimensional indexed arrays.\n", + "\n", + "- Natural sciences & numerical mathematics\n", + " - Vectors, matrices, tensors\n", + "- Data Science\n", + " - Datasets (e.g. via Pandas), tensors" + ] + }, + { + "cell_type": "markdown", + "id": "f9a4116a", + "metadata": { + "cell_style": "center", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Use cases\n", + "\n", + "- Linear algebra\n", + " - Matrix-vector multiplication, matrix-matrix multiplication\n", + "- Statistics with large datasets\n", + " - Aggregating data for computing mean, standard deviation, ...\n", + "- Deep learning:\n", + " - Operations involving high-dimensional arrays (\"tensors\")" + ] + }, + { + "cell_type": "markdown", + "id": "e9f6d378", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## What about Python's builtin container types (`list`, `tuple`)?\n", + "\n", + "- Can hold *any type* of Python object\n", + " - (Mostly?) Not suitable for native CPU instructions\n", + " - Agnostic of concept of e.g. a rectangular array\n", + "- Not designed with numerical calculations in mind\n", + "\n", + "Not efficient enough to be used for \"number crunching\"." + ] + }, + { + "cell_type": "markdown", + "id": "230188fa", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Numpys [Ndarrays](https://numpy.org/doc/stable/reference/arrays.ndarray.html)\n", + "\n", + "- multi-dimensional (, fixed-size) containers of items of the same *size* and *type*\n", + "- number of dimensions and items \n", + " - `shape`: N-tuple with non-negative integer values describing the sizes of each dimension\n", + "- type of each item\n", + " - data type object (`dtype`)" + ] + }, + { + "cell_type": "markdown", + "id": "9b542fa8", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Usage of Numpy" + ] + }, + { + "cell_type": "markdown", + "id": "e26a3840", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Import NumPy\n", + "Import NumPy into current namespace, usually with alias `np` for conceiseness." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f092e077", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "id": "5ac42dc5", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Array creation" + ] + }, + { + "cell_type": "markdown", + "id": "4226dbd4", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### From Python's built-in container types" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae65a30e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# from Python lists\n", + "A = np.array([1, 2, 3])\n", + "A, type(A)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d996a1c8", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# from Python tuples\n", + "A = np.array((1, 2, 3))\n", + "A, type(A)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7a19ea3", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Does not work for generators since ndarrays need their size at creation time\n", + "A = np.array(i**2 for i in range(1, 5))\n", + "A\n", + "# We have to use `np.fromiter()` to convert iterator to ndarray type.\n", + "# A = np.fromiter((i ** 2 for i in range(1, 5)), dtype=np.int32)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79a667ca", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# We can also use nested lists\n", + "nested_list = [[1, 2, 3], [10, 20, 30]] \n", + "A = np.array(nested_list)\n", + "print(A)\n", + "print(f\"dimension of A: {A.ndim}\")\n", + "print(f\"shape of A: {A.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f346f93c", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# MIND: sublist must all have the *same* size\n", + "nested_list = [[1] * 3, [2] * 3, [3] * 2]\n", + "print(nested_list)\n", + "np.array(nested_list)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cdfbc8cb", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# A = np.array(range(3))\n", + "A = np.array([[1.0] * 5, [4] * 5])\n", + "\n", + "# metadata sample of an ndarray (these are the attributes of the instances of ndarray class)\n", + "print(\"shape :\", A.shape)\n", + "print(\"size :\", A.size)\n", + "print(\"dtype :\", A.dtype) # we will come to the type later!\n", + "print(\"itemsize:\", A.itemsize) # size is in bytes" + ] + }, + { + "cell_type": "markdown", + "id": "a38c59fb", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### From NumPy built-in functions (factory functions)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f18b2f91", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# range(n) generates n integers,\n", + "# starting at 0, up to n-1\n", + "# (this function is similar to the standard python range function)\n", + "A = np.arange(10)\n", + "A" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74fb95b7", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# range with start, stop, step parameters\n", + "# NOTE: start value is included, stop value not\n", + "print(np.arange(3, 11, 1))\n", + "print(np.arange(3, 11, 2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "014c3f70", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Numpy provides some factory methods for array initialisation:\n", + "\n", + "# zero array, here with 2D shape\n", + "array_of_zeros = np.zeros((2, 3))\n", + "print(array_of_zeros)\n", + "\n", + "# create ones array with same shape as zeros\n", + "array_of_ones = np.ones(array_of_zeros.shape)\n", + "print(array_of_ones)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7cff636b", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "array_of_ones = np.ones((3, 3))\n", + "print(array_of_ones)\n", + "# Return an array of zeros with same shape and type as given array.\n", + "array_of_zeros = np.zeros_like(array_of_ones)\n", + "print(array_of_zeros)\n", + "# Return an array with `fill_value` with same shape and type as given array.\n", + "array_of_fives = np.full_like(array_of_ones, fill_value=5)\n", + "print(array_of_fives)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff4b8beb", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Diagonal arrays\n", + "print(np.diag(np.arange(1, 4), k=-1))\n", + "print(np.eye(3))" + ] + }, + { + "cell_type": "markdown", + "id": "5e72e19f", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "- Numpy arrays can also be read from / stored to disk.\n", + "- But this is not covered in this course.\n", + "- For this we will introduce Pandas tomorrow." + ] + }, + { + "cell_type": "markdown", + "id": "f530026d", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Data type ([dtype](https://numpy.org/doc/stable/reference/arrays.dtypes.html))\n", + "You can explicitly set the underlying numeric data type for an ndarray.\n", + "We must specify this in order to have take advantage of the speed of numpy.\n", + "\n", + "A (more) complete list of supported data types for `ndarray`s can be found [here](https://numpy.org/doc/stable/user/basics.types.html). The tables also feature the corresponding C datatype.\n", + "\n", + "NOTE: There are *plattform dependent* datatypes (e.g. `np.intc`) where the corresponding C type (e.g. `int`) is also plattform dependent." + ] + }, + { + "cell_type": "markdown", + "id": "2c938caa", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "\n", + "- integer types\n", + " - `np.int64` (default int type on 64-bit architectures)\n", + " - `np.int32`\n", + " - `np.uint64` (unsigned int)\n", + " - `np.uint8`\n", + " - ...\n" + ] + }, + { + "cell_type": "markdown", + "id": "dacd61f5", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "Floating point-based data types:\n", + "\n", + "- floating point types\n", + " - `np.float64` (aka double precision; matches the precision of Python `float`)\n", + " - `np.float32` (aka single precision)\n", + " - `np.float16` (aka half precision)\n", + " - ...\n", + "- complex types\n", + " - `np.complex64` (2x 32-bit floats; for real and imag. part)\n", + " - `np.complex128` (2x 64-bit floats; for real and imag. part)|" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3886925", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Numpy deduces the type of the elements from the input. Types can be inferred from dtype attribute.\n", + "A = np.array([1, 2, 3])\n", + "print(A)\n", + "print(A.dtype)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a01d43ae", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A_float = np.array([1.1, 5.5, 9.9])\n", + "print(A_float)\n", + "print(A_float.dtype)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8409769", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# We can also pass the type explicitly\n", + "A_int64 = np.array(list(map(float, range(4))), dtype=np.int64)\n", + "A_float32 = np.array(list(map(float, range(4))), dtype=np.float32)\n", + "\n", + "print(f\"A_int64 type : {A_int64.dtype}\")\n", + "print(f\"A_float32 type: {A_float32.dtype}\")" + ] + }, + { + "cell_type": "markdown", + "id": "65b112fb", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "- `dtype` can have an impact on the performance of computations with `ndarray`s. \n", + "- In essence, use a *numerical* datatype (e.g. `int` or `float64`) for representing numerical data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb71d2da", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Compare numeric type vs object type\n", + "N = 100000\n", + "%timeit np.arange(N, dtype=object).sum() # Python object\n", + "%timeit np.arange(N, dtype=int).sum() # Python compatible integer (most likely np.int64)" + ] + }, + { + "cell_type": "markdown", + "id": "9dfba808", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Overflows\n", + "\n", + "Example: `np.int8` represents a *signed* integer of size 8 bit $\\Rightarrow$ $2^8 = 256$ possible values!\n", + "\n", + "- First bit is for sign: $\\pm$\n", + "- Remaining 7 bits for value: $[-2^7, 2^7 - 1] = [-128, 127]$\n", + "\n", + "MIND: Any number *larger* that 127 *cannot* be represented with this type of integer.\n", + "\n", + "*Note*: Fixed size datatypes are essential for performant calculations and vectorization." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1b63c961", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Overflows\n", + "print(np.arange(0, 256, 1, dtype=np.int8)) # note the \"warp-around\" after 127\n", + "# print(np.arange(0, 256, 1, dtype=np.uint8)) # works for *unsigned* integer" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "207d3ba2", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# special float values\n", + "# 1.0 / 0.0 # raises ZeroDivisionError exception\n", + "# but what about numpy?\n", + "\n", + "A = np.array([1.0]) / 0.0 # raises no exception\n", + "print(A) # special value np.inf (\"infinity\")\n", + "B = np.array([-1.0]) / 0.0 \n", + "print(B) # results in -np.inf\n", + "C = np.array([0.0]) / 0.0 \n", + "print(C) # special value np.nan (\"not a number\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60fbf8c8", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# If desired, we can specify at which instances we want an error to be raised.\n", + "\n", + "# np.seterr(**{'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}) # default\n", + "#np.seterr(all='raise') # raise exceptions on numerical errors" + ] + }, + { + "cell_type": "markdown", + "id": "336d45fe", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Rounding Errors and Machine Epsilon" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8ed2d381", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "np.finfo(np.float32)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "56c6a257", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "np.finfo(np.float64)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5298837", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "x=np.asarray([0.001,1,1e10],dtype=np.float32)\n", + "one=np.ones(3,dtype=np.float32)\n", + "y=x+one\n", + "z=y-one\n", + "print(x)\n", + "print(y)\n", + "print(z)\n", + "print(one)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "569d8895", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(x==z)# should be true for all entries\n", + "print(y==x)# should be false for all entries" + ] + }, + { + "cell_type": "markdown", + "id": "09e93162", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Machine epsilon\n", + "- Computers cannot represent every number exact\n", + "- Machine epsilon is the smallest distance between two representable numbers \n", + " - minimal $\\epsilon$ such that $1+\\epsilon >1$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb8be20e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "def geteps(base_val=1,dtype=np.float64):\n", + " one, new_eps, eps = dtype(base_val), dtype(base_val), dtype(base_val)\n", + " while ( one + new_eps ) > one:\n", + " eps = new_eps\n", + " new_eps /= dtype(2)\n", + " return eps" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0eacb662", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(geteps(dtype=np.float32))\n", + "print(geteps(dtype=np.float64))" + ] + }, + { + "cell_type": "markdown", + "id": "fb64a4fd", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Machine Epsilon\n", + "- The value of machine epsilon changes with the number of bits for a float\n", + "- But it also gets larger for larger numbers\n", + "- **It is a *relative* error**!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33e9df26", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(geteps(1,dtype=np.float64))\n", + "print(geteps(1e6,dtype=np.float64))" + ] + }, + { + "cell_type": "markdown", + "id": "8ac8a4f8", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Number Representation\n", + " - Computers use a binary representation\n", + " - Some numbers that have finite number of digits in decimal representation, have infinite number of digits in binary representation\n", + " - Example: \n", + " $\\frac{1}{10}=0.1$ in decimal, but $0.000110011\\overline{0011}$ in binary\n", + " - It has the same reason as why $\\frac{1}{6}=0.\\overline{6}$ has an infinite number of digits in decimal representation" + ] + }, + { + "cell_type": "markdown", + "id": "768e6b74", + "metadata": { + "slideshow": { + "slide_type": "notes" + } + }, + "source": [ + "- This is way, rounding errors larger than machine epsilon are possible" + ] + }, + { + "cell_type": "markdown", + "id": "e0843fbc", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Commutative operations\n", + "- The rounding errors of floating point operations may depend on their execution order" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d2c1b62d", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "x=np.asarray([0.001,1,1e10],dtype=np.float32)\n", + "a=np.asarray([1,1,1],dtype=np.float32)\n", + "print (x==x+a-a) #executed as (x+a)-a\n", + "print (x==x+(a-a))" + ] + }, + { + "cell_type": "markdown", + "id": "400b1d00", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "- These errors can propagate through multiple steps of calculations and even magnify" + ] + }, + { + "cell_type": "markdown", + "id": "10d62ba5", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## GENERAL rules of thumb\n", + "- Multiplication/Division are mostly safe\n", + "- Addition and subtraction can lead to errors:\n", + " - when values of different magnitude are involved, the digits of the smaller one can be lost\n", + " - when subtracting two numbers that are close together, rounding errors are more likely\n", + "\n", + "- More detailed information: https://doi.org/10.1145/103162.103163" + ] + }, + { + "cell_type": "markdown", + "id": "a7d5b7a4", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Comparing floating point numbers with rounding errors\n", + "- Comparing numbers with rounding errors with `==` can lead to wrong assumptions\n", + " - We have seen that the number representation is not exact in some cases\n", + "- numpy offers `isclose` (or `allclose`) for this purpose\n", + "- $a$ and $b$ are considered close if $|a-b| <= atol+rtol*|b|$\n", + "- numpy uses $rtol=10^{-5}$ and $atol=10^{-8}$ by default\n", + " - But you can specify your own tolerances\n", + " - you may need adjust $atol$ if you want to compare numbers close to 0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a21c09a", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "a=np.asarray([1e-6,1,1e6])\n", + "b=np.asarray([1e-6+1e-6,1e-6+1,1e6+1])\n", + "print(a)\n", + "print(b)\n", + "print(np.isclose(a,b))\n", + "print(np.isclose(a,b,rtol=1e-6))\n", + "print(np.isclose(a,b,atol=1e-6))" + ] + }, + { + "cell_type": "markdown", + "id": "8fcd7aea", + "metadata": { + "slideshow": { + "slide_type": "notes" + } + }, + "source": [ + "https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/" + ] + }, + { + "cell_type": "markdown", + "id": "0efcf38d", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Array shape manipulation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b0e3a9e", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "A = np.arange(10)\n", + "print(A)\n", + "print(A.shape) # the return type is a tuple" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab24e946", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Reshaping\n", + "A = A.reshape((2, 5))\n", + "# A = A.reshape((2, -1)) # make NumPy deduce the last dimension\n", + "\n", + "# * First set of 5 values: row 1\n", + "# * Second set of 5 values: row 2\n", + "print(A)\n", + "print(A.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7025063c", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Transpose\n", + "A = np.arange(10).reshape((5, 2))\n", + "# We make an explicit copy here (more on copies and views later!).\n", + "A_transpose = A.copy().T # or A.copy().transpose(( 1, 0))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1ac5962d", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(A)\n", + "print(A_transpose)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f89634e3", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(f\"shape of A : {A.shape}\")\n", + "print(f\"shape of A_transpose: {A_transpose.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3109ba5", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Transposition also works with N-dimensional arrays\n", + "A = np.ones((3, 4, 5))\n", + "\n", + "print(f\"shape of A: {A.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70fe95ee", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A_transpose = A.copy().transpose() # this reverses the order of the sizes of each dimension\n", + "print(f\"shape of A_transpose: {A_transpose.shape}\")\n", + "A.copy().transpose((2, 1, 0)).shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3496374c", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "from itertools import permutations\n", + "for perm in permutations((0, 1, 2)):\n", + " A_transpose = A.copy().transpose(perm) # Provide a tuple with dimension to permute\n", + " print(f\"shape of A_transpose for permutation {perm}: {A_transpose.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fec034c", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Concatenation and stacking\n", + "A0 = np.zeros(3)\n", + "A1 = np.ones(3)\n", + "print(\"start with two arrays:\", A0, A1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d20cabf3", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Concatenation\n", + "A_concat = np.concatenate((A0, A1), axis=0)\n", + "print(\"concatenate along existing dimension:\")\n", + "print(A_concat)\n", + "print(A_concat.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d224ca7", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Stacking\n", + "A_stack_ax0 = np.stack((A0, A1), axis=0) # stacking along rows\n", + "A_stack_ax1 = np.stack((A0, A1), axis=1) # stacking along columns\n", + "print(\"stack along axis=0:\")\n", + "print(A_stack_ax0)\n", + "print(\"stack along axis=1:\")\n", + "print(A_stack_ax1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e38879b4", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Nested arrays can also be \"transformed\" in to 1D arrays\n", + "A = np.arange(1, 28).reshape((3, 3, 3))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08888075", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# this returns a 1D copy\n", + "A_flattened = A.flatten()\n", + "print(np.may_share_memory(A, A_flattened))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b4344d8", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# this returns a view (we will come to this later)\n", + "A_ravelled = A.ravel() # np.ravel(A) in case you want to use the free function\n", + "print(np.may_share_memory(A, A_ravelled))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad68e206", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# `np.ravel()` is equivalent to using `reshape()`\n", + "A.reshape((-1,))" + ] + }, + { + "cell_type": "markdown", + "id": "2316fafd", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Timing analysis: Copies vs. views" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed9b4aa4", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A = np.arange(500000).reshape(5, -1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dca9aefd", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# measure time of summing large array after ...\n", + "# - ... `flatten()`ing the array, and\n", + "# - ... `ravel()`ing the array." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4332dcab", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "%timeit np.sum(A.flatten())\n", + "%timeit np.sum(A.ravel())" + ] + }, + { + "cell_type": "markdown", + "id": "59f610ef", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "We will come back to the difference between a *view* and a *copy* later" + ] + }, + { + "cell_type": "markdown", + "id": "2be9aebd", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Content\n", + " - Introduction to Numpy\n", + " - Datatypes\n", + " - Concept of multi-dimensional arrays\n", + " - ***Array access***\n", + " - Broadcasting\n", + " - Universal functions\n" + ] + }, + { + "cell_type": "markdown", + "id": "359ea28c", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Element access with indexing in 1D" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9291539d", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Access array element\n", + "A = np.array([10, 20, 30])\n", + "A[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb82fd90", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Change value of array element\n", + "A[1] = 222\n", + "A" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19141d3d", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Reverse indexing also works like with Python lists\n", + "A[-1]" + ] + }, + { + "cell_type": "markdown", + "id": "2ceb94a8", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "### Exercises\n", + "1. Create a numpy array with the current date (year, month, day)\n", + "\n", + "2. Index the array to retrieve the year.\n", + "\n", + "3. Replace the year with the year of your birth.\n", + "\n", + "4. Create a NumPy array containing every 7rd number from 123 to 456. What is the 9th number in this array? What is the 78th?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf66e38e", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "#This field is for the solution of the exercise" + ] + }, + { + "cell_type": "markdown", + "id": "4ec81374", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Elementwise operations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6822f4d7", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "A = np.array([1, 2, 3])\n", + "print(A)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ae54600", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A + 1 # Note that this operation is broadcasted over the whole array" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd4ad7ec", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A / 2 # Note that this operation is broadcasted over the whole array" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e94e610d", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "B = np.ones((2, 2))\n", + "B" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78fe1c57", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Element-wise addition\n", + "B + B" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bff45d9c", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Elementwise multiplication. This is *not* matrix-matrix multiplication.\n", + "# this is the Hadamard product: https://en.wikipedia.org/wiki/Hadamard_product_(matrices)\n", + "B * B # element-wise matrix multiplication; " + ] + }, + { + "cell_type": "markdown", + "id": "fb4c4393", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### `np.dot()` and `np.matmul()`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "efa62480", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A = np.arange(1, 5).reshape((2, 2))\n", + "B = np.arange(5, 9).reshape((2, 2))\n", + "A, B" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4646f4b", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Compute the matrix-matrix multiplication in 3 equivalent ways\n", + "print(np.dot(A, B)) # np.dot is much more general; prefer np.matmul for matrix-multiplication\n", + "print(np.matmul(A, B)) # implements the semantics of the `@` operator\n", + "print(A @ B)" + ] + }, + { + "cell_type": "markdown", + "id": "4437b478", + "metadata": { + "slideshow": { + "slide_type": "notes" + } + }, + "source": [ + "\n", + "\n", + " matmul differs from dot in two important ways.\n", + "\n", + " Multiplication by scalars is not allowed.\n", + " Stacks of matrices are broadcast together as if the matrices were elements.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84358ebf", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Matrix-vector / vector matrix multiplication\n", + "M = np.arange(1, 5).reshape((2, 2))\n", + "v = np.arange(1, 3)" + ] + }, + { + "cell_type": "markdown", + "id": "21d500cc", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "Compute $\\sum_{j = 1}^N M_{ij} v_j$: Sum along columns (`axis = 1`) of the matrix." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6816a443", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(np.dot(M, v))\n", + "print([sum(row * v) for row in M]) # test if `dot()` does what it is supposed to" + ] + }, + { + "cell_type": "markdown", + "id": "13b90b55", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "Compute $\\sum_{i = 1}^N v_i M_{ij}$: Sum along the rows (`axis = 0`) of the matrix." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f4805934", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(np.dot(v, M))\n", + "print([sum(v * row) for row in M.transpose()]) # test if `dot()` does what it is supposed to" + ] + }, + { + "cell_type": "markdown", + "id": "16cc2079", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "Differences for `np.matmul` and `np.dot` are for higher-dimensional arrays (3-dimensions upwards).\n", + "\n", + "See the docs for [`np.matmul`](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html?highlight=matmul#numpy.matmul) and [`np.dot`](https://numpy.org/doc/stable/reference/generated/numpy.dot.html#numpy.dot) for more details." + ] + }, + { + "cell_type": "markdown", + "id": "29d79088", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Computing the dot-product for (complex-valued) vectors: `np.dot()` vs. `np.vdot()`" + ] + }, + { + "cell_type": "markdown", + "id": "29c09a65", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "For $\\vec{v}, \\vec{w} \\in \\mathbb{R}^N$: $\\langle v, w \\rangle = \\sum_{i=1}^N v_i w_i$ " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d7f76bc", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# For real-valued vectors there is *no* difference\n", + "v = np.arange(1, 5)\n", + "print(np.dot(v, v), np.vdot(v, v))" + ] + }, + { + "cell_type": "markdown", + "id": "b13ef4b6", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "For $\\vec{v}, \\vec{w} \\in \\mathbb{C}^N$: $\\langle v, w \\rangle = \\sum_{i=1}^N \\overline{v_i} w_i$ " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11eb626e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# For complex-valued vectors there is a difference\n", + "v = np.arange(1, 5) + 1j + np.arange(1, 5)\n", + "print(np.dot(v, v)) # does *not* automatically apply complex conjugation of 1st argument (use np.dot(v.conj(), v))\n", + "print(np.vdot(v, v)) # does apply complex conjugation to 1st argument" + ] + }, + { + "cell_type": "markdown", + "id": "13fe7988", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Reductions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9a7cf0e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A = np.array([[1, 2], [3, 4]])\n", + "A" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e477c39a", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Summing all elements in an array\n", + "np.sum(A), A.sum()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f7c5bb2", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Summing along rows (axis=0) and along columns (axis=1)\n", + "np.sum(A, axis=0), np.sum(A, axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d81b0eb0", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Some statistical quantities\n", + "np.mean(A), np.std(A)" + ] + }, + { + "cell_type": "markdown", + "id": "e3920dd3", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Slicing (1D)\n", + "Access range of elements using slice notation: \n", + "\n", + "```\n", + " A[start:stop:step]\n", + "```\n", + "\n", + "defaults: `start=0`, `stop=len(array)`, `step=1` \n", + "second `:` is optional, if default step is used\n", + "\n", + "Remember: Indices start at 0. `stop` is exclusive." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4b44b53", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "A = np.arange(10)\n", + "A" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e80ce38f", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "#equivalent due to default values:\n", + "print(A[0:6:1])\n", + "print(A[0:6:])\n", + "print(A[0:6]) #second ':' is optional if default step used\n", + "print(A[:6])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d45423bc", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A[:-2]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4c26dd8", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A[1::2]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf391c33", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A[::-1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4a1472c", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A[3:1:-1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50149cd6", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Assigning to a slice is something we cannot do with a Python list.\n", + "A = np.arange(10)\n", + "A[1::2] = -100\n", + "print(A)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "faa52409", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# We get an error with a Python list.\n", + "A = list(range(10))\n", + "A[1::2] = -100\n", + "# A[1::2] = [-100] * 5 # This implies knowing the length of the slice.\n", + "print(A)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1ea5138b", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "from time import time\n", + "# Example: Prime sieve\n", + "N = 100000\n", + "prime_candidates = np.ones((N,), dtype=np.bool_)\n", + "prime_candidates[0] = False\n", + "prime_candidates[1] = False\n", + "\n", + "tstart = time()\n", + "for i in range(2, N): # for each integer starting from 2 cross out higher multiples\n", + " prime_candidates[2 * i :: i] = False\n", + "print(f\"Time needed: {time() - tstart}\")\n", + " \n", + "# print(prime_candidates)\n", + "# boolean array; if prime_candidates[x] == True, then x is prime" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "018e8135", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "#convert boolean mask to list of integers with list comprehension\n", + "%timeit np.array([x for x in range(N) if prime_candidates[x]]) # list comprehension with conditional" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4857233a", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# convert boolean mask to list of integers with NumPy built-in function (much faster)\n", + "%timeit np.nonzero(prime_candidates) # or using a NumPy function" + ] + }, + { + "cell_type": "markdown", + "id": "62a2ece1", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "### Array-oriented programming\n", + "\n", + "- Use `numpy` built-in functions and methods of `ndarray` class to operate on `ndarray`s\n", + "- *Avoid* using standard Python loop contructs such as \"raw\" for loops of list comprehension" + ] + }, + { + "cell_type": "markdown", + "id": "7dfbfb0f", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Indexing and slicing in $N$ dimensions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d50f99e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "B = np.array([[1, 2, 3], [10, 20, 30]])\n", + "print(B)\n", + "print(B.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7aef119c", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "B[0, 1] # get value from index along each axis\n", + "# B[0][1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eef08541", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "B[:2, :2] # use slicing along each axis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a262f62f", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "B[:, 1] = [70, 700] # use full slice ':' to select whole axis 1 (here: 2nd column)\n", + "B" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "daae0878", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "a = np.arange(1, 49).reshape((8, 6))\n", + "a" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "56309989", + "metadata": {}, + "outputs": [], + "source": [ + "# From first row take two element before the last one\n", + "a[0, 3:5]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac557199", + "metadata": {}, + "outputs": [], + "source": [ + "# Submatrix in upper left corner\n", + "a[:2, :2]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1b46d191", + "metadata": {}, + "outputs": [], + "source": [ + "# Last column\n", + "a[:, -1]\n", + "# a[-1, :]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "621a7337", + "metadata": {}, + "outputs": [], + "source": [ + "# Amore complicated pattern with non-unit steps\n", + "a[2::2, 3::]" + ] + }, + { + "cell_type": "markdown", + "id": "a6b1df82", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Views and copies\n", + "\n", + "When using *slicing* or *transposition* we create *references* to the original data (memory views). **No** copy is made of the original array and stored in memory. We can use `np.may_share_memory()` to check if two arrays share the same memory block. Note however, that this uses heuristics and may give you false positives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e3fee47", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "A = np.arange(5)\n", + "A" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7aca885", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A_view = A[::2] # slicing! This gives us a *view* on this mem location\n", + "print(A_view)\n", + "print(A_view.shape, A.shape) # view has it own metadata" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d6404ba", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Notice how this operation changes the original array!\n", + "A_view[0] = 100\n", + "A" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c5648a3", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "np.shares_memory(A, A_view), np.may_share_memory(A, A_view)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4f38792", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# If this were not a view, we could not manipulate the data\n", + "A[::2] = 100\n", + "A" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32fa43d7", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# If you want real copy use the copy method provided by the ndarrays\n", + "A = np.arange(5)\n", + "A_copy = A[::2].copy() # we make an explicit copy \n", + "np.shares_memory(A, A_copy)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5327c13", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A_copy[0] = 100" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac181286", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(A.shape, A_copy.shape)\n", + "print(A, A_copy) # check if the original array was modified as well" + ] + }, + { + "cell_type": "markdown", + "id": "6038e1d3", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Copying values into an existing object using [:]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4edb042c", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "A = np.array([1, 2, 3])\n", + "B = np.array([4, 5, 6])\n", + "A = B # assign object referenced by B to variable A\n", + "B[0] = 100\n", + "print(A, B)\n", + "print(\"A has same identity as B: \", id(A) == id(B))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "64928398", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "A = np.array([1, 2, 3])\n", + "B = np.array([4, 5, 6])\n", + "old_A_id = id(A)\n", + "A = B.copy() # copy values into new object, then assign this new object to name A\n", + "B[0] = 100\n", + "print(A, B)\n", + "print(\"A has same identity as B: \", id(A) == id(B))\n", + "print(\"A has kept it's identity: \", old_A_id == id(A))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01b8a7de", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "A = np.array(range(6)).reshape((3, 2))\n", + "B = np.array(range(6, 12)).reshape((3, 2))\n", + "old_A_id = id(A)\n", + "A[:, :] = B # copy values from B into existing object A\n", + "# syntax A[:] ensures, that assinment to object in A is triggered\n", + "# not to the symbol A\n", + "B[0] = 100\n", + "print(A)\n", + "print(B)\n", + "print(\"A has kept it's identity: \", old_A_id == id(A))" + ] + }, + { + "cell_type": "markdown", + "id": "7fc99cb2", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Advanced indexing\n", + "\n", + "*Use of arrays (integer of boolean type) to index other arrays*." + ] + }, + { + "cell_type": "markdown", + "id": "9f4eb590", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "\n", + "- \"Elementary\" indexing (and slices) always returns a *view*.\n", + "- Advanced (\"fancy\") indexing always returns a *copy*." + ] + }, + { + "cell_type": "markdown", + "id": "5d12425e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "- Assignment *to* array with advanced indexing changes original array (just like with normal indexing slicing).\n", + "- Assignment *from* array with advanced indexing creates copies (and not views like regular slicing)." + ] + }, + { + "cell_type": "markdown", + "id": "404837b7", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Boolean expressions with `ndarray`s" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc042b09", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A = np.array([1, 2, 3])\n", + "B = np.array([3, 2, 1])\n", + "A == B # component-wise comparison" + ] + }, + { + "cell_type": "markdown", + "id": "f8329f0b", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "In boolean expression with `ndarray`s use *binary* operations instead of logical opertions:\n", + "\n", + "Operation | Not to use | Use\n", + "------ | -----------|----------\n", + "and | `and` | `&`\n", + "or | `or` | `\\|`\n", + "not | `not` | `~` " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b89d056f", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# The following expressions yield the same result\n", + "print(np.array([not x for x in ((A == B) | ([True, False, False]))]))\n", + "print(~((A == B) | ([True, False, False])))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15b5a088", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Boolean masks\n", + "A = np.arange(100000) # change to larger value for time measurement below\n", + "divisible_by_3_mask = (A % 3 == 0)\n", + "# print(A)\n", + "# print(divisible_by_3_mask)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80f77707", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "%timeit A[divisible_by_3_mask]\n", + "%timeit [A[i] for i in range(A.size) if divisible_by_3_mask[i] == True]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f02f53fa", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Assign *from* array with boolean mask!\n", + "A_masked = A[divisible_by_3_mask] # creates a copy!!!\n", + "print(A_masked)\n", + "print(id(A_masked) == id(A))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4059b8c8", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Assign *to* array with boolean mask\n", + "A[divisible_by_3_mask] = -100 # changes original!!!\n", + "print(A)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "932492d7", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# get indices where condition is True\n", + "np.where(divisible_by_3_mask)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19fdda9b", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# integer list indexing\n", + "B = np.arange(4 * 4).reshape(4, 4)\n", + "print(B)\n", + "list_index = [1, 3]\n", + "B[list_index] # or B[list_index, :]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d091bf47", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# 2D: 2 lists, each containing indices for axis 0 and axis 1, respectively\n", + "B[[0, 3], [1, 2]], B[0, 1], B[3, 2]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5822143f", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# or, to create a boolean mask from these integer indices:\n", + "mask = np.zeros_like(B, dtype=np.bool_)\n", + "mask[[0, 3], [1, 2]] = 1\n", + "print(mask)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3bf12e7", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "B[mask] = -10000\n", + "print(B)" + ] + }, + { + "cell_type": "markdown", + "id": "5091adea", + "metadata": {}, + "source": [ + "For a visual example for fancy indexing see [here](http://scipy-lectures.org/intro/numpy/array_object.html#fancy-indexing)." + ] + }, + { + "cell_type": "markdown", + "id": "d55d3a8b", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Content\n", + " - Introduction to Numpy\n", + " - Datatypes\n", + " - Concept of multi-dimensional arrays\n", + " - Array access\n", + " - ***Broadcasting***\n", + " - Universal functions\n" + ] + }, + { + "cell_type": "markdown", + "id": "586156e7", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Broadcasting\n", + "[Broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) describes how arrays with the *different* shapes are treated during arithmetic operations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0386670a", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Let's start with a very simple example\n", + "a = np.array([1.0, 2.0, 3.0])\n", + "b = np.array([2.0, 2.0, 2.0])\n", + "a * b # Multiplication means element-wise multiplication" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9504866", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# NumPy *automatically* applies the scalar value to all elements of the ndarray. We can \"think\" of `b` being \n", + "# replicated to an ndarray of the same size as a. NumPy, however, is smart enough *not* to make additional copies.\n", + "a = np.array([1.0, 2.0, 3.0])\n", + "b = 2.0\n", + "a * b" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da58f3b5", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# This also works for higher-dimensional arrays\n", + "A = np.arange(1, 10).reshape((3, 3))\n", + "b = 0.5\n", + "A * b" + ] + }, + { + "cell_type": "markdown", + "id": "687bcf20", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### General broadcasting rules\n", + "NumPy compares shapes of two arrays element-wise, starting from the *end* of the `shape` tuple and working its way to the beginning. Dimensions are compatible if\n", + "\n", + "1. they are *equal*, or\n", + "2. one of them is 1 ." + ] + }, + { + "cell_type": "markdown", + "id": "c6ead848", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "\n", + "Arrays can be broadcast into the same shape if one of following conditions is fulfilled:\n", + "1. Arrays already have exactly the same shape.\n", + "2. Arrays have same number of dimensions, and the individual dimensions are either of the same length, or of length 1.\n", + "3. Arrays of unequal dimensions can have their shape prepended with dimensions of length 1. Then rule 2. applies." + ] + }, + { + "cell_type": "markdown", + "id": "027bc942", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Note\n", + "\n", + "The following examples only serve to deepen the conceptual understanding of broadcasting rules.\n", + "\n", + "This is **not** how NumPy does broadcasting. NumPy is much more memory efficient since it avoids making needless copies of the data." + ] + }, + { + "cell_type": "markdown", + "id": "26fa673b", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Arrays do *not* need to have the same number of dimensions. Then rule 3. applies.\n", + "```\n", + "data (4d array): 4 x 8 x 3 x 8\n", + "factor (1d array): 8 # replace missing dimensions with 1 (1 x 1 x 1 x 8)\n", + "result (4d array): 4 x 8 x 3 x 8\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "40e511a2", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "data = np.random.random((4, 8, 3, 8))\n", + "factor = np.random.random((8,))\n", + "result = factor * data\n", + "result.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea5d63af", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "factor_augmented = np.tile(\n", + " factor.reshape((1, 1, 1, 8)), # Add missing dimensions\n", + " (4, 8, 3, 1) # Augment the data to shape (8, 8, 3)\n", + ")\n", + "print(scale_augmented.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba6e6a46", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Compare sizes of arrays\n", + "print(factor.nbytes)\n", + "print(factor_augmented.nbytes) # NumPy does not create such an array" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79169f4d", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "np.allclose(data * factor_augmented, result)" + ] + }, + { + "cell_type": "markdown", + "id": "7d52d43a", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "```\n", + "data1 (5d array): 2 x 8 x 1 x 6 x 1\n", + "data2 (3d array): 7 x 1 x 5 # axes with length 1 will be expanded\n", + "Result (5d array): 2 x 8 x 7 x 6 x 5\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7631defe", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Example\n", + "data1 = np.random.random((2, 8, 1, 6, 1))\n", + "data2 = np.random.random((7, 1, 5))\n", + "result = data1 * data2\n", + "print(result.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8fffaf22", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "data1_augmented = np.tile(data1, (1, 1, 7, 1, 5))\n", + "data2_augmented = np.tile(data2.copy().reshape((1, 1) + data2.shape), (2, 8, 1, 6, 1))\n", + "print(data1_augmented.shape)\n", + "print(data2_augmented.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3053479c", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(np.allclose(result, data1 * data2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35cae9b9", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "A = np.arange(3 * 4).reshape((3, 4))\n", + "B = np.array([10, 20, 30, 40])\n", + "print(A)\n", + "print(B)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6fb5ea59", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Broadcasting occurs implicitly\n", + "# print(np.tile(B, (3, 1)))\n", + "A + B" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15b1d452", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# explicit broadcast\n", + "A_broadcast, B_broadcast = np.broadcast_arrays(A, B)\n", + "print(\"A:\")\n", + "print(A_broadcast)\n", + "print(\"B:\")\n", + "print(B_broadcast)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc9a187f", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Methods for adding dimensions\n", + "A = np.arange(3)\n", + "print(A.reshape(1, -1))\n", + "print(A[None, :])\n", + "print(A[np.newaxis, :])\n", + "print(np.newaxis is None) # np.newaxis is an alias for None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "536fab81", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html\n", + "print(np.expand_dims(A, axis=0))\n", + "print(np.expand_dims(A, axis=1))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "801f988d", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "A = np.arange(3 * 4).reshape((3, 4))\n", + "B = np.array([1, 2, 3])\n", + "print(A.shape, B.shape)\n", + "# This does not work, since A.shape and B.shape are not compatible, \n", + "# because dimensions along axis 0 cannot be matched.\n", + "# A_broadcast, B_broadcast = np.broadcast_arrays(A, B)\n", + "A_broadcast, B_broadcast = np.broadcast_arrays(A, B[:, None]) # adding another dimension helps" + ] + }, + { + "cell_type": "markdown", + "id": "2ba18826", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "Sometimes it can be useful to [manually add another axis](https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html) to leverage broadcasting." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e54e77d", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A = np.zeros((3, 2))\n", + "B = np.arange(3)\n", + "print(A.shape)\n", + "print(B.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9eb94a5b", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A + B # This will fail since last dimensions mismatch: a: 2 vs b: 3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d14fff6", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A + B[:, None]" + ] + }, + { + "cell_type": "markdown", + "id": "893b2ae5", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "TODO Broadcasting Quiz, if still time left" + ] + }, + { + "cell_type": "markdown", + "id": "a6c61ee4", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Content\n", + " - Introduction to Numpy\n", + " - Datatypes\n", + " - Concept of multi-dimensional arrays\n", + " - Array access\n", + " - Broadcasting\n", + " - ***Universal functions***\n" + ] + }, + { + "cell_type": "markdown", + "id": "c31a4456", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## NumPy universal elementwise functions (\"ufuncs\")\n", + "[ufuncs](https://docs.scipy.org/doc/numpy/reference/ufuncs.html ) perform function operations on individual elements of `ndarray`s in an element-by-element manner. They have\n", + "broadcasting built-in." + ] + }, + { + "cell_type": "markdown", + "id": "115ab970", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "Non-exhaustive list of universal functions and their operator equivalent.\n", + "\n", + "Operator | ufunc | Description\n", + "---------| ----------------|----------\n", + "`+` | `np.add()` | addition\n", + "`-` | `np.subtract()` | subtraction\n", + "`*` | `np.mul()` | multiplication\n", + "`/` | `np.divide()` | division\n", + "`//` | `np.floor_divide()`| floor division\n", + "`**` | `np.power()` | exponentiation\n", + "`%` | `np.mod()` | remainder of division\n", + "\n", + "For more mathematical functions included in NumPy see [here](https://numpy.org/doc/stable/reference/ufuncs.html)." + ] + }, + { + "cell_type": "markdown", + "id": "db3eaeb7", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "These functions generally create a new (temporary) target array. You can use the `out=` parameter to avoid creating temporary output arrays by supplying an existing array." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a1e4b7d", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "A = np.arange(3 * 3).reshape(3, 3)\n", + "np.power(A, 2, out=A) # no temporary array\n", + "A" + ] + }, + { + "cell_type": "markdown", + "id": "30310e19", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "A useful resource on universal functions can be found [here](https://jakevdp.github.io/PythonDataScienceHandbook/02.03-computation-on-arrays-ufuncs.html)." + ] + }, + { + "cell_type": "markdown", + "id": "d3d8ed2b", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Create your own ufuncs from scalar Python functions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba396e41", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# Poor implementation of a function that tests if its argument is prime.\n", + "def is_prime(x):\n", + " \"\"\"Check if input number is a prime number.\"\"\"\n", + " if x < 2:\n", + " return False\n", + " for value in range(2, x): # x is not included in range!\n", + " if x % value == 0:\n", + " return False\n", + " return True" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb952cff", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# We can apply this function to a `ndarray`. This will be inefficient when the we have to check many numbers.\n", + "[is_prime(x) for x in np.arange(1, 10, dtype=int)]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a88b8c1", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "# We can create our *own* universal function \n", + "is_prime_ufunc = np.vectorize(is_prime)\n", + "print(is_prime_ufunc.__doc__)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "493eace7", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# let's make a timing analysis\n", + "number_range = np.arange(1, 1000, dtype=int)\n", + "%timeit [is_prime(x) for x in number_range]\n", + "%timeit is_prime_ufunc(number_range)" + ] + }, + { + "cell_type": "markdown", + "id": "123372f3", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Time For Hands On" + ] + }, + { + "cell_type": "markdown", + "id": "403b7e56", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Hands On Exercise\n", + " - Implement the K-means clustering algorithm. \n", + " - See [here](https://en.wikipedia.org/wiki/K-means_clustering) for a description of the algorithm.\n", + " - To explain the algorithm, we will first implement it with standard Python together.\n", + " - Then it's your turn to use Numpy for it.\n", + " - Which implementation is more efficient?\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "ce7e0814", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Summary\n", + "- Concept of numpy arrays\n", + "- Indexing / slicing arrays\n", + "- Difference between a copy and a view\n", + "- Array-oriented programming for better performance" + ] + }, + { + "cell_type": "markdown", + "id": "cd06a649", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "## Agenda for tomorrow\n", + "- 09:00 - 12:00 Morning session\n", + " - Introduction to Pandas\n", + " - Usage of Pandas `Dataframe`s\n", + "- 12:00 - 13:00 Lunch break\n", + "- 13:00 - 17:00 Afternoon session\n", + " - Some more `DataFrame`s\n", + " - **Hands on Exercises**" + ] + }, + { + "cell_type": "markdown", + "id": "6fad04ed", + "metadata": { + "slideshow": { + "slide_type": "notes" + } + }, + "source": [ + "TODO interactive Quiz zu broadcasting rules?\n", + "\n" + ] + } + ], + "metadata": { + "celltoolbar": "Slideshow", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "rise": { + "controls": true, + "controlsLayout": "edges", + "controlsTutorial": false, + "footer": "<img src=hpc-hessen-logo-only.png height=60 width=100>Competence Center for High Performance Computing in Hessen (HKHLR) Tim Jammer, Marcel Giar HiPerCH 2022", + "header": "", + "help": false, + "slideNumber": "c/t", + "theme": "white" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": false, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/slides/Day1.ipynb.license b/slides/Day1.ipynb.license new file mode 100644 index 0000000000000000000000000000000000000000..c207ab8c094a9d18d7c6cb5c9dfbf8913df4aa8a --- /dev/null +++ b/slides/Day1.ipynb.license @@ -0,0 +1,4 @@ +SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> + +SPDX-License-Identifier: MIT diff --git a/slides/Day2_PandasDataFrames.ipynb b/slides/Day2_PandasDataFrames.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..288ef4f76211849868b2ac8055e7f3dfe237e890 --- /dev/null +++ b/slides/Day2_PandasDataFrames.ipynb @@ -0,0 +1,2330 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4d6423b0", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# HiPerCH 14 Module 1: Introduction to Python Data Processing tools" + ] + }, + { + "cell_type": "markdown", + "id": "dbd2f680", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Pandas `DataFrame`s" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f4829e5", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "from matplotlib import pyplot as plt\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "f\"Numpy version: {np.__version__}; Pandas version: {pd.__version__}\"\n", + "\n", + "import importlib\n", + "import utils\n", + "importlib.reload(utils)" + ] + }, + { + "cell_type": "markdown", + "id": "e1376473", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# `DataFrame` Objects\n", + "\n", + "The `pd.DataFrame` class provides a data structure to handle 2-dimensional tabular data. `DataFrame` objects are *size-mutable* and can contain mixed datatypes (e.g. `float`, `int` or `str`). All data columns inside a `DataFrame` share the same `index`." + ] + }, + { + "cell_type": "markdown", + "id": "857d1d3c", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Creating `DataFrame`s" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62875512", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "name = [\"person 1\", \"person 2\", \"person 3\"]\n", + "age = [23, 27, 34] " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2f65441", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Create nested list and pass column names\n", + "df = pd.DataFrame(data=zip(name, age), columns=[\"Name\", \"Age\"])\n", + "df # This gives a nicely formatted output. When using the `print` function the output looks different." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6e0313e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# The same can be achieved by using a `dict`\n", + "df = pd.DataFrame(data={\"Name\": name, \"Age\": age})\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "b81aae00", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "It is also possible to create `DataFrame`s from `Series` objects." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5b8385cc", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "math_grades = pd.Series({\n", + " 'student1': 15,\n", + " 'student2': 11,\n", + " 'student3': 9,\n", + " 'student4': 13,\n", + " 'student5': 12,\n", + " 'student6': 7,\n", + " 'student7': 14\n", + "})\n", + "chemistry_grades = pd.Series({\n", + " 'student1': 10,\n", + " 'student2': 14,\n", + " 'student3': 12,\n", + " 'student4': 8,\n", + " 'student5': 11,\n", + " 'student6': 10,\n", + " 'student7': 12,\n", + " \"student8\": 5 # <-- note the additional entry here\n", + "})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1234ca1c", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df = pd.DataFrame(data={\"Math Grades\": math_grades, \"Chemistry Grades\": chemistry_grades})" + ] + }, + { + "cell_type": "markdown", + "id": "5ae40e4e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "\n", + "Series objects are *matched by index* and missing values are replaced with a default value." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2222bd8e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df # default value is `NaN`" + ] + }, + { + "cell_type": "markdown", + "id": "a220919f", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Exercises (optional)" + ] + }, + { + "cell_type": "markdown", + "id": "552b4899", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "* Given the two iterables `values1` and `values2`, create a `pd.DataFrame` containing both in two different ways. Label the columns `'label1'` and `'label2'`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "267a91a5", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "values1 = np.random.randint(-10, 10, 5)\n", + "values2 = range(5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ffbf28d0", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df_iterables = pd.DataFrame(data=zip(values1, values2), columns=[\"label1\", \"label2\"])\n", + "df_iterables" + ] + }, + { + "cell_type": "markdown", + "id": "00fc0dd4", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "\n", + "* Combine the two `pd.Series` named `series1` and `series2` to a `pd.DataFrame`. Label the columns `'col1'` and `'col2'`. \n", + " * Replace missing values with `0`.\n", + " * Remove rows that contain `NaN` values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0881ae3", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "series1 = pd.Series(data=range(5), \n", + " index=[f\"{idx}\" for idx in range(5)])\n", + "series2 = pd.Series(data=range(0, 10, 2), \n", + " index=[f\"{idx}\" for idx in range(0, 10, 2)])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de1b4fbe", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df_from_series = pd.DataFrame({\"col1\": series1, \"col2\": series2})\n", + "df_from_series" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1f76d67e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# df_from_series.replace(np.NaN, 0 )\n", + "# df_from_series.dropna() " + ] + }, + { + "cell_type": "markdown", + "id": "86337fd7", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## What characterises a `DataFrame`?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6fd61f32", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df = pd.DataFrame(data={\"Math Grades\": math_grades, \"Chemistry Grades\": chemistry_grades})" + ] + }, + { + "cell_type": "markdown", + "id": "6f08fae8", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "How many rows and columns are container in the `DataFrame`. We have seen this attribute when dealing with `ndarrays` ..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9fc5896", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a12c423b", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Detailed information on the data contained inside the `DataFrame`.\n", + "df.info()" + ] + }, + { + "cell_type": "markdown", + "id": "7f4f886e", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "`DataFrame`s are essentially composed of 3 components. Theses components can be accessed with specific data attributes.\n", + "\n", + "- Index (`df.index`)\n", + "- Columns (`df.columns`)\n", + "- Body (`df.values`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e6ed9ab7", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df.index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b0afa504", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ade6b93", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df.values" + ] + }, + { + "cell_type": "markdown", + "id": "de4be0fb", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Data indexing and selection" + ] + }, + { + "cell_type": "markdown", + "id": "fa82625d", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### The Iris flower dataset\n", + "\n", + "<a title=\"w:ru:Денис Анисимов (talk | contribs), Public domain, via Wikimedia Commons\" href=\"https://commons.wikimedia.org/wiki/File:Irissetosa1.jpg\"><img width=\"512\" alt=\"Irissetosa1\" src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Irissetosa1.jpg/512px-Irissetosa1.jpg\"></a>\n", + "\n", + "Image taken from: <a href=\"https://commons.wikimedia.org/wiki/File:Irissetosa1.jpg\">w:ru:Денис Анисимов (talk | contribs)</a>, Public domain, via Wikimedia Commons\n", + "\n", + "Attribution for dataset: *Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.*" + ] + }, + { + "cell_type": "markdown", + "id": "673472df", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "The dataset contains measurements of for \"features\" related to the species of Iris flowers:\n", + "* Petal length (\"Bluetenblattlaenge\")\n", + "* Petal width (\"Bluetenblattbreite\")\n", + "* Sepal length (\"Kelchblattlaenge\")\n", + "* Sepal width (\"Kelchblattbreite\")\n", + "\n", + "The species contained in the dataset are:\n", + "\n", + "* Iris setosa\n", + "* Iris virginica\n", + "* Iris versicolor" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f95709e5", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df = utils.download_IRIS()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30c97a17", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Quick check if data looks alright\n", + "# petal - Bluetenblatt\n", + "# sepal - Kelchblatt\n", + "df.head() \n", + "# df.tail()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76e5852d", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "285fad0e", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Column access with the `[]` operator.\n", + "df[\"Name\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c7fc248", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# The columns of a DataFrame are `Series` objects.\n", + "type(df[\"Name\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87123e39", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "data_columns = [cname for cname in df.columns if cname != \"Name\"]\n", + "data_columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32a00007", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df[data_columns]" + ] + }, + { + "cell_type": "markdown", + "id": "ce8a4f9e", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "As for `Series` objects the `loc` as well as the `iloc` methods are also available for `DataFrame`s." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "146337e7", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Remember that when using the `loc` method the argument passed to the `[]` operator must present in `df.index`.\n", + "df.loc[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bac9da5", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# We can also use slicing with the `loc` method.\n", + "df.loc[0::50].head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75a449ca", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Fancy indexing is also possible.\n", + "df.loc[[0, 50, 100]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7dcc63b4", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# We can combine row and column access with the `loc` method.\n", + "df.loc[:, ['sepal width', 'sepal length']].head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab98b434", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Rows can also be selected with boolean masks.\n", + "mask = (df[\"Name\"] == \"Iris-setosa\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87024f21", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df.loc[mask].head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9eb788b5", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# More complicated boolean masks can be conceived\n", + "mask = (df[\"sepal length\"] > 6.0) & (df[\"petal length\"] > 1.0) # use () for each boolean sub-expression\n", + "df.loc[mask]" + ] + }, + { + "cell_type": "markdown", + "id": "2b97d718", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Exercises (optional)" + ] + }, + { + "cell_type": "markdown", + "id": "b1d0491f", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "* Change all column names to uppercase, e.g.\n", + " * \"petal length\" $\\to$ \"PETAL LENGTH\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86cadb9b", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "77c2b1fc", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "* From the `\"sepal length\"` column retrieve all values that are `> 6` but `< 7`! How often does each of the resulting values occur in this column? (*Hint*: Refer to the [`DataFrame` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for a method to count values.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "905540ad", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "a71c5cd3", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "* In the DataFrame `df`, *simultaneously* access the columns `\"sepal length`\", `\"petal width\"`, and `\"Name\"` in two different ways.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a346e36", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1b3f2b9", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "f5522a5c", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "* Compare the following two ways of replacing data in a DataFrame. Do they both work? Why?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c85a31fe", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50f405da", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "eb2bcfe0", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "* Determine the indices in the `DataFrame` that correspond to rows that contain data on the Iris setosa species.\n", + "* Use indices to delete the corresponding rows from the `DataFrame`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8cc5891d", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "69acd4c9", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "* Sort the columns in the `DataFrame` by the values contained in the columns `\"petal length\"` *and* `\"petal width\"`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b240056", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "e852cb23", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Reading data into a `DataFrame`" + ] + }, + { + "cell_type": "markdown", + "id": "1c151460", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Pandas can import several common file formats:\n", + "\n", + "- `pd.read_csv`: Read in CSV spreadsheets (`.csv` suffix)\n", + "- `pd.read_excel`: Read in MS Office spreadsheets (`.xls` and `.xlsx` suffix) \n", + "- `pd.read_stata`: Read stata datasets (`.dta` suffix)\n", + "- `pd.read_hdf`: Read HDF datasets (`.hdf` suffix)\n", + "- `pd.read_sql`: Read from SQL database\n", + "\n", + "Other file formats are [supported](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) as well." + ] + }, + { + "cell_type": "markdown", + "id": "2ec29c1d", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Reading CSV files " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "002a168c", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Download the files and write to CSV file.\n", + "from pathlib import Path\n", + "importlib.reload(utils)\n", + "utils.download_IRIS_with_addons(delimiter=\";\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bdb4b736", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Inspect the file content. This command will only work on a UNIX-like operating system.\n", + "! head -n 15 tmp_with_addons/iris-data.csv | nl" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bcadd113", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Read the file with Pandas and specify the delimiter symbol as well as the a symbol for the comment.\n", + "df = pd.read_csv(Path(\"tmp_with_addons\") / \"iris-data.csv\", delimiter=\";\", comment='#')\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e4cb505", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# We can limit the number of imported columns by specifying those that we explicitly want to have.\n", + "df = pd.read_csv(Path(\"tmp_with_addons\") / \"iris-data.csv\", \n", + " delimiter=\";\", \n", + " comment=\"#\", \n", + " usecols=[\"Name\", \"sepal length\", \"sepal width\"])\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62801f60", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# When importing data we can specifiy which data column should become the index in the `DataFrame`.\n", + "df = pd.read_csv(Path(\"tmp_with_addons\") / \"iris-data.csv\", delimiter=\";\", \n", + " comment=\"#\", index_col=\"Name\")\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9cf40bc", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df_tmp1 = df.copy(deep=True)\n", + "df_tmp2 = df.copy(deep=True)" + ] + }, + { + "cell_type": "markdown", + "id": "53834587", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Oftentimes -- when invoking a method of a `DataFrame` object -- a *new* `DataFrame` instance is returned. This means that new memory allocations will be made which can be quite time-consuming and also a waste of precious memory ressources." + ] + }, + { + "cell_type": "markdown", + "id": "c79088e2", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "Reset the index of the current `DataFrame`. This is done *out-of-place* and a new instance is returned." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c17dbdd", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df_tmp1.reset_index().set_index(\"sepal length\").head()" + ] + }, + { + "cell_type": "markdown", + "id": "d8f6726f", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "We can use the `inplace` argument to modify the current instance itself." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b244ae21", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# We can use the `inplace` argument to modify the object itself. \n", + "df_tmp2.reset_index(inplace=True)\n", + "df_tmp2.set_index(\"sepal length\", inplace=True)\n", + "df_tmp2.head()" + ] + }, + { + "cell_type": "markdown", + "id": "125a396b", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Operations with `DataFrame`s" + ] + }, + { + "cell_type": "markdown", + "id": "4446f2d6", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Arithmetic operations" + ] + }, + { + "cell_type": "markdown", + "id": "5a6a944b", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Mapping between Python arithmetic operators and `DataFrame` methods.\n", + "\n", + "| Python operator | Pandas methods |\n", + "|:---------------:|----------------------------------|\n", + "| `+` | `add()` |\n", + "| `-` | `sub()`, `subtract()` |\n", + "| `*` | `mul()`, `multiply()` |\n", + "| `/` | `truediv()`, `div()`, `divide()` |\n", + "| `//` | `floordiv()` |\n", + "| `%` | `mod()` |\n", + "| `**` | `pow()` |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74e113b4", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "A = pd.DataFrame(np.random.randint(0, 20, (3, 2)), columns=list(\"AB\"))\n", + "B = pd.DataFrame(np.random.randint(0, 20, (3, 3)), columns=list(\"BAC\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a9bb50e8", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Indices of all DataFrames involved in the operation are aligned. The order of each index is irrelevant.\n", + "# Data columns not shared by the DataFrames will be filled with a special value.\n", + "A + B" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc290da1", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Use the `add` method to specifiy the fill_value. Note that the `fill_value` will be used in the DataFrame with the\n", + "# *missing* column. The specified `fill_value` is then used in the arithmetic operation. \n", + "# >>> Choose wisely when using the `fill_value` argument <<<\n", + "A.add(B, fill_value=\"-1000\")" + ] + }, + { + "cell_type": "markdown", + "id": "7d5fccee", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "NumPy broadcasting rules apply for `DataFrame`s as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dfd28894", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df = pd.DataFrame(np.random.randint(10, size=(3, 4)), columns=list(\"wxyz\"))\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d8b8e06", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Subtract a row.\n", + "df - df.loc[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c0ecfb7", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Call the appropriate method if you want to operate on the columns. We operate along axis=0 (the rows).\n", + "df.sub(df[\"x\"], axis=0)" + ] + }, + { + "cell_type": "markdown", + "id": "8da5ad3d", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "`DataFrame`s can be fed to Numpy `ufunc`s." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55d00536", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "np.exp(df)" + ] + }, + { + "cell_type": "markdown", + "id": "60dfcb8b", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "New columns can be added with arithmetic operations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29bb3975", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df[\"asdf\"] = np.sin( df[\"x\"] + df[\"y\"] )\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "b1381d02", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Methods for operating on `DataFrame`s" + ] + }, + { + "cell_type": "markdown", + "id": "aa5758e3", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Pandas `DataFrame` and `Series` objects have several built-in method to operate on the data.\n", + "\n", + "- `apply()`: available for *both* `Series` and `DataFrame` objects\n", + "- `transform()`: available for *both* `Series` and `DataFrame` objects\n", + "- `applymap()` *only* available for `DataFrame` objects\n", + "- `map()`: *only* available for `Series` objects" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0545a722", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df = utils.download_IRIS()\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0b8e197", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Get a subset of columns by using regular expressions\n", + "data_columns = df.columns[df.columns.str.match('^(petal|sepal).*(width|length)$')]\n", + "data_columns" + ] + }, + { + "cell_type": "markdown", + "id": "c4ca8241", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)\n", + "\n", + "```python\n", + "DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)\n", + "```\n", + "- *applies* a function (callable) along an `axis` of the `DataFrame`\n", + " - `axis=0`: `func` is applied to each column (a `Series` object). This is the default!\n", + " - `axis=1`: `func` is applied to each row\n", + "- return type is inferred from `func`" + ] + }, + { + "cell_type": "markdown", + "id": "ec3f1484", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "The return type of `func` determines the form of the result.\n", + "\n", + "`func` can operate on `Series` objects an perform operations that are supported by these types of objects (e.g. by means of the methods `.min()`, `.max()` or `.mean()`). \n", + "- result can be a scalar value (e.g. `.sum()` which is an aggregation operation)\n", + "- result can be another `Series` object" + ] + }, + { + "cell_type": "markdown", + "id": "440dfc49", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Compute the mean value of each column (this is the default because we do not specify the `axis` argument)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f412f504", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "mean_values = df[data_columns].apply(lambda x: x.mean())\n", + "mean_values # This returns a `Series` object because x.mean() returns a scalar value." + ] + }, + { + "cell_type": "markdown", + "id": "511add01", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Question\n", + "\n", + "How does the result look like if the operate along the rows of the `DataFrame`. This is achieved by using the argument `axis = 1`. What is the shape of the resulting object?\n" + ] + }, + { + "cell_type": "markdown", + "id": "ef03fc24", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Now we transform the values in the columns of the `DataFrame`. We define a function that will operate on the `Series` objects that form the columns.\n", + "\n", + "The object resulting from this operation is another `DataFrame` instance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58bdf8e5", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "def scale_to_mm(s):\n", + " return s * 10\n", + "\n", + "df_scaled_to_mm = df[data_columns].apply(scale_to_mm) # This will return a new DataFrame\n", + "df_scaled_to_mm[\"Name\"] = df[\"Name\"]\n", + "df_scaled_to_mm.head()" + ] + }, + { + "cell_type": "markdown", + "id": "233adbba", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Question\n", + "\n", + "How must the above command be changed if we want to operate along the rows of the `DataFrame` instead? Does this also work with the already-defined function or do we have to define a dedicated function?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dd57a2b", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df[data_columns].apply(scale_to_mm, axis=1).head()" + ] + }, + { + "cell_type": "markdown", + "id": "371d521f", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Experimenting with the `apply()` method" + ] + }, + { + "cell_type": "markdown", + "id": "afb7a126", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Let's generate a large `DataFrame`. We wish to operate on the data with the `apply` method. We can do this in two different ways:\n", + "- Operate along the rows (`axis=1`)\n", + "- Operate along the columns (`axis=0`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "355d3698", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "N_rows, N_cols = 10_000, 500\n", + "data = pd.DataFrame(np.random.random((N_rows, N_cols)), columns=[f\"col{idx}\" for idx in range(N_cols)])" + ] + }, + { + "cell_type": "markdown", + "id": "2d78baa7", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Question \n", + "\n", + "What do you think is faster: Operating along the columns or operating along the rows?\n", + "\n", + "When you have made your decision try to come up with a reason!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "368b5567", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "%timeit data.apply(lambda x: x ** 2, axis=0) # operate along columns\n", + "%timeit data.apply(lambda x: x ** 2, axis=1) # operate along rows" + ] + }, + { + "cell_type": "markdown", + "id": "524759b7", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "The `apply` method wants to operate on `Series` objects. The columns of a `DataFrame` are `Series`. Inside each `Series` data is stored contiguously in memory. Hence operating on the columns is *fast*.\n", + "\n", + "When operating row-wise for *each* row a new `Series` object must be generated. A buffer must be allocated in memory and data needs to copied to that buffer in order to be able to operate on the data with the `apply` method. Since there are many steps involved that are repeated for each row this procedure generally is *slower* than operating along the columns." + ] + }, + { + "cell_type": "markdown", + "id": "cb2244cb", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Task (optional)\n", + "\n", + "The names of the Iris species are contained in the column with heading `\"Name\"`. The names follow the pattern:\n", + "\n", + "```\n", + "Iris-<identifier for species>\n", + "```\n", + "\n", + "Remove the dash `-` from the names and just keep the identifier for each species. Use the `apply` method.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79f7a36a", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "3f112be8", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### `transform()`\n", + "\n", + "```python\n", + "DataFrame.transform(func, axis=0, *args, **kwargs)\n", + "```\n", + "\n", + "`func` can either be\n", + "- callable, e.g. `np.exp`\n", + "- list-like, e.g. `[np.sin, np.cos]`\n", + "- dict-like, e.g. `{\"sepal length\": np.sin, \"petal length\": np.cos}`. Application is limited to columns names passed as keys to `dict`.\n", + "- string, e.g. `\"sqrt\"`\n", + "\n", + "*Note*: This function *transforms*, i.e, when the input value is `Series` another (transformed) `Series` is returned. Returning a scalar value is not valid (resulting error message will be: `ValueError: Function did not transform\n", + "`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4848e300", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df[data_columns].transform({\"sepal length\": np.cos, \"petal length\": np.sin}).head()" + ] + }, + { + "cell_type": "markdown", + "id": "baccb2f2", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Task (optional)\n", + "\n", + "Convert the measured values (which are all given in cm units) to mm units by using the `transform` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d19379c2", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "c0910d4c", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Performance considerations" + ] + }, + { + "cell_type": "markdown", + "id": "b7d5536e", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "When operating on columns of a `DataFrame` or a `DataFrame` *as a whole* it is oftentimes faster to use a vectorised operations instead of column-/row-wise operations.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8432d37", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df = pd.DataFrame(np.random.randn(1_000_000, 3), columns=list(\"abc\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75e45ac5", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "%timeit df.apply(lambda x: x ** 2, axis=0)\n", + "%timeit df ** 2\n", + "%timeit (df.values ** 2) # here we operate on the underlying `ndarray`" + ] + }, + { + "cell_type": "markdown", + "id": "dd5b8a17", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### `assign`" + ] + }, + { + "cell_type": "markdown", + "id": "2cd2aaa0", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "The `assign` method adds a new column to a `DataFrame`. It is called on an existing `DataFrame` and returns a new `DataFrame` (that has all columns of the original `DataFrame`) with the new column added.\n", + "\n", + "* Allows to add single as well as multiple columns per call." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "527d0daf", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df_mean = df[data_columns].mean()\n", + "df.assign(\n", + " petal_length_dev_from_mean=lambda x: x[\"petal length\"] - df_mean[\"petal length\"],\n", + " petal_width_dev_from_mean=lambda x: x[\"petal width\"] - df_mean[\"petal width\"],\n", + " sepal_length_dev_from_mean=lambda x: x[\"sepal length\"] - df_mean[\"sepal length\"],\n", + " sepal_width_dev_from_mean=lambda x: x[\"sepal width\"] - df_mean[\"sepal width\"]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "f470a4a6", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Grouping data" + ] + }, + { + "cell_type": "markdown", + "id": "ab7c9102", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Properties of `GroupedBy` objects" + ] + }, + { + "cell_type": "markdown", + "id": "e360d0f1", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "- Oftentimes items in a dataset can be grouped in a certain manner (e.g., if a column contains a value multiple times). The Iris dataset, for instance, can be grouped according the species of each flower.\n", + "\n", + " ```python\n", + " my_dataframe.groupby(by=[\"<column label>\"])\n", + " ```\n", + "- The `DataFrame` is split and entries are grouped according to the values in the column with `\"<column-label>\"`. Once the data has been grouped operations can be conducted on the items of each group.\n", + "\n", + "*Note*: `DataFrame`s cannot only be [grouped](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) according to the entries of a column." + ] + }, + { + "cell_type": "markdown", + "id": "99fce3ac", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "The return type of `groupby()` is *not* another `DataFrame` but rather a `DataFrameGroupBy` object. We can imagine this object to be a grouping of multiple `DataFrame`s.\n", + "\n", + "It is important to understand that such an object essentially is a special *view* on the original `DataFrame`. No computations have been carried out when generating it (lazy evaluation)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72c259e0", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "df = utils.download_IRIS()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "00f1cbca", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# We group the data according to the species of the flowers\n", + "grouped_by_species = df.groupby(by=[\"Name\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "acbfef47", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(type(grouped_by_species))" + ] + }, + { + "cell_type": "markdown", + "id": "12e3ea0f", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "This data structure still knows about the `columns` that were present in the original `DataFrame`. We can use the `[<column-name>]` operation to access the columns with the correspoding label in each of the group members (subframes)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74f7786d", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "grouped_by_species[\"sepal length\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab6a8365", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Pandas will access the corresponding column of all subframes and apply the functions passed to the `agg()` method.\n", + "grouped_by_species[\"sepal length\"].agg([np.min, np.max, np.mean])" + ] + }, + { + "cell_type": "markdown", + "id": "3e17a46e", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "We can iterate over the `DataFrameGroupBy` object where each subframe is returned as a `Series` of a `DataFrame`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a245fca9", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "for (species, subframe) in grouped_by_species:\n", + " print(f\"Subframe for species {species} has shape {subframe.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d110f91", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Call the getter to obtain a `DataFrame`.\n", + "grouped_by_species.get_group(\"Iris-setosa\").head()" + ] + }, + { + "cell_type": "markdown", + "id": "cf9faf2d", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Methods that are not directly implemented for the `DataFrameGroupBy` object are passed to the subframes and executed on these." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db2a75ca", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# The `describe()` method can also be called on the full object but the output would be rather hard to view.\n", + "grouped_by_species[\"sepal length\"].describe() # The return type is a `DataFrame`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2df3b968", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Single methods are available as well. E.g. `mean()`, `std()` or `sum()`\n", + "grouped_by_species.mean() # The return type is a `DataFrame`" + ] + }, + { + "cell_type": "markdown", + "id": "dafb2e66", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Operating on `GroupedBy` objects" + ] + }, + { + "cell_type": "markdown", + "id": "9bbd870d", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "`DataFrameGroupBy` object support `aggregate()`, `filter()`, `transform()` and `apply()` operations.\n", + "\n", + "These methods can be efficiently used to implement a great variety of operations on grouped data." + ] + }, + { + "cell_type": "markdown", + "id": "03cd3096", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### [`aggregate()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) (or simply `agg()`)\n", + "\n", + "```python\n", + "DataFrameGroupBy.aggregate(func=None, *args, engine=None, \n", + " engine_kwargs=None, **kwargs)\n", + "```\n", + "\n", + "`func` can for example be ...\n", + "- ... function (Python callable),\n", + "- ... a string specifiying a function name (e.g. `\"mean\"`)\n", + "- ... list of functions or strings, e.g. `[\"std\", np.mean]`\n", + "- ... `dict` of column labels and function to apply (e.g. `{'data1': np.mean}`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32798a44", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Perform some common aggegrations within each subframe. The output of this method is another `DataFrame`.\n", + "group_agg = grouped_by_species.agg([np.min, np.max, np.mean, np.std])\n", + "group_agg" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "217e1c1f", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# To understand this a bit better consider the following. Note that we limit the output to only one species.\n", + "df.loc[df[\"Name\"] == \"Iris-setosa\", df.columns[:-1]].agg(\n", + " [np.min, \n", + " np.max, \n", + " np.mean, \n", + " np.std]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cfe77e99", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "The resulting output looks somewhat complicated than what we are used to from `DataFrame`s so far. The column labels now are hierarchical due to the grouping." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9de104f9", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "group_agg.columns # This is a so-called `MultiIndex`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ef5258d", + "metadata": {}, + "outputs": [], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "9fe0ff59", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Exercises (optional)" + ] + }, + { + "cell_type": "markdown", + "id": "e1b12f15", + "metadata": {}, + "source": [ + "### Task 1\n", + "\n", + "Consider the Iris dataset.\n", + "\n", + "* For each of the features compute the mean value as well as the standard deviation.\n", + "* Center the values of a particular feature on the mean values and scale them to have unit variance.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62978f94", + "metadata": {}, + "outputs": [], + "source": [ + "df = utils.download_IRIS()" + ] + }, + { + "cell_type": "markdown", + "id": "5b09bfcb", + "metadata": {}, + "source": [ + "Let us first make a working copy of the `DataFrame` containing the data on the Iris dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15edb74b", + "metadata": {}, + "outputs": [], + "source": [ + "df_tmp = df.copy()" + ] + }, + { + "cell_type": "markdown", + "id": "d7ba0a2a", + "metadata": {}, + "source": [ + "Next, compute the mean value and the standard deviation for all features of the dataset. Computing these quantities does *not* take into the account the particular species." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53d45569", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "ae660e8f", + "metadata": {}, + "source": [ + "Now transform each of the features to be centred on the mean value and to have unit variance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52a96d54", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "f2b49c83", + "metadata": {}, + "source": [ + "### Task 2\n", + "\n", + "Again consider the Iris dataset.\n", + "\n", + "* Group the measured values by the species.\n", + "* Create boxplots for each species for all features.\n", + " * Retrieve the names of the single groups from the `GroupedBy` objects.\n", + " * Get the `DataFrame` for each of the groups from the `GroupedBy` object and call the [`boxplot` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) to create the plot.\n", + " * Use the names in the titles of the plot.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ae51b08", + "metadata": {}, + "outputs": [], + "source": [ + "df = utils.download_IRIS()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "801dd946", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "90b8c49e", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a031db1b", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "celltoolbar": "Slideshow", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "rise": { + "controls": true, + "controlsLayout": "edges", + "controlsTutorial": false, + "footer": "<img src=hpc-hessen-logo-only.png height=60 width=100>Competence Center for High Performance Computing in Hessen (HKHLR) Tim Jammer, Marcel Giar HiPerCH 2022", + "header": "", + "help": false, + "slideNumber": "c/t", + "theme": "white" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": false, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/slides/Day2_PandasDataFrames.ipynb.license b/slides/Day2_PandasDataFrames.ipynb.license new file mode 100644 index 0000000000000000000000000000000000000000..c207ab8c094a9d18d7c6cb5c9dfbf8913df4aa8a --- /dev/null +++ b/slides/Day2_PandasDataFrames.ipynb.license @@ -0,0 +1,4 @@ +SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> + +SPDX-License-Identifier: MIT diff --git a/slides/Day2_PandasSeries.ipynb b/slides/Day2_PandasSeries.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..9dec8104e92205f611b670ca6097e2e73ce17fcb --- /dev/null +++ b/slides/Day2_PandasSeries.ipynb @@ -0,0 +1,1749 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# HiPerCH 14 Module 1: Introduction to Python Data Processing tools" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Day 2: Pandas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## A Python data analysis library and data manipulation tool\n", + "- essential Python library for data analysis\n", + "\n", + "- a \"wrapper\" around numpy\n", + " - basic knowledge of numpy is required for this course\n", + " - numpy provides efficiency \"under the hood\"\n", + " - pandas provides lots of ready-made functions for analyzing and plotting data\n", + "\n", + "- \"Excel inside of Python\"\n", + "\n", + "- provides its own data structures\n", + " - `Series` and `DataFrame`s have numerous methods to work on data\n", + " - no need for imperative programming!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "from matplotlib import pyplot as plt\n", + "\n", + "import pandas as pd\n", + "import numpy as np \n", + "\n", + "print(f'Pandas version: {pd.__version__}\\nNumpy version: {np.__version__}')\n", + "\n", + "import importlib\n", + "# import utils" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Pandas `Series` Objects\n", + "\n", + "- essentially `np.ndarray`s with generalized indexing capabilities\n", + "- have an `index`, `values`, a `size`, and a `dtype`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Constructing `Series` from Python objects\n", + "- may use `list`s, `tuple`s, or `dicts`\n", + "- `set` does *not* work since contained data is *unordered*\n", + "- can contain different datatypes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Construct from Python `list` object\n", + "integers = pd.Series(data=[10, 30, 195, 2021]) # data keyword can be omitted since it is the first positional argument\n", + "integers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Series objects have important metadata\n", + "integers.values, integers.index, integers.dtype, integers.size" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# When constructing from a `dict` the keys become the index and the values become the value entries.\n", + "ordinal_values = pd.Series({'a': 97, 'b': 98, 'c': 99})\n", + "ordinal_values" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Generating `Series` from numpy arrays\n", + "- fastest to \"stay in the numpy world\"\n", + "- Series neatly wrap themselves around numpy arrays" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "array = np.arange(10, 14)\n", + "integers = pd.Series(data=array)\n", + "integers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "floats = pd.Series(data=np.random.randn(4))\n", + "floats" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Indexing Series\n", + "- *not* recommended: Python-style indexing with `[]` operator\n", + " - unintuitive behavior\n", + " - slicing refers to *numeric* indices\n", + "- use Series methods `.loc`, `.iloc` instead" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Indexing with `.loc`, `.iloc` methods\n", + "- [`loc[<index value>]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.loc.html)\n", + " - access by actual (index) label\n", + " - slices include both end points\n", + "\n", + "- [`iloc[<index value>]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)\n", + " - numeric indexing with integers, from 0\n", + " - slices exclude the end point (as with e.g. ranges)\n", + "- can be used with boolean arrays" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + }, + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "ordinal_values = pd.Series({'a': 97, 'b': 98, 'c': 99})\n", + "ordinal_values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# We can use the `[]` operator with the `loc` and `iloc` methods.\n", + "ordinal_values.loc['a'], ordinal_values.iloc[1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Slicing\n", + "ordinal_values.loc['a':'c']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "yearly_numbers = pd.Series(data=[137, 214, 195, 271], index=[2014, 2016, 2018, 2020])\n", + "yearly_numbers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + }, + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers.loc[2020], yearly_numbers.iloc[-1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers.iloc[0:2] # This will *not* work with the `loc` method!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers.loc[2014:2018] # This will *not* work with the `iloc` method!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Setting values\n", + "- values of a `Series` can be modified\n", + "- use `.loc`, `.iloc` for indexing!\n", + " - unintuitive results with \"standard Python\" indices" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers[2014] = 138" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers[2016] = 300.78 # Warning: This is typecast to `int`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers[2018] = \"300\" # Warning: This is typecast to `int`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers[2020] = \"300.7889\" # Warning: Now the *Series* changes!\n", + "yearly_numbers[0:2] = [1.5, 2.2] # This also changes the series!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Do it this way instead\n", + "yearly_numbers = pd.Series(data=[137, 214, 195, 271], index=[2014, 2016, 2018, 2020])\n", + "yearly_numbers.dtype" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers.loc[2014] = 300.78\n", + "yearly_numbers # Setting with `.loc` *always* changes the Series datatype (here: type conversion from `int64` to `float64`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers.iloc[2] = \"Can also set to a string\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers.loc[2014:2018] = ['this', 'also', 'works']\n", + "yearly_numbers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Setting the Index\n", + "- `Series` have an `index` as a separate attribute\n", + " - index itself is a numpy array\n", + "- can be inspected and set\n", + " - various data types possible\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers.index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_numbers.index = [0, 2, 4, 6]\n", + "yearly_numbers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "floats.values, floats.index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "floats.index = ['a', 'c', 'b', 'd']\n", + "floats.loc['b']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "floats.loc['c':'d']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "We can set the index when we create an instance of the `pd.Series` object. Use the `index` argument of the `pd.Series` constructor for this purpose." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "monthly_numbers = pd.Series([5, 2, 3, 91], index = 'Jan Feb Mar Apr'.split())\n", + "monthly_numbers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Fancy indexing\n", + "- we can address series by more than one index at once\n", + " - give a sequence of indices we want to pull out for `.loc`, `.iloc`\n", + " - values may repeat" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "monthly_numbers.loc[['Jan', 'Mar']]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "monthly_numbers.iloc[[1, 2, 3, 2, 3, 2, 1]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers.loc[[True, False, True, False]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Operations on Series\n", + "- Series can be added, multiplied, divided, ...\n", + " - operations are performed element-wise\n", + " - with other series: performed by index (*not* the numeric index!)\n", + " - with scalar values: broadcast to all values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "yearly_revenue = pd.Series([4, 20, 69, 420])\n", + "yearly_expenses = pd.Series([1, 33, 7, 57])\n", + "\n", + "yearly_revenue - yearly_expenses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "yearly_revenue + yearly_expenses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_revenue % yearly_expenses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_revenue > yearly_expenses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "yearly_revenue = pd.Series([4, 20, 69, 420], index=[2017, 2018, 2019, 2020])\n", + "yearly_expenses = pd.Series([1, 33, 7, 57], index = [2020, 2018, 2017, 2019])\n", + "\n", + "yearly_revenue - yearly_expenses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "yearly_expenses = pd.Series([1, 33, 7, 57, 120000], index=[2020, 2018, 2017, 2019, 2020])\n", + "yearly_revenue - yearly_expenses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "integers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "integers + 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers ** 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers + 0.2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers < 12" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Boolean Masks\n", + "- an easy way to extract data by condition\n", + " 1. create a boolean mask (same length, entries `True` / `False`)\n", + " 2. use with `.loc`\n", + "- careful: Cannot use \"Truthiness\" in place of booleans\n", + " - may need to explicitly compare" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "integers < 10" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers.loc[integers < 12]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "integers % 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers[integers % 2]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers.loc[integers % 2 == 1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers.loc[(integers < 11) | (integers > 12)]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Exercises\n", + "1. Create a `Series` with 8 random integers from in the range $[0,7]$. For the `index` use letters \"a\" to \"h\".\n", + "2. Which entry do you get for index \"d\"? \n", + "3. Retrieve the frist, the fifth and the last entry of the `Series`.\n", + "4. Retrieve all `Series` entries that are even.\n", + "5. Is the sum of all entries and even or an odd number?\n", + "6. Copy all values into a new `Series` object. For the new object use indices 'fegdachb' (in this order).\n", + "7. What do you get when dividing one `Series` object by the other?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Plotting data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Pandas `Series` objects have an interface to Matplotlib that can be conventiently used to generate plots of datasets. The advantage of having a dedicated method for visualising (parts of) the data will become even more apparent when we deal with Pandas `DataFrame` objects." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "- `Series` instances have a [`plot()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html) method that returns a Matplotlib `Axes` object.\n", + " - The `kind` parameter of this method allows tho choose between different *types* of plots (default value is `'line'`). \n", + "- It is also possible to use the `plot` module which offers dedicated functions for certain types of plots (e.g. `pandas.Series.plot.line` or `pandas.Series.plot.bar`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "x_values = np.linspace(-np.pi, np.pi, num=201)\n", + "cos_data = pd.Series(data=np.cos(x_values), index=x_values)\n", + "\n", + "ax = cos_data.plot.line()\n", + "ax.set_xlabel(\"$x$ label\")\n", + "ax.set_ylabel(\"$y$ label\")\n", + "ax.grid()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "The plotting capabilities of `Series` are particularly useful when dealing with categorical data.\n", + "\n", + "We will learn more about this later when we deal with `pd.DataFrame`s." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Datatypes and Missing values" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Series data types\n", + "* internally, `Series` (and indices) use NumPy datatypes\n", + "* important implications for \"big data\":\n", + " * storage requirement differs widely\n", + " * overflow, precision\n", + "* at creation, pandas determines a \"fitting\" dtype\n", + " * only numeric types or \"object\"\n", + "* `Series` are \"flexible\"\n", + " * assignment can *change* the Series data type (and therefore the type of the underlying C array)\n", + " * easy typecasting with `.astype`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "integers32 = pd.Series(np.ones((1000000,)), dtype=np.int32) # use `dtype` to specifiy the type when creating a `Series`\n", + "integers32.dtype\n", + "# integers.memory_usage(index=False, deep=True) # returned values are in [bytes]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Use `astype` method for explicit type conversion/\n", + "integers64 = integers32.astype(np.int64)\n", + "integers64.dtype" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Type case can also happen *implicitly*\n", + "floats64 = integers32 * 1.0\n", + "floats64.dtype" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Implicit type conversion can also happen when changing single value.\n", + "integers32.loc[0] = 1.234\n", + "integers32.dtype" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "An implicit / explicit type conversion can *increase* memory demands of a `Series`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers = pd.Series(np.ones((1000000,)), dtype=np.int32)\n", + "integers.memory_usage(index=False, deep=True) # returned values are in [bytes]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers.astype(np.float64).memory_usage(index=False, deep=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "When casting to a type with a larger \"itemsize\" (i.e. more bits used to represent the numeric type) a reallocation must occur to accomodate for the larger memory demand of the underlying C array." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Exercises\n", + "Create three Series with random entries:\n", + "* a Series with 4 integer values and indices 'abcd',\n", + "* a Series with 5 float values and indices 'abcde',\n", + "* and a Series with 6 boolean values and indices 'abcabc'\n", + "\n", + "\n", + "1. multiply these Series pairwise (for each pair of two series). Where and why do you see missing values?\n", + "2. how can you deal with `nan`-values or prevent their creation?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Transformations (additional material)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "* Series values and indices are mutable\n", + " * can easily be re-assigned\n", + " * typical operations still create new instances" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "More comprehensive transformations need dedicated methods:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* replace\n", + " * `Series.replace` *ignores* values not found\n", + " * `Series.map` *drops* values not found\n", + "\n", + "* condense\n", + " * `Series.cumsum` adds progressively\n", + " * `Series.aggregate` (or `Series.agg`) returns a scalar value\n", + " \n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "cell_style": "split", + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* sort\n", + " * `Series.sort_values` sorts by series *values*\n", + " * `Series.sort_index` sort by series *index*\n", + "* manipulate\n", + " * `Series.apply` uses a single function\n", + " * `Series.transform` uses one or more functions, \"string functions\", or dicts" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Replace and map\n", + "* Replace values with different values according to a replacement rule\n", + "* for the difference, see also https://stackoverflow.com/a/62947436" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### `Series.replace`\n", + "- can utilize strings or regular expressions\n", + "- may give two positional arguments: replace first with second\n", + "- may also give a mapping (dict or Series)\n", + "- all values not explicitly given are ignored\n", + "\n", + "See `help(pd.Series.replace)` for more details." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "strings = pd.Series('Er sah das Wasser as'.split())\n", + "strings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "strings.replace(to_replace='as', value='an') # replace with string" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "strings.str.replace('as', 'an') # accessing the `str.replace` method" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "integers = pd.Series((0, 10, 20, 30))\n", + "integers.replace(0, 1000) # replace with two values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers.replace({10: 100, 20: 200, 50: 10}) # replace with a dict" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### `Series.map`\n", + "- accepts a Series, dict, or function\n", + " - `Series` with old values in the index\n", + " - `dict` with old values: new values as key-value pairs\n", + " - function with a single argument: similar to `apply` (see below)\n", + "- if a value is not found, replace with `na`\n", + "\n", + "Refer to `help(pd.Series.map)` for more details.\n", + "\n", + "Main difference to `replace`: `map` is applied to each element of a `Series` object while `replace` usually is applied to only a few elements." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "integers = pd.Series(range(1, 5))\n", + "integers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "integers.map(lambda x: x ** 2 / ( x + 1 )) # pass a callable" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Condense\n", + "- `Series.cumsum` cumulates values\n", + "- `Series.mean`, `Series.std` for statistics\n", + "- `Series.all`, `Series.any` for truthiness\n", + "- `Series.agg` with arbitrary functions" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### `Series.cumsum`\n", + "- adds up all values for a given index in the `Series`\n", + "- sometimes useful in statistics\n", + "- returns a Series of the sum up to each index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "errors = pd.Series((1, 1, 0, 0, 2, 2, 1), index=pd.date_range(start='2021-04-01', periods=7))\n", + "errors" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "errors.cumsum()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### `Series.aggregate`\n", + "* applies a function to a Series\n", + " * returns a single value\n", + "* applies a *list of* functions\n", + " * returns a *Series of* values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "errors.agg(np.sum) # pass a single callable. Return value is a scalar." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "errors.agg([np.sum, np.std]) # pass a list of callables. The operation then returns a `Series` object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "random_normal = pd.Series(np.random.normal(size=5))\n", + "# We can also pass multiple callables in a list. The operation then returns a `Series` object.\n", + "random_normal.agg([pd.Series.count, pd.Series.mean, pd.Series.std,\n", + " pd.Series.min, pd.Series.max, pd.Series.quantile],\n", + " q=0.25) # positional arguments to be passed to each function " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### `Series` and statistics\n", + "`pd.Series` objects have a number of methods that be used to compute statistical quantities such as the mean value, the median oder the standard deviation.\n", + "\n", + "The latter deserves some explanation: Ther is a *difference* between NumPy and Pandas in how the standard deviation (oftentimes denoted as $\\sigma$) is computed:\n", + "\n", + "Generally:\n", + "$$\\mu = \\frac{1}{N} \\sum_{i=1}^N s_i\\quad;\\quad\\sigma = \\sqrt{\\frac{1}{N-\\Delta_{\\text{dof}}} \\sum_{i=1}^N (s_i - \\mu)^2}$$\n", + "- with *degrees of freedom* $\\Delta_\\text{dof}$: default 1, **differing from [numpy.std](https://numpy.org/doc/stable/reference/generated/numpy.std.html)** (where `ddof=0` by default)\n", + "- pass `ddof=0` for the \"uncorrected\" standard deviation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "random_values = pd.Series(np.random.random((10,)))\n", + "print(f\"Standard deviation with default Pandas behaviour: {random_values.std()}\") # ddof=1 by default\n", + "print(f\"Standard deviation with default NumPy behaviour : {random_values.std(ddof=0)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Exercises\n", + "\n", + "Create a `Series` named `ints` with random integers between 0 and 100.\n", + "\n", + "* what do you get with `ints.replace(ints)`, versus `ints.map(ints)`? Where and why do you get missing values?\n", + "\n", + "* replace all values < 10 and all values > 90 with `np.nan`. What else changes?\n", + "* write functions `sum_odd` and `sum_even`, which sum the odd and even values of a series, respectively. Use `Series.aggregate` to create a new Series with the sum of even values, the sum of odd values, and the sum of all values." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Apply and Transform\n", + "- invoke a function on the values\n", + " - operates on *one row at a time*\n", + " - may provide additional keyword args\n", + "- for the difference, see https://towardsdatascience.com/difference-between-apply-and-transform-in-pandas-242e5cf32705" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### `Series.transform`\n", + "(single Series $\\rightarrow$ multiple results)\n", + "- may use a (numpy or python) function, a 'string function', a list of functions, or a dict\n", + "- cannot use to aggregate Series (result has same length as input)\n", + "- may only use a single Series at a time" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "values = pd.Series(range(10, 40, 10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "values.transform(np.exp) # transforms the whole `Series` and returns another `Series`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "df_values = values.transform([np.exp, np.sin, np.cos])\n", + "df_values # this is a DataFrame (we will deal with this data structure later)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "# Non-transforming functions produce a ValueError\n", + "def compute_mean(x):\n", + " return x.mean()\n", + "\n", + "values.transform(compute_mean)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### `Series.apply`\n", + "(multiple Series $\\rightarrow$ single result)\n", + "- may *only* use a numpy ufunc, string function, or a Python function\n", + " - cannot always use list or dict\n", + "- may use multiple Series (of a DataFrame) at a time\n", + "- may produce aggregated results\n", + "- may automatically convert the data type" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [], + "source": [ + "values = pd.Series(range(10, 40, 10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "values.apply(np.exp) # ufunc applied to each value of the a Series -> returns another Series" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# Try this with the `transform()` method and see what happens.\n", + "values.apply('sum') # reduction operations: returns the sum of all values in the Series (a scalar!)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Exercises\n", + "\n", + "Create a `Series` named `ints` with random integers between 0 and 100.\n", + "\n", + "- apply (with `Series.apply`) the list of functions `[np.log, np.exp, 'sqrt', 'square']` to the Series. Inspect the resulting object. Then apply the function `'sum'` to this object, passing the additional argument `axis=1`.\n", + "\n", + "- how can you reach the same final result with a single call to `Series.transform`?" + ] + } + ], + "metadata": { + "celltoolbar": "Slideshow", + "file_extension": ".py", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "mimetype": "text/x-python", + "name": "python", + "npconvert_exporter": "python", + "pygments_lexer": "ipython3", + "rise": { + "controls": true, + "controlsLayout": "edges", + "controlsTutorial": false, + "footer": "<img src=hpc-hessen-logo-only.png height=60 width=100>Competence Center for High Performance Computing in Hessen (HKHLR) Tim Jammer, Marcel Giar HiPerCH 2022", + "header": "", + "help": false, + "slideNumber": "c/t", + "theme": "white" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": false, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": { + "height": "calc(100% - 180px)", + "left": "10px", + "top": "150px", + "width": "384px" + }, + "toc_section_display": true, + "toc_window_display": false + }, + "varInspector": { + "cols": { + "lenName": 16, + "lenType": 16, + "lenVar": 40 + }, + "kernels_config": { + "python": { + "delete_cmd_postfix": "", + "delete_cmd_prefix": "del ", + "library": "var_list.py", + "varRefreshCmd": "print(var_dic_list())" + }, + "r": { + "delete_cmd_postfix": ") ", + "delete_cmd_prefix": "rm(", + "library": "var_list.r", + "varRefreshCmd": "cat(var_dic_list()) " + } + }, + "types_to_exclude": [ + "module", + "function", + "builtin_function_or_method", + "instance", + "_Feature" + ], + "window_display": false + }, + "version": 3 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/slides/Day2_PandasSeries.ipynb.license b/slides/Day2_PandasSeries.ipynb.license new file mode 100644 index 0000000000000000000000000000000000000000..c207ab8c094a9d18d7c6cb5c9dfbf8913df4aa8a --- /dev/null +++ b/slides/Day2_PandasSeries.ipynb.license @@ -0,0 +1,4 @@ +SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> + +SPDX-License-Identifier: MIT diff --git a/slides/utils.py b/slides/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..b9207ddc54dbb6d7c07b8adc10337bdf26b90c0a --- /dev/null +++ b/slides/utils.py @@ -0,0 +1,53 @@ +# SPDX-FileCopyrightText: © 2021 HPC Core Facility of the Justus-Liebig-University Giessen <philipp.e.risius@theo.physik.uni-giessen.de>,<marcel.giar@physik.jlug.de> +# SPDX-FileCopyrightText: © 2022 Competence Center for High Performance Computing in Hessen (HKHLR) <tim.jammer@hpc-hessen.de>, <marcel.giar@hpc-hessen.de> +# +# SPDX-License-Identifier: MIT + +import urllib +from os import makedirs, path +from pathlib import Path + +import pandas as pd + + +def download_IRIS(url="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/"): + datafile = 'iris.data' + namesfile = 'iris.names' + + output_path = Path('tmp') + output_datafile = output_path / "iris-data.csv" + + makedirs(output_path, exist_ok=True) + + column_names = ["sepal length", "sepal width", 'petal length', 'petal width', "Name"] + if not path.exists(output_datafile): + print(f"Will be downloading Iris dataset...") + with urllib.request.urlopen(url + datafile) as response, open(output_datafile, "w", encoding="utf-8") as out_file: + data = response.read() + out_file.write(",".join(column_names) + "\n") + out_file.write(data.decode('utf-8')) + else: + print(f"No need to download Iris dataset. Data is already present in {output_datafile}.") + + df = pd.read_csv(output_datafile, delimiter=',') + + return df + +def download_IRIS_with_addons(url="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/", + delimiter=None, datafile = 'iris.data', namesfile = 'iris.names'): + output_path = Path("tmp_with_addons") + output_datafile = output_path / "iris-data.csv" + makedirs(output_path, exist_ok=True) + + column_names = ["sepal length", "sepal width", 'petal length', 'petal width', "Name"] + with urllib.request.urlopen(url + datafile) as response, open(output_datafile, "w", encoding="utf-8") as out_file: + data = response.read() + for cname in column_names[:-1]: + out_file.write(f"# {cname} is in [cm]\n") # We use the '#' symbols for comments. + out_file.write("# Species:\n# - Iris Setosa\n# - Iris Versicolour\n# - Iris Virginica\n") + if delimiter is None: + out_file.write(",".join(column_names) + "\n") + out_file.write(data.decode('utf-8')) + else: + out_file.write(f"{delimiter}".join(column_names) + "\n") + out_file.write(data.decode("utf-8").replace(",", delimiter)) \ No newline at end of file