updated the slides to also include solutions for the last exercises

22067e49 · Jammer, Tim · 9bd35732 · 22067e49
Commit 22067e49 authored Sep 13, 2022 by Jammer, Tim
--- a/slides/Day2_PandasDataFrames.ipynb
+++ b/slides/Day2_PandasDataFrames.ipynb
@@ -2213,7 +2213,11 @@
   "id": "53d45569",
   "metadata": {},
   "outputs": [],
-   "source": []
+   "source": [
+    "cols=[\"sepal length\",\"sepal width\",\"petal length\",\"petal width\"]\n",
+    "df_agg= df.loc[:,cols].agg([np.min, np.max, np.mean, np.std])\n",
+    "df_agg"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -2229,7 +2233,10 @@
   "id": "52a96d54",
   "metadata": {},
   "outputs": [],
-   "source": []
+   "source": [
+    "df.loc[:,cols] =((df.loc[:,cols]  - df_agg.loc['mean',:] )/ df_agg.loc['std',:])\n",
+    "df.describe()"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -2263,7 +2270,9 @@
   "id": "801dd946",
   "metadata": {},
   "outputs": [],
-   "source": []
+   "source": [
+    "grouped_by_species = df.groupby(by=[\"Name\"])"
+   ]
  },
  {
   "cell_type": "code",
@@ -2271,7 +2280,12 @@
   "id": "90b8c49e",
   "metadata": {},
   "outputs": [],
-   "source": []
+   "source": [
+    "fig, axs = plt.subplots(1,3)\n",
+    "for ax, (name, group_data) in zip(axs,grouped_by_species):    \n",
+    "    group_data.boxplot(ax=ax)\n",
+    "    ax.set_title(name)"
+   ]
  },
  {
   "cell_type": "code",
@@ -2285,7 +2299,7 @@
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },

 %% Cell type:markdown id:4d6423b0 tags:

 # HiPerCH 14 Module 1:  Introduction to Python Data Processing tools

 %% Cell type:markdown id:dbd2f680 tags:

 # Pandas `DataFrame`s

 %% Cell type:code id:7f4829e5 tags:

 ``` python
 %matplotlib inline

 from matplotlib import pyplot as plt

 import numpy as np
 import pandas as pd

 f"Numpy version: {np.__version__}; Pandas version: {pd.__version__}"

 import importlib
 import utils
 importlib.reload(utils)
 ```

 %% Cell type:markdown id:e1376473 tags:

 # `DataFrame` Objects

 The `pd.DataFrame` class provides a data structure to handle 2-dimensional tabular data. `DataFrame`  objects are *size-mutable* and can contain mixed datatypes (e.g. `float`, `int` or `str`). All data columns inside a `DataFrame` share the same `index`.

 %% Cell type:markdown id:857d1d3c tags:

 ## Creating `DataFrame`s

 %% Cell type:code id:62875512 tags:

 ``` python
 name = ["person 1", "person 2", "person 3"]
 age = [23, 27, 34]
 ```

 %% Cell type:code id:e2f65441 tags:

 ``` python
 # Create nested list and pass column names
 df = pd.DataFrame(data=zip(name, age), columns=["Name", "Age"])
 df # This gives a nicely formatted output. When using the `print` function the output looks different.
 ```

 %% Cell type:code id:d6e0313e tags:

 ``` python
 # The same can be achieved by using a `dict`
 df =  pd.DataFrame(data={"Name": name, "Age": age})
 df
 ```

 %% Cell type:markdown id:b81aae00 tags:

 It is also possible to create `DataFrame`s from `Series` objects.

 %% Cell type:code id:5b8385cc tags:

 ``` python
 math_grades = pd.Series({
    'student1': 15,
    'student2': 11,
    'student3': 9,
    'student4': 13,
    'student5': 12,
    'student6': 7,
    'student7': 14
 })
 chemistry_grades = pd.Series({
    'student1': 10,
    'student2': 14,
    'student3': 12,
    'student4': 8,
    'student5': 11,
    'student6': 10,
    'student7': 12,
    "student8": 5  # <-- note the additional entry here
 })
 ```

 %% Cell type:code id:1234ca1c tags:

 ``` python
 df = pd.DataFrame(data={"Math Grades": math_grades, "Chemistry Grades": chemistry_grades})
 ```

 %% Cell type:markdown id:5ae40e4e tags:


 Series objects are *matched by index* and missing values are replaced with a default value.

 %% Cell type:code id:2222bd8e tags:

 ``` python
 df # default value is `NaN`
 ```

 %% Cell type:markdown id:a220919f tags:

 ## Exercises (optional)

 %% Cell type:markdown id:552b4899 tags:

 * Given the two iterables `values1` and `values2`, create a `pd.DataFrame` containing both in two different ways. Label the columns `'label1'` and `'label2'`.

 %% Cell type:code id:267a91a5 tags:

 ``` python
 values1 = np.random.randint(-10, 10, 5)
 values2 = range(5)
 ```

 %% Cell type:code id:ffbf28d0 tags:

 ``` python
 df_iterables = pd.DataFrame(data=zip(values1, values2), columns=["label1", "label2"])
 df_iterables
 ```

 %% Cell type:markdown id:00fc0dd4 tags:


 * Combine the two `pd.Series` named `series1` and `series2` to a `pd.DataFrame`. Label the columns `'col1'` and `'col2'`.
    * Replace missing values with `0`.
    * Remove rows that contain `NaN` values.

 %% Cell type:code id:e0881ae3 tags:

 ``` python
 series1 = pd.Series(data=range(5),
                    index=[f"{idx}" for idx in range(5)])
 series2 = pd.Series(data=range(0, 10, 2),
                    index=[f"{idx}" for idx in range(0, 10, 2)])
 ```

 %% Cell type:code id:de1b4fbe tags:

 ``` python
 df_from_series = pd.DataFrame({"col1": series1, "col2": series2})
 df_from_series
 ```

 %% Cell type:code id:1f76d67e tags:

 ``` python
 # df_from_series.replace(np.NaN, 0 )
 # df_from_series.dropna()
 ```

 %% Cell type:markdown id:86337fd7 tags:

 ## What characterises a `DataFrame`?

 %% Cell type:code id:6fd61f32 tags:

 ``` python
 df = pd.DataFrame(data={"Math Grades": math_grades, "Chemistry Grades": chemistry_grades})
 ```

 %% Cell type:markdown id:6f08fae8 tags:

 How many rows and columns are container in the `DataFrame`. We have seen this attribute when dealing with `ndarrays` ...

 %% Cell type:code id:c9fc5896 tags:

 ``` python
 df.shape
 ```

 %% Cell type:code id:a12c423b tags:

 ``` python
 # Detailed information on the data contained inside the `DataFrame`.
 df.info()
 ```

 %% Cell type:markdown id:7f4f886e tags:

 `DataFrame`s are essentially composed of 3 components. Theses components can be accessed with specific data attributes.

 - Index (`df.index`)
 - Columns (`df.columns`)
 - Body (`df.values`)

 %% Cell type:code id:e6ed9ab7 tags:

 ``` python
 df.index
 ```

 %% Cell type:code id:b0afa504 tags:

 ``` python
 df.columns
 ```

 %% Cell type:code id:7ade6b93 tags:

 ``` python
 df.values
 ```

 %% Cell type:markdown id:de4be0fb tags:

 ## Data indexing and selection

 %% Cell type:markdown id:fa82625d tags:

 ### The Iris flower dataset

 <a title="w:ru:Денис Анисимов (talk | contribs), Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Irissetosa1.jpg"><img width="512" alt="Irissetosa1" src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Irissetosa1.jpg/512px-Irissetosa1.jpg"></a>

 Image taken from: <a href="https://commons.wikimedia.org/wiki/File:Irissetosa1.jpg">w:ru:Денис Анисимов (talk | contribs)</a>, Public domain, via Wikimedia Commons

 Attribution for dataset: *Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.*

 %% Cell type:markdown id:673472df tags:

 The dataset contains measurements of for "features" related to the species of Iris flowers:
 * Petal length ("Bluetenblattlaenge")
 * Petal width ("Bluetenblattbreite")
 * Sepal length ("Kelchblattlaenge")
 * Sepal width ("Kelchblattbreite")

 The species contained in the dataset are:

 * Iris setosa
 * Iris virginica
 * Iris versicolor

 %% Cell type:code id:f95709e5 tags:

 ``` python
 df = utils.download_IRIS()
 ```

 %% Cell type:code id:30c97a17 tags:

 ``` python
 # Quick check if data looks alright
 # petal - Bluetenblatt
 # sepal - Kelchblatt
 df.head()
 # df.tail()
 ```

 %% Cell type:code id:76e5852d tags:

 ``` python
 df.columns
 ```

 %% Cell type:code id:285fad0e tags:

 ``` python
 # Column access with the `[]` operator.
 df["Name"]
 ```

 %% Cell type:code id:0c7fc248 tags:

 ``` python
 # The columns of a DataFrame are `Series` objects.
 type(df["Name"])
 ```

 %% Cell type:code id:87123e39 tags:

 ``` python
 data_columns = [cname for cname in df.columns if cname != "Name"]
 data_columns
 ```

 %% Cell type:code id:32a00007 tags:

 ``` python
 df[data_columns]
 ```

 %% Cell type:markdown id:ce8a4f9e tags:

 As for `Series` objects the `loc` as well as the `iloc` methods are also available for `DataFrame`s.

 %% Cell type:code id:146337e7 tags:

 ``` python
 # Remember that when using the `loc` method the argument passed to the `[]` operator must present in `df.index`.
 df.loc[0]
 ```

 %% Cell type:code id:7bac9da5 tags:

 ``` python
 # We can also use slicing with the `loc` method.
 df.loc[0::50].head()
 ```

 %% Cell type:code id:75a449ca tags:

 ``` python
 # Fancy indexing is also possible.
 df.loc[[0, 50, 100]]
 ```

 %% Cell type:code id:7dcc63b4 tags:

 ``` python
 # We can combine row and column access with the `loc` method.
 df.loc[:, ['sepal width', 'sepal length']].head()
 ```

 %% Cell type:code id:ab98b434 tags:

 ``` python
 # Rows can also be selected with boolean masks.
 mask = (df["Name"] == "Iris-setosa")
 ```

 %% Cell type:code id:87024f21 tags:

 ``` python
 df.loc[mask].head()
 ```

 %% Cell type:code id:9eb788b5 tags:

 ``` python
 # More complicated boolean masks can be conceived
 mask = (df["sepal length"] > 6.0) & (df["petal length"] > 1.0) # use () for each boolean sub-expression
 df.loc[mask]
 ```

 %% Cell type:markdown id:2b97d718 tags:

 ## Exercises (optional)

 %% Cell type:markdown id:b1d0491f tags:

 * Change all column names to uppercase, e.g.
    * "petal length" $\to$ "PETAL LENGTH"

 %% Cell type:code id:86cadb9b tags:

 ``` python
 ```

 %% Cell type:markdown id:77c2b1fc tags:

 * From the `"sepal length"` column retrieve all values that are `> 6` but `< 7`! How often does each of the resulting values occur in this column? (*Hint*: Refer to the [`DataFrame` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for a method to count values.)

 %% Cell type:code id:905540ad tags:

 ``` python
 ```

 %% Cell type:markdown id:a71c5cd3 tags:

 * In the DataFrame `df`, *simultaneously* access the columns `"sepal length`", `"petal width"`, and `"Name"` in two different ways.

 %% Cell type:code id:4a346e36 tags:

 ``` python
 ```

 %% Cell type:code id:e1b3f2b9 tags:

 ``` python
 ```

 %% Cell type:markdown id:f5522a5c tags:

 * Compare the following two ways of replacing data in a DataFrame. Do they both work? Why?

 %% Cell type:code id:c85a31fe tags:

 ``` python
 ```

 %% Cell type:code id:50f405da tags:

 ``` python
 ```

 %% Cell type:markdown id:eb2bcfe0 tags:

 * Determine the indices in the `DataFrame` that correspond to rows that contain data on the Iris setosa species.
 * Use indices to delete the corresponding rows from the `DataFrame`.

 %% Cell type:code id:8cc5891d tags:

 ``` python
 ```

 %% Cell type:markdown id:69acd4c9 tags:

 * Sort the columns in the `DataFrame` by the values contained in the columns `"petal length"` *and* `"petal width"`.

 %% Cell type:code id:4b240056 tags:

 ``` python
 ```

 %% Cell type:markdown id:e852cb23 tags:

 # Reading data into a `DataFrame`

 %% Cell type:markdown id:1c151460 tags:

 Pandas can import several common file formats:

 - `pd.read_csv`: Read in CSV spreadsheets (`.csv` suffix)
 - `pd.read_excel`: Read in MS Office spreadsheets (`.xls` and `.xlsx` suffix)
 - `pd.read_stata`: Read stata datasets (`.dta` suffix)
 - `pd.read_hdf`: Read HDF datasets (`.hdf` suffix)
 - `pd.read_sql`: Read from SQL database

 Other file formats are [supported](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) as well.

 %% Cell type:markdown id:2ec29c1d tags:

 ## Reading CSV files

 %% Cell type:code id:002a168c tags:

 ``` python
 # Download the files and write to CSV file.
 from pathlib import Path
 importlib.reload(utils)
 utils.download_IRIS_with_addons(delimiter=";")
 ```

 %% Cell type:code id:bdb4b736 tags:

 ``` python
 # Inspect the file content. This command will only work on a UNIX-like operating system.
 ! head -n 15 tmp_with_addons/iris-data.csv | nl
 ```

 %% Cell type:code id:bcadd113 tags:

 ``` python
 # Read the file with Pandas and specify the delimiter symbol as well as the a symbol for the comment.
 df = pd.read_csv(Path("tmp_with_addons") / "iris-data.csv", delimiter=";", comment='#')
 df.head()
 ```

 %% Cell type:code id:7e4cb505 tags:

 ``` python
 # We can limit the number of imported columns by specifying those that we explicitly want to have.
 df = pd.read_csv(Path("tmp_with_addons") / "iris-data.csv",
                 delimiter=";",
                 comment="#",
                 usecols=["Name", "sepal length", "sepal width"])
 df.head()
 ```

 %% Cell type:code id:62801f60 tags:

 ``` python
 # When importing data we can specifiy which data column should become the index in the `DataFrame`.
 df =  pd.read_csv(Path("tmp_with_addons") / "iris-data.csv", delimiter=";",
                  comment="#", index_col="Name")
 df.head()
 ```

 %% Cell type:code id:d9cf40bc tags:

 ``` python
 df_tmp1 = df.copy(deep=True)
 df_tmp2 = df.copy(deep=True)
 ```

 %% Cell type:markdown id:53834587 tags:

 Oftentimes -- when invoking a method of a `DataFrame` object -- a *new* `DataFrame` instance is returned. This means that new memory allocations will be made which can be quite time-consuming and also a waste of precious memory ressources.

 %% Cell type:markdown id:c79088e2 tags:

 Reset the index of the current `DataFrame`. This is done   *out-of-place* and a new instance is returned.

 %% Cell type:code id:7c17dbdd tags:

 ``` python
 df_tmp1.reset_index().set_index("sepal length").head()
 ```

 %% Cell type:markdown id:d8f6726f tags:

 We can use the `inplace` argument to modify the current instance itself.

 %% Cell type:code id:b244ae21 tags:

 ``` python
 # We can use the `inplace` argument to modify the object itself.
 df_tmp2.reset_index(inplace=True)
 df_tmp2.set_index("sepal length", inplace=True)
 df_tmp2.head()
 ```

 %% Cell type:markdown id:125a396b tags:

 # Operations with `DataFrame`s

 %% Cell type:markdown id:4446f2d6 tags:

 ## Arithmetic operations

 %% Cell type:markdown id:5a6a944b tags:

 Mapping between Python arithmetic operators and `DataFrame` methods.

 | Python operator | Pandas methods                   |
 |:---------------:|----------------------------------|
 |       `+`       | `add()`                          |
 |       `-`       | `sub()`, `subtract()`            |
 |       `*`       | `mul()`, `multiply()`            |
 |       `/`       | `truediv()`, `div()`, `divide()` |
 |       `//`      | `floordiv()`                     |
 |       `%`       | `mod()`                          |
 |       `**`       | `pow()`                          |

 %% Cell type:code id:74e113b4 tags:

 ``` python
 A = pd.DataFrame(np.random.randint(0, 20, (3, 2)), columns=list("AB"))
 B = pd.DataFrame(np.random.randint(0, 20, (3, 3)), columns=list("BAC"))
 ```

 %% Cell type:code id:a9bb50e8 tags:

 ``` python
 # Indices of all DataFrames involved in the operation are aligned. The order of each index is irrelevant.
 # Data columns not shared by the DataFrames will be filled with a special value.
 A + B
 ```

 %% Cell type:code id:dc290da1 tags:

 ``` python
 # Use the `add` method to specifiy the fill_value. Note that the `fill_value` will be used in the DataFrame with the
 # *missing* column. The specified `fill_value` is then used in the arithmetic operation.
 # >>> Choose wisely when using the `fill_value` argument <<<
 A.add(B, fill_value="-1000")
 ```

 %% Cell type:markdown id:7d5fccee tags:

 NumPy broadcasting rules apply for `DataFrame`s as well.

 %% Cell type:code id:dfd28894 tags:

 ``` python
 df = pd.DataFrame(np.random.randint(10, size=(3, 4)), columns=list("wxyz"))
 df
 ```

 %% Cell type:code id:9d8b8e06 tags:

 ``` python
 # Subtract a row.
 df - df.loc[0]
 ```

 %% Cell type:code id:9c0ecfb7 tags:

 ``` python
 # Call the appropriate method if you want to operate on the columns. We operate along axis=0 (the rows).
 df.sub(df["x"], axis=0)
 ```

 %% Cell type:markdown id:8da5ad3d tags:

 `DataFrame`s can be fed to Numpy `ufunc`s.

 %% Cell type:code id:55d00536 tags:

 ``` python
 np.exp(df)
 ```

 %% Cell type:markdown id:60dfcb8b tags:

 New columns can be added with arithmetic operations.

 %% Cell type:code id:29bb3975 tags:

 ``` python
 df["asdf"] = np.sin( df["x"] + df["y"] )
 df
 ```

 %% Cell type:markdown id:b1381d02 tags:

 ## Methods for operating on `DataFrame`s

 %% Cell type:markdown id:aa5758e3 tags:

 Pandas `DataFrame` and `Series` objects have several built-in method to operate on the data.

 - `apply()`: available for *both* `Series` and `DataFrame` objects
 - `transform()`: available for *both* `Series` and `DataFrame` objects
 - `applymap()` *only* available for `DataFrame` objects
 - `map()`: *only* available for `Series` objects

 %% Cell type:code id:0545a722 tags:

 ``` python
 df = utils.download_IRIS()
 df.head()
 ```

 %% Cell type:code id:a0b8e197 tags:

 ``` python
 # Get a subset of columns by using regular expressions
 data_columns = df.columns[df.columns.str.match('^(petal|sepal).*(width|length)$')]
 data_columns
 ```

 %% Cell type:markdown id:c4ca8241 tags:

 ### [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

 ```python
 DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)
 ```
 - *applies* a function (callable) along an `axis` of the `DataFrame`
    - `axis=0`: `func` is applied to each column (a `Series` object). This is the default!
    - `axis=1`: `func` is applied to each row
 - return type is inferred from `func`

 %% Cell type:markdown id:ec3f1484 tags:

 The return type of `func` determines the form of the result.

 `func` can operate on `Series` objects an perform operations that are supported by these types of objects (e.g. by means of the methods `.min()`, `.max()` or `.mean()`).
 - result can be a scalar value (e.g. `.sum()` which is an aggregation operation)
 - result can be another `Series` object

 %% Cell type:markdown id:440dfc49 tags:

 Compute the mean value of each column (this is the default because we do not specify the `axis` argument).

 %% Cell type:code id:f412f504 tags:

 ``` python
 mean_values = df[data_columns].apply(lambda x: x.mean())
 mean_values # This returns a `Series` object because x.mean() returns a scalar value.
 ```

 %% Cell type:markdown id:511add01 tags:

 ### Question

 How does the result look like if the operate along the rows of the `DataFrame`. This is achieved by using the argument `axis = 1`. What is the shape of the resulting object?

 %% Cell type:markdown id:ef03fc24 tags:

 Now we transform the values in the columns of the `DataFrame`. We define a function that will operate on the `Series` objects that form the columns.

 The object resulting from this operation is another `DataFrame` instance.

 %% Cell type:code id:58bdf8e5 tags:

 ``` python
 def scale_to_mm(s):
    return s * 10

 df_scaled_to_mm = df[data_columns].apply(scale_to_mm) # This will return a new DataFrame
 df_scaled_to_mm["Name"] = df["Name"]
 df_scaled_to_mm.head()
 ```

 %% Cell type:markdown id:233adbba tags:

 ### Question

 How must the above command be changed if we want to operate along the rows of the `DataFrame` instead? Does this also work with the already-defined function or do we have to define a dedicated function?

 %% Cell type:code id:5dd57a2b tags:

 ``` python
 df[data_columns].apply(scale_to_mm, axis=1).head()
 ```

 %% Cell type:markdown id:371d521f tags:

 ### Experimenting with the `apply()` method

 %% Cell type:markdown id:afb7a126 tags:

 Let's generate a large `DataFrame`. We wish to operate on the data with the `apply` method. We can do this in two different ways:
 - Operate along the rows (`axis=1`)
 - Operate along the columns (`axis=0`)

 %% Cell type:code id:355d3698 tags:

 ``` python
 N_rows, N_cols = 10_000, 500
 data = pd.DataFrame(np.random.random((N_rows, N_cols)), columns=[f"col{idx}" for idx in range(N_cols)])
 ```

 %% Cell type:markdown id:2d78baa7 tags:

 ### Question

 What do you think is faster: Operating along the columns or operating along the rows?

 When you have made your decision try to come up with a reason!

 %% Cell type:code id:368b5567 tags:

 ``` python
 %timeit data.apply(lambda x: x ** 2, axis=0) # operate along columns
 %timeit data.apply(lambda x: x ** 2, axis=1) # operate along rows
 ```

 %% Cell type:markdown id:524759b7 tags:

 The `apply` method wants to operate on `Series` objects. The columns of a `DataFrame` are `Series`. Inside each `Series` data is stored contiguously in memory. Hence operating on the columns is *fast*.

 When operating row-wise for *each* row a new `Series` object must be generated. A buffer must be allocated in memory and data needs to copied to that buffer in order to be able to operate on the data with the `apply` method. Since there are many steps involved that are repeated for each row this procedure generally is *slower* than operating along the columns.

 %% Cell type:markdown id:cb2244cb tags:

 ### Task (optional)

 The names of the Iris species are contained in the column with heading `"Name"`. The names follow the pattern:

 ```
 Iris-<identifier for species>
 ```

 Remove the dash `-` from the names and just keep the identifier for each species. Use the `apply` method.

 %% Cell type:code id:79f7a36a tags:

 ``` python
 ```

 %% Cell type:markdown id:3f112be8 tags:

 ### `transform()`

 ```python
 DataFrame.transform(func, axis=0, *args, **kwargs)
 ```

 `func` can either be
 - callable, e.g. `np.exp`
 - list-like, e.g. `[np.sin, np.cos]`
 - dict-like, e.g. `{"sepal length": np.sin,  "petal length": np.cos}`. Application is limited to columns names passed as keys to `dict`.
 - string, e.g. `"sqrt"`

 *Note*: This function *transforms*, i.e, when the input value is `Series` another (transformed) `Series` is returned. Returning a scalar value is not valid (resulting error message will be: `ValueError: Function did not transform
 `)

 %% Cell type:code id:4848e300 tags:

 ``` python
 df[data_columns].transform({"sepal length": np.cos, "petal length": np.sin}).head()
 ```

 %% Cell type:markdown id:baccb2f2 tags:

 ### Task (optional)

 Convert the measured values (which are all given in cm units) to mm units by using the `transform` method.

 %% Cell type:code id:d19379c2 tags:

 ``` python
 ```

 %% Cell type:markdown id:c0910d4c tags:

 ### Performance considerations

 %% Cell type:markdown id:b7d5536e tags:

 When operating on columns of a `DataFrame` or a `DataFrame` *as a whole* it is oftentimes faster to use a vectorised operations instead of column-/row-wise operations.

 %% Cell type:code id:d8432d37 tags:

 ``` python
 df = pd.DataFrame(np.random.randn(1_000_000, 3), columns=list("abc"))
 ```

 %% Cell type:code id:75e45ac5 tags:

 ``` python
 %timeit df.apply(lambda x: x ** 2, axis=0)
 %timeit df ** 2
 %timeit (df.values ** 2) # here we operate on the underlying `ndarray`
 ```

 %% Cell type:markdown id:dd5b8a17 tags:

 ### `assign`

 %% Cell type:markdown id:2cd2aaa0 tags:

 The `assign` method adds a new column to a `DataFrame`. It is called on an existing `DataFrame` and returns a new `DataFrame` (that has all columns of the original `DataFrame`) with the new column added.

 * Allows to add single as well as multiple columns per call.

 %% Cell type:code id:527d0daf tags:

 ``` python
 df_mean = df[data_columns].mean()
 df.assign(
    petal_length_dev_from_mean=lambda x: x["petal length"] - df_mean["petal length"],
    petal_width_dev_from_mean=lambda x: x["petal width"] - df_mean["petal width"],
    sepal_length_dev_from_mean=lambda x: x["sepal length"] - df_mean["sepal length"],
    sepal_width_dev_from_mean=lambda x: x["sepal width"] - df_mean["sepal width"]
 )
 ```

 %% Cell type:markdown id:f470a4a6 tags:

 ## Grouping data

 %% Cell type:markdown id:ab7c9102 tags:

 ### Properties of `GroupedBy` objects

 %% Cell type:markdown id:e360d0f1 tags:

 - Oftentimes items in a dataset can be grouped in a certain manner (e.g., if a column contains a value multiple times). The Iris dataset, for instance, can  be grouped according the species of each flower.

    ```python
    my_dataframe.groupby(by=["<column label>"])
    ```
 - The `DataFrame` is split and entries are grouped according to the values in the column with `"<column-label>"`. Once the data  has been grouped operations can be conducted on the items of each group.

 *Note*: `DataFrame`s cannot only be [grouped](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) according to the entries of a column.

 %% Cell type:markdown id:99fce3ac tags:

 The return type of `groupby()` is *not* another `DataFrame` but rather a `DataFrameGroupBy` object. We can imagine this object to be a grouping of multiple `DataFrame`s.

 It is important to understand that such an object essentially is a special *view* on the original `DataFrame`. No computations have been carried out when generating it (lazy evaluation).

 %% Cell type:code id:72c259e0 tags:

 ``` python
 df = utils.download_IRIS()
 ```

 %% Cell type:code id:00f1cbca tags:

 ``` python
 # We group the data according to the species of the flowers
 grouped_by_species = df.groupby(by=["Name"])
 ```

 %% Cell type:code id:acbfef47 tags:

 ``` python
 print(type(grouped_by_species))
 ```

 %% Cell type:markdown id:12e3ea0f tags:

 This data structure still knows about the `columns` that were present in the original `DataFrame`. We can use the `[<column-name>]` operation to access the columns with the correspoding label in each of the group members (subframes).

 %% Cell type:code id:74f7786d tags:

 ``` python
 grouped_by_species["sepal length"]
 ```

 %% Cell type:code id:ab6a8365 tags:

 ``` python
 # Pandas will access the corresponding column of all subframes and apply the functions passed to the `agg()` method.
 grouped_by_species["sepal length"].agg([np.min, np.max, np.mean])
 ```

 %% Cell type:markdown id:3e17a46e tags:

 We can iterate over the `DataFrameGroupBy` object where each subframe is returned as a `Series` of a `DataFrame`.

 %% Cell type:code id:a245fca9 tags:

 ``` python
 for (species, subframe) in grouped_by_species:
    print(f"Subframe for species {species} has shape {subframe.shape}")
 ```

 %% Cell type:code id:4d110f91 tags:

 ``` python
 # Call the getter to obtain a `DataFrame`.
 grouped_by_species.get_group("Iris-setosa").head()
 ```

 %% Cell type:markdown id:cf9faf2d tags:

 Methods that are not directly implemented for the `DataFrameGroupBy` object are passed to the subframes and executed on these.

 %% Cell type:code id:db2a75ca tags:

 ``` python
 # The `describe()` method can also be called on the full object but the output would be rather hard to view.
 grouped_by_species["sepal length"].describe() # The return type is a `DataFrame`
 ```

 %% Cell type:code id:2df3b968 tags:

 ``` python
 # Single methods are available as well. E.g. `mean()`, `std()` or `sum()`
 grouped_by_species.mean() # The return type is a `DataFrame`
 ```

 %% Cell type:markdown id:dafb2e66 tags:

 ### Operating on `GroupedBy` objects

 %% Cell type:markdown id:9bbd870d tags:

 `DataFrameGroupBy` object support `aggregate()`, `filter()`, `transform()` and `apply()` operations.

 These methods can be efficiently used to implement a great variety of operations on grouped data.

 %% Cell type:markdown id:03cd3096 tags:

 #### [`aggregate()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) (or simply `agg()`)

 ```python
 DataFrameGroupBy.aggregate(func=None, *args, engine=None,
                           engine_kwargs=None, **kwargs)
 ```

 `func` can for example be ...
 - ... function (Python callable),
 - ... a string specifiying a function name (e.g. `"mean"`)
 - ...  list of functions or strings, e.g. `["std", np.mean]`
 - ... `dict` of column labels and function to apply (e.g. `{'data1': np.mean}`)

 %% Cell type:code id:32798a44 tags:

 ``` python
 # Perform some common aggegrations within each subframe. The output of this method is another `DataFrame`.
 group_agg = grouped_by_species.agg([np.min, np.max, np.mean, np.std])
 group_agg
 ```

 %% Cell type:code id:217e1c1f tags:

 ``` python
 # To understand this a bit better consider the following. Note that we limit the output to only one species.
 df.loc[df["Name"] == "Iris-setosa", df.columns[:-1]].agg(
    [np.min,
     np.max,
     np.mean,
     np.std]
 )
 ```

 %% Cell type:markdown id:cfe77e99 tags:

 The resulting output looks somewhat complicated than what we are used to from `DataFrame`s so far. The column labels now are hierarchical due to the grouping.

 %% Cell type:code id:9de104f9 tags:

 ``` python
 group_agg.columns # This is a so-called `MultiIndex`.
 ```

 %% Cell type:code id:4ef5258d tags:

 ``` python
 df
 ```

 %% Cell type:markdown id:9fe0ff59 tags:

 ## Exercises (optional)

 %% Cell type:markdown id:e1b12f15 tags:

 ### Task 1

 Consider the Iris dataset.

 * For each of the features compute the mean value as well as the standard deviation.
 * Center the values of a particular feature on the mean values and scale them to have unit variance.


 %% Cell type:code id:62978f94 tags:

 ``` python
 df = utils.download_IRIS()
 ```

 %% Cell type:markdown id:5b09bfcb tags:

 Let us first make a working copy of the `DataFrame` containing the data on the Iris dataset.

 %% Cell type:code id:15edb74b tags:

 ``` python
 df_tmp = df.copy()
 ```

 %% Cell type:markdown id:d7ba0a2a tags:

 Next, compute the mean value and the standard deviation for all features of the dataset. Computing these quantities does *not* take into the account the particular species.

 %% Cell type:code id:53d45569 tags:

 ``` python
+cols=["sepal length","sepal width","petal length","petal width"]
+df_agg= df.loc[:,cols].agg([np.min, np.max, np.mean, np.std])
+df_agg
 ```

 %% Cell type:markdown id:ae660e8f tags:

 Now transform each of the features to be centred on the mean value and to have unit variance.

 %% Cell type:code id:52a96d54 tags:

 ``` python
+df.loc[:,cols] =((df.loc[:,cols]  - df_agg.loc['mean',:] )/ df_agg.loc['std',:])
+df.describe()
 ```

 %% Cell type:markdown id:f2b49c83 tags:

 ### Task 2

 Again consider the Iris dataset.

 * Group the measured values by the species.
 * Create boxplots for each species for all features.
    * Retrieve the names of the single groups from the `GroupedBy` objects.
    * Get the `DataFrame` for each of the groups from the `GroupedBy` object and call the [`boxplot` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) to create the plot.
    * Use the names in the titles of the plot.

 %% Cell type:code id:7ae51b08 tags:

 ``` python
 df = utils.download_IRIS()
 ```

 %% Cell type:code id:801dd946 tags:

 ``` python
+grouped_by_species = df.groupby(by=["Name"])
 ```

 %% Cell type:code id:90b8c49e tags:

 ``` python
+fig, axs = plt.subplots(1,3)
+for ax, (name, group_data) in zip(axs,grouped_by_species):
+    group_data.boxplot(ax=ax)
+    ax.set_title(name)
 ```

 %% Cell type:code id:a031db1b tags:

 ``` python
 ```