Skip to content
Snippets Groups Projects
Commit 22067e49 authored by Jammer, Tim's avatar Jammer, Tim
Browse files

updated the slides to also include solutions for the last exercises

parent 9bd35732
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id:4d6423b0 tags:
# HiPerCH 14 Module 1: Introduction to Python Data Processing tools
%% Cell type:markdown id:dbd2f680 tags:
# Pandas `DataFrame`s
%% Cell type:code id:7f4829e5 tags:
``` python
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
f"Numpy version: {np.__version__}; Pandas version: {pd.__version__}"
import importlib
import utils
importlib.reload(utils)
```
%% Cell type:markdown id:e1376473 tags:
# `DataFrame` Objects
The `pd.DataFrame` class provides a data structure to handle 2-dimensional tabular data. `DataFrame` objects are *size-mutable* and can contain mixed datatypes (e.g. `float`, `int` or `str`). All data columns inside a `DataFrame` share the same `index`.
%% Cell type:markdown id:857d1d3c tags:
## Creating `DataFrame`s
%% Cell type:code id:62875512 tags:
``` python
name = ["person 1", "person 2", "person 3"]
age = [23, 27, 34]
```
%% Cell type:code id:e2f65441 tags:
``` python
# Create nested list and pass column names
df = pd.DataFrame(data=zip(name, age), columns=["Name", "Age"])
df # This gives a nicely formatted output. When using the `print` function the output looks different.
```
%% Cell type:code id:d6e0313e tags:
``` python
# The same can be achieved by using a `dict`
df = pd.DataFrame(data={"Name": name, "Age": age})
df
```
%% Cell type:markdown id:b81aae00 tags:
It is also possible to create `DataFrame`s from `Series` objects.
%% Cell type:code id:5b8385cc tags:
``` python
math_grades = pd.Series({
'student1': 15,
'student2': 11,
'student3': 9,
'student4': 13,
'student5': 12,
'student6': 7,
'student7': 14
})
chemistry_grades = pd.Series({
'student1': 10,
'student2': 14,
'student3': 12,
'student4': 8,
'student5': 11,
'student6': 10,
'student7': 12,
"student8": 5 # <-- note the additional entry here
})
```
%% Cell type:code id:1234ca1c tags:
``` python
df = pd.DataFrame(data={"Math Grades": math_grades, "Chemistry Grades": chemistry_grades})
```
%% Cell type:markdown id:5ae40e4e tags:
Series objects are *matched by index* and missing values are replaced with a default value.
%% Cell type:code id:2222bd8e tags:
``` python
df # default value is `NaN`
```
%% Cell type:markdown id:a220919f tags:
## Exercises (optional)
%% Cell type:markdown id:552b4899 tags:
* Given the two iterables `values1` and `values2`, create a `pd.DataFrame` containing both in two different ways. Label the columns `'label1'` and `'label2'`.
%% Cell type:code id:267a91a5 tags:
``` python
values1 = np.random.randint(-10, 10, 5)
values2 = range(5)
```
%% Cell type:code id:ffbf28d0 tags:
``` python
df_iterables = pd.DataFrame(data=zip(values1, values2), columns=["label1", "label2"])
df_iterables
```
%% Cell type:markdown id:00fc0dd4 tags:
* Combine the two `pd.Series` named `series1` and `series2` to a `pd.DataFrame`. Label the columns `'col1'` and `'col2'`.
* Replace missing values with `0`.
* Remove rows that contain `NaN` values.
%% Cell type:code id:e0881ae3 tags:
``` python
series1 = pd.Series(data=range(5),
index=[f"{idx}" for idx in range(5)])
series2 = pd.Series(data=range(0, 10, 2),
index=[f"{idx}" for idx in range(0, 10, 2)])
```
%% Cell type:code id:de1b4fbe tags:
``` python
df_from_series = pd.DataFrame({"col1": series1, "col2": series2})
df_from_series
```
%% Cell type:code id:1f76d67e tags:
``` python
# df_from_series.replace(np.NaN, 0 )
# df_from_series.dropna()
```
%% Cell type:markdown id:86337fd7 tags:
## What characterises a `DataFrame`?
%% Cell type:code id:6fd61f32 tags:
``` python
df = pd.DataFrame(data={"Math Grades": math_grades, "Chemistry Grades": chemistry_grades})
```
%% Cell type:markdown id:6f08fae8 tags:
How many rows and columns are container in the `DataFrame`. We have seen this attribute when dealing with `ndarrays` ...
%% Cell type:code id:c9fc5896 tags:
``` python
df.shape
```
%% Cell type:code id:a12c423b tags:
``` python
# Detailed information on the data contained inside the `DataFrame`.
df.info()
```
%% Cell type:markdown id:7f4f886e tags:
`DataFrame`s are essentially composed of 3 components. Theses components can be accessed with specific data attributes.
- Index (`df.index`)
- Columns (`df.columns`)
- Body (`df.values`)
%% Cell type:code id:e6ed9ab7 tags:
``` python
df.index
```
%% Cell type:code id:b0afa504 tags:
``` python
df.columns
```
%% Cell type:code id:7ade6b93 tags:
``` python
df.values
```
%% Cell type:markdown id:de4be0fb tags:
## Data indexing and selection
%% Cell type:markdown id:fa82625d tags:
### The Iris flower dataset
<a title="w:ru:Денис Анисимов (talk | contribs), Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Irissetosa1.jpg"><img width="512" alt="Irissetosa1" src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Irissetosa1.jpg/512px-Irissetosa1.jpg"></a>
Image taken from: <a href="https://commons.wikimedia.org/wiki/File:Irissetosa1.jpg">w:ru:Денис Анисимов (talk | contribs)</a>, Public domain, via Wikimedia Commons
Attribution for dataset: *Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.*
%% Cell type:markdown id:673472df tags:
The dataset contains measurements of for "features" related to the species of Iris flowers:
* Petal length ("Bluetenblattlaenge")
* Petal width ("Bluetenblattbreite")
* Sepal length ("Kelchblattlaenge")
* Sepal width ("Kelchblattbreite")
The species contained in the dataset are:
* Iris setosa
* Iris virginica
* Iris versicolor
%% Cell type:code id:f95709e5 tags:
``` python
df = utils.download_IRIS()
```
%% Cell type:code id:30c97a17 tags:
``` python
# Quick check if data looks alright
# petal - Bluetenblatt
# sepal - Kelchblatt
df.head()
# df.tail()
```
%% Cell type:code id:76e5852d tags:
``` python
df.columns
```
%% Cell type:code id:285fad0e tags:
``` python
# Column access with the `[]` operator.
df["Name"]
```
%% Cell type:code id:0c7fc248 tags:
``` python
# The columns of a DataFrame are `Series` objects.
type(df["Name"])
```
%% Cell type:code id:87123e39 tags:
``` python
data_columns = [cname for cname in df.columns if cname != "Name"]
data_columns
```
%% Cell type:code id:32a00007 tags:
``` python
df[data_columns]
```
%% Cell type:markdown id:ce8a4f9e tags:
As for `Series` objects the `loc` as well as the `iloc` methods are also available for `DataFrame`s.
%% Cell type:code id:146337e7 tags:
``` python
# Remember that when using the `loc` method the argument passed to the `[]` operator must present in `df.index`.
df.loc[0]
```
%% Cell type:code id:7bac9da5 tags:
``` python
# We can also use slicing with the `loc` method.
df.loc[0::50].head()
```
%% Cell type:code id:75a449ca tags:
``` python
# Fancy indexing is also possible.
df.loc[[0, 50, 100]]
```
%% Cell type:code id:7dcc63b4 tags:
``` python
# We can combine row and column access with the `loc` method.
df.loc[:, ['sepal width', 'sepal length']].head()
```
%% Cell type:code id:ab98b434 tags:
``` python
# Rows can also be selected with boolean masks.
mask = (df["Name"] == "Iris-setosa")
```
%% Cell type:code id:87024f21 tags:
``` python
df.loc[mask].head()
```
%% Cell type:code id:9eb788b5 tags:
``` python
# More complicated boolean masks can be conceived
mask = (df["sepal length"] > 6.0) & (df["petal length"] > 1.0) # use () for each boolean sub-expression
df.loc[mask]
```
%% Cell type:markdown id:2b97d718 tags:
## Exercises (optional)
%% Cell type:markdown id:b1d0491f tags:
* Change all column names to uppercase, e.g.
* "petal length" $\to$ "PETAL LENGTH"
%% Cell type:code id:86cadb9b tags:
``` python
```
%% Cell type:markdown id:77c2b1fc tags:
* From the `"sepal length"` column retrieve all values that are `> 6` but `< 7`! How often does each of the resulting values occur in this column? (*Hint*: Refer to the [`DataFrame` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for a method to count values.)
%% Cell type:code id:905540ad tags:
``` python
```
%% Cell type:markdown id:a71c5cd3 tags:
* In the DataFrame `df`, *simultaneously* access the columns `"sepal length`", `"petal width"`, and `"Name"` in two different ways.
%% Cell type:code id:4a346e36 tags:
``` python
```
%% Cell type:code id:e1b3f2b9 tags:
``` python
```
%% Cell type:markdown id:f5522a5c tags:
* Compare the following two ways of replacing data in a DataFrame. Do they both work? Why?
%% Cell type:code id:c85a31fe tags:
``` python
```
%% Cell type:code id:50f405da tags:
``` python
```
%% Cell type:markdown id:eb2bcfe0 tags:
* Determine the indices in the `DataFrame` that correspond to rows that contain data on the Iris setosa species.
* Use indices to delete the corresponding rows from the `DataFrame`.
%% Cell type:code id:8cc5891d tags:
``` python
```
%% Cell type:markdown id:69acd4c9 tags:
* Sort the columns in the `DataFrame` by the values contained in the columns `"petal length"` *and* `"petal width"`.
%% Cell type:code id:4b240056 tags:
``` python
```
%% Cell type:markdown id:e852cb23 tags:
# Reading data into a `DataFrame`
%% Cell type:markdown id:1c151460 tags:
Pandas can import several common file formats:
- `pd.read_csv`: Read in CSV spreadsheets (`.csv` suffix)
- `pd.read_excel`: Read in MS Office spreadsheets (`.xls` and `.xlsx` suffix)
- `pd.read_stata`: Read stata datasets (`.dta` suffix)
- `pd.read_hdf`: Read HDF datasets (`.hdf` suffix)
- `pd.read_sql`: Read from SQL database
Other file formats are [supported](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) as well.
%% Cell type:markdown id:2ec29c1d tags:
## Reading CSV files
%% Cell type:code id:002a168c tags:
``` python
# Download the files and write to CSV file.
from pathlib import Path
importlib.reload(utils)
utils.download_IRIS_with_addons(delimiter=";")
```
%% Cell type:code id:bdb4b736 tags:
``` python
# Inspect the file content. This command will only work on a UNIX-like operating system.
! head -n 15 tmp_with_addons/iris-data.csv | nl
```
%% Cell type:code id:bcadd113 tags:
``` python
# Read the file with Pandas and specify the delimiter symbol as well as the a symbol for the comment.
df = pd.read_csv(Path("tmp_with_addons") / "iris-data.csv", delimiter=";", comment='#')
df.head()
```
%% Cell type:code id:7e4cb505 tags:
``` python
# We can limit the number of imported columns by specifying those that we explicitly want to have.
df = pd.read_csv(Path("tmp_with_addons") / "iris-data.csv",
delimiter=";",
comment="#",
usecols=["Name", "sepal length", "sepal width"])
df.head()
```
%% Cell type:code id:62801f60 tags:
``` python
# When importing data we can specifiy which data column should become the index in the `DataFrame`.
df = pd.read_csv(Path("tmp_with_addons") / "iris-data.csv", delimiter=";",
comment="#", index_col="Name")
df.head()
```
%% Cell type:code id:d9cf40bc tags:
``` python
df_tmp1 = df.copy(deep=True)
df_tmp2 = df.copy(deep=True)
```
%% Cell type:markdown id:53834587 tags:
Oftentimes -- when invoking a method of a `DataFrame` object -- a *new* `DataFrame` instance is returned. This means that new memory allocations will be made which can be quite time-consuming and also a waste of precious memory ressources.
%% Cell type:markdown id:c79088e2 tags:
Reset the index of the current `DataFrame`. This is done *out-of-place* and a new instance is returned.
%% Cell type:code id:7c17dbdd tags:
``` python
df_tmp1.reset_index().set_index("sepal length").head()
```
%% Cell type:markdown id:d8f6726f tags:
We can use the `inplace` argument to modify the current instance itself.
%% Cell type:code id:b244ae21 tags:
``` python
# We can use the `inplace` argument to modify the object itself.
df_tmp2.reset_index(inplace=True)
df_tmp2.set_index("sepal length", inplace=True)
df_tmp2.head()
```
%% Cell type:markdown id:125a396b tags:
# Operations with `DataFrame`s
%% Cell type:markdown id:4446f2d6 tags:
## Arithmetic operations
%% Cell type:markdown id:5a6a944b tags:
Mapping between Python arithmetic operators and `DataFrame` methods.
| Python operator | Pandas methods |
|:---------------:|----------------------------------|
| `+` | `add()` |
| `-` | `sub()`, `subtract()` |
| `*` | `mul()`, `multiply()` |
| `/` | `truediv()`, `div()`, `divide()` |
| `//` | `floordiv()` |
| `%` | `mod()` |
| `**` | `pow()` |
%% Cell type:code id:74e113b4 tags:
``` python
A = pd.DataFrame(np.random.randint(0, 20, (3, 2)), columns=list("AB"))
B = pd.DataFrame(np.random.randint(0, 20, (3, 3)), columns=list("BAC"))
```
%% Cell type:code id:a9bb50e8 tags:
``` python
# Indices of all DataFrames involved in the operation are aligned. The order of each index is irrelevant.
# Data columns not shared by the DataFrames will be filled with a special value.
A + B
```
%% Cell type:code id:dc290da1 tags:
``` python
# Use the `add` method to specifiy the fill_value. Note that the `fill_value` will be used in the DataFrame with the
# *missing* column. The specified `fill_value` is then used in the arithmetic operation.
# >>> Choose wisely when using the `fill_value` argument <<<
A.add(B, fill_value="-1000")
```
%% Cell type:markdown id:7d5fccee tags:
NumPy broadcasting rules apply for `DataFrame`s as well.
%% Cell type:code id:dfd28894 tags:
``` python
df = pd.DataFrame(np.random.randint(10, size=(3, 4)), columns=list("wxyz"))
df
```
%% Cell type:code id:9d8b8e06 tags:
``` python
# Subtract a row.
df - df.loc[0]
```
%% Cell type:code id:9c0ecfb7 tags:
``` python
# Call the appropriate method if you want to operate on the columns. We operate along axis=0 (the rows).
df.sub(df["x"], axis=0)
```
%% Cell type:markdown id:8da5ad3d tags:
`DataFrame`s can be fed to Numpy `ufunc`s.
%% Cell type:code id:55d00536 tags:
``` python
np.exp(df)
```
%% Cell type:markdown id:60dfcb8b tags:
New columns can be added with arithmetic operations.
%% Cell type:code id:29bb3975 tags:
``` python
df["asdf"] = np.sin( df["x"] + df["y"] )
df
```
%% Cell type:markdown id:b1381d02 tags:
## Methods for operating on `DataFrame`s
%% Cell type:markdown id:aa5758e3 tags:
Pandas `DataFrame` and `Series` objects have several built-in method to operate on the data.
- `apply()`: available for *both* `Series` and `DataFrame` objects
- `transform()`: available for *both* `Series` and `DataFrame` objects
- `applymap()` *only* available for `DataFrame` objects
- `map()`: *only* available for `Series` objects
%% Cell type:code id:0545a722 tags:
``` python
df = utils.download_IRIS()
df.head()
```
%% Cell type:code id:a0b8e197 tags:
``` python
# Get a subset of columns by using regular expressions
data_columns = df.columns[df.columns.str.match('^(petal|sepal).*(width|length)$')]
data_columns
```
%% Cell type:markdown id:c4ca8241 tags:
### [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)
```python
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)
```
- *applies* a function (callable) along an `axis` of the `DataFrame`
- `axis=0`: `func` is applied to each column (a `Series` object). This is the default!
- `axis=1`: `func` is applied to each row
- return type is inferred from `func`
%% Cell type:markdown id:ec3f1484 tags:
The return type of `func` determines the form of the result.
`func` can operate on `Series` objects an perform operations that are supported by these types of objects (e.g. by means of the methods `.min()`, `.max()` or `.mean()`).
- result can be a scalar value (e.g. `.sum()` which is an aggregation operation)
- result can be another `Series` object
%% Cell type:markdown id:440dfc49 tags:
Compute the mean value of each column (this is the default because we do not specify the `axis` argument).
%% Cell type:code id:f412f504 tags:
``` python
mean_values = df[data_columns].apply(lambda x: x.mean())
mean_values # This returns a `Series` object because x.mean() returns a scalar value.
```
%% Cell type:markdown id:511add01 tags:
### Question
How does the result look like if the operate along the rows of the `DataFrame`. This is achieved by using the argument `axis = 1`. What is the shape of the resulting object?
%% Cell type:markdown id:ef03fc24 tags:
Now we transform the values in the columns of the `DataFrame`. We define a function that will operate on the `Series` objects that form the columns.
The object resulting from this operation is another `DataFrame` instance.
%% Cell type:code id:58bdf8e5 tags:
``` python
def scale_to_mm(s):
return s * 10
df_scaled_to_mm = df[data_columns].apply(scale_to_mm) # This will return a new DataFrame
df_scaled_to_mm["Name"] = df["Name"]
df_scaled_to_mm.head()
```
%% Cell type:markdown id:233adbba tags:
### Question
How must the above command be changed if we want to operate along the rows of the `DataFrame` instead? Does this also work with the already-defined function or do we have to define a dedicated function?
%% Cell type:code id:5dd57a2b tags:
``` python
df[data_columns].apply(scale_to_mm, axis=1).head()
```
%% Cell type:markdown id:371d521f tags:
### Experimenting with the `apply()` method
%% Cell type:markdown id:afb7a126 tags:
Let's generate a large `DataFrame`. We wish to operate on the data with the `apply` method. We can do this in two different ways:
- Operate along the rows (`axis=1`)
- Operate along the columns (`axis=0`)
%% Cell type:code id:355d3698 tags:
``` python
N_rows, N_cols = 10_000, 500
data = pd.DataFrame(np.random.random((N_rows, N_cols)), columns=[f"col{idx}" for idx in range(N_cols)])
```
%% Cell type:markdown id:2d78baa7 tags:
### Question
What do you think is faster: Operating along the columns or operating along the rows?
When you have made your decision try to come up with a reason!
%% Cell type:code id:368b5567 tags:
``` python
%timeit data.apply(lambda x: x ** 2, axis=0) # operate along columns
%timeit data.apply(lambda x: x ** 2, axis=1) # operate along rows
```
%% Cell type:markdown id:524759b7 tags:
The `apply` method wants to operate on `Series` objects. The columns of a `DataFrame` are `Series`. Inside each `Series` data is stored contiguously in memory. Hence operating on the columns is *fast*.
When operating row-wise for *each* row a new `Series` object must be generated. A buffer must be allocated in memory and data needs to copied to that buffer in order to be able to operate on the data with the `apply` method. Since there are many steps involved that are repeated for each row this procedure generally is *slower* than operating along the columns.
%% Cell type:markdown id:cb2244cb tags:
### Task (optional)
The names of the Iris species are contained in the column with heading `"Name"`. The names follow the pattern:
```
Iris-<identifier for species>
```
Remove the dash `-` from the names and just keep the identifier for each species. Use the `apply` method.
%% Cell type:code id:79f7a36a tags:
``` python
```
%% Cell type:markdown id:3f112be8 tags:
### `transform()`
```python
DataFrame.transform(func, axis=0, *args, **kwargs)
```
`func` can either be
- callable, e.g. `np.exp`
- list-like, e.g. `[np.sin, np.cos]`
- dict-like, e.g. `{"sepal length": np.sin, "petal length": np.cos}`. Application is limited to columns names passed as keys to `dict`.
- string, e.g. `"sqrt"`
*Note*: This function *transforms*, i.e, when the input value is `Series` another (transformed) `Series` is returned. Returning a scalar value is not valid (resulting error message will be: `ValueError: Function did not transform
`)
%% Cell type:code id:4848e300 tags:
``` python
df[data_columns].transform({"sepal length": np.cos, "petal length": np.sin}).head()
```
%% Cell type:markdown id:baccb2f2 tags:
### Task (optional)
Convert the measured values (which are all given in cm units) to mm units by using the `transform` method.
%% Cell type:code id:d19379c2 tags:
``` python
```
%% Cell type:markdown id:c0910d4c tags:
### Performance considerations
%% Cell type:markdown id:b7d5536e tags:
When operating on columns of a `DataFrame` or a `DataFrame` *as a whole* it is oftentimes faster to use a vectorised operations instead of column-/row-wise operations.
%% Cell type:code id:d8432d37 tags:
``` python
df = pd.DataFrame(np.random.randn(1_000_000, 3), columns=list("abc"))
```
%% Cell type:code id:75e45ac5 tags:
``` python
%timeit df.apply(lambda x: x ** 2, axis=0)
%timeit df ** 2
%timeit (df.values ** 2) # here we operate on the underlying `ndarray`
```
%% Cell type:markdown id:dd5b8a17 tags:
### `assign`
%% Cell type:markdown id:2cd2aaa0 tags:
The `assign` method adds a new column to a `DataFrame`. It is called on an existing `DataFrame` and returns a new `DataFrame` (that has all columns of the original `DataFrame`) with the new column added.
* Allows to add single as well as multiple columns per call.
%% Cell type:code id:527d0daf tags:
``` python
df_mean = df[data_columns].mean()
df.assign(
petal_length_dev_from_mean=lambda x: x["petal length"] - df_mean["petal length"],
petal_width_dev_from_mean=lambda x: x["petal width"] - df_mean["petal width"],
sepal_length_dev_from_mean=lambda x: x["sepal length"] - df_mean["sepal length"],
sepal_width_dev_from_mean=lambda x: x["sepal width"] - df_mean["sepal width"]
)
```
%% Cell type:markdown id:f470a4a6 tags:
## Grouping data
%% Cell type:markdown id:ab7c9102 tags:
### Properties of `GroupedBy` objects
%% Cell type:markdown id:e360d0f1 tags:
- Oftentimes items in a dataset can be grouped in a certain manner (e.g., if a column contains a value multiple times). The Iris dataset, for instance, can be grouped according the species of each flower.
```python
my_dataframe.groupby(by=["<column label>"])
```
- The `DataFrame` is split and entries are grouped according to the values in the column with `"<column-label>"`. Once the data has been grouped operations can be conducted on the items of each group.
*Note*: `DataFrame`s cannot only be [grouped](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) according to the entries of a column.
%% Cell type:markdown id:99fce3ac tags:
The return type of `groupby()` is *not* another `DataFrame` but rather a `DataFrameGroupBy` object. We can imagine this object to be a grouping of multiple `DataFrame`s.
It is important to understand that such an object essentially is a special *view* on the original `DataFrame`. No computations have been carried out when generating it (lazy evaluation).
%% Cell type:code id:72c259e0 tags:
``` python
df = utils.download_IRIS()
```
%% Cell type:code id:00f1cbca tags:
``` python
# We group the data according to the species of the flowers
grouped_by_species = df.groupby(by=["Name"])
```
%% Cell type:code id:acbfef47 tags:
``` python
print(type(grouped_by_species))
```
%% Cell type:markdown id:12e3ea0f tags:
This data structure still knows about the `columns` that were present in the original `DataFrame`. We can use the `[<column-name>]` operation to access the columns with the correspoding label in each of the group members (subframes).
%% Cell type:code id:74f7786d tags:
``` python
grouped_by_species["sepal length"]
```
%% Cell type:code id:ab6a8365 tags:
``` python
# Pandas will access the corresponding column of all subframes and apply the functions passed to the `agg()` method.
grouped_by_species["sepal length"].agg([np.min, np.max, np.mean])
```
%% Cell type:markdown id:3e17a46e tags:
We can iterate over the `DataFrameGroupBy` object where each subframe is returned as a `Series` of a `DataFrame`.
%% Cell type:code id:a245fca9 tags:
``` python
for (species, subframe) in grouped_by_species:
print(f"Subframe for species {species} has shape {subframe.shape}")
```
%% Cell type:code id:4d110f91 tags:
``` python
# Call the getter to obtain a `DataFrame`.
grouped_by_species.get_group("Iris-setosa").head()
```
%% Cell type:markdown id:cf9faf2d tags:
Methods that are not directly implemented for the `DataFrameGroupBy` object are passed to the subframes and executed on these.
%% Cell type:code id:db2a75ca tags:
``` python
# The `describe()` method can also be called on the full object but the output would be rather hard to view.
grouped_by_species["sepal length"].describe() # The return type is a `DataFrame`
```
%% Cell type:code id:2df3b968 tags:
``` python
# Single methods are available as well. E.g. `mean()`, `std()` or `sum()`
grouped_by_species.mean() # The return type is a `DataFrame`
```
%% Cell type:markdown id:dafb2e66 tags:
### Operating on `GroupedBy` objects
%% Cell type:markdown id:9bbd870d tags:
`DataFrameGroupBy` object support `aggregate()`, `filter()`, `transform()` and `apply()` operations.
These methods can be efficiently used to implement a great variety of operations on grouped data.
%% Cell type:markdown id:03cd3096 tags:
#### [`aggregate()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) (or simply `agg()`)
```python
DataFrameGroupBy.aggregate(func=None, *args, engine=None,
engine_kwargs=None, **kwargs)
```
`func` can for example be ...
- ... function (Python callable),
- ... a string specifiying a function name (e.g. `"mean"`)
- ... list of functions or strings, e.g. `["std", np.mean]`
- ... `dict` of column labels and function to apply (e.g. `{'data1': np.mean}`)
%% Cell type:code id:32798a44 tags:
``` python
# Perform some common aggegrations within each subframe. The output of this method is another `DataFrame`.
group_agg = grouped_by_species.agg([np.min, np.max, np.mean, np.std])
group_agg
```
%% Cell type:code id:217e1c1f tags:
``` python
# To understand this a bit better consider the following. Note that we limit the output to only one species.
df.loc[df["Name"] == "Iris-setosa", df.columns[:-1]].agg(
[np.min,
np.max,
np.mean,
np.std]
)
```
%% Cell type:markdown id:cfe77e99 tags:
The resulting output looks somewhat complicated than what we are used to from `DataFrame`s so far. The column labels now are hierarchical due to the grouping.
%% Cell type:code id:9de104f9 tags:
``` python
group_agg.columns # This is a so-called `MultiIndex`.
```
%% Cell type:code id:4ef5258d tags:
``` python
df
```
%% Cell type:markdown id:9fe0ff59 tags:
## Exercises (optional)
%% Cell type:markdown id:e1b12f15 tags:
### Task 1
Consider the Iris dataset.
* For each of the features compute the mean value as well as the standard deviation.
* Center the values of a particular feature on the mean values and scale them to have unit variance.
%% Cell type:code id:62978f94 tags:
``` python
df = utils.download_IRIS()
```
%% Cell type:markdown id:5b09bfcb tags:
Let us first make a working copy of the `DataFrame` containing the data on the Iris dataset.
%% Cell type:code id:15edb74b tags:
``` python
df_tmp = df.copy()
```
%% Cell type:markdown id:d7ba0a2a tags:
Next, compute the mean value and the standard deviation for all features of the dataset. Computing these quantities does *not* take into the account the particular species.
%% Cell type:code id:53d45569 tags:
``` python
cols=["sepal length","sepal width","petal length","petal width"]
df_agg= df.loc[:,cols].agg([np.min, np.max, np.mean, np.std])
df_agg
```
%% Cell type:markdown id:ae660e8f tags:
Now transform each of the features to be centred on the mean value and to have unit variance.
%% Cell type:code id:52a96d54 tags:
``` python
df.loc[:,cols] =((df.loc[:,cols] - df_agg.loc['mean',:] )/ df_agg.loc['std',:])
df.describe()
```
%% Cell type:markdown id:f2b49c83 tags:
### Task 2
Again consider the Iris dataset.
* Group the measured values by the species.
* Create boxplots for each species for all features.
* Retrieve the names of the single groups from the `GroupedBy` objects.
* Get the `DataFrame` for each of the groups from the `GroupedBy` object and call the [`boxplot` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) to create the plot.
* Use the names in the titles of the plot.
%% Cell type:code id:7ae51b08 tags:
``` python
df = utils.download_IRIS()
```
%% Cell type:code id:801dd946 tags:
``` python
grouped_by_species = df.groupby(by=["Name"])
```
%% Cell type:code id:90b8c49e tags:
``` python
fig, axs = plt.subplots(1,3)
for ax, (name, group_data) in zip(axs,grouped_by_species):
group_data.boxplot(ax=ax)
ax.set_title(name)
```
%% Cell type:code id:a031db1b tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment