The `pd.DataFrame` class provides a data structure to handle 2-dimensional tabular data. `DataFrame` objects are *size-mutable* and can contain mixed datatypes (e.g. `float`, `int` or `str`). All data columns inside a `DataFrame` share the same `index`.
Series objects are *matched by index* and missing values are replaced with a default value.
%% Cell type:code id:2222bd8e tags:
``` python
df# default value is `NaN`
```
%% Cell type:markdown id:a220919f tags:
## Exercises (optional)
%% Cell type:markdown id:552b4899 tags:
* Given the two iterables `values1` and `values2`, create a `pd.DataFrame` containing both in two different ways. Label the columns `'label1'` and `'label2'`.
How many rows and columns are container in the `DataFrame`. We have seen this attribute when dealing with `ndarrays` ...
%% Cell type:code id:c9fc5896 tags:
``` python
df.shape
```
%% Cell type:code id:a12c423b tags:
``` python
# Detailed information on the data contained inside the `DataFrame`.
df.info()
```
%% Cell type:markdown id:7f4f886e tags:
`DataFrame`s are essentially composed of 3 components. Theses components can be accessed with specific data attributes.
- Index (`df.index`)
- Columns (`df.columns`)
- Body (`df.values`)
%% Cell type:code id:e6ed9ab7 tags:
``` python
df.index
```
%% Cell type:code id:b0afa504 tags:
``` python
df.columns
```
%% Cell type:code id:7ade6b93 tags:
``` python
df.values
```
%% Cell type:markdown id:de4be0fb tags:
## Data indexing and selection
%% Cell type:markdown id:fa82625d tags:
### The Iris flower dataset
<atitle="w:ru:Денис Анисимов (talk | contribs), Public domain, via Wikimedia Commons"href="https://commons.wikimedia.org/wiki/File:Irissetosa1.jpg"><imgwidth="512"alt="Irissetosa1"src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Irissetosa1.jpg/512px-Irissetosa1.jpg"></a>
Image taken from: <ahref="https://commons.wikimedia.org/wiki/File:Irissetosa1.jpg">w:ru:Денис Анисимов (talk | contribs)</a>, Public domain, via Wikimedia Commons
Attribution for dataset: *Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.*
%% Cell type:markdown id:673472df tags:
The dataset contains measurements of for "features" related to the species of Iris flowers:
* Petal length ("Bluetenblattlaenge")
* Petal width ("Bluetenblattbreite")
* Sepal length ("Kelchblattlaenge")
* Sepal width ("Kelchblattbreite")
The species contained in the dataset are:
* Iris setosa
* Iris virginica
* Iris versicolor
%% Cell type:code id:f95709e5 tags:
``` python
df=utils.download_IRIS()
```
%% Cell type:code id:30c97a17 tags:
``` python
# Quick check if data looks alright
# petal - Bluetenblatt
# sepal - Kelchblatt
df.head()
# df.tail()
```
%% Cell type:code id:76e5852d tags:
``` python
df.columns
```
%% Cell type:code id:285fad0e tags:
``` python
# Column access with the `[]` operator.
df["Name"]
```
%% Cell type:code id:0c7fc248 tags:
``` python
# The columns of a DataFrame are `Series` objects.
Oftentimes -- when invoking a method of a `DataFrame` object -- a *new* `DataFrame` instance is returned. This means that new memory allocations will be made which can be quite time-consuming and also a waste of precious memory ressources.
%% Cell type:markdown id:c79088e2 tags:
Reset the index of the current `DataFrame`. This is done *out-of-place* and a new instance is returned.
- *applies* a function (callable) along an `axis` of the `DataFrame`
- `axis=0`: `func` is applied to each column (a `Series` object). This is the default!
- `axis=1`: `func` is applied to each row
- return type is inferred from `func`
%% Cell type:markdown id:ec3f1484 tags:
The return type of `func` determines the form of the result.
`func` can operate on `Series` objects an perform operations that are supported by these types of objects (e.g. by means of the methods `.min()`, `.max()` or `.mean()`).
- result can be a scalar value (e.g. `.sum()` which is an aggregation operation)
- result can be another `Series` object
%% Cell type:markdown id:440dfc49 tags:
Compute the mean value of each column (this is the default because we do not specify the `axis` argument).
mean_values # This returns a `Series` object because x.mean() returns a scalar value.
```
%% Cell type:markdown id:511add01 tags:
### Question
How does the result look like if the operate along the rows of the `DataFrame`. This is achieved by using the argument `axis = 1`. What is the shape of the resulting object?
%% Cell type:markdown id:ef03fc24 tags:
Now we transform the values in the columns of the `DataFrame`. We define a function that will operate on the `Series` objects that form the columns.
The object resulting from this operation is another `DataFrame` instance.
%% Cell type:code id:58bdf8e5 tags:
``` python
def scale_to_mm(s):
return s * 10
df_scaled_to_mm = df[data_columns].apply(scale_to_mm) # This will return a new DataFrame
df_scaled_to_mm["Name"] = df["Name"]
df_scaled_to_mm.head()
```
%% Cell type:markdown id:233adbba tags:
### Question
How must the above command be changed if we want to operate along the rows of the `DataFrame` instead? Does this also work with the already-defined function or do we have to define a dedicated function?
Let's generate a large `DataFrame`. We wish to operate on the data with the `apply` method. We can do this in two different ways:
- Operate along the rows (`axis=1`)
- Operate along the columns (`axis=0`)
%% Cell type:code id:355d3698 tags:
``` python
N_rows, N_cols = 10_000, 500
data = pd.DataFrame(np.random.random((N_rows, N_cols)), columns=[f"col{idx}" for idx in range(N_cols)])
```
%% Cell type:markdown id:2d78baa7 tags:
### Question
What do you think is faster: Operating along the columns or operating along the rows?
When you have made your decision try to come up with a reason!
%% Cell type:code id:368b5567 tags:
``` python
%timeit data.apply(lambda x: x ** 2, axis=0) # operate along columns
%timeit data.apply(lambda x: x ** 2, axis=1) # operate along rows
```
%% Cell type:markdown id:524759b7 tags:
The `apply` method wants to operate on `Series` objects. The columns of a `DataFrame` are `Series`. Inside each `Series` data is stored contiguously in memory. Hence operating on the columns is *fast*.
When operating row-wise for *each* row a new `Series` object must be generated. A buffer must be allocated in memory and data needs to copied to that buffer in order to be able to operate on the data with the `apply` method. Since there are many steps involved that are repeated for each row this procedure generally is *slower* than operating along the columns.
%% Cell type:markdown id:cb2244cb tags:
### Task (optional)
The names of the Iris species are contained in the column with heading `"Name"`. The names follow the pattern:
```
Iris-<identifierforspecies>
```
Remove the dash `-` from the names and just keep the identifier for each species. Use the `apply` method.
- dict-like, e.g. `{"sepal length": np.sin, "petal length": np.cos}`. Application is limited to columns names passed as keys to `dict`.
- string, e.g. `"sqrt"`
*Note*: This function *transforms*, i.e, when the input value is `Series` another (transformed) `Series` is returned. Returning a scalar value is not valid (resulting error message will be: `ValueError: Function did not transform
Convert the measured values (which are all given in cm units) to mm units by using the `transform` method.
%% Cell type:code id:d19379c2 tags:
``` python
```
%% Cell type:markdown id:c0910d4c tags:
### Performance considerations
%% Cell type:markdown id:b7d5536e tags:
When operating on columns of a `DataFrame` or a `DataFrame` *as a whole* it is oftentimes faster to use a vectorised operations instead of column-/row-wise operations.
%timeit (df.values ** 2) # here we operate on the underlying `ndarray`
```
%% Cell type:markdown id:dd5b8a17 tags:
### `assign`
%% Cell type:markdown id:2cd2aaa0 tags:
The `assign` method adds a new column to a `DataFrame`. It is called on an existing `DataFrame` and returns a new `DataFrame` (that has all columns of the original `DataFrame`) with the new column added.
* Allows to add single as well as multiple columns per call.
- Oftentimes items in a dataset can be grouped in a certain manner (e.g., if a column contains a value multiple times). The Iris dataset, for instance, can be grouped according the species of each flower.
```python
my_dataframe.groupby(by=["<column label>"])
```
- The `DataFrame` is split and entries are grouped according to the values in the column with `"<column-label>"`. Once the data has been grouped operations can be conducted on the items of each group.
*Note*: `DataFrame`s cannot only be [grouped](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) according to the entries of a column.
%% Cell type:markdown id:99fce3ac tags:
The return type of `groupby()` is *not* another `DataFrame` but rather a `DataFrameGroupBy` object. We can imagine this object to be a grouping of multiple `DataFrame`s.
It is important to understand that such an object essentially is a special *view* on the original `DataFrame`. No computations have been carried out when generating it (lazy evaluation).
%% Cell type:code id:72c259e0 tags:
``` python
df = utils.download_IRIS()
```
%% Cell type:code id:00f1cbca tags:
``` python
# We group the data according to the species of the flowers
grouped_by_species = df.groupby(by=["Name"])
```
%% Cell type:code id:acbfef47 tags:
``` python
print(type(grouped_by_species))
```
%% Cell type:markdown id:12e3ea0f tags:
This data structure still knows about the `columns` that were present in the original `DataFrame`. We can use the `[<column-name>]` operation to access the columns with the correspoding label in each of the group members (subframes).
%% Cell type:code id:74f7786d tags:
``` python
grouped_by_species["sepal length"]
```
%% Cell type:code id:ab6a8365 tags:
``` python
# Pandas will access the corresponding column of all subframes and apply the functions passed to the `agg()` method.
The resulting output looks somewhat complicated than what we are used to from `DataFrame`s so far. The column labels now are hierarchical due to the grouping.
%% Cell type:code id:9de104f9 tags:
``` python
group_agg.columns # This is a so-called `MultiIndex`.
```
%% Cell type:code id:4ef5258d tags:
``` python
df
```
%% Cell type:markdown id:9fe0ff59 tags:
## Exercises (optional)
%% Cell type:markdown id:e1b12f15 tags:
### Task 1
Consider the Iris dataset.
* For each of the features compute the mean value as well as the standard deviation.
* Center the values of a particular feature on the mean values and scale them to have unit variance.
%% Cell type:code id:62978f94 tags:
``` python
df = utils.download_IRIS()
```
%% Cell type:markdown id:5b09bfcb tags:
Let us first make a working copy of the `DataFrame` containing the data on the Iris dataset.
%% Cell type:code id:15edb74b tags:
``` python
df_tmp = df.copy()
```
%% Cell type:markdown id:d7ba0a2a tags:
Next, compute the mean value and the standard deviation for all features of the dataset. Computing these quantities does *not* take into the account the particular species.
* Create boxplots for each species for all features.
* Retrieve the names of the single groups from the `GroupedBy` objects.
* Get the `DataFrame` for each of the groups from the `GroupedBy` object and call the [`boxplot` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) to create the plot.
* Use the names in the titles of the plot.
%% Cell type:code id:7ae51b08 tags:
``` python
df = utils.download_IRIS()
```
%% Cell type:code id:801dd946 tags:
``` python
grouped_by_species = df.groupby(by=["Name"])
```
%% Cell type:code id:90b8c49e tags:
``` python
fig, axs = plt.subplots(1,3)
for ax, (name, group_data) in zip(axs,grouped_by_species):