Skip to content
Snippets Groups Projects
Commit 3e8538d4 authored by Ulrich Kerzel's avatar Ulrich Kerzel
Browse files

selections in pandas dataframes

parent 477152ab
No related branches found
No related tags found
No related merge requests found
......@@ -236,7 +236,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {},
"outputs": [
{
......
......@@ -31,7 +31,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"metadata": {},
"outputs": [
{
......@@ -40,7 +40,7 @@
"7"
]
},
"execution_count": 1,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
......
%% Cell type:markdown id: tags:
# Flow Control
So far we have encountered variables, basic data types (such as int, float, string, bool) and more complex data structures (list, dictionaries, etc.).
In order to write some programs, we would need to check for conditions and then act accordingly whether or not a condition is met, or perform some calculations repeatedly.
Therefore, we now look at elements how to control the flow of our programs:
* ```if``` statements
* ```for``` loops
* ``` while``` loops
* iterating over lists or dictionaries, etc.
## Checking for conditions
### If statement
The ``if`` statement is the simplest way to conditionally execute some code (or not).
The general syntax is:
```
if condition:
action
```
The ```condition``` needs to evaluate to either ```True``` or ```False```, i.e. a boolean value.
The fundamental checks are for some variables ```a``` and ```b```:
* Equal: ```a == b```.
Note the double ``` ==``` : a single ```=``` is used for assignments, so we need to use the double ```==``` for a comparison
* not equal: ``` a != b```
* less than: ```a < b``` or less equal ``` a <= b ```
* greater than: ``` a > b``` or greater equal ``` a >= b```
%% Cell type:code id: tags:
``` python
a = 1
b = 2
# First we can look what the condition might be:
print (a > b)
# now do a conditional:
if a > b:
print('a is greater than b')
print('--------')
```
%% Output
False
--------
%% Cell type:markdown id: tags:
... where we note that the statement in the conditional was not executed.
We also note a key concept in python: The code in the conditional statement is *** indented ***. Unlike other programming languages, python does not use, for example, brackets to indicte which parts of the code belong together but indentations.
> **Note**
>
> Code that belongs together has the same level of indentation.
This has the benefit that the code is much more readable as it forces us to write the code such that parts of the code that belong together also have the same level of indentation. It is also a source of confusing bugs if we accidently get the indentation wrong...
To be more flexible, we can test for more than one condition using ```elif``` (else-if) and then finally ```else``` as a "catch-all" for all conditions that we have not met so far.
%% Cell type:code id: tags:
``` python
a = 10
if a > 100:
print ('a is greater than 100')
elif a > 50:
print ('a is greater than 50')
elif a > 10:
print ('a is greater than 10')
else:
print ('none applies')
```
%% Output
none applies
%% Cell type:code id: tags:
``` python
# if we only have one condition to test, we can write a short one-liner
a = 1
b = 2
print ('a is greater than b') if a > b else print ('b is greater than a')
```
%% Output
b greater than a
%% Cell type:markdown id: tags:
We can also have more than one condition and combine them using ```and``` , ```or```.
%% Cell type:code id: tags:
``` python
a = 10
b = 15
if (a > 10) and (b < 20):
print('condition met')
```
%% Cell type:markdown id: tags:
We can also nest ```if``` statements, i.e. have ```if``` statements within ```if``` statements.
**Exercise**
Write a nested ```if``` statement checking if the value of the variable ```a``` is above 25, and if yes, if it is also above 30 or not.
%% Cell type:code id: tags:
``` python
a = 27
# ... your code here ...
```
%% Cell type:markdown id: tags:
---
## for loops
Quite often we want to execute the same code a fixed number of times. For example, we want to execute the code five times or we want to look at all elements of a list, a dictionary or even a string.
In this case, we can use the ```for``` loop.
The general syntax is
```
for variable in list:
do something
else:
do something else
```
(where typically we do not need the ```else``` statement.)
Again, note the indentations that define which part of the code belongs together.
If we want to run the code a certain number of times, we can use the ```range(start, stop, step)``` function, where ```start``` specifies the number we want to start from, ```stop``` the final number (excluding this value) and ```step``` the step size. The step size is optional and assumed to be 1 if we do not specify it.
%% Cell type:code id: tags:
``` python
for i in range(0, 5, 1):
print('The value of i is now {}'.format(i))
```
%% Output
The value of i is now 0
The value of i is now 1
The value of i is now 2
The value of i is now 3
The value of i is now 4
%% Cell type:code id: tags:
``` python
# We can iterate over a string as well:
my_string = 'I love python'
for i in my_string:
print(i)
```
%% Output
I
l
o
v
e
p
y
t
h
o
n
%% Cell type:markdown id: tags:
In some cases we may need to end the execution of the ```for``` loop early. There are two ways to do this:
* ```break```: this exists the loop and returns to the code outside the loop
* ```continue```: do not execute the rest of the current iteration in the loop but start with the next iteration
%% Cell type:code id: tags:
``` python
# Note here that we go through all values in the loop (0,1,2,3,4) but only print if the value is not equal to 2
for i in range(0, 5, 1):
if i == 2:
continue
print('The value of i is now {}'.format(i))
```
%% Output
The value of i is now 0
The value of i is now 1
The value of i is now 3
The value of i is now 4
%% Cell type:code id: tags:
``` python
# Note here we abort the loop when we reach the value 2
for i in range(0, 5, 1):
if i == 2:
break
print('The value of i is now {}'.format(i))
```
%% Output
The value of i is now 0
The value of i is now 1
%% Cell type:markdown id: tags:
### Exercise: Fibonacci Series
The [Fibonacci Numbers](https://en.wikipedia.org/wiki/Fibonacci_number) are a sequence where the current number is derived from the sum of the two preceeding ones, i.e. $F_n = F_{n-1} + F_{n-2}$. The first two numbers are $F_1 = 0$ and $F_2 = 1$. Therefore, the next number is $F_3 = 0 + 1 = 1$
Write a ```for``` loop to compute the first 10 digits of the Fibonacci series and then print the series.
The output should be:
```
The Fibonacci numbers are: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
```
%% Cell type:code id: tags:
``` python
# ... your code here ....
```
%% Cell type:markdown id: tags:
## While Loops
The for loop is pre-defined in the sense that the number of times the loop is executed is defined beforehand. If we loop, for example, over a dictionary or a list, we can access the individual elements and work with them.
However, in many other cases we want to continue the execution until a suitable condition is met. In these cases, we use a ```while``` loop.
The general syntax is:
```
while <Statement is true>:
do something
else:
do something else
```
As with the ```for``` loop or the ```if``` statement, the ```else``` clause is optional.
%% Cell type:code id: tags:
``` python
i = 0
while i <= 10:
print('The value of i is now {}'.format(i))
i = i +1
```
%% Output
The value of i is now 0
The value of i is now 1
The value of i is now 2
The value of i is now 3
The value of i is now 4
The value of i is now 5
The value of i is now 6
The value of i is now 7
The value of i is now 8
The value of i is now 9
The value of i is now 10
%% Cell type:markdown id: tags:
Again, we can terminate the execution of the loop early with ```break``` and ```continue```.
***Note***
It is quite important to think what will happen to the loop if we do use these statements.
For example, in the code below we first increase i, then do the check and then print the value, whereas above we first printed the value and then increased it by 1. We observe that, indeed, the value 2 is not printed, but the loop now runs between 1,...,11 instead of 0,...,10. However, if we were to place the statements in other orders, we would find that either there is not effect (the execution skips over everything after ```continue```) or we have an infinite loop, ...
Using these statements can be quite tricky and you may introduce subtle bugs or unwanted behaviour with them...
***Exercise***
Try and see what the effect is of using a different order of statements inside the loop.
%% Cell type:code id: tags:
``` python
i = 0
while i <= 10:
i = i +1
if i == 2:
continue
print('The value of i is now {}'.format(i))
```
%% Output
The value of i is now 1
The value of i is now 3
The value of i is now 4
The value of i is now 5
The value of i is now 6
The value of i is now 7
The value of i is now 8
The value of i is now 9
The value of i is now 10
The value of i is now 11
%% Cell type:code id: tags:
``` python
i = 0
while i <= 10:
print('The value of i is now {}'.format(i))
i = i +1
if i == 2:
break
```
%% Output
The value of i is now 0
The value of i is now 1
%% Cell type:markdown id: tags:
### Exercise
Rewrite the Fibonacci Series as a ```while``` loop, terminating after the list is 10 elements long.
(Alternatively, until the series arrives at the value of 34.)
%% Cell type:code id: tags:
``` python
# ... your code here ...
```
......
......@@ -84,7 +84,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 4,
"metadata": {},
"outputs": [
{
......
%% Cell type:markdown id: tags:
# Pandas
[Pandas](https://pandas.pydata.org) is one one of the most important modules in the toolkit of a Data Scientist.
It focuses on the analysis of timeseries or structured data, i.e., data that can be represented by a sequence of events or a table.
Pandas provides a wide range of convenient functions to access, manipulate, and analyse such data.
As with the other modules, there is a commonly used abbreviation and we typically use ```pd``` for pandas.
More importantly, the underlying data format called "dataframe" has become the de-facto standard to exchange structured data, and many packages take dataframes as input.
In the following, we will use the iris dataset, which is one of the simplest datasets that are frequently used to demonstrate data science approaches.
It contains data about 3 different types of the [iris flower](http://en.wikipedia.org/wiki/Iris_(plant) ):
* Setosa,
* Versicolour and
* Virginica
The dataset was [originally introduced](http://en.wikipedia.org/wiki/Iris_flower_data_set) by Sir Robert Fisher in 1936 as an example for discriminant analysis and contains the following features (measured in cm):
* Sepal Length,
* Sepal Width,
* Petal Length and
* Petal Width.
The data are often included as demo datasets in various data science packages and also available on public repositories such as the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris) entry on the UCI Machine Learning Repository. In this case, we use the copy from the [data archive in Seaborn](https://github.com/mwaskom/seaborn-data), which is a copy of the UCI repository, but with some added information such as a description in the data what the columsn mean.
The data looks like:
```
,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
...
```
The colums have the following meaning:
* The first column (without name) is a running index
* Then there are four columns with descriptive variables ("features")
* The final variable is the species of the iris flower.
The data are stored in CSV format (comma separated values) which is very common for small (-ish) structured data. For larger files, we would typically use more efficient file formats such as [Apache Parquet](https://parquet.apache.org/)
In a first step, we read the contents of the file and store the data in a new dataframe. As mentioned when we first worked with files in Python, we do not *usually* do this manually, but use one of the many convenient functions that are already provided. In our case, Pandas knows how to read CSV files using the [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.
Since we know that the first column (starting to count from zero) is an index, we tell pandas this.
%% Cell type:code id: tags:
``` python
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
iris_df = pd.read_csv('iris.csv', index_col=0)
```
%% Cell type:markdown id: tags:
We can then print the first few lines of the dataframe to become familiar with its structure
%% Cell type:code id: tags:
``` python
iris_df.head()
```
%% Output
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
%% Cell type:markdown id: tags:
We can look at the frequency with which a certain value appears for one of the variables. \
These frequency plots are called ```histograms```.
For illustration, we use two variables in plots next to each other. \
This means that the "axis" in which we interact with the plot is no longer a single variable, but now an array. The left (first) plot can be accessed with ```ax[0]```, and correspondingly for the right plot ```ax[1]```. When making the plot, we then tell Seaborn where to put the histogram.
%% Cell type:code id: tags:
``` python
fig, ax = plt.subplots(ncols=2,figsize=(7, 2))
sns.histplot(data=iris_df, x='sepal_length', ax = ax[0])
sns.histplot(data=iris_df, x='petal_length', ax = ax[1])
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
Note that Seaborn added "count" on the y-axis automatically as we count the frequency of occurences in a histogram.
We would probably want to know how this depends on the species of the iris flower. \
Here, Seaborn becomes more convenient to use - we can do the same thing with matplotlib but it is not quite as convenient. By adding the parameter ```hue```, Seaborn splits the histogram by the type of flower we consider in our data, adds a separate colour to each of them and adds a legend explaining what is what.
%% Cell type:code id: tags:
``` python
fig, ax = plt.subplots(ncols=1,figsize=(7, 2))
sns.histplot(data=iris_df, x='sepal_length', hue='species')
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
We also want to know how two variables behave when we compare them to each other. We can do this with a "scatterplot" where each data-point from two variables is "scattered" (hence the name) in the x-y plane. We can also use ```relplot``` for the same effect (but with different options).
If we have a large dataset, we can use ```histplot``` again and pass two variables for ```x``` and ```y```. Then, we do not plot individual data points but two-dimensional histgrams.
%% Cell type:code id: tags:
``` python
fig, ax = plt.subplots(ncols=1,figsize=(5, 5))
sns.scatterplot(data=iris_df, x="sepal_length", y="sepal_width", hue='species')
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
We will also want to look at all combinations, Seaborn provides a convenient function for this, the ```pairplot```. This is a matrix of all combinations of scatterplots from all (or: a selection of) variables. We place the frequency plot (histogram) of each variable on the diagonal. \
For large datasets, we can use ```kind='hist'``` to use histograms instead of scatterplots. \
Using the ```height``` parameter, we can set the height of each individual plot and use this to control the overall size.
For more details, see the [pairplot documentation](https://seaborn.pydata.org/generated/seaborn.pairplot.html).
%% Cell type:code id: tags:
``` python
sns.pairplot(data=iris_df, hue="species", kind='scatter', diag_kind='hist', height = 1.5)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
# Selections
In many cases, we want to select a sub-part of the data and work with this.
We can access individual variables by using their name as strings:
%% Cell type:code id: tags:
``` python
iris_df['petal_width']
```
%% Output
0 0.2
1 0.2
2 0.2
3 0.2
4 0.2
...
145 2.3
146 1.9
147 2.0
148 2.3
149 1.8
Name: petal_width, Length: 150, dtype: float64
%% Cell type:markdown id: tags:
We can then write statements that evaluate to ```True``` or ```False``` to make selections
The general syntax is:
```
df[ df['variable_name'] <comparison>]
````
%% Cell type:markdown id: tags:
I.e. we access the dataframe (the "outer" ```df[...]```) and then operate on it with a statement that evaluates to ```True``` or ```False```.
In this statement, we can then again use the variables with conditions.
If we want to combine different statements (that each evaluate to ```True``` or ```False```), we can use logical ```&``` (and) or ```|``` (or) as opposed to the Python equivalent. Note that if we do this, we should put each statement in brackets, e.g. ```(statement 1) & (statement 2)```.
Example: we select the "blue" (setosa) species by requiring that the sepal length is less than 6 and at the same time the petal width is less than 1.
%% Cell type:code id: tags:
``` python
df_selected = iris_df[ (iris_df['sepal_length'] < 6) & (iris_df['petal_width'] < 1) ]
df_selected.head()
```
%% Output
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
%% Cell type:code id: tags:
``` python
sns.pairplot(data=df_selected, hue="species", kind='scatter', diag_kind='hist', height = 1.5)
plt.show()
```
%% Output
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment