"In this simple example we investigate how the the way we record data can have a large impact on the conclusions we can draw from data.\n",
"\n",
"We use the example of two simple dice and simulate rolling the dice with the random number generator in NumPy ```np.random.choice``` that returns one of the numbers listed in the arguments randomly."
],
"metadata": {
"id": "gCrmxVOcGaD6"
}
]
},
{
"cell_type": "code",
...
...
@@ -42,27 +28,20 @@
},
{
"cell_type": "markdown",
"metadata": {
"id": "dL0XzGaak6DB"
},
"source": [
"# Two (simple) Dice\n",
"\n",
"Imagine we have two six-sided dice and roll them.\n",
"If we either roll them one after the the other, or both at the same time:\n",
"In either case we expect that each number appears with the same frequency and that there is no dependency between them."
"Now we do the same thing but only record the numbers if one of the dice shows either a 5 or a 6."
],
"metadata": {
"id": "TVfjYZFd1k1A"
}
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"id": "fAlvEKJM17AL"
},
"outputs": [],
"source": [
"dice_1 = []\n",
"dice_2 = []\n",
...
...
@@ -180,34 +171,27 @@
" if d_1 == 5 or d_1 == 6 or d_2 == 5 or d_2 == 6:\n",
" dice_1.append(d_1[0])\n",
" dice_2.append(d_2[0])"
],
"metadata": {
"id": "fAlvEKJM17AL"
},
"execution_count": 79,
"outputs": []
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "iUXf1azM20LE"
},
"source": [
"Now we compute the correlation between the two set of numbers.\n",
"Now we compute the correlation between the two sets of numbers.\n",
"The two numbers are now strongly correlated.\n",
"\n",
"This is not surprising, because we changed the way we recorded the data:\n",
"We only record the outcomes of the two dice if one of them shows a 5 or a 6.\n",
"Hence, we expect that we observe these two numbers in the final set of recorded values much more frequently than any other value - and we do see this in the plot below.\n",
"All other numbers occur as well - but we do not observe them as frequently as we see the numbers 5 and 6."
"When we think about this in the sequence: We change the way we record the data, and then we observe a change in the outcome, it seems obvious what is happening and why.\n",
"\n",
...
...
@@ -305,10 +295,21 @@
"With our knowledge \"behind the scenes\" we know that these effects are just artefacts from the way we obtain the data.\n",
"\n",
"This highlights the importance of understanding not only where the data we use comes from, but also how it was recorded and which potential issues may arise from this setup."
]
}
],
"metadata": {
"id": "9cBsr_lGFAT4"
}
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
]
},
"nbformat": 4,
"nbformat_minor": 0
}
%% Cell type:markdown id: tags:
# Spurious Correlation
In this simple example we investigate how the the way we record data can have a large impact on the conclusions we can draw from data.
We use the example of two simple dice and simulate rolling the dice with the random number generator in NumPy ```np.random.choice``` that returns one of the numbers listed in the arguments randomly.
%% Cell type:code id: tags:
```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
```
%% Cell type:markdown id: tags:
# Two (simple) Dice
Imagine we have two six-sided dice and roll them.
If we either roll them one after the the other, or both at the same time:
In either case we expect that each number appears with the same frequency and that there is no dependency between them.
Now we do the same thing but only record the numbers if one of the dice shows either a 5 or a 6.
%% Cell type:code id: tags:
```
dice_1 = []
dice_2 = []
for i in (np.arange(n_random)):
d_1 = np.random.choice([1,2,3,4,5,6], 1)
d_2 = np.random.choice([1,2,3,4,5,6], 1)
if d_1 == 5 or d_1 == 6 or d_2 == 5 or d_2 == 6:
dice_1.append(d_1[0])
dice_2.append(d_2[0])
```
%% Cell type:markdown id: tags:
Now we compute the correlation between the two set of numbers.
Now we compute the correlation between the two sets of numbers.
The two numbers are now strongly correlated.
This is not surprising, because we changed the way we recorded the data:
We only record the outcomes of the two dice if one of them shows a 5 or a 6.
Hence, we expect that we observe these two numbers in the final set of recorded values much more frequently than any other value - and we do see this in the plot below.
All other numbers occur as well - but we do not observe them as frequently as we see the numbers 5 and 6.
When we think about this in the sequence: We change the way we record the data, and then we observe a change in the outcome, it seems obvious what is happening and why.
However, in other situations, we start from a very different premise:
We obtain the data (or are given the data) and we may not know the data were acquired,
For example, imagine the task of a (novice) data scientist who is given the data we have generated above and is tasked with building a prediction model or using machine learning based on these data. In this case, we would easily conclude that the two variables are correlated and have a distribution skewed to high values.
With our knowledge "behind the scenes" we know that these effects are just artefacts from the way we obtain the data.
This highlights the importance of understanding not only where the data we use comes from, but also how it was recorded and which potential issues may arise from this setup.