Skip to content
Snippets Groups Projects
Commit 13377268 authored by Tom Reclik's avatar Tom Reclik
Browse files

Corrected typos

parent d4bb182e
No related branches found
No related tags found
1 merge request!1Corrections trk
%% Cell type:markdown id: tags:
# Regression
In this example, we use the [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality) for a first, simple regression task.
These data consist of a range of features such as acidity or sugar levels, pH value, alcohol content, etc. and the aim is to predict a quality score in the range (0,10).
The data originate from the paper: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009. [Wine paper](https://www.sciencedirect.com/science/article/pii/S0167923609001377)
Note that the score is based on integer values - therefore, we could treat this either as a regression problem (like we do here), or a a classification problem with (up to) 10 different classes. The original published paper considered a regression approach - we will therefore follow in their footsteps and treat this problem as a regression task.
We focus on red wines here in this example - feel free to explore the white wines as an exercise.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
```
%% Cell type:markdown id: tags:
# Data
We can load the data directly from the public repository directly (or from the local backup)
%% Cell type:code id: tags:
``` python
df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
sep=';')
df.head(5)
```
%% Output
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.70 0.00 1.9 0.076
1 7.8 0.88 0.00 2.6 0.098
2 7.8 0.76 0.04 2.3 0.092
3 11.2 0.28 0.56 1.9 0.075
4 7.4 0.70 0.00 1.9 0.076
free sulfur dioxide total sulfur dioxide density pH sulphates \
0 11.0 34.0 0.9978 3.51 0.56
1 25.0 67.0 0.9968 3.20 0.68
2 15.0 54.0 0.9970 3.26 0.65
3 17.0 60.0 0.9980 3.16 0.58
4 11.0 34.0 0.9978 3.51 0.56
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
%% Cell type:code id: tags:
``` python
df.describe()
```
%% Output
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
chlorides free sulfur dioxide total sulfur dioxide density \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 0.087467 15.874922 46.467792 0.996747
std 0.047065 10.460157 32.895324 0.001887
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 22.000000 0.995600
50% 0.079000 14.000000 38.000000 0.996750
75% 0.090000 21.000000 62.000000 0.997835
max 0.611000 72.000000 289.000000 1.003690
pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983 5.636023
std 0.154386 0.169507 1.065668 0.807569
min 2.740000 0.330000 8.400000 3.000000
25% 3.210000 0.550000 9.500000 5.000000
50% 3.310000 0.620000 10.200000 6.000000
75% 3.400000 0.730000 11.100000 6.000000
max 4.010000 2.000000 14.900000 8.000000
%% Cell type:markdown id: tags:
# Exploratory Data Analysis
First, we want to get a feel for the data and look at the various variables.
In this case, all variables are already in numeric format.
## Univariate analysis
First, we look at some individual variables to get a feel for their distribution.
Let's start by looking at the target variable ("Quality") - as we can see, not all scores are used, there are very few good or bad wines, most wines are average.
***Exercise***
Explore some more variables
%% Cell type:code id: tags:
``` python
g = sns.histplot(data=df, x='quality',bins=20)
plt.ylabel('count', size = 20)
plt.xlabel('Wine Quality', size = 20)
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
##
## explore some more variables
##
```
%% Cell type:markdown id: tags:
## Explore correlations
Machine learning essentially exploits linear and non-linear correlations between the features and the target (label).
Next, we want to get a feel for how the variables are correlated.
We can first print the correlation table of the variables and then plot the behaviour of individual pairs of variables.
To look at all combinations (or a subset), we can use the [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) from Seaborn, or plot a few individual combinations.
%% Cell type:code id: tags:
``` python
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, fmt=".1f", cmap='seismic')
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
As we can see, some variables are strongly correlated with each other but only three variables have a high correlation with out target: volatile acidity sulphates and alcohol. We will therefore suspect that these variables are the most important for the model.
As we can see, some variables are strongly correlated with each other but only three variables have a high correlation with out target: volatile acidity, sulphates, and alcohol. We will therefore suspect that these variables are the most important for the model.
%% Cell type:code id: tags:
``` python
sns.scatterplot(data=df, x='alcohol', y='volatile acidity', hue='quality' )
plt.xlabel('Alcohol content', size = 20)
plt.ylabel('Volatile acidity', size = 20)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
***Exercise***
Explore some more correlations between the variables and the target.
%% Cell type:code id: tags:
``` python
##
## explore some more correlations using plots for two variables liek above or a pairplot
##
```
%% Cell type:markdown id: tags:
A contour plot in the pair-plot for the 3 main variables shows some dependency - but overall, a visual inspection shows that there is no obvious 1:1 relationship.
%% Cell type:markdown id: tags:
# Machine Learning
We now use a machine learning model to predict the quality index.
The model is trained on the training data and then evaluated on predictions made from the independent test data.
We follow the typical Scikit-Learn approach of:
- create an instance of the model
- call the ```fit``` method for the training data
- call the ```predict``` method for the test data.
As an example, we will use the [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
However, first we need to define the training/test data, the features and the labels.
The target (label) is the last column (quality) in the data-frame.
%% Cell type:code id: tags:
``` python
###
### Your code here
###
train_cols = # ...
label = # ...
X = df[train_cols]
y = df[label]
# split into training and test sample
X_train, X_test, y_train, y_test = # ....
```
%% Cell type:code id: tags:
``` python
#define and fit the model
model = RandomForestRegressor(max_depth=20, random_state=0, n_estimators=15)
model.fit(X_train, y_train)
```
%% Output
RandomForestRegressor(max_depth=20, n_estimators=15, random_state=0)
%% Cell type:markdown id: tags:
Then, we obtain the predictions on the test data for the model.
For convenience, we make a copy of the test data and append the predictions, together with the true values, as the last column to the dataframe
For convenience, we make a copy of the test data and append the predictions, together with the true values, as the last column to the dataframe.
%% Cell type:code id: tags:
``` python
predictions = X_test.copy()
y_hat = model.predict(predictions)
predictions.loc[:,'y_hat'] = y_hat
predictions.loc[:,'y'] = y_test
predictions.head(5)
```
%% Output
fixed acidity volatile acidity citric acid residual sugar chlorides \
1552 6.3 0.680 0.01 3.7 0.103
1333 9.1 0.775 0.22 2.2 0.079
660 7.2 0.520 0.07 1.4 0.074
1294 8.2 0.635 0.10 2.1 0.073
1339 7.5 0.510 0.02 1.7 0.084
free sulfur dioxide total sulfur dioxide density pH sulphates \
1552 32.0 54.0 0.99586 3.51 0.66
1333 12.0 48.0 0.99760 3.18 0.51
660 5.0 20.0 0.99730 3.32 0.81
1294 25.0 60.0 0.99638 3.29 0.75
1339 13.0 31.0 0.99538 3.36 0.54
alcohol y_hat y
1552 11.3 5.933333 6
1333 9.6 5.333333 5
660 9.6 5.733333 6
1294 10.9 6.066667 6
1339 10.5 6.000000 6
%% Cell type:markdown id: tags:
# Evaluation
### Global metrics
We now look at a few evaluation metrics.
First we look at some global properties, such as the mean absolute or squared deviation
%% Cell type:code id: tags:
``` python
print('MAD: {0:0.2f}, MSE {1:0.2f}'.format(metrics.mean_absolute_error(y_test, y_hat),
metrics.mean_squared_error(y_test, y_hat)))
```
%% Output
MAD: 0.44, MSE 0.39
%% Cell type:markdown id: tags:
### Residuals
We now look at the residuals of the predictions. Ideally, we would expect "Gaussian noise": We do expect that the predictions are not perfect, but there should be no trend and the deviations from the true values (in the test data) should be both as small as possible, as well as randomly distributed.
%% Cell type:code id: tags:
``` python
#display = metrics.PredictionErrorDisplay(y_true=y_test, y_pred=y_hat)
#display.plot()
#plt.show()
```
%% Cell type:markdown id: tags:
### Profile Plot
We can also do a profile plot that illustrates how well the predictions work in various regions. Ideally, the data points on the model should be compatible with the diagonal line for all predictions.
%% Cell type:code id: tags:
``` python
import profile_plot
#fig, ax = plt.subplots(1, 1)
#out = profile_plot.pplot(y_test, y_hat, n_bins=10, yerr='var', ax=ax)
#plt.show()
```
%% Cell type:markdown id: tags:
Think about what this means for our choice of modelling aproach, for the model and what we would need to think about to improve.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment