"As we can see, some variables are strongly correlated with each other but only three variables have a high correlation with out target: volatile acidity sulphates and alcohol. We will therefore suspect that these variables are the most important for the model."
"As we can see, some variables are strongly correlated with each other but only three variables have a high correlation with out target: volatile acidity, sulphates, and alcohol. We will therefore suspect that these variables are the most important for the model."
]
},
{
...
...
@@ -622,7 +622,7 @@
"metadata": {},
"source": [
"Then, we obtain the predictions on the test data for the model.\n",
"For convenience, we make a copy of the test data and append the predictions, together with the true values, as the last column to the dataframe"
"For convenience, we make a copy of the test data and append the predictions, together with the true values, as the last column to the dataframe."
]
},
{
...
...
%% Cell type:markdown id: tags:
# Regression
In this example, we use the [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality) for a first, simple regression task.
These data consist of a range of features such as acidity or sugar levels, pH value, alcohol content, etc. and the aim is to predict a quality score in the range (0,10).
The data originate from the paper: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009. [Wine paper](https://www.sciencedirect.com/science/article/pii/S0167923609001377)
Note that the score is based on integer values - therefore, we could treat this either as a regression problem (like we do here), or a a classification problem with (up to) 10 different classes. The original published paper considered a regression approach - we will therefore follow in their footsteps and treat this problem as a regression task.
We focus on red wines here in this example - feel free to explore the white wines as an exercise.
%% Cell type:code id: tags:
``` python
importpandasaspd
importnumpyasnp
importmatplotlib.pyplotasplt
importseabornassns
fromsklearn.model_selectionimporttrain_test_split
fromsklearn.ensembleimportRandomForestRegressor
fromsklearnimportmetrics
```
%% Cell type:markdown id: tags:
# Data
We can load the data directly from the public repository directly (or from the local backup)
First, we want to get a feel for the data and look at the various variables.
In this case, all variables are already in numeric format.
## Univariate analysis
First, we look at some individual variables to get a feel for their distribution.
Let's start by looking at the target variable ("Quality") - as we can see, not all scores are used, there are very few good or bad wines, most wines are average.
***Exercise***
Explore some more variables
%% Cell type:code id: tags:
``` python
g=sns.histplot(data=df,x='quality',bins=20)
plt.ylabel('count',size=20)
plt.xlabel('Wine Quality',size=20)
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
##
## explore some more variables
##
```
%% Cell type:markdown id: tags:
## Explore correlations
Machine learning essentially exploits linear and non-linear correlations between the features and the target (label).
Next, we want to get a feel for how the variables are correlated.
We can first print the correlation table of the variables and then plot the behaviour of individual pairs of variables.
To look at all combinations (or a subset), we can use the [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) from Seaborn, or plot a few individual combinations.
As we can see, some variables are strongly correlated with each other but only three variables have a high correlation with out target: volatile acidity sulphates and alcohol. We will therefore suspect that these variables are the most important for the model.
As we can see, some variables are strongly correlated with each other but only three variables have a high correlation with out target: volatile acidity, sulphates, and alcohol. We will therefore suspect that these variables are the most important for the model.
Explore some more correlations between the variables and the target.
%% Cell type:code id: tags:
``` python
##
## explore some more correlations using plots for two variables liek above or a pairplot
##
```
%% Cell type:markdown id: tags:
A contour plot in the pair-plot for the 3 main variables shows some dependency - but overall, a visual inspection shows that there is no obvious 1:1 relationship.
%% Cell type:markdown id: tags:
# Machine Learning
We now use a machine learning model to predict the quality index.
The model is trained on the training data and then evaluated on predictions made from the independent test data.
We follow the typical Scikit-Learn approach of:
- create an instance of the model
- call the ```fit``` method for the training data
- call the ```predict``` method for the test data.
As an example, we will use the [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
However, first we need to define the training/test data, the features and the labels.
The target (label) is the last column (quality) in the data-frame.
We now look at the residuals of the predictions. Ideally, we would expect "Gaussian noise": We do expect that the predictions are not perfect, but there should be no trend and the deviations from the true values (in the test data) should be both as small as possible, as well as randomly distributed.
We can also do a profile plot that illustrates how well the predictions work in various regions. Ideally, the data points on the model should be compatible with the diagonal line for all predictions.