Skip to content
Snippets Groups Projects
Commit f0432903 authored by Ulrich Kerzel's avatar Ulrich Kerzel
Browse files

add new notebook for lin-reg with sklearn

parent 7c547095
Branches
Tags
No related merge requests found
Showing
with 1063 additions and 312 deletions
{
"python.analysis.typeCheckingMode": "basic"
}
\ No newline at end of file
......@@ -431,7 +431,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.10.12"
},
"vscode": {
"interpreter": {
......
......@@ -326,8 +326,8 @@
{
"data": {
"text/html": [
"<!-- http://127.0.0.1:7001/139927338527968/ -->\n",
"<iframe src=\"http://127.0.0.1:7001/139927338527968/\" width=100% height=800 frameBorder=\"0\"></iframe>"
"<!-- http://127.0.0.1:7001/140319794450432/ -->\n",
"<iframe src=\"http://127.0.0.1:7001/140319794450432/\" width=100% height=800 frameBorder=\"0\"></iframe>"
]
},
"metadata": {},
......@@ -369,8 +369,8 @@
{
"data": {
"text/html": [
"<!-- http://127.0.0.1:7001/139928798281536/ -->\n",
"<iframe src=\"http://127.0.0.1:7001/139928798281536/\" width=100% height=800 frameBorder=\"0\"></iframe>"
"<!-- http://127.0.0.1:7001/140319797370000/ -->\n",
"<iframe src=\"http://127.0.0.1:7001/140319797370000/\" width=100% height=800 frameBorder=\"0\"></iframe>"
]
},
"metadata": {},
......@@ -401,7 +401,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.10.12"
},
"vscode": {
"interpreter": {
......
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:markdown id: tags:
# Image classification with PyTorch
In this exercise we start to build our first (deep learning) model ot analyse images.
We use the famous MNIST data that was originally developed by [Yann LeCun: "THE MNIST DATABASE of handwritten digits". Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York Christopher J.C. Burges, Microsoft Research, Redmond.](http://yann.lecun.com/exdb/mnist/)
The dataset contains about 70.000 images of hand-written digits 0-9, 60.000 of which are typically used for training, the rest for testing.
The images are normalised to a size of 28x28 pixels and anti-aliased which lead to the grayscale values that we can see in the data (as opposed a pure b/w image).
These data have the advantage that they are often included in the tools, and are small enough that we can train a neural network in reasonable time even without using GPUs.
A few years ago, the fashion company [Zalando](https://zalando.de) released a variant called [fahion-mnist](https://github.com/zalandoresearch/fashion-mnist) that can be used as a "drop-in" replacement, i.e. it has very similar properties: the same number of images for training and testing, the same size in terms of pixels, grayscale images, 10 classes, etc.
However, instead of showing just numbers, clothing items are choses (trousers, shirts,etc). The main motivations where that the original MNIST data are too easy for modern systems and can no longer really be used to develop modern algorithms.
However, in our case here we stick with the original data, as we want to explore how to setup a system and reduce our depencency on GPUs for now.
We will also make use of the [TorchVision](https://pytorch.org/vision/stable/index.html) library. This provides convenient access to a number of datasets (such as MNIST), but also pre-trained models or commonly used image augmentation methods.
%% Cell type:code id: tags:
``` python
import torch
from torch import nn
from torch.utils.data import DataLoader
import torch.nn.functional as F
import torchvision as tv
import matplotlib.pyplot as plt
import numpy as np
```
%% Output
/home/kerzel/.cache/pypoetry/virtualenvs/datascienceintro-eVBNPtpL-py3.10/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
%% Cell type:markdown id: tags:
# Data access
Now we download the data using the convenience function from TorchVision for the [MNIST](https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html#torchvision.datasets.MNIST) data. It is essentially a wrapper function that downloads the data from Yann LeCun's webpage and adds it to the local directory.
%% Cell type:code id: tags:
``` python
# Download training data from open datasets.
training_data = tv.datasets.MNIST(
root="data/mnist",
train=True,
download=True,
transform=tv.transforms.ToTensor(),
)
# Download test data from open datasets.
test_data = tv.datasets.MNIST(
root="data/mnist",
train=False,
download=True,
transform=tv.transforms.ToTensor(),
)
```
%% Cell type:markdown id: tags:
Next, we need to make these data accessible to PyTorch.
This can be done using the [DataLoader](https://pytorch.org/docs/stable/data.html) provided by PyTorch. This function provides some common functionality when dealing with the data.
In particular, we will need to repeatedly access all images in the dataset and loop over them as the training progresses.
Further, experience shows that it is helpful to split the training data into smaller chunks (or batches) to speed up training. This is more efficient than updating the parameters of the network after each single image, or wait until we have processed the entire data in one training loop (or: epoch).
Here we will use a batch size of 64 (images) which works well - however, this is a free parameter and different values may work better in other situations.
After we have defined the DataLoaders for access to the data, we print an example. As we would normally loop over the data, we set the loop up here as well, but terminate it after the first image.
The array/vector/tensor ```X``` contains the images. Therefore, it's dimension reflects the batches of images we process:
- In each iteration, we train/test the network in batches of 64 images
- We have grayscale images (only one colour channel)
- the picture is 28x28 pixels high/wide
The array/vector/tensor ```y``` contains the corresponding true labels for each image (supervised training).
Note the grayscale values due to the anti-aliasing that was applied when creating the data.
%% Cell type:code id: tags:
``` python
batch_size = 64
# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)
# define the loop over the test data (even if we only use one image)
for X, y in test_dataloader:
print("Shape of X [N images per batch, # colours, height, width]: ", X.shape)
print("Shape of y: ", y.shape, y.dtype)
#print an example
# since PyTorch tensors are of the form [#Channel, #pix X, #pix Y], need to
# transform back into image format
index = 0
plt.imshow(tv.transforms.ToPILImage()(X[index]), cmap='Greys')
print('True label: {}'.format(y[index]) )
break
```
%% Output
Shape of X [N images per batch, # colours, height, width]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
True label: 7
%% Cell type:markdown id: tags:
# Network definition
Before we can train the network, we need to first setup the various components.
In particular, we need to define the architecture of the neural network, as well as the optimiser and other parameters that we use to train the network.
We start by defining the neural network.
In PyTorch, we do so by defining a class inheriting vom ```nn.Module```. We need to do two main steps:
- In the ```__init__()``` method: We need to define the components we want to use in our neural network model. Here we can make use of the many building blocks the PyTorch library provides us with and create the respective instances we need.
- In the ```forward()``` method: Here we define how the model is built from the various components defined in the ```init()``` method. This defines the *forward* pass of the network (from input to output), hence the name. The backward pass with the update of the network parameters using back-propagation is then handled automatically by PyTorch. Essentially, we need to "chain" the components together to define our model: The data (i.e. our images) enter the network through the input layer, go through all intermediate layers and then get to the output layer that provides us, for example, with the predictions for the labels (i.e. here, which number the image shows).
However, we also need to make sure that we move the model to GPU if we have one avaiable (together with the data).
Generally, when building models working on images we use
- [Convolutional](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) layers for learnable image processing "filters"
- [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers, e.g. for classification of the labels using all the features learned by the previous (convolutional) layers.
- [ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) or others as an activation function.
- [DropOut](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) or [Dropout2D](https://pytorch.org/docs/stable/generated/torch.nn.Dropout2d.html) for regularisation.
- [MaxPool2D](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html) as the pooling layer
When defining the model, we have two hard constraints:
- The input layer needs to be able to handle the 28x28 pixel grayscale images
- We need 10 nodes in the output layer (one for each digit 0-9).
How we set the intermediate layers up is, essentially, up to us and this is where the hard work of building and optimising a model comes in.
In a first step, we build a very simple network without convolutional layers. Therefore, we do not expect it to be very good or performant, but it will allow us to "learn the ropes" of how to train a neural network.
Since we only use "fully-connected" (or: "dense") layers, we first need to "flatten" the 2D image information into a long array that is 28*28 numbers long.
This is then passed into the input layer ```fc1```. We choose ```512``` output nodes for this layer. This is passed into the next layer, where we choose also ```512``` ouput nodes, before going to the output layer with ```10``` output nodes. In between, we use ```ReLU``` as activation function.
***Exercise***
Upgrade the model to use convolutional layers, e.g. two ```Conv2D``` layers with a ```(3,3)``` kernel, ```(1,1)``` stride, as well as 32 and 64 output channels. Then, also use ```maxpool2D``` after the convolutional layers and dropout in the "dense" part of the network for regularisation.
Note how we use the [Functional](https://pytorch.org/docs/stable/nn.functional.html) to pass the data through the various elements, e.g. the activation function.
%% Cell type:code id: tags:
``` python
# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")
# Define a simple model
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28*28, 512)
self.fc2 = nn.Linear(512, 512)
self.fc3 = nn.Linear(512, 10)
def forward(self, x):
x = self.flatten(x)
#logits = self.linear_relu_stack(x)
# return logits
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
x = F.relu(x)
x = self.fc3(x)
return x
model = NeuralNetwork().to(device)
print(model)
```
%% Output
Using cuda device
NeuralNetwork(
(flatten): Flatten(start_dim=1, end_dim=-1)
(fc1): Linear(in_features=784, out_features=512, bias=True)
(fc2): Linear(in_features=512, out_features=512, bias=True)
(fc3): Linear(in_features=512, out_features=10, bias=True)
)
%% Cell type:markdown id: tags:
Next, we need to define the further parameters of the network.
Two very important ingredients are:
- the optimiser that is used to adjust the network weights during training
- the loss function
The loss function allows us to quantify how well the network performs and how "good" the predictions are, i.e. how close the predicted label is to the true label. Since we work on a classification task, we use the [cross-entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
Instead of stochastic gradient descent [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html), we will use the [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html?highlight=adam#torch.optim.Adam) optimiser, which is more efficient. We could also use, e.g. [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html?highlight=adam#torch.optim.AdamW) as a modern variant.
We need to pass the model parameters (i.e. the network weights) to the optimiser. This essentially "tells" the optimiser, which parameters need to be tuned during training.
%% Cell type:code id: tags:
``` python
##
## Optimizer and Loss Function
##
loss_fn = nn.CrossEntropyLoss()
# Choose the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
```
%% Cell type:markdown id: tags:
# Network training
Here we define the functions that run over the data and either train the network or evaluate the test data.
The main skeleton is the same:
- we loop over batches over the data
- move the data to the device (CPU or GPU)
- calculate the predictions with the current state of the network (i.e. a forward pass)
- determine the quality of the prediction (i.e. the value of the loss function).
For the training loop, we then need to tell PyTorch to do the backward propagation and update the network weights.
Note:
- We need to tell PyTorch explicitly if we train the model, so that the weights can get updated: ```model.train()```. Otherwise, we switch to evaluation mode ```model.eval()```.
- Our network has 10 output nodes, i.e. each note tells us how likely it is that this image corresponds to the label represented by this node. We need the node with the highest value to identify the most likely classification which we can do with ```argmax```.
%% Cell type:code id: tags:
``` python
#
# Training loop
#
def train_epoch(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
running_loss = 0.0
# put the model into training mode
model.train()
for batch, (X, y) in enumerate(dataloader):
# need to send the data tensors explicitly to the device we want to
# train on. E.g. when we create the tensors (and load the data),
# they get created on the CPU, if we want to train on the GPU,
# we need to move it there explictly.
X, y = X.to(device), y.to(device)
#
# Forward pass - compute the output of the network
# (do not call forward() directly), "apply" the model on the data,
# compute prediction error
#
pred = model(X)
loss = loss_fn(pred, y)
#
# Backpropagation
#
# By default, the gradients accumulate, so we need to
# reset them explicitly at each new training step
optimizer.zero_grad()
# Now we need to calculate the gradients in all layers
# in the backward pass.
# (For some reason, this is method is owned by the loss function)
loss.backward()
# After we calculated the gradients, we need to take the step
# in the next best direction.
# (For some reason, this method is owned by the optimizer)
optimizer.step()
# record loss functions // change loss
loss_item = loss.item()
current = batch * len(X)
running_loss += loss_item
if batch % 100 == 0:
#loss, current = loss.item(), batch * len(X)
print(f"loss: {loss_item:>7f} [{current:>5d}/{size:>5d}]")
return running_loss/size
#
# Test Loop
#
def test(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
# put the model into the evaluation mode.
# some aspects (e.g. dropout, etc.) behave differently during
# trainig and inference/evaluation
model.eval()
test_loss, correct = 0, 0
# PyTorch stores gradient information by default -
# during inference we no longer need this as we won't update the
# parameters anymore. Therefore, we can use "no_grad()" to switch that off.
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
return correct
```
%% Cell type:markdown id: tags:
Now we use these functions for the actual training.
%% Cell type:code id: tags:
``` python
##
## The actual training/evaluation
##
epochs = 5
loss_values = []
accuracy_values = []
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
loss = train_epoch(train_dataloader, model, loss_fn, optimizer)
loss_values.append(loss)
accuracy = test(test_dataloader, model, loss_fn)
accuracy_values.append(accuracy)
print("Done!")
```
%% Output
Epoch 1
-------------------------------
loss: 2.295028 [ 0/60000]
loss: 0.275703 [ 6400/60000]
loss: 0.188599 [12800/60000]
loss: 0.266648 [19200/60000]
loss: 0.154783 [25600/60000]
loss: 0.308687 [32000/60000]
loss: 0.121409 [38400/60000]
loss: 0.231967 [44800/60000]
loss: 0.269322 [51200/60000]
loss: 0.164584 [57600/60000]
Test Error:
Accuracy: 95.6%, Avg loss: 0.140970
Epoch 2
-------------------------------
loss: 0.064725 [ 0/60000]
loss: 0.066647 [ 6400/60000]
loss: 0.092921 [12800/60000]
loss: 0.104935 [19200/60000]
loss: 0.066522 [25600/60000]
loss: 0.107404 [32000/60000]
loss: 0.054125 [38400/60000]
loss: 0.118254 [44800/60000]
loss: 0.120747 [51200/60000]
loss: 0.099537 [57600/60000]
Test Error:
Accuracy: 96.4%, Avg loss: 0.119036
Epoch 3
-------------------------------
loss: 0.027532 [ 0/60000]
loss: 0.058633 [ 6400/60000]
loss: 0.061383 [12800/60000]
loss: 0.081768 [19200/60000]
loss: 0.034236 [25600/60000]
loss: 0.077802 [32000/60000]
loss: 0.035839 [38400/60000]
loss: 0.087089 [44800/60000]
loss: 0.073838 [51200/60000]
loss: 0.044092 [57600/60000]
Test Error:
Accuracy: 97.6%, Avg loss: 0.087102
Epoch 4
-------------------------------
loss: 0.020311 [ 0/60000]
loss: 0.062589 [ 6400/60000]
loss: 0.069213 [12800/60000]
loss: 0.100371 [19200/60000]
loss: 0.041693 [25600/60000]
loss: 0.034252 [32000/60000]
loss: 0.017241 [38400/60000]
loss: 0.060870 [44800/60000]
loss: 0.092723 [51200/60000]
loss: 0.042284 [57600/60000]
Test Error:
Accuracy: 96.9%, Avg loss: 0.127436
Epoch 5
-------------------------------
loss: 0.032525 [ 0/60000]
loss: 0.010098 [ 6400/60000]
loss: 0.069154 [12800/60000]
loss: 0.007468 [19200/60000]
loss: 0.008562 [25600/60000]
loss: 0.039671 [32000/60000]
loss: 0.023815 [38400/60000]
loss: 0.006657 [44800/60000]
loss: 0.048527 [51200/60000]
loss: 0.026487 [57600/60000]
Test Error:
Accuracy: 96.4%, Avg loss: 0.142745
Done!
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(2, sharex=True)
axes[0].set_ylabel("Loss", fontsize=14)
axes[0].plot(loss_values)
axes[1].set_ylabel("Accuracy", fontsize=14)
axes[1].set_xlabel("Iteration", fontsize=14)
axes[1].plot(accuracy_values)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
Even with the very simple fully connected network, we can achive an accuracy of 96%-97%. This isn't bad as such, indeed, in many applications, this would be considered phenomenal. However, the best classifiers achieve around 99.8%, the code at [MNIST-0.17](https://github.com/Matuzas77/MNIST-0.17) has an error rate of only 0.17%.
This illustrates why we use MNIST for learning how to use neural networks for image processing and how to set the system up - but the data are not suitable (anymore) to develop performant neural network architectures.
......
%% Cell type:markdown id: tags:
# Introduction to PyTorch
Modern deep learning typically builds on one of the major libraries that allow to efficiently build models without having to deal with the underlying infrastructure and computations.
In particular, they free us from the burden of calculating all the mathematical formulae (mostly linear algebra) to train the network, and to do so efficiently on modern GPUs.
Mostly, two libraries are used: [PyTorch](https://pytorch.org/), developed by Meta/Facebook, or [TensorFlow](https://www.tensorflow.org/), developed by Alphabet/Google. As you can see, a lot of AI research is no longer driven by academia, but industry. Each of these libraries have their strenghts and weaknesses. However, the choice will not limit you in what you can do (maybe how easy it is to implement something). We will choose PyTorch in this course as it has emerged in recent years as them most used package in research, both in academia and industry.
Before we will use this library to do any deep learning, we will first cover the basics - using the humble linear regression as an example. Of course, this is efficiently provided by [Linear Regression in Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), however, it serves as an example to:
- see what is going on step by step
- remind ourselves that, in the end, these libraries are also powerful optimisation libraries.
%% Cell type:code id: tags:
```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.optim as optim
```
%% Output
/home/kerzel/.cache/pypoetry/virtualenvs/datascienceintro-eVBNPtpL-py3.10/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
%% Cell type:markdown id: tags:
# Generate data
We will generate some dummy data. Essentially, we want to fit a straight line: We generate data according to a straight line and add some white (Gaussian) noise:
$$
y \sim m*x + b + \mathcal{N(0,1)}
$$
%% Cell type:code id: tags:
```
m_true = 3
b_true = 1
eps = 0.1
n_samples = 100
std_dev = 1
x = np.random.rand(n_samples, std_dev)
y = m_true * x + b_true + eps * np.random.randn(n_samples, std_dev)
```
%% Cell type:code id: tags:
```
plt.scatter(x,y)
plt.xlabel('x', fontsize=15)
plt.ylabel('y', fontsize=15)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
# Initialise
Before we initialise PyTorch, we need to find out if we have a CPU or a GPU available. For this simple example, a CPU is sufficient - but if we do have a GPU, we need to "move" all data and models to it.
We do this by checking if CUDA is available (not just installed), i.e. if there are NVidia graphics drivers.
%% Cell type:code id: tags:
```
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
```
%% Output
cpu
%% Cell type:markdown id: tags:
Internally, PyTorch works with [Torch.Tensor](https://pytorch.org/docs/stable/tensors.html) as numeric data types, see also [PyTorch: Tensor](https://pytorch.org/tutorials/beginner/examples_tensor/polynomial_tensor.html).
In a sense, they are quite similar to a multidimensional [NumPy Array](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html). One major difference is that these tensors can be used both on CPU and on GPU. We can also specify the data type.
Therefore, to be able to use PyTorch, we need to convert our data to these tensors. Since we started from NumPy, we can use a convenience function provided by PyTorch to do this.
We need to do the following steps:
- convert the data from numpy array to PyTorch tensor
- specify the data type
- move the data to the device we want to run on.
The latter step is a crucial detail if we want to make use of, for example, a GPU: If we forget to move either the data or the model to the GPU, we cannot utilise it.
Note that we can chain the commands to make the code a bit more efficient
%% Cell type:code id: tags:
```
#
# convert the NumPy arrays with the data from above to PyTorch tensors
#
x_tensor = torch.from_numpy(x).float().to(device)
y_tensor = torch.from_numpy(y).float().to(device)
print(x_tensor.shape,y_tensor.shape)
print(x_tensor.type(),y_tensor.type())
```
%% Output
torch.Size([100, 1]) torch.Size([100, 1])
torch.FloatTensor torch.FloatTensor
%% Cell type:markdown id: tags:
In this case where we want to fit a linear regression model, we have two variables that we need to fit: the slope and the intercept of the line.
Therefore, we need to define two variables that we use in the fitting.
We initialise these variables using a standard Gaussian, so that they have well-defined and not pathological values. We can do so using [torch.randn](https://pytorch.org/docs/stable/generated/torch.randn.html)
We need to specify a few parameters:
- ```size```: the size of the array. Here: 1, as we need a single number
- ```requires_grad = True```: to be able to use this in the minimisation/fitting procedure
- ```dtype```: here we use ```torch.float``` as all our variables are floating point variables
-```device``` we need to make sure that we "move" the model to the device we are using, i.e. either a CPU or a GPU.
Finally, we print the tensors. Note that this gives us quite a bit of information: They are indeed of the datatype "tensor", hold a single value in array shape and are used during optimisation.
We can get the shape of the array also directly.
%% Cell type:code id: tags:
```
#
# define Torch variables - if a GPU is available, use this for the computation
# indicate that these variables will be optimized using gradient descent
#
m = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
print(m, b)
print(m.shape, b.shape)
```
%% Output
tensor([-0.4388], requires_grad=True) tensor([-1.1128], requires_grad=True)
torch.Size([1]) torch.Size([1])
%% Cell type:markdown id: tags:
# Model fit
Now we need to fit the model, i.e. our straigt line.
We need the following ingredients to do so:
- an optimiser with suitable parameters
- an error metric
The optimizer is the algorithm that determines how to change the values of the parameters in our model (here: slope $m$ and intercept $b$) such that the resuling straight line fits "better" to the data we want to fit. This is then quantified by a loss (or: error) function.
In our case, we use [Stochastic Gradient Descent](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) as optimiser. This is one of the simpler algorithms - it will work well in this case, but we would choose others in more involved applications. This optimiser needs at least the following parameters:
- learn rate: How big the steps are in each iteration when estimating new parameters
- number of epochs: the number of times we want to run the optimiser. Unfornately, we cannot just say "run until it's done" but need to specify, in the simplest approach, how long we want to run for and then determine if this was sufficient.
Further, we use the means squared error (MSE) as loss function, i.e. the difference between the predicted ($\hat{y}$) and true value ($y$):
$$
MSE = \frac{1}{N}\sum_i (\hat{y} - y)^2
$$
In this simple example, we do everything manually and step through the fitting. Later, when we define deep learning models, much of this will be done by helper functions.
%% Cell type:code id: tags:
```
#
# Fit the model
#
learn_rate = 1e-1
n_epochs = 1000
# define the optimiser used to learn the parameters
# (Gradient descent)
optimiser = optim.SGD([m, b], lr=learn_rate)
loss_array = []
# training loop
for epoch in range(n_epochs):
# prediction
y_hat = m * x_tensor + b
# use MSE as loss function
error = y_tensor - y_hat
loss = (error ** 2).mean()
loss_array.append(loss.detach().numpy())
# this computes the gradient of the loss function
# i.e. d(loss)/dx
loss.backward()
# use the gradient to update the values of the parameters
optimiser.step()
# clears the computations of the gradients
optimiser.zero_grad()
print('slope:', m, 'true slope: ', m_true)
print('intercept: ', b, 'true intercept: ', b_true)
```
%% Output
slope: tensor([3.0210], requires_grad=True) true slope: 3
intercept: tensor([0.9898], requires_grad=True) true intercept: 1
%% Cell type:markdown id: tags:
Next, we look at the values of the loss function during the optimisation.
As we can see, starting from the random values above, we make a big jump in the first few iterations, then the progress is much smaller.
However, looking at the plot on a log-scale, we can see that after the initial jump in the first iterations, there is a steady decline over the next about 200 iterations, and only after about 300 iterations or so we start to run into a plateau.
%% Cell type:code id: tags:
```
plt.plot(loss_array)
plt.yscale('log')
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
Finally, we compare the fitted model with our original data, as well as the "true" slope and intercept we have generated the data from.
Note that in order to be able to do this, we first need to "de-tensorfy" the variables. Behind the scenes, PyTorch uses a computational graph to do all calculations, if we want to use the variables, e.g. for plotting, we need to [detach](https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html) the variable from this graph. Further, we need to convert this back into numpy format to be able to plot the data.
For the plot, we create a number of points on the $x$-axis (```np.linspace```), and then create a straight line: $ y = m*x + b$.
%% Cell type:code id: tags:
```
plt.scatter(x,y)
x_space = np.linspace(np.min(x), np.max(x))
plt.plot(x_space, m.detach().numpy()*x_space + b.detach().numpy(), label='Fit')
plt.plot(x_space, m_true*x_space +b_true, label='Truth')
plt.legend()
plt.xlabel('x', fontsize = 20)
plt.ylabel('y', fontsize = 20)
plt.show()
```
%% Output
......
%% Cell type:markdown id: tags:
# Regression
In this example, we use the [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality) for a first, simple regression task.
These data consist of a range of features such as acidity or sugar levels, pH value, alcohol content, etc. and the aim is to predict a quality score in the range (0,10).
The data originate from the paper: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009. [Wine paper](https://www.sciencedirect.com/science/article/pii/S0167923609001377)
Note that the score is based on integer values - therefore, we could treat this either as a regression problem (like we do here), or a a classification problem with (up to) 10 different classes. The original published paper considered a regression approach - we will therefore follow in their footsteps and treat this problem as a regression task.
We focus on red wines here in this example - feel free to explore the white wines as an exercise.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
```
%% Cell type:markdown id: tags:
# Data
We can load the data directly from the public repository directly (or from the local backup)
%% Cell type:code id: tags:
``` python
df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
sep=';')
df.head(5)
```
%% Output
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.70 0.00 1.9 0.076
1 7.8 0.88 0.00 2.6 0.098
2 7.8 0.76 0.04 2.3 0.092
3 11.2 0.28 0.56 1.9 0.075
4 7.4 0.70 0.00 1.9 0.076
free sulfur dioxide total sulfur dioxide density pH sulphates \
0 11.0 34.0 0.9978 3.51 0.56
1 25.0 67.0 0.9968 3.20 0.68
2 15.0 54.0 0.9970 3.26 0.65
3 17.0 60.0 0.9980 3.16 0.58
4 11.0 34.0 0.9978 3.51 0.56
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
%% Cell type:code id: tags:
``` python
df.describe()
```
%% Output
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
chlorides free sulfur dioxide total sulfur dioxide density \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 0.087467 15.874922 46.467792 0.996747
std 0.047065 10.460157 32.895324 0.001887
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 22.000000 0.995600
50% 0.079000 14.000000 38.000000 0.996750
75% 0.090000 21.000000 62.000000 0.997835
max 0.611000 72.000000 289.000000 1.003690
pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983 5.636023
std 0.154386 0.169507 1.065668 0.807569
min 2.740000 0.330000 8.400000 3.000000
25% 3.210000 0.550000 9.500000 5.000000
50% 3.310000 0.620000 10.200000 6.000000
75% 3.400000 0.730000 11.100000 6.000000
max 4.010000 2.000000 14.900000 8.000000
%% Cell type:markdown id: tags:
# Exploratory Data Analysis
First, we want to get a feel for the data and look at the various variables.
In this case, all variables are already in numeric format.
## Univariate analysis
First, we look at some individual variables to get a feel for their distribution.
Let's start by looking at the target variable ("Quality") - as we can see, not all scores are used, there are very few good or bad wines, most wines are average.
***Exercise***
Explore some more variables
%% Cell type:code id: tags:
``` python
g = sns.histplot(data=df, x='quality',bins=20)
plt.ylabel('count', size = 20)
plt.xlabel('Wine Quality', size = 20)
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
##
## explore some more variables
##
```
%% Cell type:markdown id: tags:
## Explore correlations
Machine learning essentially exploits linear and non-linear correlations between the features and the target (label).
Next, we want to get a feel for how the variables are correlated.
We can first print the correlation table of the variables and then plot the behaviour of individual pairs of variables.
To look at all combinations (or a subset), we can use the [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) from Seaborn, or plot a few individual combinations.
%% Cell type:code id: tags:
``` python
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, fmt=".1f", cmap='seismic')
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
As we can see, some variables are strongly correlated with each other but only three variables have a high correlation with out target: volatile acidity, sulphates, and alcohol. We will therefore suspect that these variables are the most important for the model.
%% Cell type:code id: tags:
``` python
sns.scatterplot(data=df, x='alcohol', y='volatile acidity', hue='quality' )
plt.xlabel('Alcohol content', size = 20)
plt.ylabel('Volatile acidity', size = 20)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
***Exercise***
Explore some more correlations between the variables and the target.
%% Cell type:code id: tags:
``` python
##
## explore some more correlations using plots for two variables liek above or a pairplot
##
```
%% Cell type:markdown id: tags:
A contour plot in the pair-plot for the 3 main variables shows some dependency - but overall, a visual inspection shows that there is no obvious 1:1 relationship.
%% Cell type:markdown id: tags:
# Machine Learning
We now use a machine learning model to predict the quality index.
The model is trained on the training data and then evaluated on predictions made from the independent test data.
We follow the typical Scikit-Learn approach of:
- create an instance of the model
- call the ```fit``` method for the training data
- call the ```predict``` method for the test data.
As an example, we will use the [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
However, first we need to define the training/test data, the features and the labels.
The target (label) is the last column (quality) in the data-frame.
%% Cell type:code id: tags:
``` python
###
### Your code here
###
train_cols = # ...
label = # ...
X = df[train_cols]
y = df[label]
# split into training and test sample
X_train, X_test, y_train, y_test = # ....
```
%% Cell type:code id: tags:
``` python
#define and fit the model
model = RandomForestRegressor(max_depth=20, random_state=0, n_estimators=15)
model.fit(X_train, y_train)
```
%% Output
RandomForestRegressor(max_depth=20, n_estimators=15, random_state=0)
%% Cell type:markdown id: tags:
Then, we obtain the predictions on the test data for the model.
For convenience, we make a copy of the test data and append the predictions, together with the true values, as the last column to the dataframe.
%% Cell type:code id: tags:
``` python
predictions = X_test.copy()
y_hat = model.predict(predictions)
predictions.loc[:,'y_hat'] = y_hat
predictions.loc[:,'y'] = y_test
predictions.head(5)
```
%% Output
fixed acidity volatile acidity citric acid residual sugar chlorides \
1552 6.3 0.680 0.01 3.7 0.103
1333 9.1 0.775 0.22 2.2 0.079
660 7.2 0.520 0.07 1.4 0.074
1294 8.2 0.635 0.10 2.1 0.073
1339 7.5 0.510 0.02 1.7 0.084
free sulfur dioxide total sulfur dioxide density pH sulphates \
1552 32.0 54.0 0.99586 3.51 0.66
1333 12.0 48.0 0.99760 3.18 0.51
660 5.0 20.0 0.99730 3.32 0.81
1294 25.0 60.0 0.99638 3.29 0.75
1339 13.0 31.0 0.99538 3.36 0.54
alcohol y_hat y
1552 11.3 5.933333 6
1333 9.6 5.333333 5
660 9.6 5.733333 6
1294 10.9 6.066667 6
1339 10.5 6.000000 6
%% Cell type:markdown id: tags:
# Evaluation
### Global metrics
We now look at a few evaluation metrics.
First we look at some global properties, such as the mean absolute or squared deviation
%% Cell type:code id: tags:
``` python
print('MAD: {0:0.2f}, MSE {1:0.2f}'.format(metrics.mean_absolute_error(y_test, y_hat),
metrics.mean_squared_error(y_test, y_hat)))
```
%% Output
MAD: 0.44, MSE 0.39
%% Cell type:markdown id: tags:
### Residuals
We now look at the residuals of the predictions. Ideally, we would expect "Gaussian noise": We do expect that the predictions are not perfect, but there should be no trend and the deviations from the true values (in the test data) should be both as small as possible, as well as randomly distributed.
%% Cell type:code id: tags:
``` python
#display = metrics.PredictionErrorDisplay(y_true=y_test, y_pred=y_hat)
#display.plot()
#plt.show()
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Profile Plot
We can also do a profile plot that illustrates how well the predictions work in various regions. Ideally, the data points on the model should be compatible with the diagonal line for all predictions.
%% Cell type:code id: tags:
``` python
import profile_plot
#fig, ax = plt.subplots(1, 1)
#out = profile_plot.pplot(y_test, y_hat, n_bins=10, yerr='var', ax=ax)
#plt.show()
```
%% Cell type:markdown id: tags:
Think about what this means for our choice of modelling aproach, for the model and what we would need to think about to improve.
......
No preview for this file type
Source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment