add new notebook for lin-reg with sklearn

f0432903 · Ulrich Kerzel · 7c547095 · f0432903 · f0432903 · f0432903
Commit f0432903 authored Oct 13, 2023 by Ulrich Kerzel
--- a/datascienceintro/.vscode/settings.json
+++ b/datascienceintro/.vscode/settings.json
+{
+    "python.analysis.typeCheckingMode": "basic"
+}
\ No newline at end of file
--- a/datascienceintro/DataScience_Stats.ipynb
+++ b/datascienceintro/DataScience_Stats.ipynb
@@ -431,7 +431,7 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.10.6"
+      "version": "3.10.12"
    },
    "vscode": {
      "interpreter": {

--- a/datascienceintro/ExplainableAI_Classification_Adult_EBM.ipynb
+++ b/datascienceintro/ExplainableAI_Classification_Adult_EBM.ipynb
@@ -326,8 +326,8 @@
        {
          "data": {
            "text/html": [
-              "<!-- http://127.0.0.1:7001/139927338527968/ -->\n",
-              "<iframe src=\"http://127.0.0.1:7001/139927338527968/\" width=100% height=800 frameBorder=\"0\"></iframe>"
+              "<!-- http://127.0.0.1:7001/140319794450432/ -->\n",
+              "<iframe src=\"http://127.0.0.1:7001/140319794450432/\" width=100% height=800 frameBorder=\"0\"></iframe>"
            ]
          },
          "metadata": {},
@@ -369,8 +369,8 @@
        {
          "data": {
            "text/html": [
-              "<!-- http://127.0.0.1:7001/139928798281536/ -->\n",
-              "<iframe src=\"http://127.0.0.1:7001/139928798281536/\" width=100% height=800 frameBorder=\"0\"></iframe>"
+              "<!-- http://127.0.0.1:7001/140319797370000/ -->\n",
+              "<iframe src=\"http://127.0.0.1:7001/140319797370000/\" width=100% height=800 frameBorder=\"0\"></iframe>"
            ]
          },
          "metadata": {},
@@ -401,7 +401,7 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.10.6"
+      "version": "3.10.12"
    },
    "vscode": {
      "interpreter": {

--- a/datascienceintro/LinearRegression_sklearn.ipynb
+++ b/datascienceintro/LinearRegression_sklearn.ipynb
--- a/datascienceintro/PyTorch_GraphNN_MoleculePrediction.ipynb
+++ b/datascienceintro/PyTorch_GraphNN_MoleculePrediction.ipynb
--- a/datascienceintro/PyTorch_MNIST.ipynb
+++ b/datascienceintro/PyTorch_MNIST.ipynb
@@ -25,11 +25,20 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 7,
+      "execution_count": 1,
      "metadata": {
        "id": "SbsUiK9_lXNS"
      },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/home/kerzel/.cache/pypoetry/virtualenvs/datascienceintro-eVBNPtpL-py3.10/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+            "  from .autonotebook import tqdm as notebook_tqdm\n"
+          ]
+        }
+      ],
      "source": [
        "import torch\n",
        "from torch import nn\n",
@@ -584,8 +593,16 @@
      "name": "python3"
    },
    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
      "name": "python",
-      "version": "3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]"
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.6"
    },
    "vscode": {
      "interpreter": {

 %% Cell type:markdown id: tags:

 # Image classification with PyTorch

 In this exercise we start to build our first (deep learning) model ot analyse images.
 We use the famous MNIST data that was originally developed by [Yann LeCun:  "THE MNIST DATABASE of handwritten digits". Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York Christopher J.C. Burges, Microsoft Research, Redmond.](http://yann.lecun.com/exdb/mnist/)

 The dataset contains about 70.000 images of hand-written digits 0-9, 60.000 of which are typically used for training, the rest for testing.
 The images are normalised to a size of 28x28 pixels and anti-aliased which lead to the grayscale values that we can see in the data (as opposed a pure b/w image).
 These data have the advantage that they are often included in the tools, and are small enough that we can train a neural network in reasonable time even without using GPUs.

 A few years ago, the fashion company [Zalando](https://zalando.de) released a variant called [fahion-mnist](https://github.com/zalandoresearch/fashion-mnist) that can be used as a "drop-in" replacement, i.e. it has very similar properties: the same number of images for training and testing, the same size in terms of pixels, grayscale images, 10 classes, etc.
 However, instead of showing just numbers, clothing items are choses (trousers, shirts,etc). The main motivations where that the original MNIST data are too easy for modern systems and can no longer really be used to develop modern algorithms.
 However, in our case here we stick with the original data, as we want to explore how to setup a system and reduce our depencency on GPUs for now.

 We will also make use of the [TorchVision](https://pytorch.org/vision/stable/index.html) library. This provides convenient access to a number of datasets (such as MNIST), but also pre-trained models or commonly used image augmentation methods.

 %% Cell type:code id: tags:

 ``` python
 import torch
 from torch import nn
 from torch.utils.data import DataLoader
 import torch.nn.functional as F

 import torchvision as tv

 import matplotlib.pyplot as plt
 import numpy as np

 ```

+%% Output
+
+    /home/kerzel/.cache/pypoetry/virtualenvs/datascienceintro-eVBNPtpL-py3.10/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
+      from .autonotebook import tqdm as notebook_tqdm
+
 %% Cell type:markdown id: tags:

 # Data access

 Now we download the data using the convenience function from TorchVision for the [MNIST](https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html#torchvision.datasets.MNIST) data. It is essentially a wrapper function that downloads the data from Yann LeCun's webpage and adds it to the local directory.

 %% Cell type:code id: tags:

 ``` python
 # Download training data from open datasets.
 training_data = tv.datasets.MNIST(
    root="data/mnist",
    train=True,
    download=True,
    transform=tv.transforms.ToTensor(),
 )

 # Download test data from open datasets.
 test_data = tv.datasets.MNIST(
    root="data/mnist",
    train=False,
    download=True,
    transform=tv.transforms.ToTensor(),
 )
 ```

 %% Cell type:markdown id: tags:

 Next, we need to make these data accessible to PyTorch.
 This can be done using the [DataLoader](https://pytorch.org/docs/stable/data.html) provided by PyTorch. This function provides some common functionality when dealing with the data.
 In particular, we will need to repeatedly access all images in the dataset and loop over them as the training progresses.
 Further, experience shows that it is helpful to split the training data into smaller chunks (or batches) to speed up training. This is more efficient than updating the parameters of the network after each single image, or wait until we have processed the entire data in one training loop (or: epoch).
 Here we will use a batch size of 64 (images) which works well - however, this is a free parameter and different values may work better in other situations.

 After we have defined the DataLoaders for access to the data, we print an example. As we would normally loop over the data, we set the loop up here as well, but terminate it after the first image.

 The array/vector/tensor ```X``` contains the images. Therefore, it's dimension reflects the batches of images we process:
 - In each iteration, we train/test the network in batches of 64 images
 - We have grayscale images (only one colour channel)
 - the picture is 28x28 pixels high/wide

 The array/vector/tensor ```y``` contains the corresponding true labels for each image (supervised training).

 Note the grayscale values due to the anti-aliasing that was applied when creating the data.

 %% Cell type:code id: tags:

 ``` python
 batch_size = 64

 # Create data loaders.
 train_dataloader = DataLoader(training_data, batch_size=batch_size)
 test_dataloader = DataLoader(test_data, batch_size=batch_size)

 # define the loop over the test data (even if we only use one image)
 for X, y in test_dataloader:
    print("Shape of X [N images per batch, # colours, height, width]: ", X.shape)
    print("Shape of y: ", y.shape, y.dtype)

    #print an example
    # since PyTorch tensors are of the form [#Channel, #pix X, #pix Y], need to
    # transform back into image format
    index = 0
    plt.imshow(tv.transforms.ToPILImage()(X[index]), cmap='Greys')
    print('True label: {}'.format(y[index]) )
    break
 ```

 %% Output

    Shape of X [N images per batch, # colours, height, width]:  torch.Size([64, 1, 28, 28])
    Shape of y:  torch.Size([64]) torch.int64
    True label: 7



 %% Cell type:markdown id: tags:

 # Network definition

 Before we can train the network, we need to first setup the various components.
 In particular, we need to define the architecture of the neural network, as well as the optimiser and other parameters that we use to train the network.

 We start by defining the neural network.
 In PyTorch, we do so by defining a class inheriting vom ```nn.Module```. We need to do two main steps:
 - In the ```__init__()``` method: We need to define the components we want to use in our neural network model. Here we can make use of the many building blocks the PyTorch library provides us with and create the respective instances we need.
 - In the ```forward()``` method: Here we define how the model is built from the various components defined in the ```init()``` method. This defines the *forward* pass of the network (from input to output), hence the name. The backward pass with the update of the network parameters using back-propagation is then handled automatically by PyTorch. Essentially, we need to "chain" the components together to define our model: The data (i.e. our images) enter the network through the input layer, go through all intermediate layers and then get to the output layer that provides us, for example, with the predictions for the labels (i.e. here, which number the image shows).

 However, we also need to make sure that we move the model to GPU if we have one avaiable (together with the data).

 Generally, when building models working on images we use
 - [Convolutional](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) layers for learnable image processing "filters"
 - [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers, e.g. for classification of the labels using all the features learned by the previous (convolutional) layers.
 - [ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) or others as an activation function.
 - [DropOut](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) or  [Dropout2D](https://pytorch.org/docs/stable/generated/torch.nn.Dropout2d.html) for regularisation.
 - [MaxPool2D](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html) as the pooling layer

 When defining the model, we have two hard constraints:
 - The input layer needs to be able to handle the 28x28 pixel grayscale images
 - We need 10 nodes in the output layer (one for each digit 0-9).
 How we set the intermediate layers up is, essentially, up to us and this is where the hard work of building and optimising a model comes in.

 In a first step, we build a very simple network without convolutional layers. Therefore, we do not expect it to be very good or performant, but it will allow us to "learn the ropes" of how to train a neural network.
 Since we only use "fully-connected" (or: "dense") layers, we first need to "flatten" the 2D image information into a long array that is 28*28 numbers long.
 This is then passed into the input layer ```fc1```. We choose ```512``` output nodes for this layer. This is passed into the next layer, where we choose also ```512``` ouput nodes, before going to the output layer with ```10``` output nodes. In between, we use ```ReLU``` as activation function.


 ***Exercise***
 Upgrade the model to use convolutional layers, e.g. two ```Conv2D``` layers with a ```(3,3)``` kernel, ```(1,1)``` stride, as well as 32 and 64 output channels. Then, also use ```maxpool2D``` after the convolutional layers and dropout in the "dense" part of the network for regularisation.

 Note how we use the [Functional](https://pytorch.org/docs/stable/nn.functional.html) to pass the data through the various elements, e.g. the activation function.

 %% Cell type:code id: tags:

 ``` python
 # Get cpu or gpu device for training.
 device = "cuda" if torch.cuda.is_available() else "cpu"
 print(f"Using {device} device")

 # Define  a simple model
 class NeuralNetwork(nn.Module):
   def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28*28, 512)
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, 10)

   def forward(self, x):
       x = self.flatten(x)
       #logits = self.linear_relu_stack(x)
       # return logits
       x = self.fc1(x)
       x = F.relu(x)
       x = self.fc2(x)
       x = F.relu(x)
       x = self.fc3(x)
       return x

 model = NeuralNetwork().to(device)
 print(model)
 ```

 %% Output

    Using cuda device
    NeuralNetwork(
      (flatten): Flatten(start_dim=1, end_dim=-1)
      (fc1): Linear(in_features=784, out_features=512, bias=True)
      (fc2): Linear(in_features=512, out_features=512, bias=True)
      (fc3): Linear(in_features=512, out_features=10, bias=True)
    )

 %% Cell type:markdown id: tags:

 Next, we need to define the further parameters of the network.
 Two very important ingredients are:
 - the optimiser that is used to adjust the network weights during training
 - the loss function

 The loss function allows us to quantify how well the network performs and how "good" the predictions are, i.e. how close the predicted label is to the true label. Since we work on a classification task, we use the [cross-entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)

 Instead of stochastic gradient descent [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html), we will use the [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html?highlight=adam#torch.optim.Adam) optimiser, which is more efficient. We could also use, e.g. [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html?highlight=adam#torch.optim.AdamW) as a modern variant.
 We need to pass the model parameters (i.e. the network weights) to the optimiser. This essentially "tells" the optimiser, which parameters need to be tuned during training.

 %% Cell type:code id: tags:

 ``` python
 ##
 ## Optimizer and Loss Function
 ##
 loss_fn = nn.CrossEntropyLoss()

 # Choose the optimizer
 optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
 ```

 %% Cell type:markdown id: tags:

 # Network training

 Here we define the functions that run over the data and either train the network or evaluate the test data.

 The main skeleton is the same:
 - we loop over batches over the data
 - move the data to the device (CPU or GPU)
 - calculate the predictions with the current state of the network (i.e. a forward pass)
 - determine the quality of the prediction (i.e. the value of the loss function).

 For the training loop, we then need to tell PyTorch to do the backward propagation and update the network weights.

 Note:
 - We need to tell PyTorch explicitly if we train the model, so that the weights can get updated: ```model.train()```. Otherwise, we switch to evaluation mode ```model.eval()```.
 - Our network has 10 output nodes, i.e. each note tells us how likely it is that this image corresponds to the label represented by this node. We need the node with the highest value to identify the most likely classification which we can do with ```argmax```.

 %% Cell type:code id: tags:

 ``` python
 #
 # Training loop
 #
 def train_epoch(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    running_loss = 0.0

    # put the model into training mode
    model.train()
    for batch, (X, y) in enumerate(dataloader):

        # need to send the data tensors explicitly to the device we want to
        # train on. E.g. when we create the tensors (and load the data),
        # they get created on the CPU, if we want to train on the GPU,
        # we need to move it there explictly.
        X, y = X.to(device), y.to(device)

        #
        # Forward pass - compute the output of the network
        # (do not call forward() directly), "apply" the model on the data,
        # compute prediction error
        #
        pred = model(X)
        loss = loss_fn(pred, y)

        #
        # Backpropagation
        #

        # By default, the gradients accumulate, so we need to
        # reset them explicitly at each new training step
        optimizer.zero_grad()

        # Now we need to calculate the gradients in all layers
        # in the backward pass.
        # (For some reason, this is method is owned by the loss function)
        loss.backward()

        # After we calculated the gradients, we need to take the step
        # in the next best direction.
        # (For some reason, this method is owned by the optimizer)
        optimizer.step()

        # record loss functions // change loss
        loss_item = loss.item()
        current = batch * len(X)

        running_loss += loss_item
        if batch % 100 == 0:
            #loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss_item:>7f}  [{current:>5d}/{size:>5d}]")

    return running_loss/size


 #
 # Test Loop
 #
 def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)

    # put the model into the evaluation mode.
    # some aspects (e.g. dropout, etc.) behave differently during
    # trainig and inference/evaluation
    model.eval()
    test_loss, correct = 0, 0

    # PyTorch stores gradient information by default -
    # during inference we no longer need this as we won't update the
    # parameters anymore. Therefore, we can use "no_grad()" to switch that off.
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
    return correct
 ```

 %% Cell type:markdown id: tags:

 Now we use these functions for the actual training.

 %% Cell type:code id: tags:

 ``` python
 ##
 ## The actual training/evaluation
 ##
 epochs = 5
 loss_values = []
 accuracy_values = []

 for t in range(epochs):

    print(f"Epoch {t+1}\n-------------------------------")
    loss = train_epoch(train_dataloader, model, loss_fn, optimizer)
    loss_values.append(loss)

    accuracy = test(test_dataloader, model, loss_fn)
    accuracy_values.append(accuracy)
 print("Done!")
 ```

 %% Output

    Epoch 1
    -------------------------------
    loss: 2.295028  [    0/60000]
    loss: 0.275703  [ 6400/60000]
    loss: 0.188599  [12800/60000]
    loss: 0.266648  [19200/60000]
    loss: 0.154783  [25600/60000]
    loss: 0.308687  [32000/60000]
    loss: 0.121409  [38400/60000]
    loss: 0.231967  [44800/60000]
    loss: 0.269322  [51200/60000]
    loss: 0.164584  [57600/60000]
    Test Error:
     Accuracy: 95.6%, Avg loss: 0.140970
    
    Epoch 2
    -------------------------------
    loss: 0.064725  [    0/60000]
    loss: 0.066647  [ 6400/60000]
    loss: 0.092921  [12800/60000]
    loss: 0.104935  [19200/60000]
    loss: 0.066522  [25600/60000]
    loss: 0.107404  [32000/60000]
    loss: 0.054125  [38400/60000]
    loss: 0.118254  [44800/60000]
    loss: 0.120747  [51200/60000]
    loss: 0.099537  [57600/60000]
    Test Error:
     Accuracy: 96.4%, Avg loss: 0.119036
    
    Epoch 3
    -------------------------------
    loss: 0.027532  [    0/60000]
    loss: 0.058633  [ 6400/60000]
    loss: 0.061383  [12800/60000]
    loss: 0.081768  [19200/60000]
    loss: 0.034236  [25600/60000]
    loss: 0.077802  [32000/60000]
    loss: 0.035839  [38400/60000]
    loss: 0.087089  [44800/60000]
    loss: 0.073838  [51200/60000]
    loss: 0.044092  [57600/60000]
    Test Error:
     Accuracy: 97.6%, Avg loss: 0.087102
    
    Epoch 4
    -------------------------------
    loss: 0.020311  [    0/60000]
    loss: 0.062589  [ 6400/60000]
    loss: 0.069213  [12800/60000]
    loss: 0.100371  [19200/60000]
    loss: 0.041693  [25600/60000]
    loss: 0.034252  [32000/60000]
    loss: 0.017241  [38400/60000]
    loss: 0.060870  [44800/60000]
    loss: 0.092723  [51200/60000]
    loss: 0.042284  [57600/60000]
    Test Error:
     Accuracy: 96.9%, Avg loss: 0.127436
    
    Epoch 5
    -------------------------------
    loss: 0.032525  [    0/60000]
    loss: 0.010098  [ 6400/60000]
    loss: 0.069154  [12800/60000]
    loss: 0.007468  [19200/60000]
    loss: 0.008562  [25600/60000]
    loss: 0.039671  [32000/60000]
    loss: 0.023815  [38400/60000]
    loss: 0.006657  [44800/60000]
    loss: 0.048527  [51200/60000]
    loss: 0.026487  [57600/60000]
    Test Error:
     Accuracy: 96.4%, Avg loss: 0.142745
    
    Done!

 %% Cell type:code id: tags:

 ``` python
 fig, axes = plt.subplots(2, sharex=True)
 axes[0].set_ylabel("Loss", fontsize=14)
 axes[0].plot(loss_values)
 axes[1].set_ylabel("Accuracy", fontsize=14)
 axes[1].set_xlabel("Iteration", fontsize=14)
 axes[1].plot(accuracy_values)
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 Even with the very simple fully connected network, we can achive an accuracy of 96%-97%. This isn't bad as such, indeed, in many applications, this would be considered phenomenal. However, the best classifiers achieve around 99.8%, the code at [MNIST-0.17](https://github.com/Matuzas77/MNIST-0.17) has an error rate of only 0.17%.
 This illustrates why we use MNIST for learning how to use neural networks for image processing and how to set the system up - but the data are not suitable (anymore) to develop performant neural network architectures.

--- a/datascienceintro/Pytorch_SimpleLinearRegression.ipynb
+++ b/datascienceintro/Pytorch_SimpleLinearRegression.ipynb
@@ -21,11 +21,20 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 2,
+      "execution_count": 1,
      "metadata": {
        "id": "1sga7qkdmqq6"
      },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/home/kerzel/.cache/pypoetry/virtualenvs/datascienceintro-eVBNPtpL-py3.10/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+            "  from .autonotebook import tqdm as notebook_tqdm\n"
+          ]
+        }
+      ],
      "source": [
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
@@ -433,7 +442,16 @@
      "name": "python3"
    },
    "language_info": {
-      "name": "python"
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.6"
    }
  },
  "nbformat": 4,

 %% Cell type:markdown id: tags:

 # Introduction to PyTorch

 Modern deep learning typically builds on one of the major libraries that allow to efficiently build models without having to deal with the underlying infrastructure and computations.
 In particular, they free us from the burden of calculating all the mathematical formulae (mostly linear algebra) to train the network, and to do so efficiently on modern GPUs.

 Mostly, two libraries are used: [PyTorch](https://pytorch.org/), developed by Meta/Facebook, or [TensorFlow](https://www.tensorflow.org/), developed by Alphabet/Google. As you can see, a lot of AI research is no longer driven by academia, but industry. Each of these libraries have their strenghts and weaknesses. However, the choice will not limit you in what you can do (maybe how easy it is to implement something). We will choose PyTorch in this course as it has emerged in recent years as them most used package in research, both in academia and industry.

 Before we will use this library to do any deep learning, we will first cover the basics - using the humble linear regression as an example. Of course, this is efficiently provided by [Linear Regression in Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), however, it serves as an example to:
 - see what is going on step by step
 - remind ourselves that, in the end, these libraries are also powerful optimisation libraries.

 %% Cell type:code id: tags:

 ``` 
 import numpy as np
 import matplotlib.pyplot as plt
 import seaborn as sns

 import torch
 import torch.optim as optim
 ```

+%% Output
+
+    /home/kerzel/.cache/pypoetry/virtualenvs/datascienceintro-eVBNPtpL-py3.10/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
+      from .autonotebook import tqdm as notebook_tqdm
+
 %% Cell type:markdown id: tags:

 # Generate data

 We will generate some dummy data. Essentially, we want to fit a straight line: We generate data according to a straight line and add some white (Gaussian) noise:
 $$
 y \sim m*x + b + \mathcal{N(0,1)}
 $$

 %% Cell type:code id: tags:

 ``` 
 m_true = 3
 b_true = 1
 eps = 0.1

 n_samples = 100
 std_dev = 1
 x = np.random.rand(n_samples, std_dev)
 y = m_true * x + b_true + eps * np.random.randn(n_samples, std_dev)
 ```

 %% Cell type:code id: tags:

 ``` 
 plt.scatter(x,y)
 plt.xlabel('x', fontsize=15)
 plt.ylabel('y', fontsize=15)
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 # Initialise

 Before we initialise PyTorch, we need to find out if we have a CPU or a GPU available. For this simple example, a CPU is sufficient - but if we do have a GPU, we need to "move" all data and models to it.

 We do this by checking if CUDA is available (not just installed), i.e. if there are NVidia graphics drivers.

 %% Cell type:code id: tags:

 ``` 
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
 print(device)
 ```

 %% Output

    cpu

 %% Cell type:markdown id: tags:

 Internally, PyTorch works with [Torch.Tensor](https://pytorch.org/docs/stable/tensors.html) as numeric data types, see also [PyTorch: Tensor](https://pytorch.org/tutorials/beginner/examples_tensor/polynomial_tensor.html).

 In a sense, they are quite similar to a multidimensional [NumPy Array](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html). One major difference is that these tensors can be used both on CPU and on GPU. We can also specify the data type.

 Therefore, to be able to use PyTorch, we need to convert our data to these tensors. Since we started from NumPy, we can use a convenience function provided by PyTorch to do this.
 We need to do the following steps:
 - convert the data from numpy array to PyTorch tensor
 - specify the data type
 - move the data to the device we want to run on.

 The latter step is a crucial detail if we want to make use of, for example, a GPU: If we forget to move either the data or the model to the GPU, we cannot utilise it.
 Note that we can chain the commands to make the code a bit more efficient

 %% Cell type:code id: tags:

 ``` 
 #
 # convert the NumPy arrays with the data from above to PyTorch tensors
 #
 x_tensor = torch.from_numpy(x).float().to(device)
 y_tensor = torch.from_numpy(y).float().to(device)

 print(x_tensor.shape,y_tensor.shape)
 print(x_tensor.type(),y_tensor.type())
 ```

 %% Output

    torch.Size([100, 1]) torch.Size([100, 1])
    torch.FloatTensor torch.FloatTensor

 %% Cell type:markdown id: tags:

 In this case where we want to fit a linear regression model, we have two variables that we need to fit: the slope and the intercept of the line.

 Therefore, we need to define two variables that we use in the fitting.
 We initialise these variables using a standard Gaussian, so that they have well-defined and not pathological values. We can do so using [torch.randn](https://pytorch.org/docs/stable/generated/torch.randn.html)

 We need to specify a few parameters:
 - ```size```: the size of the array. Here: 1, as we need a single number
 - ```requires_grad = True```: to be able to use this in the minimisation/fitting procedure
 - ```dtype```: here we use ```torch.float``` as all our variables are floating point variables
 -```device``` we need to make sure that we "move" the model to the device we are using, i.e. either a CPU or  a GPU.

 Finally, we print the tensors. Note that this gives us quite a bit of information: They are indeed of the datatype "tensor", hold a single value in array shape and are used during optimisation.

 We can get the shape of the array also directly.

 %% Cell type:code id: tags:

 ``` 
 #
 # define Torch variables - if a GPU is available, use this for the computation
 # indicate that these variables will be optimized using gradient descent
 #

 m = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
 b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
 print(m, b)
 print(m.shape, b.shape)
 ```

 %% Output

    tensor([-0.4388], requires_grad=True) tensor([-1.1128], requires_grad=True)
    torch.Size([1]) torch.Size([1])

 %% Cell type:markdown id: tags:

 # Model fit

 Now we need to fit the model, i.e. our straigt line.

 We need the following ingredients to do so:
 - an optimiser with suitable parameters
 - an error metric

 The optimizer is the algorithm that determines how to change the values of the parameters in our model (here: slope $m$ and intercept $b$) such that the resuling straight line fits "better" to the data we want to fit. This is then quantified by a loss (or: error) function.

 In our case, we use [Stochastic Gradient Descent](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) as optimiser. This is one of the simpler algorithms - it will work well in this case, but we would choose others in more involved applications. This optimiser needs at least the following parameters:
 - learn rate: How big the steps are in each iteration when estimating new parameters
 - number of epochs: the number of times we want to run the optimiser. Unfornately, we cannot just say "run until it's done" but need to specify, in the simplest approach, how long we want to run for and then determine if this was sufficient.

 Further, we use the means squared error (MSE) as loss function, i.e. the difference between the predicted ($\hat{y}$) and true value ($y$):
 $$
 MSE = \frac{1}{N}\sum_i (\hat{y} -  y)^2
 $$

 In this simple example, we do everything manually and step through the fitting. Later, when we define deep learning models, much of this will be done by helper functions.

 %% Cell type:code id: tags:

 ``` 
 #
 # Fit the model
 #

 learn_rate = 1e-1
 n_epochs = 1000

 # define the optimiser used to learn the parameters
 # (Gradient descent)
 optimiser = optim.SGD([m, b], lr=learn_rate)

 loss_array = []
 # training loop
 for epoch in range(n_epochs):

  # prediction
  y_hat =  m * x_tensor + b

  # use MSE as loss function
  error = y_tensor - y_hat
  loss = (error ** 2).mean()
  loss_array.append(loss.detach().numpy())


  # this computes the gradient of the loss function
  # i.e. d(loss)/dx
  loss.backward()

  # use the gradient to update the values of the parameters
  optimiser.step()

  # clears the computations of the gradients
  optimiser.zero_grad()

 print('slope:', m, 'true slope: ', m_true)
 print('intercept: ', b, 'true intercept: ', b_true)
 ```

 %% Output

    slope: tensor([3.0210], requires_grad=True) true slope:  3
    intercept:  tensor([0.9898], requires_grad=True) true intercept:  1

 %% Cell type:markdown id: tags:

 Next, we look at the values of the loss function during the optimisation.
 As we can see, starting from the random values above, we make a big jump in the first few iterations, then the progress is much smaller.

 However, looking at the plot on a log-scale, we can see that after the initial jump in the first iterations, there is a steady decline over the next about 200 iterations, and only after about 300 iterations or so we start to run into a plateau.

 %% Cell type:code id: tags:

 ``` 
 plt.plot(loss_array)
 plt.yscale('log')
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 Finally, we compare the fitted model with our original data, as well as the "true" slope and intercept we have generated the data from.

 Note that in order to be able to do this, we first need to "de-tensorfy" the variables. Behind the scenes, PyTorch uses a computational graph to do all calculations, if we want to use the variables, e.g. for plotting, we need to [detach](https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html) the variable from this graph. Further, we need to convert this back into numpy format to be able to plot the data.

 For the plot, we create a number of points on the $x$-axis (```np.linspace```), and then create a straight line: $ y = m*x + b$.

 %% Cell type:code id: tags:

 ``` 
 plt.scatter(x,y)
 x_space = np.linspace(np.min(x), np.max(x))
 plt.plot(x_space, m.detach().numpy()*x_space + b.detach().numpy(), label='Fit')
 plt.plot(x_space, m_true*x_space +b_true, label='Truth')
 plt.legend()
 plt.xlabel('x', fontsize = 20)
 plt.ylabel('y', fontsize = 20)
 plt.show()
 ```

 %% Output



--- a/datascienceintro/Regression_WineQuality.ipynb
+++ b/datascienceintro/Regression_WineQuality.ipynb
@@ -838,6 +838,13 @@
    "#plt.show()"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
  {
   "attachments": {},
   "cell_type": "markdown",

 %% Cell type:markdown id: tags:

 # Regression

 In this example, we use the [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality) for a first, simple regression task.
 These data consist of a range of features such as acidity or sugar levels, pH value, alcohol content, etc. and the aim is to predict a quality score in the range (0,10).
 The data originate from the paper: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
 In Decision Support Systems, Elsevier, 47(4):547-553, 2009. [Wine paper](https://www.sciencedirect.com/science/article/pii/S0167923609001377)

 Note that the score is based on integer values - therefore, we could treat this either as a regression problem (like we do here), or a a classification problem with (up to) 10 different classes. The original published paper considered a regression approach - we will therefore follow in their footsteps and treat this problem as a regression task.

 We focus on red wines here in this example - feel free to explore the white wines as an exercise.

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 import seaborn as sns

 from sklearn.model_selection import train_test_split
 from sklearn.ensemble import RandomForestRegressor
 from sklearn import metrics
 ```

 %% Cell type:markdown id: tags:

 # Data
 We can load the data directly from the public repository directly (or from the local backup)

 %% Cell type:code id: tags:

 ``` python
 df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
    sep=';')
 df.head(5)
 ```

 %% Output

       fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
    0            7.4              0.70         0.00             1.9      0.076
    1            7.8              0.88         0.00             2.6      0.098
    2            7.8              0.76         0.04             2.3      0.092
    3           11.2              0.28         0.56             1.9      0.075
    4            7.4              0.70         0.00             1.9      0.076
    
       free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
    0                 11.0                  34.0   0.9978  3.51       0.56
    1                 25.0                  67.0   0.9968  3.20       0.68
    2                 15.0                  54.0   0.9970  3.26       0.65
    3                 17.0                  60.0   0.9980  3.16       0.58
    4                 11.0                  34.0   0.9978  3.51       0.56
    
       alcohol  quality
    0      9.4        5
    1      9.8        5
    2      9.8        5
    3      9.8        6
    4      9.4        5

 %% Cell type:code id: tags:

 ``` python
 df.describe()
 ```

 %% Output

           fixed acidity  volatile acidity  citric acid  residual sugar  \
    count    1599.000000       1599.000000  1599.000000     1599.000000
    mean        8.319637          0.527821     0.270976        2.538806
    std         1.741096          0.179060     0.194801        1.409928
    min         4.600000          0.120000     0.000000        0.900000
    25%         7.100000          0.390000     0.090000        1.900000
    50%         7.900000          0.520000     0.260000        2.200000
    75%         9.200000          0.640000     0.420000        2.600000
    max        15.900000          1.580000     1.000000       15.500000
    
             chlorides  free sulfur dioxide  total sulfur dioxide      density  \
    count  1599.000000          1599.000000           1599.000000  1599.000000
    mean      0.087467            15.874922             46.467792     0.996747
    std       0.047065            10.460157             32.895324     0.001887
    min       0.012000             1.000000              6.000000     0.990070
    25%       0.070000             7.000000             22.000000     0.995600
    50%       0.079000            14.000000             38.000000     0.996750
    75%       0.090000            21.000000             62.000000     0.997835
    max       0.611000            72.000000            289.000000     1.003690
    
                    pH    sulphates      alcohol      quality
    count  1599.000000  1599.000000  1599.000000  1599.000000
    mean      3.311113     0.658149    10.422983     5.636023
    std       0.154386     0.169507     1.065668     0.807569
    min       2.740000     0.330000     8.400000     3.000000
    25%       3.210000     0.550000     9.500000     5.000000
    50%       3.310000     0.620000    10.200000     6.000000
    75%       3.400000     0.730000    11.100000     6.000000
    max       4.010000     2.000000    14.900000     8.000000

 %% Cell type:markdown id: tags:

 # Exploratory Data Analysis

 First, we want to get a feel for the data and look at the various variables.
 In this case, all variables are already in numeric format.

 ## Univariate analysis
 First, we look at some individual variables to get a feel for their distribution.
 Let's start by looking at the target variable ("Quality") - as we can see, not all scores are used, there are very few good or bad wines, most wines are average.

 ***Exercise***
 Explore some more variables

 %% Cell type:code id: tags:

 ``` python
 g = sns.histplot(data=df, x='quality',bins=20)
 plt.ylabel('count', size = 20)
 plt.xlabel('Wine Quality', size = 20)
 plt.show()
 ```

 %% Output



 %% Cell type:code id: tags:

 ``` python
 ##
 ## explore some more variables
 ##
 ```

 %% Cell type:markdown id: tags:

 ## Explore correlations

 Machine learning essentially exploits linear and non-linear correlations between the features and the target (label).
 Next, we want to get a feel for how the variables are correlated.
 We can first print the correlation table of the variables and then plot the behaviour of individual pairs of variables.
 To look at all combinations (or a subset), we can use the [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) from Seaborn, or plot a few individual combinations.

 %% Cell type:code id: tags:

 ``` python
 corr_matrix = df.corr()
 sns.heatmap(corr_matrix, annot=True, fmt=".1f", cmap='seismic')
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 As we can see, some variables are strongly correlated with each other but only three variables have a high correlation with out target: volatile acidity, sulphates, and alcohol. We will therefore suspect that these variables are the most important for the model.

 %% Cell type:code id: tags:

 ``` python
 sns.scatterplot(data=df, x='alcohol', y='volatile acidity', hue='quality' )
 plt.xlabel('Alcohol content', size = 20)
 plt.ylabel('Volatile acidity', size = 20)
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 ***Exercise***

 Explore some more correlations between the variables and the target.

 %% Cell type:code id: tags:

 ``` python
 ##
 ## explore some more correlations using plots for two variables liek above or a pairplot
 ##
 ```

 %% Cell type:markdown id: tags:

 A contour plot in the pair-plot for the 3 main variables shows some dependency - but overall, a visual inspection shows that there is no obvious 1:1 relationship.

 %% Cell type:markdown id: tags:

 # Machine Learning

 We now use a machine learning model to predict the quality index.
 The model is trained on the training data and then evaluated on predictions made from the independent test data.

 We follow the typical Scikit-Learn approach of:
 - create an instance of the model
 - call the ```fit``` method for the training data
 - call the ```predict``` method for the test data.

 As an example, we will use the [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

 However, first we need to define the training/test data, the features and the labels.

 The target (label) is the last column (quality) in the data-frame.

 %% Cell type:code id: tags:

 ``` python
 ###
 ### Your code here
 ###
 train_cols = # ...
 label = # ...
 X = df[train_cols]
 y = df[label]

 # split into training and test sample
 X_train, X_test, y_train, y_test = # ....
 ```

 %% Cell type:code id: tags:

 ``` python
 #define and fit the model
 model = RandomForestRegressor(max_depth=20, random_state=0, n_estimators=15)
 model.fit(X_train, y_train)

 ```

 %% Output

    RandomForestRegressor(max_depth=20, n_estimators=15, random_state=0)

 %% Cell type:markdown id: tags:

 Then, we obtain the predictions on the test data for the model.
 For convenience, we make a copy of the test data and append the predictions, together with the true values,  as the last column to the dataframe.

 %% Cell type:code id: tags:

 ``` python
 predictions = X_test.copy()
 y_hat = model.predict(predictions)
 predictions.loc[:,'y_hat'] = y_hat
 predictions.loc[:,'y'] = y_test
 predictions.head(5)
 ```

 %% Output

          fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
    1552            6.3             0.680         0.01             3.7      0.103
    1333            9.1             0.775         0.22             2.2      0.079
    660             7.2             0.520         0.07             1.4      0.074
    1294            8.2             0.635         0.10             2.1      0.073
    1339            7.5             0.510         0.02             1.7      0.084
    
          free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
    1552                 32.0                  54.0  0.99586  3.51       0.66
    1333                 12.0                  48.0  0.99760  3.18       0.51
    660                   5.0                  20.0  0.99730  3.32       0.81
    1294                 25.0                  60.0  0.99638  3.29       0.75
    1339                 13.0                  31.0  0.99538  3.36       0.54
    
          alcohol     y_hat  y
    1552     11.3  5.933333  6
    1333      9.6  5.333333  5
    660       9.6  5.733333  6
    1294     10.9  6.066667  6
    1339     10.5  6.000000  6

 %% Cell type:markdown id: tags:

 # Evaluation

 ### Global metrics
 We now look at a few evaluation metrics.
 First we look at some global properties, such as the mean absolute or squared deviation

 %% Cell type:code id: tags:

 ``` python
 print('MAD: {0:0.2f}, MSE {1:0.2f}'.format(metrics.mean_absolute_error(y_test, y_hat),
                                           metrics.mean_squared_error(y_test, y_hat)))
 ```

 %% Output

    MAD: 0.44, MSE 0.39

 %% Cell type:markdown id: tags:

 ### Residuals

 We now look at the residuals of the predictions. Ideally, we would expect "Gaussian noise": We do expect that the predictions are not perfect, but there should be no trend and the deviations from the true values (in the test data) should be both as small as possible, as well as randomly distributed.

 %% Cell type:code id: tags:

 ``` python
 #display = metrics.PredictionErrorDisplay(y_true=y_test, y_pred=y_hat)
 #display.plot()
 #plt.show()
 ```

+%% Cell type:code id: tags:
+
+``` python
+```
+
 %% Cell type:markdown id: tags:

 ### Profile Plot

 We can also do a profile plot that illustrates how well the predictions work in various regions. Ideally, the data points on the model should be compatible with the diagonal line for all predictions.


 %% Cell type:code id: tags:

 ``` python

 import profile_plot


 #fig, ax = plt.subplots(1, 1)
 #out = profile_plot.pplot(y_test, y_hat, n_bins=10, yerr='var', ax=ax)
 #plt.show()
 ```

 %% Cell type:markdown id: tags:

 Think about what this means for our choice of modelling aproach, for the model and what we would need to think about to improve.

--- a/datascienceintro/__pycache__/profile_plot.cpython-310.pyc
+++ b/datascienceintro/__pycache__/profile_plot.cpython-310.pyc
--- a/datascienceintro/solutions/Solution_Classification_DecisionTree.ipynb
+++ b/datascienceintro/solutions/Solution_Classification_DecisionTree.ipynb
--- a/datascienceintro/solutions/Solution_PyTorch_MNIST.ipynb
+++ b/datascienceintro/solutions/Solution_PyTorch_MNIST.ipynb