Corrected typos

d22f6c70 · Tom Reclik · 13377268 · d22f6c70
Commit d22f6c70 authored Mar 29, 2023 by Tom Reclik
--- a/datascienceintro/Regression_Housing.ipynb
+++ b/datascienceintro/Regression_Housing.ipynb
@@ -11,7 +11,7 @@
    "\n",
    "The data are about prices for houses in California, USA and the target variable is the natural logarithm of the median house price.\n",
    "\n",
-    "We will use this example to explore how we an utilise hierarchical features and building a model pipeline in Scikit-Learn.\n",
+    "We will use this example to explore how we can utilise hierarchical features and build a model pipeline in Scikit-Learn.\n",
    "\n",
    "The data can be obtained from the web-page above or, as this is a popular training dataset, using the convenience function [fetch_california_housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) provided by scikit-learn. We will use a local copy in this exercise that has been obtained using this function."
   ]
@@ -592,7 +592,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As we can see, the houses are located in areas we visually recognise as Californa, with areas such as Los Angeles or San Francisco densely populted and the deseart areas almost empty. The more expensive houses are also located in the popular cities, just as we would expect.\n",
+    "As we can see, the houses are located in areas we visually recognise as Californa, with areas such as Los Angeles or San Francisco densely populated and the deseart areas almost empty. The more expensive houses are also located in the popular cities, just as we would expect.\n",
    "\n",
    "However, we need to think about how we can make use of this information.\n"
   ]
@@ -604,7 +604,7 @@
   "source": [
    "# Machine Learning Model\n",
    "\n",
-    "We start with a base model that contains the numerical features we can use straight away (i.e. without latitude and longitude)\n",
+    "We start with a base model that contains the numerical features we can use straight away (i.e. without latitude and longitude).\n",
    "\n",
    "We follow the typical Scikit-Learn approach of:\n",
    "- create an instance of the model\n",
@@ -654,7 +654,7 @@
   "metadata": {},
   "source": [
    "Then, we obtain the predictions on the test data for the model.\n",
-    "For convenience, we make a copy of the test data and append the predictions, together with the true values,  as the last column to the dataframe"
+    "For convenience, we make a copy of the test data and append the predictions, together with the true values,  as the last column to the dataframe."
   ]
  },
  {

 %% Cell type:markdown id: tags:

 # Regression

 In this example, we use the public dataset [California Housing](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) that was first described in the paper: Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297.

 The data are about prices for houses in California, USA and the target variable is the natural logarithm of the median house price.

-We will use this example to explore how we an utilise hierarchical features and building a model pipeline in Scikit-Learn.
+We will use this example to explore how we can utilise hierarchical features and build a model pipeline in Scikit-Learn.

 The data can be obtained from the web-page above or, as this is a popular training dataset, using the convenience function [fetch_california_housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) provided by scikit-learn. We will use a local copy in this exercise that has been obtained using this function.

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 import seaborn as sns

 from sklearn.model_selection import train_test_split
 from sklearn.datasets import fetch_california_housing
 from sklearn import metrics
 from sklearn import base
 from sklearn import neighbors
 from sklearn import pipeline

 from sklearn.ensemble import HistGradientBoostingRegressor

 import profile_plot
 ```

 %% Cell type:markdown id: tags:

 # Data

 First, we obtain the data using the inbuilt function.
 Note that the first column is just the dataset index for a specific record - we need to drop this when we want to build our features later.

 %% Cell type:code id: tags:

 ``` python
 ##
 ## use convenience function to access data and save a local copy.
 ##
 #df = fetch_california_housing(return_X_y=False, as_frame=True).frame
 #df.to_csv('california_housing.csv')

 ##
 ## here we load the local copy to avoid internet traffic
 ##
 df = pd.read_csv('data/housing/california_housing.csv')
 df.head(5)
 ```

 %% Output

       Unnamed: 0  MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  \
    0           0  8.3252      41.0  6.984127   1.023810       322.0  2.555556
    1           1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842
    2           2  7.2574      52.0  8.288136   1.073446       496.0  2.802260
    3           3  5.6431      52.0  5.817352   1.073059       558.0  2.547945
    4           4  3.8462      52.0  6.281853   1.081081       565.0  2.181467
    
       Latitude  Longitude  MedHouseVal
    0     37.88    -122.23        4.526
    1     37.86    -122.22        3.585
    2     37.85    -122.24        3.521
    3     37.85    -122.25        3.413
    4     37.85    -122.25        3.422

 %% Cell type:code id: tags:

 ``` python
 df = df.drop(columns=['Unnamed: 0'])
 df.head(5)
 ```

 %% Output

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
    0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88
    1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86
    2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85
    3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85
    4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85
    
       Longitude  MedHouseVal
    0    -122.23        4.526
    1    -122.22        3.585
    2    -122.24        3.521
    3    -122.25        3.413
    4    -122.25        3.422

 %% Cell type:code id: tags:

 ``` python
 df.describe()
 ```

 %% Output

                 MedInc      HouseAge      AveRooms     AveBedrms    Population  \
    count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000
    mean       3.870671     28.639486      5.429000      1.096675   1425.476744
    std        1.899822     12.585558      2.474173      0.473911   1132.462122
    min        0.499900      1.000000      0.846154      0.333333      3.000000
    25%        2.563400     18.000000      4.440716      1.006079    787.000000
    50%        3.534800     29.000000      5.229129      1.048780   1166.000000
    75%        4.743250     37.000000      6.052381      1.099526   1725.000000
    max       15.000100     52.000000    141.909091     34.066667  35682.000000
    
               AveOccup      Latitude     Longitude   MedHouseVal
    count  20640.000000  20640.000000  20640.000000  20640.000000
    mean       3.070655     35.631861   -119.569704      2.068558
    std       10.386050      2.135952      2.003532      1.153956
    min        0.692308     32.540000   -124.350000      0.149990
    25%        2.429741     33.930000   -121.800000      1.196000
    50%        2.818116     34.260000   -118.490000      1.797000
    75%        3.282261     37.710000   -118.010000      2.647250
    max     1243.333333     41.950000   -114.310000      5.000010

 %% Cell type:markdown id: tags:

 # Exploratory Data Analysis

 Before we start with the modelling, we want to get a feel for the data and look at some variables.
 We start with some univariate plots.

 First, the target (MedHouseVal). This is the natural log of the median house value in a given area.

 ***Discuss***

 What do you notice about the target variable and what will this likely mean for the development of a model?

 %% Cell type:code id: tags:

 ``` python
 ##
 ## your code here
 ##
 ```

 %% Cell type:markdown id: tags:


 ***Discuss***\
 What does this mean for the model development?



 We now look at a few other plots.

 %% Cell type:code id: tags:

 ``` python
 ##
 ## your code here
 ##
 ```

 %% Cell type:code id: tags:

 ``` python
 ##
 ## also do some 2D scatter plots or similar
 ##
 ```

 %% Cell type:markdown id: tags:

 ### Map data
 The dataset also contains the latitude and longitude of the properties.
 First, we make a scatter-plot to explore this data:

 %% Cell type:code id: tags:

 ``` python
 sns.scatterplot(data=df, x='Latitude', y='Longitude', hue='MedHouseVal', s=0.3 )
 plt.xlabel('Latitude', size = 20)
 plt.ylabel('Longitude', size = 20)
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

-As we can see, the houses are located in areas we visually recognise as Californa, with areas such as Los Angeles or San Francisco densely populted and the deseart areas almost empty. The more expensive houses are also located in the popular cities, just as we would expect.
+As we can see, the houses are located in areas we visually recognise as Californa, with areas such as Los Angeles or San Francisco densely populated and the deseart areas almost empty. The more expensive houses are also located in the popular cities, just as we would expect.

 However, we need to think about how we can make use of this information.

 %% Cell type:markdown id: tags:

 # Machine Learning Model

-We start with a base model that contains the numerical features we can use straight away (i.e. without latitude and longitude)
+We start with a base model that contains the numerical features we can use straight away (i.e. without latitude and longitude).

 We follow the typical Scikit-Learn approach of:
 - create an instance of the model
 - call the ```fit``` method for the training data
 - call the ```predict``` method for the test data.

 As an example, we will use the [HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html)

 However, first we need to define the training/test data, the features and the labels.
 The target (label) is the last column in the data-frame.


 %% Cell type:code id: tags:

 ``` python
 train_cols = ['MedInc', 'HouseAge','AveRooms','AveBedrms','Population','AveOccup']
 label = df.columns[-1]
 X = df[train_cols]
 y = df[label]

 # split into training and test sample
 ###
 ### Your code here
 ###
 X_train, X_test, y_train, y_test = # ....
 ```

 %% Cell type:code id: tags:

 ``` python
 ###
 ### Your code here
 ###
 #model =  ...
 ```

 %% Cell type:markdown id: tags:

 Then, we obtain the predictions on the test data for the model.
-For convenience, we make a copy of the test data and append the predictions, together with the true values,  as the last column to the dataframe
+For convenience, we make a copy of the test data and append the predictions, together with the true values,  as the last column to the dataframe.

 %% Cell type:code id: tags:

 ``` python
 predictions = X_test.copy()
 y_hat = model.predict(predictions)
 predictions.loc[:,'y_hat'] = y_hat
 predictions.loc[:,'y'] = y_test
 predictions.head(5)
 ```

 %% Output

           MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup     y_hat  \
    12277  2.8783       5.0  5.821086   1.172524       798.0  2.549521  1.206398
    14831  2.8089      30.0  4.400000   1.079339      1997.0  3.300826  1.599351
    14099  2.2287      29.0  3.707865   1.114607       882.0  1.982022  1.911623
    11269  3.6531      18.0  5.642638   1.064417      1820.0  2.791411  1.572099
    16478  3.3929      28.0  5.673961   1.065646      1365.0  2.986871  1.359892
    
               y
    12277  1.052
    14831  1.206
    14099  1.400
    11269  1.435
    16478  1.130

 %% Cell type:markdown id: tags:

 Now we plot the distribution of predictions and true values, together with some global metrics.

 %% Cell type:code id: tags:

 ``` python
 print('MAD: {0:0.2f}, MSE {1:0.2f}'.format(metrics.mean_absolute_error(y_test, y_hat),
                                           metrics.mean_squared_error(y_test, y_hat)))

 sns.histplot(data=predictions, x='y',bins=120)
 sns.histplot(data=predictions, x='y_hat',bins=120)
 plt.xlabel('ln(property value)', size = 20)
 plt.ylabel('count', size = 20)
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 We also look at the residuals and profile plot

 %% Cell type:code id: tags:

 ``` python
 display = metrics.PredictionErrorDisplay(y_true=y_test, y_pred=y_hat)
 display.plot()
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 What do we notice about the residuals?

 %% Cell type:code id: tags:

 ``` python
 fig, ax = plt.subplots(1, 1)

 out = profile_plot.pplot(y_test, y_hat, n_bins=50, yerr='var', ax=ax)
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 What do we notice about the profile plot?

 %% Cell type:markdown id: tags:

 ## Feature Engineering

 Next, we want to exploit the information from the latitude and longitude. Individually, these variables are likely not very helpful - but we have already seen in our exploratory data analysis that the combination of these two variables has quite some predictive power that can be well explained using our domain knowledge, i.e. what we know or can reason about the distribution of (expensive) houses in California.

 To do so, we need to define a **hierarchical** model: We first build a model just on latitude and longitude, and then use the output of this model in our more advanced model.
 This means that we need to *transform* the data.

 We can do so by building our own class that creates such a feature and then use this in a *pipeline* to pass it on to our final model.
 As an example, we use the [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html).
 This class takes the columsn that should go into the new feature as input and adds a new column to the dataframe. Since we no longer use the original columns (their information is now taken into account), we remove them from the dataframe.

 %% Cell type:code id: tags:

 ``` python
 class KnearestNeigbourAsFeature(base.BaseEstimator):

    def __init__(self, columns):
        self.columns = columns
        self.subest_ = neighbors.KNeighborsRegressor(n_neighbors=25)

    def fit(self, X, y=None):
        Xtemp = X[self.columns]
        self.subest_.fit(Xtemp, y)
        return self

    def transform(self, X):
        pred = self.subest_.predict(X[self.columns])
        X['KNN'] = pred
        for col in self.columns:
            del X[col]
        return X
 ```

 %% Cell type:markdown id: tags:

 Now we need to use this class to build an intermediate feature and pass this on to the final model.

 We could do this manually, however, the [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) functionality of Scikit-Learn allows us to do this in a single step.
 This has the added benefit that we do not need to ensure that all variables are treated properly in the process ourselves, but can leave this to the "magic" of the pipeline.

 %% Cell type:code id: tags:

 ``` python
 #now we take all columns
 train_cols = df.columns[0:-1]
 label = df.columns[-1]
 X = df[train_cols]
 y = df[label]

 # split into training and test sample
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

 # define the intermediate (hierarchical) feature using our new class
 columns = ["Longitude", "Latitude"]
 k_nearest = KnearestNeigbourAsFeature(columns)

 # add this into a pipeline with our model
 pipe = pipeline.Pipeline(
     [('k-nearest', k_nearest),
      ('regression', model)])

 # fit the whole pipeline - this can be done exactly like for one model.
 pipe.fit(X_train, y_train)
 ```

 %% Output

    Pipeline(steps=[('k-nearest',
                     KnearestNeigbourAsFeature(columns=['Longitude', 'Latitude'])),
                    ('regression', HistGradientBoostingRegressor())])

 %% Cell type:markdown id: tags:

 Now we evaluate the improved model in the same way as before

 %% Cell type:code id: tags:

 ``` python
 predictions = X_test.copy()
 y_hat = pipe.predict(predictions)
 predictions.loc[:,'y_hat'] = y_hat
 predictions.loc[:,'y'] = y_test
 predictions.head(5)
 ```

 %% Output

           MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup       KNN  \
    15824  2.7173      52.0  4.245161   1.129032       935.0  2.010753  2.998960
    13208  2.7321      31.0  4.703390   1.114407      1662.0  3.521186  1.335040
    15072  3.9630      28.0  5.427894   0.976281      2672.0  2.535104  1.719960
    9012   5.8735      35.0  5.811639   1.056662      1521.0  2.329250  4.095242
    132    7.4882      35.0  6.467269   1.038375      1072.0  2.419865  3.824682
    
              y_hat      y
    15824  3.201904  3.000
    13208  1.189337  1.043
    15072  1.740126  1.644
    9012   4.446608  4.481
    132    4.434946  3.193

 %% Cell type:code id: tags:

 ``` python
 print('MAD: {0:0.2f}, MSE {1:0.2f}'.format(metrics.mean_absolute_error(y_test, y_hat),
                                           metrics.mean_squared_error(y_test, y_hat)))

 sns.histplot(data=predictions, x='y',bins=120)
 sns.histplot(data=predictions, x='y_hat',bins=120)
 plt.xlabel('ln(property value)', size = 20)
 plt.ylabel('count', size = 20)
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 We can see that the model has improved considerably with this new information.
 What do we notice?

 %% Cell type:code id: tags:

 ``` python
 display = metrics.PredictionErrorDisplay(y_true=y_test, y_pred=y_hat)
 display.plot()
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 What do we notice?

 %% Cell type:code id: tags:

 ``` python
 fig, ax = plt.subplots(1, 1)

 out = profile_plot.pplot(y_test, y_hat, n_bins=50, yerr='var', ax=ax)
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 What do we notice?

 **Discuss**
 - What have we achieved?
 - What did the new hierarchical feature do?
 - What are the remaining issues and how could we tackle them?