"The data are about prices for houses in California, USA and the target variable is the natural logarithm of the median house price.\n",
"\n",
"We will use this example to explore how we an utilise hierarchical features and building a model pipeline in Scikit-Learn.\n",
"We will use this example to explore how we can utilise hierarchical features and build a model pipeline in Scikit-Learn.\n",
"\n",
"The data can be obtained from the web-page above or, as this is a popular training dataset, using the convenience function [fetch_california_housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) provided by scikit-learn. We will use a local copy in this exercise that has been obtained using this function."
]
...
...
@@ -592,7 +592,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, the houses are located in areas we visually recognise as Californa, with areas such as Los Angeles or San Francisco densely populted and the deseart areas almost empty. The more expensive houses are also located in the popular cities, just as we would expect.\n",
"As we can see, the houses are located in areas we visually recognise as Californa, with areas such as Los Angeles or San Francisco densely populated and the deseart areas almost empty. The more expensive houses are also located in the popular cities, just as we would expect.\n",
"\n",
"However, we need to think about how we can make use of this information.\n"
]
...
...
@@ -604,7 +604,7 @@
"source": [
"# Machine Learning Model\n",
"\n",
"We start with a base model that contains the numerical features we can use straight away (i.e. without latitude and longitude)\n",
"We start with a base model that contains the numerical features we can use straight away (i.e. without latitude and longitude).\n",
"\n",
"We follow the typical Scikit-Learn approach of:\n",
"- create an instance of the model\n",
...
...
@@ -654,7 +654,7 @@
"metadata": {},
"source": [
"Then, we obtain the predictions on the test data for the model.\n",
"For convenience, we make a copy of the test data and append the predictions, together with the true values, as the last column to the dataframe"
"For convenience, we make a copy of the test data and append the predictions, together with the true values, as the last column to the dataframe."
]
},
{
...
...
%% Cell type:markdown id: tags:
# Regression
In this example, we use the public dataset [California Housing](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) that was first described in the paper: Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297.
The data are about prices for houses in California, USA and the target variable is the natural logarithm of the median house price.
We will use this example to explore how we an utilise hierarchical features and building a model pipeline in Scikit-Learn.
We will use this example to explore how we can utilise hierarchical features and build a model pipeline in Scikit-Learn.
The data can be obtained from the web-page above or, as this is a popular training dataset, using the convenience function [fetch_california_housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) provided by scikit-learn. We will use a local copy in this exercise that has been obtained using this function.
As we can see, the houses are located in areas we visually recognise as Californa, with areas such as Los Angeles or San Francisco densely populted and the deseart areas almost empty. The more expensive houses are also located in the popular cities, just as we would expect.
As we can see, the houses are located in areas we visually recognise as Californa, with areas such as Los Angeles or San Francisco densely populated and the deseart areas almost empty. The more expensive houses are also located in the popular cities, just as we would expect.
However, we need to think about how we can make use of this information.
%% Cell type:markdown id: tags:
# Machine Learning Model
We start with a base model that contains the numerical features we can use straight away (i.e. without latitude and longitude)
We start with a base model that contains the numerical features we can use straight away (i.e. without latitude and longitude).
We follow the typical Scikit-Learn approach of:
- create an instance of the model
- call the ```fit``` method for the training data
- call the ```predict``` method for the test data.
As an example, we will use the [HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html)
However, first we need to define the training/test data, the features and the labels.
The target (label) is the last column in the data-frame.
Next, we want to exploit the information from the latitude and longitude. Individually, these variables are likely not very helpful - but we have already seen in our exploratory data analysis that the combination of these two variables has quite some predictive power that can be well explained using our domain knowledge, i.e. what we know or can reason about the distribution of (expensive) houses in California.
To do so, we need to define a **hierarchical** model: We first build a model just on latitude and longitude, and then use the output of this model in our more advanced model.
This means that we need to *transform* the data.
We can do so by building our own class that creates such a feature and then use this in a *pipeline* to pass it on to our final model.
As an example, we use the [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html).
This class takes the columsn that should go into the new feature as input and adds a new column to the dataframe. Since we no longer use the original columns (their information is now taken into account), we remove them from the dataframe.
Now we need to use this class to build an intermediate feature and pass this on to the final model.
We could do this manually, however, the [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) functionality of Scikit-Learn allows us to do this in a single step.
This has the added benefit that we do not need to ensure that all variables are treated properly in the process ourselves, but can leave this to the "magic" of the pipeline.