"These scatter plots allow us to explore how the a single feature contributes to the predictions in more detail. Each point in the scatter-plot is a single prediction.\n",
"\n",
"The x-axis show the numerical value of the feature we want to test (e.g. alcohol content in this example), the y-axis the corresponding Shapley value for this feature.\n",
"In this example we can see that the importance raises with increasing levels of alcohol "
"In this example we can see that the importance raises with increasing levels of alcohol. "
]
},
{
...
...
@@ -450,7 +450,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, the individual features are shown with an indication how much they influence the deviatoin from the base value for in this event. Features in red indicate that they push the value up, features in blue push the value down. The relative size indicates by how much."
"Then, the individual features are shown with an indication how much they influence the deviation from the base value for this event. Features in red indicate that they push the value up, features in blue push the value down. The relative size indicates by how much."
]
},
{
...
...
%% Cell type:markdown id: tags:
# Explainable AI - Shapley Values
In this example, we use the [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality).
We have already explored this dataset in a dedicated exercise for regression, so we won't discuss the exploratory data analysis or model considerations here.
Instead, we will focus on how to make predictions explainable *post hoc*, i.e., after they have been made. This means we can use any (black box) prediction model and then use Shapley values to explore how the features contribute to the model - both on a global level, as well as for individual predictions.
Shapley values originate from game theory and we can use them to analyze how much a given feature contributes to the prediction.
We can interpret the features as "players" that work together to achieve an outcome, in our case, the prediction of a machine learning model.
Just like in a, say, football game, the "players" have different strenghts and the team is not just a set of players but each player contributes in relation to the other players on the "team" (our model).
In this example, we will use the [SHAP](https://shap.readthedocs.io/en/latest/) package for this.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import shap
```
%% Cell type:markdown id: tags:
# Data and model
Load the data from the internet archive, define the model, and run some predictions.
The most important ones are: alcohol content, sulphates, and volatile acidity.
Remembering our previous exercise, these are also the features that are most highly correlated with the label / target variable, so this explanation fits nicely with our expectation.
We can look at this in more detail by exploring the impact on low or high values of these features:
%% Cell type:code id: tags:
``` python
shap.summary_plot(shap_values, X_train)
```
%% Output
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
%% Cell type:markdown id: tags:
Next, we look at [dependence plots](https://slundberg.github.io/shap/notebooks/plots/dependence_plot.html).
These scatter plots allow us to explore how the a single feature contributes to the predictions in more detail. Each point in the scatter-plot is a single prediction.
The x-axis show the numerical value of the feature we want to test (e.g. alcohol content in this example), the y-axis the corresponding Shapley value for this feature.
In this example we can see that the importance raises with increasing levels of alcohol
In this example we can see that the importance raises with increasing levels of alcohol.
We can also add a second feature variable to this plot.
This allows us to analyse whether there are any interactions between these two features. Such an interaction would show up as a distinct pattern in this plot.
Further, we can look at all the interactions between the features. The diagonal show the same plots as above, and then we investigate if or how any pair of features interacts in the prediction.
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
%% Cell type:markdown id: tags:
# Local interpretation
Next, we look at individual predictions. the ```base value``` is the mean prediction across the training data (this is all the model sees during training). The prediction for the current sample is denoted by ```f(x)```.
We can interprete this in the following way:
The predictions are split into the average (base) value plus the deviations that originate from the values of the features for a specific event and their interaction.
Then, the individual features are shown with an indication how much they influence the deviatoin from the base value for in this event. Features in red indicate that they push the value up, features in blue push the value down. The relative size indicates by how much.
Then, the individual features are shown with an indication how much they influence the deviation from the base value for this event. Features in red indicate that they push the value up, features in blue push the value down. The relative size indicates by how much.
%% Cell type:code id: tags:
``` python
sample_id = 0
# here we use the test data again.
# Note that then the base value will differ from the mean predictions as the
# SHAP package has not seeen the test data and we expect some statistical fluctuations
shap_values = explainer.shap_values(predictions)
#initialise JavaScript for the viusalization
shap.initjs()
shap.force_plot(explainer.expected_value,
shap_values[sample_id],
predictions.iloc[[sample_id]])
```
%% Output
<shap.plots._force.AdditiveForceVisualizer at 0x7f09488876a0>