Skip to content
Snippets Groups Projects
Commit d4bb182e authored by Tom Reclik's avatar Tom Reclik
Browse files

Corrected typos

parent c7f25e7c
No related branches found
No related tags found
1 merge request!1Corrections trk
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Explainable AI: Explainable Boosting Machine # Explainable AI: Explainable Boosting Machine
In this example we use the white-box model "EBM - Explainable Boosting Machine" from the package [InterpretML](https://github.com/interpretml/interpret) by Microsoft. The package supports a range of explainable AI tools, from white-box models to explainers for black-box models such as Shapley values, LIME, partial dependency plots and others. In this example we use the white-box model "EBM - Explainable Boosting Machine" from the package [InterpretML](https://github.com/interpretml/interpret) by Microsoft. The package supports a range of explainable AI tools, from white-box models to explainers for black-box models such as Shapley values, LIME, partial dependency plots and others.
EBM is based on Generalized Additive Models with Pairwise Interactions (GA2M) by Lou et al. ([Accurate Intelligible Models with Pairwise Interactions](https://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf)) EBM is based on Generalized Additive Models with Pairwise Interactions (GA2M) by Lou et al. ([Accurate Intelligible Models with Pairwise Interactions](https://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf))
We will use the [adult](https://archive.ics.uci.edu/ml/datasets/adult) that focuses on a (binary) classification task whether or not a person makes more than 50k USD per year. The data are taken from a 1994 census and were first discussed in the paper [Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996](https://www.academia.edu/download/40088603/Scaling_Up_the_Accuracy_of_Naive-Bayes_C20151116-5477-1fw84ob.pdf) We will use the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) that focuses on a (binary) classification task whether or not a person makes more than 50k USD per year. The data are taken from a 1994 census and were first discussed in the paper [Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996](https://www.academia.edu/download/40088603/Scaling_Up_the_Accuracy_of_Naive-Bayes_C20151116-5477-1fw84ob.pdf)
The data have a number of categorial and numerical features. The data have a number of categorial and numerical features.
We can access the data directly from the archive (or use a local copy). We can access the data directly from the archive (or use a local copy).
Since we already discussed this example when we looked at classifications, we won't do any exploratory data analysis here. Since we already discussed this example when we looked at classifications, we won't do any exploratory data analysis here.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import pandas as pd import pandas as pd
from sklearn.model_selection import train_test_split from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import seaborn as sns import seaborn as sns
# white-box model EBM # white-box model EBM
from interpret.glassbox import ExplainableBoostingClassifier from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show from interpret import show
# general parameters # general parameters
n_splits=3 n_splits=3
random_state=1337 random_state=1337
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Data Access ## Data Access
Read in data directly from the archive. Read in data directly from the archive.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df = pd.read_csv( df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
header=None) header=None)
df.columns = [ df.columns = [
"Age", "WorkClass", "fnlwgt", "Education", "EducationNum", "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
"MaritalStatus", "Occupation", "Relationship", "Race", "Gender", "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
"CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income" "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
] ]
train_cols = df.columns[0:-1] train_cols = df.columns[0:-1]
label = df.columns[-1] label = df.columns[-1]
X = df[train_cols] X = df[train_cols]
y = df[label] y = df[label]
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.head(5) df.head(5)
``` ```
%% Output %% Output
Age WorkClass fnlwgt Education EducationNum \ Age WorkClass fnlwgt Education EducationNum \
0 39 State-gov 77516 Bachelors 13 0 39 State-gov 77516 Bachelors 13
1 50 Self-emp-not-inc 83311 Bachelors 13 1 50 Self-emp-not-inc 83311 Bachelors 13
2 38 Private 215646 HS-grad 9 2 38 Private 215646 HS-grad 9
3 53 Private 234721 11th 7 3 53 Private 234721 11th 7
4 28 Private 338409 Bachelors 13 4 28 Private 338409 Bachelors 13
MaritalStatus Occupation Relationship Race Gender \ MaritalStatus Occupation Relationship Race Gender \
0 Never-married Adm-clerical Not-in-family White Male 0 Never-married Adm-clerical Not-in-family White Male
1 Married-civ-spouse Exec-managerial Husband White Male 1 Married-civ-spouse Exec-managerial Husband White Male
2 Divorced Handlers-cleaners Not-in-family White Male 2 Divorced Handlers-cleaners Not-in-family White Male
3 Married-civ-spouse Handlers-cleaners Husband Black Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male
4 Married-civ-spouse Prof-specialty Wife Black Female 4 Married-civ-spouse Prof-specialty Wife Black Female
CapitalGain CapitalLoss HoursPerWeek NativeCountry Income CapitalGain CapitalLoss HoursPerWeek NativeCountry Income
0 2174 0 40 United-States <=50K 0 2174 0 40 United-States <=50K
1 0 0 13 United-States <=50K 1 0 0 13 United-States <=50K
2 0 0 40 United-States <=50K 2 0 0 40 United-States <=50K
3 0 0 40 United-States <=50K 3 0 0 40 United-States <=50K
4 0 0 40 Cuba <=50K 4 0 0 40 Cuba <=50K
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
First, we split the data into a training and a test dataset, retaining 25% of the data for the test data. First, we split the data into a training and a test dataset, retaining 25% of the data for the test data.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=random_state) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=random_state)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Now we create an instance of the model and train it. Now we create an instance of the model and train it.
Note that the explainable boosting classifier can directly work on the categorial variables as text, i.e. we do not need to transform them to a numerical representation. Note that the explainable boosting classifier can directly work on the categorial variables as text, i.e. we do not need to transform them to a numerical representation.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
model = ExplainableBoostingClassifier(random_state=random_state) model = ExplainableBoostingClassifier(random_state=random_state)
model.fit(X_train, y_train) model.fit(X_train, y_train)
``` ```
%% Output %% Output
ExplainableBoostingClassifier(random_state=1337) ExplainableBoostingClassifier(random_state=1337)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Global interpretation # Global interpretation
First, we look at the global features of the model. First, we look at the global features of the model.
In particular, the "summary" page will show us the importance of each feature. In particular, the "summary" page will show us the importance of each feature.
We can then dive into individual features. We can then dive into individual features.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
global_explanation = model.explain_global() global_explanation = model.explain_global()
show(global_explanation) show(global_explanation)
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Local Interpretation # Local Interpretation
As this is a "white box" model, we can look at the details of individual predictions. As this is a "white box" model, we can look at the details of individual predictions.
In the interactive widget, we can select individual cases and then analyse how the features contribute to the classification result In the interactive widget, we can select individual cases and then analyse how the features contribute to the classification result.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#understand individual predictions #understand individual predictions
local_explanation = model.explain_local(X_test.iloc[0:5], y_test.iloc[0:5]) local_explanation = model.explain_local(X_test.iloc[0:5], y_test.iloc[0:5])
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
show(local_explanation) show(local_explanation)
``` ```
%% Output %% Output
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment