Corrected typos

d4bb182e · Tom Reclik · c7f25e7c · d4bb182e
Commit d4bb182e authored Mar 29, 2023 by Tom Reclik
--- a/datascienceintro/ExplainableAI_Classification_Adult_EBM.ipynb
+++ b/datascienceintro/ExplainableAI_Classification_Adult_EBM.ipynb
@@ -15,7 +15,7 @@
        "EBM is based on Generalized Additive Models with Pairwise Interactions (GA2M) by Lou et al. ([Accurate Intelligible Models with Pairwise Interactions](https://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf))\n",
        "\n",
        "\n",
-        "We will use the [adult](https://archive.ics.uci.edu/ml/datasets/adult) that focuses on a (binary) classification task whether or not a person makes more than 50k USD per year. The data are taken from a 1994 census and were first discussed in the paper [Ron Kohavi, \"Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid\", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996](https://www.academia.edu/download/40088603/Scaling_Up_the_Accuracy_of_Naive-Bayes_C20151116-5477-1fw84ob.pdf)\n",
+        "We will use the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) that focuses on a (binary) classification task whether or not a person makes more than 50k USD per year. The data are taken from a 1994 census and were first discussed in the paper [Ron Kohavi, \"Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid\", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996](https://www.academia.edu/download/40088603/Scaling_Up_the_Accuracy_of_Naive-Bayes_C20151116-5477-1fw84ob.pdf)\n",
        "\n",
        "The data have a number of categorial and numerical features.\n",
        "We can access the data directly from the archive (or use a local copy).\n",
@@ -348,7 +348,7 @@
        "\n",
        "As this is a \"white box\" model, we can look at the details of individual predictions.\n",
        "\n",
-        "In the interactive widget, we can select individual cases and then analyse how the features contribute to the classification result\n"
+        "In the interactive widget, we can select individual cases and then analyse how the features contribute to the classification result.\n"
      ]
    },
    {

 %% Cell type:markdown id: tags:
 # Explainable AI: Explainable Boosting Machine
 In this example we use the white-box model "EBM - Explainable Boosting Machine" from the package [InterpretML](https://github.com/interpretml/interpret) by Microsoft. The package supports a range of explainable AI tools, from white-box models to explainers for black-box models such as Shapley values, LIME, partial dependency plots and others.
 EBM is based on Generalized Additive Models with Pairwise Interactions (GA2M) by Lou et al. ([Accurate Intelligible Models with Pairwise Interactions](https://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf))
-We will use the [adult](https://archive.ics.uci.edu/ml/datasets/adult) that focuses on a (binary) classification task whether or not a person makes more than 50k USD per year. The data are taken from a 1994 census and were first discussed in the paper [Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996](https://www.academia.edu/download/40088603/Scaling_Up_the_Accuracy_of_Naive-Bayes_C20151116-5477-1fw84ob.pdf)
+We will use the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) that focuses on a (binary) classification task whether or not a person makes more than 50k USD per year. The data are taken from a 1994 census and were first discussed in the paper [Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996](https://www.academia.edu/download/40088603/Scaling_Up_the_Accuracy_of_Naive-Bayes_C20151116-5477-1fw84ob.pdf)
 The data have a number of categorial and numerical features.
 We can access the data directly from the archive (or use a local copy).
 Since we already discussed this example when we looked at classifications, we won't do any exploratory data analysis here.
 %% Cell type:code id: tags:
 ``` python
 import pandas as pd
 from sklearn.model_selection import train_test_split
 import matplotlib.pyplot as plt
 import seaborn as sns
 # white-box model EBM
 from interpret.glassbox import ExplainableBoostingClassifier
 from interpret import show
 # general parameters
 n_splits=3
 random_state=1337
 ```
 %% Cell type:markdown id: tags:
 ## Data Access
 Read in data directly from the archive.
 %% Cell type:code id: tags:
 ``` python
 df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    header=None)
 df.columns = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
 ]
 train_cols = df.columns[0:-1]
 label = df.columns[-1]
 X = df[train_cols]
 y = df[label]
 ```
 %% Cell type:code id: tags:
 ``` python
 df.head(5)
 ```
 %% Output
       Age          WorkClass  fnlwgt   Education  EducationNum  \
    0   39          State-gov   77516   Bachelors            13
    1   50   Self-emp-not-inc   83311   Bachelors            13
    2   38            Private  215646     HS-grad             9
    3   53            Private  234721        11th             7
    4   28            Private  338409   Bachelors            13
             MaritalStatus          Occupation    Relationship    Race   Gender  \
    0        Never-married        Adm-clerical   Not-in-family   White     Male
    1   Married-civ-spouse     Exec-managerial         Husband   White     Male
    2             Divorced   Handlers-cleaners   Not-in-family   White     Male
    3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male
    4   Married-civ-spouse      Prof-specialty            Wife   Black   Female
       CapitalGain  CapitalLoss  HoursPerWeek   NativeCountry  Income
    0         2174            0            40   United-States   <=50K
    1            0            0            13   United-States   <=50K
    2            0            0            40   United-States   <=50K
    3            0            0            40   United-States   <=50K
    4            0            0            40            Cuba   <=50K
 %% Cell type:markdown id: tags:
 First, we split the data into a training and a test dataset, retaining 25% of the data for the test data.
 %% Cell type:code id: tags:
 ``` python
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=random_state)
 ```
 %% Cell type:markdown id: tags:
 Now we create an instance of the model and train it.
 Note that the explainable boosting classifier can directly work on the categorial variables as text, i.e. we do not need to transform them to a numerical representation.
 %% Cell type:code id: tags:
 ``` python
 model = ExplainableBoostingClassifier(random_state=random_state)
 model.fit(X_train, y_train)
 ```
 %% Output
    ExplainableBoostingClassifier(random_state=1337)
 %% Cell type:markdown id: tags:
 # Global interpretation
 First, we look at the global features of the model.
 In particular, the "summary" page will show us the importance of each feature.
 We can then dive into individual features.
 %% Cell type:code id: tags:
 ``` python
 global_explanation = model.explain_global()
 show(global_explanation)
 ```
 %% Output
 %% Cell type:markdown id: tags:
 # Local Interpretation
 As this is a "white box" model, we can look at the details of individual predictions.
-In the interactive widget, we can select individual cases and then analyse how the features contribute to the classification result
+In the interactive widget, we can select individual cases and then analyse how the features contribute to the classification result.
 %% Cell type:code id: tags:
 ``` python
 #understand individual predictions
 local_explanation = model.explain_local(X_test.iloc[0:5], y_test.iloc[0:5])
 ```
 %% Cell type:code id: tags:
 ``` python
 show(local_explanation)
 ```
 %% Output