Skip to content
Snippets Groups Projects
Commit 7c547095 authored by Ulrich Kerzel's avatar Ulrich Kerzel
Browse files

change url to PDF for census data

parent 2f4a8e7f
Branches
Tags
No related merge requests found
%% Cell type:markdown id: tags:
# Decision Tree
In this example, we look at the performance of a simple [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from the Scikit-Learn package.
We will use the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) that focuses on a (binary) classification task whether or not a person makes more than 50k USD per year. The data are taken from a 1994 census and were first discussed in the paper [Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996](https://www.academia.edu/download/40088603/Scaling_Up_the_Accuracy_of_Naive-Bayes_C20151116-5477-1fw84ob.pdf)
We will use the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) that focuses on a (binary) classification task whether or not a person makes more than 50k USD per year. The data are taken from a 1994 census and were first discussed in the paper [Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996](https://staff.icar.cnr.it/manco/Teaching/2005/datamining/articoli/nbtree.pdf)
The data have a number of categorial and numerical features.
We can access the data directly from the archive (or use a local copy).
%% Cell type:code id: tags:
``` python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
import seaborn as sns
```
%% Cell type:markdown id: tags:
## Data Access
Read in data directly from the archive.
%% Cell type:code id: tags:
``` python
df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
header=None)
df.columns = [
"Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
"MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
"CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]
```
%% Cell type:markdown id: tags:
We take a first look at the data.
One thing we notice is that the categorial data are not in numerical format - this is helpful for us to understand the data, but few algorithms will be able to use the data as they are.
Further, the target column ("Income") is text as well, which we need to transform to a numberical representation: (0,1).
%% Cell type:code id: tags:
``` python
df.head(5)
```
%% Output
Age WorkClass fnlwgt Education EducationNum \
0 39 State-gov 77516 Bachelors 13
1 50 Self-emp-not-inc 83311 Bachelors 13
2 38 Private 215646 HS-grad 9
3 53 Private 234721 11th 7
4 28 Private 338409 Bachelors 13
MaritalStatus Occupation Relationship Race Gender \
0 Never-married Adm-clerical Not-in-family White Male
1 Married-civ-spouse Exec-managerial Husband White Male
2 Divorced Handlers-cleaners Not-in-family White Male
3 Married-civ-spouse Handlers-cleaners Husband Black Male
4 Married-civ-spouse Prof-specialty Wife Black Female
CapitalGain CapitalLoss HoursPerWeek NativeCountry Income
0 2174 0 40 United-States <=50K
1 0 0 13 United-States <=50K
2 0 0 40 United-States <=50K
3 0 0 40 United-States <=50K
4 0 0 40 Cuba <=50K
%% Cell type:markdown id: tags:
# Exploratory Data Analaysis
As a first step, we look at the variables to understand how the target we want to predict is correlated with the features.
The function ```displot``` from the [Seaborn](https://seaborn.pydata.org/) package can handle the text values from the data directly.
***Exercise***
Explore the influence of the various variables on the target, e.g. the education level, gender or race.
Try to understand if or how the behaviour you see would make sense with your understanding.
%% Cell type:code id: tags:
``` python
sns.displot(data=df,y='Education', hue='Income')
plt.xlabel('count', size=20)
plt.ylabel('Education', size=20)
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
# Some other plot
```
%% Cell type:code id: tags:
``` python
# another plot
```
%% Cell type:markdown id: tags:
We can also look at variables with (continuous) numerical values.
These are binned and we use a histogram which allows us to examine the behaviour of the feature variable w.r.t. the target.
If you look at the variable ```HoursPerWeek```, using 20 bins works quite well.
Note the strong peak at 40 hours (default work week), so use a log-scale for the ```y```-axis.
%% Cell type:code id: tags:
``` python
g = sns.histplot(data=df, x='HoursPerWeek', hue='Income',bins=20)
g.set_yscale("log")
plt.ylabel('count', size = 20)
plt.xlabel('Hours per week', size = 20)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
## Data Preparation
In order to work with the data further, we need to convert the text in the categorial variables into numerical values.
Pandas provides a datatype ```category``` which we can use for this purpose.
As a first step, we need to change the variable type of the respective columns to this datatype.
Then, in the next step, we can use ```.cat.codes``` to access a numerical representation.
An alternative would be to use "One-Hot-Encoding". Pandas provides a utitlity function for this called [get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).
%% Cell type:code id: tags:
``` python
# Unique categories for the column "WorkClass"
df['WorkClass'].unique().tolist()
```
%% Output
[' State-gov',
' Self-emp-not-inc',
' Private',
' Federal-gov',
' Local-gov',
' ?',
' Self-emp-inc',
' Without-pay',
' Never-worked']
%% Cell type:code id: tags:
``` python
# Change the data type.
df = df.astype({'WorkClass': 'category', 'Education': 'category', 'MaritalStatus' : 'category', 'Occupation' : 'category',
'Relationship' : 'category', 'Race' : 'category', 'Gender': 'category', 'NativeCountry': 'category' })
```
%% Cell type:code id: tags:
``` python
# see how this has worked
df['Education'].dtype
```
%% Output
CategoricalDtype(categories=[' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th',
' 7th-8th', ' 9th', ' Assoc-acdm', ' Assoc-voc',
' Bachelors', ' Doctorate', ' HS-grad', ' Masters',
' Preschool', ' Prof-school', ' Some-college'],
, ordered=False)
%% Cell type:code id: tags:
``` python
# access the numerical representation
df['Education'].cat.codes
```
%% Output
0 9
1 9
2 11
3 1
4 9
..
32556 7
32557 11
32558 11
32559 11
32560 11
Length: 32561, dtype: int8
%% Cell type:code id: tags:
``` python
# select all columns with the datatype "category"
cat_columns = df.select_dtypes(['category']).columns
print(cat_columns)
```
%% Output
Index(['WorkClass', 'Education', 'MaritalStatus', 'Occupation', 'Relationship',
'Race', 'Gender', 'NativeCountry'],
dtype='object')
%% Cell type:code id: tags:
``` python
#convert all text to numerical values
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
```
%% Cell type:code id: tags:
``` python
df.head(5)
```
%% Output
Age WorkClass fnlwgt Education EducationNum MaritalStatus Occupation \
0 39 7 77516 9 13 4 1
1 50 6 83311 9 13 2 4
2 38 4 215646 11 9 0 6
3 53 4 234721 1 7 2 6
4 28 4 338409 9 13 2 10
Relationship Race Gender CapitalGain CapitalLoss HoursPerWeek \
0 1 4 1 2174 0 40
1 0 4 1 0 0 13
2 1 4 1 0 0 40
3 0 2 1 0 0 40
4 5 2 0 0 0 40
NativeCountry Income
0 39 <=50K
1 39 <=50K
2 39 <=50K
3 39 <=50K
4 5 <=50K
%% Cell type:markdown id: tags:
As the next step, we need to separate the data (-frame) into arrays that contain the features we want to use in the machine learning approach (```X```), and the corresponding labels (```y```).
Since the label is also represented by text, we first need to convert this into a numerical representation as 0 and 1.
Note:
- we know that the target/label is in the last column, i.e. all other columns are the features we want to use
- we can use a ```lambda``` function to efficiently process each row in the table to transform the values for the target/labels. You need to use the ```apply``` method to work on each individual data record (rows) in bulk.
***Exercise:***
fill in the code below.
%% Cell type:code id: tags:
``` python
# the target column is the last column (Income)
train_cols = # ...
label = # ...
X = df[train_cols]
#Turning response into 0 and 1
y = #...
```
%% Cell type:markdown id: tags:
To follow best practices, we split the data into separate samples that we use for training and evaluation of our model.
The test data are only used in the evaluation to be able to verify the performance on an independent sample.
%% Cell type:code id: tags:
``` python
# split into training and test sample
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
```
%% Cell type:markdown id: tags:
Here we setup our model. In this example we use a simple decision tree.
The parameter ```random_state=0``` makes the model more deterministic, see the [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for details.
Since the dataset is quite small, the default values will result in a very good model. Here, we increase the minimum number of samples in the leaves to reduce the flexibility and prevent splits such that the tree can learn each category by heart. This should limit overtraining a bit - and is also needed to demonstrate the behaviour of predicting probabilities in this example.
We do this by passing the parameter ```min_samples_leaf = 10```
The general procedure to setup a classifier is to:
- define the classifier
- call the ```fit``` method passing the training features (```X```), and labels (```y```)
***Exercise***
Define and train a ```DecisionTreeClassifier```
%% Cell type:code id: tags:
``` python
#model = ....
# ....
```
%% Cell type:markdown id: tags:
We can plot the model - as you can see, we can't really "see" much, the tree is quite complex. Therefore, while in principle such a decision tree would be explainable in the sense that we can follow each decision path, the complexity of even this fairly simple model is already high enough that we cannot really gain many insights from looking at the tree itself.
%% Cell type:code id: tags:
``` python
plot_tree(model)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
## Predictions
We now use the model on the test data to derive predictions. We take a (deep) copy of the test data to be able to manipulate the dataframe and not touch the original test data.
We access the predictions with
- ```model.predict(X_test)``` for a binary (0/1) output
- ```model.predict_proba(X_test)``` for the probability for each class.
```predict_proba``` returns an array for each prediction. Since we only have two classes in this example, the first value for each prediction is for the first class (```y=0```), the second for the other class (```y=1```).
In case we had more classes, there would be more numbers and we could, for example, assign the class with the highes probability and/or require that each assignment also exceeds a certain threshold.
For convenience, we append the predictions (and the true labels) to the copy of the test data. This allows us to look at a few values and also collects all information in a single dataframe
%% Cell type:code id: tags:
``` python
predictions = X_test.copy()
y_hat = model.predict(predictions)
y_hat_proba = model.predict_proba(predictions)
predictions.loc[:,'y_hat'] = y_hat
predictions.loc[:,'y_hat_proba_class1'] = y_hat_proba[:,1]
predictions.loc[:,'y'] = y_test
```
%% Cell type:code id: tags:
``` python
predictions.head(15)
```
%% Output
Age WorkClass fnlwgt Education EducationNum MaritalStatus \
8236 42 4 52781 15 10 2
2152 21 0 300812 15 10 4
4490 29 4 107458 9 13 4
12833 36 4 240323 15 10 6
19947 34 4 269723 11 9 0
7169 54 4 94055 9 13 2
19603 44 4 216907 11 9 0
19013 33 4 379798 11 9 2
23049 65 4 171584 9 13 4
32101 45 4 174794 9 13 5
25960 27 4 159897 11 9 4
24855 25 4 44363 11 9 4
5029 17 4 121037 2 8 4
8643 43 0 116632 15 10 2
15569 29 4 1268339 11 9 3
Occupation Relationship Race Gender CapitalGain CapitalLoss \
8236 13 0 4 1 0 0
2152 0 3 4 1 0 0
4490 12 1 4 1 0 0
12833 1 4 2 0 0 0
19947 4 4 4 0 2977 0
7169 8 0 4 1 0 0
19603 8 1 4 1 0 0
19013 3 0 4 1 0 0
23049 4 1 4 1 0 0
32101 10 4 4 0 0 0
25960 4 1 2 0 0 0
24855 6 1 4 1 0 0
5029 12 3 4 0 0 0
8643 0 0 4 1 0 0
15569 13 3 2 1 0 0
HoursPerWeek NativeCountry y_hat y_hat_proba_class1 y
8236 40 39 1 0.578947 0
2152 30 39 0 0.000000 0
4490 50 39 0 0.230769 0
12833 40 39 0 0.000000 0
19947 50 39 0 0.000000 0
7169 40 39 0 0.461538 0
19603 37 39 0 0.000000 0
19013 40 39 0 0.333333 0
23049 40 39 0 0.000000 0
32101 56 11 0 0.166667 0
25960 37 39 0 0.000000 0
24855 35 39 0 0.000000 0
5029 15 39 0 0.000000 0
8643 45 39 1 0.727273 1
15569 40 39 0 0.000000 0
%% Cell type:markdown id: tags:
## Evaluation
Now we evaluate how well our classifier works.
A good way to visualise this is the confusion matrix for the binary labels and predictions.
This allows us to see if the predictions are generally quite good (most entries on the diagonal line) and how many off-diagonal elements we have that indicate wrong class assignments.
Here, we normalise the values displayed in the confusion matrix to the number of all entries to get the relative proportions
%% Cell type:code id: tags:
``` python
cm = metrics.confusion_matrix(y_test, y_hat,normalize='all')
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=['<50k', '>50k'])
disp.plot()
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
Scikit-Learn also provides a summary report with the most important metrics:
%% Cell type:code id: tags:
``` python
print(metrics.classification_report(y_test, y_hat, target_names=['<50k', '>50k']))
```
%% Output
precision recall f1-score support
<50k 0.89 0.92 0.90 6213
>50k 0.70 0.63 0.66 1928
accuracy 0.85 8141
macro avg 0.79 0.77 0.78 8141
weighted avg 0.84 0.85 0.84 8141
%% Cell type:markdown id: tags:
A key plot to understand the performance is the ROC curve.
Here, we need probabilities and the curve is constructed by setting subsequent thresholds on the predictied probabilities.
A model that is only as good as random guessing would lie on the diagonal, the ideal point is (0,1) where all predictions are perfect.
%% Cell type:code id: tags:
``` python
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_hat_proba[:,1])
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
estimator_name='Decision Tree')
display.plot()
# diagonal line
plt.plot([0, 1], [0, 1], "k--", label="random guess (AUC = 0.5)")
plt.legend()
plt.show()
```
%% Output
......
......@@ -307,7 +307,8 @@
"name": "python3"
},
"language_info": {
"name": "python"
"name": "python",
"version": "3.10.6"
}
},
"nbformat": 4,
......
......@@ -707,7 +707,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]"
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
......@@ -21,7 +21,7 @@
},
"language_info": {
"name": "python",
"version": "3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]"
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment