Copyright 2019-2020 Patrick Hall and the H2O.ai team
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
DISCLAIMER: This notebook is not legal compliance advice.
Part 2: Residual analysis for model debugging
"All models are wrong but some are useful" -- George Box, 1978
This notebook uses variants of residual analysis to find error mechanisms and security vulnerabilities and to assess stability and fairness in a trained XGBoost model. It begins by loading the UCI credit card default data and then training an interpretable, monotonically constrained XGBoost gradient boosting machine (GBM) model. (Pearson correlation with the prediction target is used to determine the direction of the monotonicity constraints for each input variable.) After the model is trained, its logloss residuals are analyzed and explained thoroughly and the constrained GBM is compared to a benchmark linear model. These model debugging exercises uncover several accuracy, drift, and security problems such as over-emphasis of important variables and strong signal in model residuals. Several remediation mechanisms are proposed including missing value injection during training, additional data collection, and use of assertions to correct known problems during scoring.
import numpy as np # array, vector, matrix calculations
import os # file-handling
import pandas as pd # DataFrame handling
import shap # for consistent, signed variable importance measurements
import xgboost as xgb # gradient boosting machines (GBMs)
import subprocess # manage external processes
import h2o # import h2o python bindings to java server
from h2o.backend import H2OLocalServer # for plotting local tree in-notebook
from h2o.estimators.random_forest import H2ORandomForestEstimator # for single tree
from h2o.estimators.glm import H2OGeneralizedLinearEstimator # for benchmark model
from h2o.grid.grid_search import H2OGridSearch # for benchmark model
# plotting ###########
import matplotlib.pyplot as plt # general plotting
from matplotlib.lines import Line2D # necessary for custom legends
import seaborn as sns # facet grid for residuals
from mpl_toolkits import mplot3d # 3-D scatterplots
import matplotlib; matplotlib.rcParams.update({'font.size': 40}) # set legible font size
# enables display of plots in notebook
# in-notebook display
from IPython.display import Image
from IPython.display import display
%matplotlib inline
pd.options.display.max_columns = 999 # enable display of all dataframe columns in notebook
SEED = 12345 # global random seed
np.random.seed(SEED) # set random seed for reproducibility
h2o.init() # start h2o java server
Checking whether there is an H2O instance running at http://localhost:54321 .
/home/patrickh/workspace/interpretable_machine_learning_with_python/env_iml/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d
.... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_232"; OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-0ubuntu1~16.04.1-b09); OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) Starting server from /home/patrickh/workspace/interpretable_machine_learning_with_python/env_iml/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpb5qehacu JVM stdout: /tmp/tmpb5qehacu/h2o_patrickh_started_from_python.out JVM stderr: /tmp/tmpb5qehacu/h2o_patrickh_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful. Warning: Your H2O cluster version is too old (4 months and 13 days)! Please download and install the latest version from http://h2o.ai/download/
H2O cluster uptime: | 01 secs |
H2O cluster timezone: | America/New_York |
H2O data parsing timezone: | UTC |
H2O cluster version: | 3.26.0.3 |
H2O cluster version age: | 4 months and 13 days !!! |
H2O cluster name: | H2O_from_python_patrickh_safz5l |
H2O cluster total nodes: | 1 |
H2O cluster free memory: | 3.462 Gb |
H2O cluster total cores: | 8 |
H2O cluster allowed cores: | 8 |
H2O cluster status: | accepting new members, healthy |
H2O connection url: | http://127.0.0.1:54321 |
H2O connection proxy: | None |
H2O internal security: | False |
H2O API Extensions: | Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 |
Python version: | 3.6.3 final |
UCI credit card default data: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
The UCI credit card default data contains demographic and payment information about credit card customers in Taiwan in the year 2005. The data set contains 23 input variables:
LIMIT_BAL
: Amount of given credit (NT dollar)SEX
: 1 = male; 2 = femaleEDUCATION
: 1 = graduate school; 2 = university; 3 = high school; 4 = othersMARRIAGE
: 1 = married; 2 = single; 3 = othersAGE
: Age in yearsPAY_0
, PAY_2
- PAY_6
: History of past payment; PAY_0
= the repayment status in September, 2005; PAY_2
= the repayment status in August, 2005; ...; PAY_6
= the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above.BILL_AMT1
- BILL_AMT6
: Amount of bill statement (NT dollar). BILL_AMNT1
= amount of bill statement in September, 2005; BILL_AMT2
= amount of bill statement in August, 2005; ...; BILL_AMT6
= amount of bill statement in April, 2005.PAY_AMT1
- PAY_AMT6
: Amount of previous payment (NT dollar). PAY_AMT1
= amount paid in September, 2005; PAY_AMT2
= amount paid in August, 2005; ...; PAY_AMT6
= amount paid in April, 2005.These 23 input variables are used to predict the target variable, whether or not a customer defaulted on their credit card bill in late 2005. Because XGBoost accepts only numeric inputs, all variables will be treated as numeric.
The credit card default data is available as an .xls
file. Pandas reads .xls
files automatically, so it's used to load the credit card default data and give the prediction target a shorter name: DEFAULT_NEXT_MONTH
.
# import XLS file
path = 'default_of_credit_card_clients.xls'
data = pd.read_excel(path,
skiprows=1) # skip the first row of the spreadsheet
# remove spaces from target column name
data = data.rename(columns={'default payment next month': 'DEFAULT_NEXT_MONTH'})
The shorthand name y
is assigned to the prediction target. X
is assigned to all other input variables in the credit card default data except the row identifier, ID
.
# assign target and inputs for GBM
y = 'DEFAULT_NEXT_MONTH'
X = [name for name in data.columns if name not in [y, 'ID']]
print('y =', y)
print('X =', X)
y = DEFAULT_NEXT_MONTH X = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
The Pandas describe()
function displays a brief description of the credit card default data. The input variables SEX
, EDUCATION
, MARRIAGE
, PAY_0
-PAY_6
, and the prediction target DEFAULT_NEXT_MONTH
, are really categorical variables, but they have already been encoded into meaningful numeric, integer values, which is great for XGBoost. Also, there are no missing values in this dataset.
data[X + [y]].describe() # display descriptive statistics for all columns
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 3.000000e+04 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 3.000000e+04 | 30000.00000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 167484.322667 | 1.603733 | 1.853133 | 1.551867 | 35.485500 | -0.016700 | -0.133767 | -0.166200 | -0.220667 | -0.266200 | -0.291100 | 51223.330900 | 49179.075167 | 4.701315e+04 | 43262.948967 | 40311.400967 | 38871.760400 | 5663.580500 | 5.921163e+03 | 5225.68150 | 4826.076867 | 4799.387633 | 5215.502567 | 0.221200 |
std | 129747.661567 | 0.489129 | 0.790349 | 0.521970 | 9.217904 | 1.123802 | 1.197186 | 1.196868 | 1.169139 | 1.133187 | 1.149988 | 73635.860576 | 71173.768783 | 6.934939e+04 | 64332.856134 | 60797.155770 | 59554.107537 | 16563.280354 | 2.304087e+04 | 17606.96147 | 15666.159744 | 15278.305679 | 17777.465775 | 0.415062 |
min | 10000.000000 | 1.000000 | 0.000000 | 0.000000 | 21.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -165580.000000 | -69777.000000 | -1.572640e+05 | -170000.000000 | -81334.000000 | -339603.000000 | 0.000000 | 0.000000e+00 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 50000.000000 | 1.000000 | 1.000000 | 1.000000 | 28.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 3558.750000 | 2984.750000 | 2.666250e+03 | 2326.750000 | 1763.000000 | 1256.000000 | 1000.000000 | 8.330000e+02 | 390.00000 | 296.000000 | 252.500000 | 117.750000 | 0.000000 |
50% | 140000.000000 | 2.000000 | 2.000000 | 2.000000 | 34.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 22381.500000 | 21200.000000 | 2.008850e+04 | 19052.000000 | 18104.500000 | 17071.000000 | 2100.000000 | 2.009000e+03 | 1800.00000 | 1500.000000 | 1500.000000 | 1500.000000 | 0.000000 |
75% | 240000.000000 | 2.000000 | 2.000000 | 2.000000 | 41.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 67091.000000 | 64006.250000 | 6.016475e+04 | 54506.000000 | 50190.500000 | 49198.250000 | 5006.000000 | 5.000000e+03 | 4505.00000 | 4013.250000 | 4031.500000 | 4000.000000 | 0.000000 |
max | 1000000.000000 | 2.000000 | 6.000000 | 3.000000 | 79.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 964511.000000 | 983931.000000 | 1.664089e+06 | 891586.000000 | 927171.000000 | 961664.000000 | 873552.000000 | 1.684259e+06 | 896040.00000 | 621000.000000 | 426529.000000 | 528666.000000 | 1.000000 |
Monotonic relationships are much easier to explain to colleagues, bosses, customers, and regulators than more complex, non-monotonic relationships and monotonic relationships may also prevent overfitting and excess error due to variance for new data.
To train a transparent monotonic classifier, constraints must be supplied to XGBoost that determine whether the learned relationship between an input variable and the prediction target DEFAULT_NEXT_MONTH
will be increasing for increases in an input variable or decreasing for increases in an input variable. Pearson correlation provides a linear measure of the direction of the relationship between each input variable and the target. If the pair-wise Pearson correlation between an input and DEFAULT_NEXT_MONTH
is positive, it will be constrained to have an increasing relationship with the predictions for DEFAULT_NEXT_MONTH
. If the pair-wise Pearson correlation is negative, the input will be constrained to have a decreasing relationship with the predictions for DEFAULT_NEXT_MONTH
.
Constraints are supplied to XGBoost in the form of a Python tuple with length equal to the number of inputs. Each item in the tuple is associated with an input variable based on its index in the tuple. The first constraint in the tuple is associated with the first variable in the training data, the second constraint in the tuple is associated with the second variable in the training data, and so on. The constraints themselves take the form of a 1 for a positive relationship and a -1 for a negative relationship.
The Pandas .corr()
function returns the pair-wise Pearson correlation between variables in a Pandas DataFrame. Because DEFAULT_NEXT_MONTH
is the last column in the data
DataFrame, the last column of the Pearson correlation matrix indicates the direction of the linear relationship between each input variable and the prediction target, DEFAULT_NEXT_MONTH
. According to the calculated values, as a customer's balance limit (LIMIT_BAL
), bill amounts (BILL_AMT1
-BILL_AMT6
), and payment amounts (PAY_AMT1
-PAY_AMT6
) increase, their probability of default tends to decrease. However as a customer's number of late payments increase (PAY_0
, PAY_2
-PAY6
), their probability of default usually increases. In general, the Pearson correlation values make sense, and they will be used to ensure that the modeled relationships will make sense as well. (Pearson correlation values between the target variable, DEFAULT_NEXT_MONTH, and each input variable are displayed directly below.)
# displays last column of Pearson correlation matrix as Pandas DataFrame
pd.DataFrame(data[X + [y]].corr()[y]).iloc[:-1]
DEFAULT_NEXT_MONTH | |
---|---|
LIMIT_BAL | -0.153520 |
SEX | -0.039961 |
EDUCATION | 0.028006 |
MARRIAGE | -0.024339 |
AGE | 0.013890 |
PAY_0 | 0.324794 |
PAY_2 | 0.263551 |
PAY_3 | 0.235253 |
PAY_4 | 0.216614 |
PAY_5 | 0.204149 |
PAY_6 | 0.186866 |
BILL_AMT1 | -0.019644 |
BILL_AMT2 | -0.014193 |
BILL_AMT3 | -0.014076 |
BILL_AMT4 | -0.010156 |
BILL_AMT5 | -0.006760 |
BILL_AMT6 | -0.005372 |
PAY_AMT1 | -0.072929 |
PAY_AMT2 | -0.058579 |
PAY_AMT3 | -0.056250 |
PAY_AMT4 | -0.056827 |
PAY_AMT5 | -0.055124 |
PAY_AMT6 | -0.053183 |
The last column of the Pearson correlation matrix is transformed from a numeric column in a Pandas DataFrame into a Python tuple of 1
s and -1
s that will be used to specify monotonicity constraints for each input variable in XGBoost. If the Pearson correlation between an input variable and DEFAULT_NEXT_MONTH
is positive, a positive monotonic relationship constraint is specified for that variable using 1
. If the correlation is negative, a negative monotonic constraint is specified using -1
. (Specifying 0
indicates that no constraints should be used.) The resulting tuple will be passed to XGBoost when the GBM model is trained.
# creates a tuple in which positive correlation values are assigned a 1
# and negative correlation values are assigned a -1
mono_constraints = tuple([int(i) for i in np.sign(data[X + [y]].corr()[y].values[:-1])])
# (-1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)
XGBoost is a very accurate, open source GBM library for regression and classification tasks. XGBoost can learn complex relationships between input variables and a target variable, but here the monotone_constraints
tuning parameter is used to enforce monotonicity between inputs and the prediction for DEFAULT_NEXT_MONTH
.
XGBoost is available from: https://github.com/dmlc/xgboost and the implementation of XGBoost is described in detail here: http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf.
The credit card default data is split into training and test sets to monitor and prevent overtraining. Reproducibility is another important factor in creating trustworthy models, and randomly splitting datasets can introduce randomness in model predictions and other results. A random seed is used here to ensure the data split is reproducible.
split_ratio = 0.70 # 70%/30% train/test split
# execute split
split = np.random.rand(len(data)) < split_ratio
train = data[split]
test = data[~split]
# summarize split
print('Train data rows = %d, columns = %d' % (train.shape[0], train.shape[1]))
print('Test data rows = %d, columns = %d' % (test.shape[0], test.shape[1]))
Train data rows = 20946, columns = 25 Test data rows = 9054, columns = 25
To train an XGBoost classifier, the training and test data must be converted from Pandas DataFrames into SVMLight format. The DMatrix()
function in the XGBoost package is used to convert the data. Many XGBoost tuning parameters must be specified as well. Typically a grid search would be performed to identify the best parameters for a given modeling task. For brevity's sake, a previously-discovered set of good tuning parameters are specified here. Notice that the monotonicity constraints are passed to XGBoost using the monotone_constraints
parameter. Because gradient boosting methods typically resample training data, an additional random seed is also specified for XGBoost using the seed
parameter to create reproducible predictions, error rates, and variable importance values. To calculate Shapley contributions to logloss in later sections of the notebook, early stopping is not used because it may be unsupported for calculating accurate contributions to logloss. Hence, a previously discovered best number of trees is used in this notebook.
# XGBoost uses SVMLight data structure, not Numpy arrays or Pandas DataFrames
dtrain = xgb.DMatrix(train[X], train[y])
dtest = xgb.DMatrix(test[X], test[y])
# used to calibrate predictions to mean of y
base_y = train[y].mean()
# tuning parameters
params = {
'objective': 'binary:logistic', # produces 0-1 probabilities for binary classification
'booster': 'gbtree', # base learner will be decision tree
'eval_metric': 'logloss', # stop training based on
'eta': 0.08, # learning rate
'subsample': 0.9, # use 90% of rows in each decision tree
'colsample_bytree': 0.9, # use 90% of columns in each decision tree
'max_depth': 15, # allow decision trees to grow to depth of 15
'monotone_constraints':mono_constraints, # 1 = increasing relationship, -1 = decreasing relationship
'base_score': base_y, # calibrate predictions to mean of y
'seed': SEED # set random seed for reproducibility
}
# watchlist is used for early stopping
watchlist = [(dtrain, 'train'), (dtest, 'eval')]
# train model
xgb_model = xgb.train(params, # set tuning parameters from above
dtrain, # training data
62, # **logloss shap may not account for early stopping**
evals=watchlist, # use watchlist for early stopping
verbose_eval=True) # display iteration progress
[0] train-logloss:0.517487 eval-logloss:0.51404 [1] train-logloss:0.504097 eval-logloss:0.500144 [2] train-logloss:0.49331 eval-logloss:0.489053 [3] train-logloss:0.48449 eval-logloss:0.480157 [4] train-logloss:0.476882 eval-logloss:0.472429 [5] train-logloss:0.470593 eval-logloss:0.46606 [6] train-logloss:0.465221 eval-logloss:0.460639 [7] train-logloss:0.460893 eval-logloss:0.456227 [8] train-logloss:0.456987 eval-logloss:0.452346 [9] train-logloss:0.453662 eval-logloss:0.449056 [10] train-logloss:0.450854 eval-logloss:0.446329 [11] train-logloss:0.448211 eval-logloss:0.443774 [12] train-logloss:0.445944 eval-logloss:0.441601 [13] train-logloss:0.444034 eval-logloss:0.439837 [14] train-logloss:0.442334 eval-logloss:0.438174 [15] train-logloss:0.440674 eval-logloss:0.436801 [16] train-logloss:0.439326 eval-logloss:0.435491 [17] train-logloss:0.438064 eval-logloss:0.434513 [18] train-logloss:0.436879 eval-logloss:0.433658 [19] train-logloss:0.4357 eval-logloss:0.432662 [20] train-logloss:0.434803 eval-logloss:0.431955 [21] train-logloss:0.433965 eval-logloss:0.431188 [22] train-logloss:0.433221 eval-logloss:0.430624 [23] train-logloss:0.432498 eval-logloss:0.430157 [24] train-logloss:0.431834 eval-logloss:0.429738 [25] train-logloss:0.431169 eval-logloss:0.429235 [26] train-logloss:0.430642 eval-logloss:0.428901 [27] train-logloss:0.430144 eval-logloss:0.428576 [28] train-logloss:0.429715 eval-logloss:0.428276 [29] train-logloss:0.429267 eval-logloss:0.427921 [30] train-logloss:0.428899 eval-logloss:0.427618 [31] train-logloss:0.42856 eval-logloss:0.427503 [32] train-logloss:0.428243 eval-logloss:0.427278 [33] train-logloss:0.427974 eval-logloss:0.427111 [34] train-logloss:0.427719 eval-logloss:0.427036 [35] train-logloss:0.427457 eval-logloss:0.426976 [36] train-logloss:0.427155 eval-logloss:0.426925 [37] train-logloss:0.426861 eval-logloss:0.426893 [38] train-logloss:0.426616 eval-logloss:0.426824 [39] train-logloss:0.42634 eval-logloss:0.426717 [40] train-logloss:0.426117 eval-logloss:0.426629 [41] train-logloss:0.425885 eval-logloss:0.426643 [42] train-logloss:0.425639 eval-logloss:0.426595 [43] train-logloss:0.425484 eval-logloss:0.426524 [44] train-logloss:0.425296 eval-logloss:0.426502 [45] train-logloss:0.425151 eval-logloss:0.426528 [46] train-logloss:0.42499 eval-logloss:0.426503 [47] train-logloss:0.424875 eval-logloss:0.426499 [48] train-logloss:0.424652 eval-logloss:0.426443 [49] train-logloss:0.424516 eval-logloss:0.426425 [50] train-logloss:0.424296 eval-logloss:0.426354 [51] train-logloss:0.42416 eval-logloss:0.426311 [52] train-logloss:0.424011 eval-logloss:0.426276 [53] train-logloss:0.423911 eval-logloss:0.426338 [54] train-logloss:0.423786 eval-logloss:0.426309 [55] train-logloss:0.423656 eval-logloss:0.426288 [56] train-logloss:0.423537 eval-logloss:0.426259 [57] train-logloss:0.423442 eval-logloss:0.426248 [58] train-logloss:0.423331 eval-logloss:0.426241 [59] train-logloss:0.423219 eval-logloss:0.426204 [60] train-logloss:0.4231 eval-logloss:0.426169 [61] train-logloss:0.422985 eval-logloss:0.426156
Cutoffs affect model characteristics, and sometimes drastically. Cutoffs should always be chosen with care. In this notebook, the cutoff is selected by maximizing the model's F1 statistic. This is a standard approach that could likely be improved upon.
A new column will be created called p_DEFAULT_NEXT_MONTH
-- this is the model's predicted probability. P-R AUC stands for precision-recall area under the curve. The curve to be created is an established model assessment technique that helps choose a probability cutoff.
# shortcut name
yhat = 'p_DEFAULT_NEXT_MONTH'
# copy test data and reset index
test_yhat = test.copy(deep=True)
test_yhat.reset_index(drop=True, inplace=True) # Pandas joins are weird otherwise
test_yhat[yhat] = pd.DataFrame(xgb_model.predict(xgb.DMatrix(test[X])))
Precision refers to the proportion of people the model predicts will default correctly out of the total number of predictions of default. Recall refers to the proportion of people the model predicts will default correctly out of the total number of actual defaults. F1 is a combination (harmonic mean) of these quantities. So by maximizing F1, a cutoff that considers true positives carefully is selected. The function below calculates precision, recall, and F1 at a number of probability cutoffs between 0 and 1. Results are displayed below.
def get_prauc(frame, y, yhat, pos=1, neg=0, res=0.01):
""" Calculates precision, recall, and f1 for a pandas dataframe of y
and yhat values.
Args:
frame: Pandas dataframe of actual (y) and predicted (yhat) values.
y: Name of actual value column.
yhat: Name of predicted value column.
pos: Primary target value, default 1.
neg: Secondary target value, default 0.
res: Resolution by which to loop through cutoffs, default 0.01.
Returns:
Pandas dataframe of precision, recall, and f1 values.
"""
frame_ = frame.copy(deep=True) # don't destroy original data
dname = 'd_' + str(y) # column for predicted decisions
eps = 1e-20 # for safe numerical operations
# init p-r roc frame
prroc_frame = pd.DataFrame(columns=['cutoff', 'recall', 'precision', 'f1'])
# loop through cutoffs to create p-r roc frame
for cutoff in np.arange(0, 1 + res, res):
# binarize decision to create confusion matrix values
frame_[dname] = np.where(frame_[yhat] > cutoff , 1, 0)
# calculate confusion matrix values
tp = frame_[(frame_[dname] == pos) & (frame_[y] == pos)].shape[0]
fp = frame_[(frame_[dname] == pos) & (frame_[y] == neg)].shape[0]
tn = frame_[(frame_[dname] == neg) & (frame_[y] == neg)].shape[0]
fn = frame_[(frame_[dname] == neg) & (frame_[y] == pos)].shape[0]
# calculate precision, recall, and f1
recall = (tp + eps)/((tp + fn) + eps)
precision = (tp + eps)/((tp + fp) + eps)
f1 = 2/((1/(recall + eps)) + (1/(precision + eps)))
# add new values to frame
prroc_frame = prroc_frame.append({'cutoff': cutoff,
'recall': recall,
'precision': precision,
'f1': f1},
ignore_index=True)
# housekeeping
del frame_
return prroc_frame
# calculate and display recall and precision
prauc_frame = get_prauc(test_yhat, y, yhat)
prauc_frame.style.set_caption('Recall and Precision')
cutoff | recall | precision | f1 | |
---|---|---|---|---|
0 | 0 | 1 | 0.219351 | 0.359783 |
1 | 0.01 | 0.999496 | 0.219361 | 0.359764 |
2 | 0.02 | 0.998993 | 0.220004 | 0.360596 |
3 | 0.03 | 0.996475 | 0.221241 | 0.362089 |
4 | 0.04 | 0.993454 | 0.22314 | 0.364426 |
5 | 0.05 | 0.988419 | 0.225892 | 0.367741 |
6 | 0.06 | 0.982377 | 0.231545 | 0.37476 |
7 | 0.07 | 0.968781 | 0.239125 | 0.383573 |
8 | 0.08 | 0.950151 | 0.248355 | 0.393781 |
9 | 0.09 | 0.929003 | 0.259166 | 0.405272 |
10 | 0.1 | 0.909366 | 0.271579 | 0.418249 |
11 | 0.11 | 0.895267 | 0.284799 | 0.43213 |
12 | 0.12 | 0.873112 | 0.298451 | 0.444844 |
13 | 0.13 | 0.856495 | 0.309893 | 0.455117 |
14 | 0.14 | 0.836858 | 0.322093 | 0.465155 |
15 | 0.15 | 0.813696 | 0.337018 | 0.476626 |
16 | 0.16 | 0.79003 | 0.356429 | 0.491234 |
17 | 0.17 | 0.76435 | 0.373616 | 0.501901 |
18 | 0.18 | 0.73565 | 0.387533 | 0.507644 |
19 | 0.19 | 0.71853 | 0.402766 | 0.516187 |
20 | 0.2 | 0.70141 | 0.420592 | 0.525859 |
21 | 0.21 | 0.68429 | 0.43488 | 0.531794 |
22 | 0.22 | 0.664149 | 0.449864 | 0.536397 |
23 | 0.23 | 0.644512 | 0.466813 | 0.541455 |
24 | 0.24 | 0.627392 | 0.486529 | 0.548054 |
25 | 0.25 | 0.607754 | 0.503126 | 0.550513 |
26 | 0.26 | 0.594663 | 0.511698 | 0.55007 |
27 | 0.27 | 0.577039 | 0.519257 | 0.546625 |
28 | 0.28 | 0.566465 | 0.530911 | 0.548112 |
29 | 0.29 | 0.556898 | 0.542955 | 0.549838 |
30 | 0.3 | 0.542296 | 0.556014 | 0.54907 |
31 | 0.31 | 0.521652 | 0.569231 | 0.544404 |
32 | 0.32 | 0.506546 | 0.583865 | 0.542464 |
33 | 0.33 | 0.493958 | 0.591677 | 0.538419 |
34 | 0.34 | 0.484391 | 0.603892 | 0.53758 |
35 | 0.35 | 0.475831 | 0.612047 | 0.535411 |
36 | 0.36 | 0.468781 | 0.620253 | 0.533983 |
37 | 0.37 | 0.458711 | 0.627843 | 0.530113 |
38 | 0.38 | 0.450151 | 0.630465 | 0.525264 |
39 | 0.39 | 0.443102 | 0.634921 | 0.521945 |
40 | 0.4 | 0.435549 | 0.639793 | 0.518274 |
41 | 0.41 | 0.427492 | 0.643182 | 0.513612 |
42 | 0.42 | 0.420443 | 0.647287 | 0.509768 |
43 | 0.43 | 0.414904 | 0.652932 | 0.507389 |
44 | 0.44 | 0.41289 | 0.657051 | 0.507112 |
45 | 0.45 | 0.406848 | 0.663928 | 0.504527 |
46 | 0.46 | 0.400302 | 0.665272 | 0.499843 |
47 | 0.47 | 0.397281 | 0.668078 | 0.498263 |
48 | 0.48 | 0.393756 | 0.671245 | 0.49635 |
49 | 0.49 | 0.388218 | 0.673362 | 0.492494 |
50 | 0.5 | 0.3857 | 0.676081 | 0.491183 |
51 | 0.51 | 0.380665 | 0.681695 | 0.48853 |
52 | 0.52 | 0.376133 | 0.682815 | 0.485065 |
53 | 0.53 | 0.372608 | 0.689655 | 0.483818 |
54 | 0.54 | 0.365559 | 0.692088 | 0.478418 |
55 | 0.55 | 0.361027 | 0.695441 | 0.475307 |
56 | 0.56 | 0.351964 | 0.698302 | 0.468028 |
57 | 0.57 | 0.340886 | 0.697219 | 0.457897 |
58 | 0.58 | 0.328298 | 0.702586 | 0.447495 |
59 | 0.59 | 0.316213 | 0.704826 | 0.436566 |
60 | 0.6 | 0.306143 | 0.71028 | 0.427868 |
61 | 0.61 | 0.294562 | 0.711679 | 0.416667 |
62 | 0.62 | 0.283484 | 0.719949 | 0.406792 |
63 | 0.63 | 0.266868 | 0.723056 | 0.389849 |
64 | 0.64 | 0.249748 | 0.727273 | 0.371814 |
65 | 0.65 | 0.240685 | 0.728659 | 0.361847 |
66 | 0.66 | 0.226586 | 0.725806 | 0.345357 |
67 | 0.67 | 0.210473 | 0.733333 | 0.327074 |
68 | 0.68 | 0.199899 | 0.729779 | 0.313834 |
69 | 0.69 | 0.188318 | 0.740594 | 0.300281 |
70 | 0.7 | 0.179758 | 0.74375 | 0.289538 |
71 | 0.71 | 0.167674 | 0.746637 | 0.273849 |
72 | 0.72 | 0.159617 | 0.754762 | 0.263508 |
73 | 0.73 | 0.147029 | 0.750643 | 0.245895 |
74 | 0.74 | 0.129406 | 0.762611 | 0.221266 |
75 | 0.75 | 0.113797 | 0.773973 | 0.19842 |
76 | 0.76 | 0.101712 | 0.765152 | 0.179556 |
77 | 0.77 | 0.0906344 | 0.772532 | 0.162235 |
78 | 0.78 | 0.0795569 | 0.78607 | 0.14449 |
79 | 0.79 | 0.0704935 | 0.79096 | 0.12945 |
80 | 0.8 | 0.0589124 | 0.774834 | 0.109499 |
81 | 0.81 | 0.0488419 | 0.76378 | 0.0918126 |
82 | 0.82 | 0.0453172 | 0.769231 | 0.085592 |
83 | 0.83 | 0.0357503 | 0.747368 | 0.0682364 |
84 | 0.84 | 0.0312185 | 0.765432 | 0.0599903 |
85 | 0.85 | 0.0231621 | 0.807018 | 0.0450318 |
86 | 0.86 | 0.020141 | 0.833333 | 0.0393314 |
87 | 0.87 | 0.0166163 | 0.846154 | 0.0325926 |
88 | 0.88 | 0.0120846 | 0.888889 | 0.023845 |
89 | 0.89 | 0.00906344 | 0.857143 | 0.0179372 |
90 | 0.9 | 0.00654582 | 0.8125 | 0.012987 |
91 | 0.91 | 0.00553877 | 0.846154 | 0.0110055 |
92 | 0.92 | 0.00352467 | 0.875 | 0.00702106 |
93 | 0.93 | 0.00251762 | 0.833333 | 0.00502008 |
94 | 0.94 | 0.00151057 | 1 | 0.00301659 |
95 | 0.95 | 5.03525e-24 | 1 | 2.00101e-20 |
96 | 0.96 | 5.03525e-24 | 1 | 2.00101e-20 |
97 | 0.97 | 5.03525e-24 | 1 | 2.00101e-20 |
98 | 0.98 | 5.03525e-24 | 1 | 2.00101e-20 |
99 | 0.99 | 5.03525e-24 | 1 | 2.00101e-20 |
100 | 1 | 5.03525e-24 | 1 | 2.00101e-20 |
For this model, the best F1 value is found at a probability cutoff of 0.25.
gbm_cut = prauc_frame.loc[prauc_frame['f1'].idxmax(), 'cutoff'] # value associated w/ index of max. F1
print('Best F1 threshold: %.2f' % gbm_cut)
Best F1 threshold: 0.25
A P-R AUC curve is a curve formed by plotting precision and recall across different cutoffs. (P-R AUC curves are often a bit more robust to imbalanced target classes than ROC curves.) This curve shows some odd behavior for high precision and low recall, potentially indicating instability in model predictions, but is smooth and mostly well-behaved where the cutoff is selected. Generally a smooth tradeoff between precision and recall is preferred as different cutoffs are considered.
title_ = 'P-R Curve: Best F1 Threshold = ' + str(gbm_cut)
ax = prauc_frame.plot(x='recall', y='precision', kind='scatter', title=title_, xlim=[0,1])
_ = ax.axvline(gbm_cut, color='magenta')
Once a cutoff has been established, a confusion matrix can be created. A confusion matrix is basic model assessment technique that shows how often a model is correct and incorrect, and how it is incorrect, e.g. type I (false positive) and type II (false negative) errors.
def get_confusion_matrix(frame, y, yhat, by=None, level=None, cutoff=0.5):
""" Creates confusion matrix from pandas dataframe of y and yhat values,
can be sliced by a variable and level.
Args:
frame: Pandas dataframe of actual (y) and predicted (yhat) values.
y: Name of actual value column.
yhat: Name of predicted value column.
by: By variable to slice frame before creating confusion matrix,
default None.
level: Value of by variable to slice frame before creating confusion
matrix, default None.
cutoff: Cutoff threshold for confusion matrix, default 0.5.
Returns:
Confusion matrix as pandas dataframe.
"""
# determine levels of target (y) variable
# sort for consistency
level_list = list(frame[y].unique())
level_list.sort(reverse=True)
# init confusion matrix
cm_frame = pd.DataFrame(columns=['actual: ' + str(i) for i in level_list],
index=['predicted: ' + str(i) for i in level_list])
# don't destroy original data
frame_ = frame.copy(deep=True)
# convert numeric predictions to binary decisions using cutoff
dname = 'd_' + str(y)
frame_[dname] = np.where(frame_[yhat] > cutoff , 1, 0)
# slice frame
if (by is not None) & (level is not None):
frame_ = frame_[frame[by] == level]
# calculate size of each confusion matrix value
for i, lev_i in enumerate(level_list):
for j, lev_j in enumerate(level_list):
cm_frame.iat[j, i] = frame_[(frame_[y] == lev_i) &
(frame_[dname] == lev_j)].shape[0]
# i,j vs. j,i - nasty little bug updated 8/30/19
return cm_frame
cm = get_confusion_matrix(test_yhat, y, yhat, cutoff=gbm_cut)
cm.style.set_caption('Confusion Matrix')
actual: 1 | actual: 0 | |
---|---|---|
predicted: 1 | 1207 | 1192 |
predicted: 0 | 779 | 5876 |
The confusion matrix shows that the GBM model is more accurate than not because the true positive and true negative cells contain the largest values by far. But the GBM model seems to make a large number of type I errors, or false positive predictions. False positives are a known disparity issue, because for complex reasons, many credit scoring and other models tend to overestimate the likelihood of non-reference groups - typically not white males - to default. The large amount of false positives could indicate that a lot of deserving people will not receive loans that they could actually pay back according to this model. This is a potential pathology that should definitely be investigated further.
Because special functionality in XGBoost to find Shapley contributions to logloss will be used later, logloss residuals are used in several places in this notebook.
# shortcut name
resid = 'r_DEFAULT_NEXT_MONTH'
# calculate logloss residuals
test_yhat[resid] = -test_yhat[y]*np.log(test_yhat[yhat]) -\
(1 - test_yhat[y])*np.log(1 - test_yhat[yhat])
# check that logloss is calculated correctly
# should match eval-logloss above
print('Mean logloss residual: %.6f' % test_yhat[resid].mean())
Mean logloss residual: 0.426156
Plotting residuals is almost always a good idea. You can see all of your model's predictions in two-dimensions!
# initialize figure
fig, ax_ = plt.subplots(figsize=(8, 8))
# plot groups with appropriate color
color_list = ['royalblue', 'magenta']
c_idx = 0
groups = test_yhat.groupby(y) # define groups for levels of PAY_0
for name, group in groups:
ax_.plot(group.p_DEFAULT_NEXT_MONTH, group.r_DEFAULT_NEXT_MONTH,
label=' '.join([y, str(name)]),
marker='o', linestyle='', color=color_list[c_idx], alpha=0.3)
c_idx += 1
# annotate plot
_ = plt.xlabel(yhat)
_ = plt.ylabel(resid)
_ = ax_.legend(loc=1)
_ = plt.title('Global Logloss Residuals')
Some high-magnitude outlying residuals are visible. Who are these customers? Why is the model so wrong about them? And are they somehow exerting undue influence on other predictions? The model could be retrained without these individuals and retested as a potentially remediation strategy. From a security standpoint it can be interesting to examine insider attacks and false negatives. These are people who where issued a loan, but did not pay it back. Are any of these employees, contractors, consultants, or their associates? Could they have poisoned the model's training data to receive the loan? Could they have changed the model's production scoring code to receive the loan?
# sort by residual and display
test_yhat_sorted = test_yhat.\
sort_values(by=resid, ascending=False).\
reset_index(drop=True)
test_yhat_sorted.head() # large positive residuals
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | p_DEFAULT_NEXT_MONTH | r_DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 983 | 500000 | 1 | 1 | 2 | 36 | -2 | -2 | -2 | -2 | -2 | -2 | 45106 | 81264 | 18122 | 27229 | 21462 | 27911 | 81690 | 18225 | 27365 | 21570 | 28050 | 17397 | 1 | 0.008015 | 4.826454 |
1 | 25772 | 350000 | 2 | 1 | 1 | 33 | 0 | -1 | -1 | -1 | -1 | -1 | 82964 | 68532 | 17926 | 17966 | 30741 | 31088 | 68940 | 18018 | 18058 | 30897 | 31244 | 88461 | 1 | 0.014396 | 4.240807 |
2 | 11571 | 200000 | 1 | 2 | 2 | 30 | -2 | -2 | -2 | -2 | -2 | -2 | 48492 | 49934 | 24753 | 123439 | 132269 | 129224 | 20294 | 24891 | 125171 | 17816 | 26269 | 4349 | 1 | 0.022704 | 3.785196 |
3 | 2561 | 310000 | 2 | 1 | 2 | 32 | -2 | -2 | -2 | -2 | -2 | -2 | 20138 | 8267 | 65993 | 8543 | 1695 | 750 | 8267 | 66008 | 8543 | 1695 | 750 | 7350 | 1 | 0.023283 | 3.760035 |
4 | 16209 | 360000 | 2 | 1 | 1 | 35 | -2 | -1 | 0 | -1 | -1 | -1 | 94657 | 34529 | 106276 | 73331 | 7759 | 31840 | 45000 | 100000 | 73427 | 7759 | 31840 | 12577 | 1 | 0.026504 | 3.630469 |
High positive residuals seem to be caused by people with excellent payment behavior who suddenly default.
test_yhat_sorted.tail()
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | p_DEFAULT_NEXT_MONTH | r_DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9049 | 28717 | 340000 | 2 | 1 | 3 | 42 | -1 | -1 | -1 | -1 | -1 | 0 | 139808 | 176743 | 34402 | 205931 | 265431 | 270237 | 873552 | 1215471 | 889043 | 621000 | 20000 | 145000 | 0 | 0.010588 | 0.010644 |
9050 | 14136 | 500000 | 1 | 1 | 2 | 43 | -1 | -1 | -1 | -1 | -1 | -1 | 37575 | 105652 | 51072 | 77094 | 77669 | 42059 | 107243 | 52090 | 78372 | 78949 | 42985 | 62793 | 0 | 0.009773 | 0.009821 |
9051 | 28744 | 440000 | 2 | 2 | 2 | 29 | -1 | -1 | -1 | -1 | 0 | 0 | 23147 | 88848 | 42045 | 400321 | 229680 | 265404 | 70554 | 45213 | 400972 | 5456 | 100267 | 7530 | 0 | 0.007182 | 0.007207 |
9052 | 3775 | 500000 | 2 | 1 | 2 | 32 | -1 | -1 | -1 | -1 | -1 | -1 | 103880 | 39356 | 301441 | 37945 | 104491 | 35234 | 39560 | 302961 | 38139 | 104673 | 35387 | 177258 | 0 | 0.007151 | 0.007176 |
9053 | 23477 | 480000 | 2 | 1 | 2 | 32 | -2 | -2 | -2 | -2 | -2 | -2 | 11872 | 38933 | 23479 | 52177 | 54005 | 53853 | 40000 | 23479 | 52209 | 54005 | 54500 | 42321 | 0 | 0.006011 | 0.006030 |
The model seems to have the easiest time making correct decisions about customers with good payment behavior who sustain that behavior.
PAY_0
¶To take residual analysis farther, try plotting residuals by values of important variables or across bins of important variables.
# some seaborn configs
sns.set(font_scale=0.9) # legible font size
sns.set_style('whitegrid', {'axes.grid': False}) # white background, no grid in plots
sns.set_palette(sns.color_palette(["#4169e1", "#ff00ff"])) # consistent colors
# facet grid of residuals by PAY_0
sorted_ = test_yhat.sort_values(by='PAY_0') # sort for better layout of by-groups
g = sns.FacetGrid(sorted_, col='PAY_0', hue=y, col_wrap=4) # init grid
_ = g.map(plt.scatter, yhat, resid, alpha=0.4) # plot points
_ = g.add_legend(bbox_to_anchor=(0.82, 0.2)) # legend
Here, a problem with the model is visible. For people whose most recent repayment status, i.e. PAY_0
value, is good, PAY_0 <= 1
, the model appears to struggle to predict default. When the model predicts this type of customer will not default, and they do, this results in large residuals. For people with late most recent payments, PAY_0 > 1
, the model appears to struggle to predict on-time payment. When the model predicts people with late most recent payments will default, and then they don't, this also causes large residuals. A mechanism for error has been identified: over-emphasis of a feature: PAY_0 > 2
. These problems could be become worse if market conditions change, such as in a recession, where more people with good payment behavior could default suddenly.
SEX
¶While not a frontline fairness tool, it can be interesting to examine residuals across demographic variables to get a sense of whether the model's errors are different across these groups.
# facet grid of residuals by SEX
sorted_ = test_yhat.sort_values(by='SEX') # sort for better layout of by-groups
g = sns.FacetGrid(sorted_, col='SEX', hue=y, col_wrap=4) # init grid
_ = g.map(plt.scatter, yhat, resid, alpha=0.4) # plot points
_ = g.add_legend(bbox_to_anchor=(0.6, 0.5)) # legend
The residual profile is not dramatically different for men (SEX = 1
) and women (SEX = 2
). The largest magnitude residuals appear for men and for false positive predictions. This does not ensure the model is not discriminatory, but is just a potentially interesting method to augment results from a more rigorous fairness analysis, such as: https://github.com/jphall663/interpretable_machine_learning_with_python/blob/master/dia.ipynb.
The analysis in this section applies simple but numerous metrics to the accuracy and error of the model across the values of important or demographic variables to discover any potential blindspots in the model's predictions.
These are the metrics that will be examined across important and demographic variables. These metrics are based on the confusion matrix, but any other metrics would likely be useful as well.
metric_dict = {
#### overall performance
'Prevalence': '(tp + fn) / (tp + tn +fp + fn)', # how much default actually happens for this group
#'Adverse Impact': '(tp + fp) / (tp + tn + fp + fn)', # how often the model predicted default for each group
'Accuracy': '(tp + tn) / (tp + tn + fp + fn)', # how often the model predicts default and non-default correctly for this group
#### predicting default will happen
# (correctly)
'True Positive Rate': 'tp / (tp + fn)', # out of the people in the group *that did* default, how many the model predicted *correctly* would default
'Precision': 'tp / (tp + fp)', # out of the people in the group the model *predicted* would default, how many the model predicted *correctly* would default
#### predicting default won't happen
# (correctly)
'Specificity': 'tn / (tn + fp)', # out of the people in the group *that did not* default, how many the model predicted *correctly* would not default
'Negative Predicted Value': 'tn / (tn + fn)', # out of the people in the group the model *predicted* would not default, how many the model predicted *correctly* would not default
#### analyzing errors - type I
# false accusations
'False Positive Rate': 'fp / (tn + fp)', # out of the people in the group *that did not* default, how many the model predicted *incorrectly* would default
'False Discovery Rate': 'fp / (tp + fp)', # out of the people in the group the model *predicted* would default, how many the model predicted *incorrectly* would default
#### analyzing errors - type II
# costly ommisions
'False Negative Rate': 'fn / (tp + fn)', # out of the people in the group *that did* default, how many the model predicted *incorrectly* would not default
'False Omissions Rate':'fn / (tn + fn)' # out of the people in the group the model *predicted* would not default, how many the model predicted *incorrectly* would not default
}
A number of small, tightly coupled utility functions are used to calculate and display the error metrics across the values of important or demographic variables.
# small utility functions
# all tightly coupled to global names and data structures !!
def cm_exp_parser(expression):
""" Translates abbreviated metric expressions from metric_dict
into executable Python statements.
Arg:
expression: Error metric expression from metric_dict.
Returns:
Python statements based on predefined metrics in metric_dict.
"""
# tp | fp cm_dict[level].iat[0, 0] | cm_dict[level].iat[0, 1]
# ------- ==> --------------------------------------------
# fn | tn cm_dict[level].iat[1, 0] | cm_dict[level].iat[1, 1]
expression = expression.replace('tp', '(cm_dict[level].iat[0, 0] + eps)')\
.replace('fp', '(cm_dict[level].iat[0, 1] + eps)')\
.replace('fn', '(cm_dict[level].iat[1, 0] + eps)')\
.replace('tn', '(cm_dict[level].iat[1, 1] + eps)')
return expression
################################################################################
def get_cm_dict(name, cutoff):
""" Loops through levels of named variable and calculates confusion
matrices per level; uses dynamically generated entities to reduce
code duplication.
Args:
name: Name of variable for which to calculate confusion matrices.
cutoff: Cutoff threshold for confusion matrices.
Returns:
Dictionary of confusion matrices.
"""
levels = sorted(list(test_yhat[name].unique())) # get levels
cm_dict = {} # init dict to store confusion matrices per level
for level in levels:
# dynamically name confusion matrices by level
# coerce to proper python names
cm_name = '_' + str(level).replace('-', 'm') + '_cm'
# dynamically calculate confusion matrices by level
code = cm_name + ''' = get_confusion_matrix(test_yhat,
y,
yhat,
by=name,
level=level,
cutoff=cutoff)'''
exec(code)
exec('cm_dict[level] = ' + cm_name) # store in dict
return cm_dict
################################################################################
def get_metrics_frame(name):
""" Loops through levels of named variable and metrics to calculate each
error metric per each level of the variable; uses dynamically generated
entities to reduce code duplication.
Arg:
name: Name of variable for which to calculate error metrics.
Return:
Pandas DataFrame of error metrics.
"""
levels = sorted(list(test_yhat[name].unique())) # get levels
metrics_frame = pd.DataFrame(index=levels) # init Pandas frame for metrics
eps = 1e-20 # for safe numerical operations
# nested loop through:
# - levels
# - metrics
for level in levels:
for metric in metric_dict.keys():
# parse metric expressions into executable pandas statements
expression = cm_exp_parser(metric_dict[metric])
# dynamically evaluate metrics to avoid code duplication
metrics_frame.loc[level, metric] = eval(expression)
# display results
return metrics_frame
PAY_0
¶name = 'PAY_0'
cm_dict = get_cm_dict(name, gbm_cut) # get dict of confusion matrices
# print formatted error metrics frame: precision, title, colors
PAY_0_metrics = get_metrics_frame(name)
PAY_0_metrics['PAY_0'] = [-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8]
PAY_0_metrics.set_index('PAY_0', inplace=True)
PAY_0_metrics.round(3).\
style.set_caption('Error Metrics for ' + name).\
background_gradient(cmap=sns.diverging_palette(-20, 260, n=7, as_cmap=True),
axis=1)
Prevalence | Accuracy | True Positive Rate | Precision | Specificity | Negative Predicted Value | False Positive Rate | False Discovery Rate | False Negative Rate | False Omissions Rate | |
---|---|---|---|---|---|---|---|---|---|---|
PAY_0 | ||||||||||
-2 | 0.124 | 0.857 | 0.119 | 0.3 | 0.961 | 0.885 | 0.039 | 0.7 | 0.881 | 0.115 |
-1 | 0.168 | 0.805 | 0.267 | 0.383 | 0.913 | 0.861 | 0.087 | 0.617 | 0.733 | 0.139 |
0 | 0.121 | 0.864 | 0.143 | 0.345 | 0.963 | 0.891 | 0.037 | 0.655 | 0.857 | 0.109 |
1 | 0.325 | 0.457 | 0.93 | 0.368 | 0.229 | 0.871 | 0.771 | 0.632 | 0.07 | 0.129 |
2 | 0.709 | 0.709 | 1 | 0.709 | 0 | 0.5 | 1 | 0.291 | 0 | 0.5 |
3 | 0.748 | 0.748 | 1 | 0.748 | 0 | 0.5 | 1 | 0.252 | 0 | 0.5 |
4 | 0.571 | 0.571 | 1 | 0.571 | 0 | 0.5 | 1 | 0.429 | 0 | 0.5 |
5 | 0.444 | 0.444 | 1 | 0.444 | 0 | 0.5 | 1 | 0.556 | 0 | 0.5 |
6 | 0.25 | 0.25 | 1 | 0.25 | 0 | 0.5 | 1 | 0.75 | 0 | 0.5 |
7 | 0.5 | 0.5 | 1 | 0.5 | 0 | 0.5 | 1 | 0.5 | 0 | 0.5 |
8 | 0.75 | 0.75 | 1 | 0.75 | 0 | 0.5 | 1 | 0.25 | 0 | 0.5 |
As identified by basic residual analysis, it can be seen here that the model behaves extremely differently for customers with PAY_0 <= 1
and PAY_0 > 1
, likely due to sparsity of data in the PAY_0 > 1
range. Again, this may be acceptable behavior today, but what if market conditions drift and these types of behaviors become more common and model mistakes become more costly in the future?
SEX
¶name = 'SEX'
cm_dict = get_cm_dict(name, gbm_cut) # get dict of confusion matrices
# print formatted error metrics frame: precision, title, colors
sex_metrics = get_metrics_frame(name)
sex_metrics['SEX'] = ['Male', 'Female']
sex_metrics.set_index('SEX', inplace=True)
sex_metrics.round(3).\
style.set_caption('Error Metrics for ' + name).\
background_gradient(cmap=sns.diverging_palette(-20, 260, n=7, as_cmap=True),
axis=1)
Prevalence | Accuracy | True Positive Rate | Precision | Specificity | Negative Predicted Value | False Positive Rate | False Discovery Rate | False Negative Rate | False Omissions Rate | |
---|---|---|---|---|---|---|---|---|---|---|
SEX | ||||||||||
Male | 0.235 | 0.773 | 0.655 | 0.513 | 0.809 | 0.884 | 0.191 | 0.487 | 0.345 | 0.116 |
Female | 0.209 | 0.788 | 0.573 | 0.495 | 0.845 | 0.882 | 0.155 | 0.505 | 0.427 | 0.118 |
Unlike PAY_0
, different types of error and accuracy measures appear fairly aligned for men and women.
Benchmark models are an excellent model debugging tool. They can be used at training time to understand how a new model differs from an established, trusted model. They can also be used at scoring time to understand if a newer or more complex model is giving different predictions from a previously deployed trusted model or simpler model. If a prediction from a new model is too different from a prediction from a trusted model, this could be indicative of potential accuracy, fairness, or security problems.
Instead of using a pre-existing model, a stable penalized GLM will be used as a benchmark in this notebook.
def glm_grid(X, y, train, valid):
""" Wrapper function for penalized GLM with alpha and lambda search.
:param X: List of inputs.
:param y: Name of target variable.
:param train: Name of training H2OFrame.
:param valid: Name of validation H2OFrame.
:return: Best H2Omodel from H2OGeneralizedLinearEstimator
"""
alpha_opts = [0.01, 0.25, 0.5, 0.99] # always keep some L2
# define search criteria
# i.e. over alpha
# lamda search handled by lambda_search param below
hyper_parameters = {'alpha': alpha_opts}
# initialize grid search
grid = H2OGridSearch(
H2OGeneralizedLinearEstimator(
family="binomial",
lambda_search=True,
seed=SEED),
hyper_params=hyper_parameters)
# run grid search
grid.train(y=y,
x=X,
training_frame=train,
validation_frame=valid)
# select best model from grid search
return grid.get_grid()[0]
best_glm = glm_grid(X, y, h2o.H2OFrame(train), h2o.H2OFrame(test))
print('Best penalized GLM mean logloss residual: %.6f' %
best_glm.logloss(valid=True))
glm_cut = best_glm.F1(valid=True)[0][0] # get GLM cutoff to create decisions
print('At threshold: %.6f' % glm_cut)
Parse progress: |█████████████████████████████████████████████████████████| 100% Parse progress: |█████████████████████████████████████████████████████████| 100% glm Grid Build progress: |████████████████████████████████████████████████| 100% Best penalized GLM mean logloss residual: 0.461254 At threshold: 0.300562
One interesting place to start comparing a more complex model to a simpler model is when the simple model is right and the complex model is wrong.
# copy test data
test_yhat2 = test_yhat.copy(deep=True)
# create columns for gbm and glm preds
test_yhat2.rename(columns={yhat: 'p_gbm_DEFAULT_NEXT_MONTH'}, inplace=True)
test_yhat2['p_glm_DEFAULT_NEXT_MONTH'] = best_glm.\
predict(h2o.H2OFrame(test))['p1'].\
as_data_frame()
# create columns for gbm and glm decisions (i.e. apply cutoff)
test_yhat2['gbm_DECISION'] = 0
test_yhat2.loc[test_yhat2['p_gbm_DEFAULT_NEXT_MONTH'] > gbm_cut,
'gbm_DECISION'] = 1
test_yhat2['glm_DECISION'] = 0
test_yhat2.loc[test_yhat2['p_glm_DEFAULT_NEXT_MONTH'] > glm_cut,
'glm_DECISION'] = 1
# create columns for gbm and glm wrong decisions
test_yhat2['gbm_WRONG'] = 0
test_yhat2.loc[test_yhat2[y] != test_yhat2['gbm_DECISION'], 'gbm_WRONG'] = 1
test_yhat2['glm_WRONG'] = 0
test_yhat2.loc[test_yhat2[y] != test_yhat2['glm_DECISION'], 'glm_WRONG'] = 1
# create a subset of preds where gbm is wrong, but glm is right
gbm_wrong = test_yhat2.loc[(test_yhat2['gbm_WRONG'] == 1) &
(test_yhat2['glm_WRONG'] == 0)]
gbm_wrong[X + [y, 'p_glm_DEFAULT_NEXT_MONTH', 'p_gbm_DEFAULT_NEXT_MONTH']].head()
Parse progress: |█████████████████████████████████████████████████████████| 100% glm prediction progress: |████████████████████████████████████████████████| 100%
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | p_glm_DEFAULT_NEXT_MONTH | p_gbm_DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 360000 | 2 | 1 | 1 | 49 | 1 | -2 | -2 | -2 | -2 | -2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.278097 | 0.338660 |
8 | 180000 | 2 | 1 | 2 | 29 | 1 | -2 | -2 | -2 | -2 | -2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.238463 | 0.297464 |
22 | 200000 | 2 | 1 | 2 | 34 | -1 | 3 | 2 | 2 | 2 | 2 | 1587 | 1098 | 782 | 1166 | 700 | 1414 | 0 | 0 | 700 | 0 | 1200 | 0 | 0 | 0.227727 | 0.524791 |
25 | 130000 | 2 | 3 | 2 | 29 | 1 | -2 | -2 | -1 | 2 | -1 | -190 | -9850 | -9850 | 10311 | 10161 | 7319 | 0 | 0 | 20161 | 0 | 7319 | 13899 | 0 | 0.240743 | 0.303131 |
27 | 290000 | 2 | 1 | 2 | 37 | 1 | -2 | -1 | -1 | -1 | -1 | 0 | 0 | 3155 | 0 | 2359 | 0 | 0 | 3155 | 0 | 2359 | 0 | 0 | 0 | 0.261741 | 0.253799 |
For the subset of customers where the more complex GBM model is wrong and the GLM model is right, the predictions of two models appear to be inversely correlated.
# custom legend
custom = [Line2D([0], [0], color='royalblue', lw=2),
Line2D([0], [0], color='deeppink', lw=2),
Line2D([0], [0], marker='o', color='w',
markerfacecolor='orange', markersize=3)]
# init plot
fig, ax = plt.subplots(figsize=(7, 5))
# plot sorted actuals
# double index reset orders index by sort variable and
# brings index into frame for plotting
_ = gbm_wrong[[y,'p_gbm_DEFAULT_NEXT_MONTH']].\
sort_values(by='p_gbm_DEFAULT_NEXT_MONTH').\
reset_index(drop=True).\
reset_index().\
plot(kind='scatter', x='index', y=y, color='orange', s=3, ax=ax,
legend=False)
# plot sorted gbm and glm preds
_ = gbm_wrong[['p_glm_DEFAULT_NEXT_MONTH', 'p_gbm_DEFAULT_NEXT_MONTH']].\
sort_values(by='p_gbm_DEFAULT_NEXT_MONTH').\
reset_index(drop=True).\
plot(ax=ax, legend=False,
title='Ranked Predictions for Correct GLM and Incorrect GBM')
# annotate plot
_ = ax.legend(custom, ['p_glm_DEFAULT_NEXT_MONTH', 'p_gbm_DEFAULT_NEXT_MONTH',
y])
_ = ax.set_xlabel('Ranked Row Index')
For a range of probabilities between ~0.2 and ~0.6 there exists a group of customers where a GLM model gives more correct predictions than the more complex GBM model. In the plot above, the yellow points represent the known target labels, the pink line is the sorted GBM model predictions, and the blue line is the GLM predictions for the same customers and target labels. For this group of people the GLM is obviously able to better represent some attribute of the customer's demographics or behaviors. Can the differences between this group of people and the rest of the customers be identified and leveraged to make better predictions?
gbm_wrong.describe()
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | p_gbm_DEFAULT_NEXT_MONTH | r_DEFAULT_NEXT_MONTH | p_glm_DEFAULT_NEXT_MONTH | gbm_DECISION | glm_DECISION | gbm_WRONG | glm_WRONG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.000000 | 502.0 | 502.0 |
mean | 13870.175299 | 160896.414343 | 1.551793 | 1.872510 | 1.484064 | 36.705179 | 0.123506 | -0.633466 | -0.444223 | -0.474104 | -0.575697 | -0.641434 | 19479.758964 | 19197.115538 | 18623.701195 | 18039.635458 | 18077.874502 | 17293.533865 | 1800.513944 | 1860.097610 | 1525.557769 | 1956.577689 | 2736.480080 | 2585.035857 | 0.019920 | 0.306885 | 0.394727 | 0.230274 | 0.980080 | 0.019920 | 1.0 | 0.0 |
std | 8502.509341 | 119357.031737 | 0.497806 | 0.859188 | 0.538669 | 8.931807 | 0.977126 | 1.452204 | 1.593361 | 1.622671 | 1.527279 | 1.506777 | 56825.032169 | 54988.156119 | 53770.733333 | 52307.534351 | 51894.923914 | 48606.147624 | 4801.440935 | 8087.762757 | 5574.953571 | 11037.034091 | 12441.992849 | 13390.459737 | 0.139866 | 0.053974 | 0.178867 | 0.062770 | 0.139866 | 0.139866 | 0.0 | 0.0 |
min | 19.000000 | 10000.000000 | 1.000000 | 1.000000 | 0.000000 | 22.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -9802.000000 | -9850.000000 | -9850.000000 | -4894.000000 | -37594.000000 | -21295.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.179299 | 0.287795 | 0.050347 | 0.000000 | 0.000000 | 1.0 | 0.0 |
25% | 6986.500000 | 50000.000000 | 1.000000 | 1.000000 | 1.000000 | 30.000000 | -1.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.269304 | 0.318116 | 0.190901 | 1.000000 | 0.000000 | 1.0 | 0.0 |
50% | 13143.000000 | 150000.000000 | 2.000000 | 2.000000 | 1.000000 | 35.000000 | 0.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 414.500000 | 393.000000 | 390.000000 | 355.000000 | 323.000000 | 390.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.297946 | 0.355270 | 0.251531 | 1.000000 | 0.000000 | 1.0 | 0.0 |
75% | 20930.250000 | 230000.000000 | 2.000000 | 2.000000 | 2.000000 | 43.000000 | 1.000000 | 0.000000 | 1.500000 | 0.000000 | 0.000000 | 0.000000 | 8009.750000 | 9862.750000 | 9885.250000 | 9400.250000 | 9752.250000 | 9392.000000 | 1601.500000 | 1249.000000 | 804.750000 | 981.000000 | 1153.500000 | 1070.000000 | 0.000000 | 0.321003 | 0.395403 | 0.276952 | 1.000000 | 0.000000 | 1.0 | 0.0 |
max | 29956.000000 | 710000.000000 | 2.000000 | 6.000000 | 3.000000 | 63.000000 | 2.000000 | 3.000000 | 3.000000 | 7.000000 | 6.000000 | 5.000000 | 455293.000000 | 441048.000000 | 433541.000000 | 425538.000000 | 419080.000000 | 381132.000000 | 59068.000000 | 97225.000000 | 80677.000000 | 180000.000000 | 202317.000000 | 164047.000000 | 1.000000 | 0.564783 | 1.718701 | 0.346547 | 1.000000 | 1.000000 | 1.0 | 0.0 |
If this group of people can be isolated, either by descriptive statistics, or by more sophisticated means, the training process could be adapted to fix these errors or another model could be used at scoring time to create more accurate predictions. Even if a group cannot be isolated, the two different model predictions could potentially be blended in this range of predicted probabilities.
A novel and very interesting application of post-hoc explanation techniques is in-depth investigation of model residuals. Below post-hoc explanation of predictions and residuals are used to further debug model predictions.
A fairly recent addition to the shap
package is the ability to calculate Shapley contributions to logloss. From this potentially very time-consuming calculation, an accurate estimate of each variable's contribution to the model's error for each row can be known. This notebook is shipped with pre-calculated values, but you can calculate them too. It might take a few hours though.
# init explainer object for access to explainer.expected_value() function below
# expected value is a function for loss, not a constant
explainer = shap.TreeExplainer(xgb_model, test_yhat[X],
feature_dependence='independent', # necessary for logloss
model_output='logloss') # necessary for logloss
# load if pre-calculated
if os.path.isfile('shap_error_values.csv'):
shap_error_values = np.loadtxt('shap_error_values.csv', delimiter=',') # load
print('Pre-calculated Shapley values loaded from disk.') # print confirmation
# else, calculate
else:
xgb_model.set_attr(objective='binary:logistic')
shap_error_values = explainer.shap_values(test_yhat[X], y=test_yhat[y]) # long step (hours!)
np.savetxt('shap_error_values.csv', shap_error_values, delimiter=',') # save for immediate reuse later
Pre-calculated Shapley values loaded from disk.
shap_values = xgb_model.predict(dtest, pred_contribs=True)
Once the local Shapley contributions to the predictions and logloss are known, they can be aggregated by their mean absolute value to create global importance metrics.
# init two pane plot
fig, (ax0, ax1) = plt.subplots(ncols=2, sharey=False, figsize=(7, 5))
plt.subplots_adjust(left=0.0, right=1.2, wspace=0.4)
# plot global abs mean Shapley contribs to predictions
global_shap_values_pd = pd.DataFrame(np.abs(shap_values[:, :-1]).mean(axis=0),
index=X, columns=['Pred. Contribs.'])
_ = global_shap_values_pd.sort_values(by='Pred. Contribs.').\
plot(kind='barh', ax=ax0, legend=False, color='royalblue',
title='Global Shapley Contributions to Prediction')
_ = ax0.set_xlabel('mean(|SHAP Contrib.|)')
# plot global abs mean Shapley contribs to logloss
global_shap_error_values_pd = pd.DataFrame(np.abs(shap_error_values).mean(axis=0),
index=X, columns=['Loss Contribs.'])
_ = global_shap_error_values_pd.sort_values(by='Loss Contribs.').\
plot(kind='barh', ax=ax1, legend=False, color='royalblue',
title='Global Shapley Contributions to Logloss')
_ = ax1.set_xlabel('mean(|SHAP Contrib.|)')
Some features are more important to the logloss than to the predictions! These include PAY_3
, PAY_2
, and BILL_AMT1
. The model could potentially be retrained without these features.
Shapley contributions to predictions are reliably available at scoring time. It may be possible to understand what Shapley contributions look like for correct model decisions and for incorrect model decisions and take remediation steps in real-time before issuing model predictions.
# init two pane plot
fig, (ax0, ax1) = plt.subplots(ncols=2, sharey=True)
plt.subplots_adjust(left=0.0, right=1.5, wspace=0.1)
# convert shap_values into Pandas DataFrame
# merge residuals onto this frame
local_shap_values_pd = pd.DataFrame(shap_values[:, :-1], columns=X)
local_shap_values_pd[resid] = test_yhat[resid]
# plot mean shap values of 100 lowest residual customers
_ = local_shap_values_pd.sort_values(by=resid).\
reset_index(drop=True).\
iloc[:1000, :-1].mean(axis=0).\
plot(kind='barh', color='royalblue', ax=ax0,
title='Mean Low Residual Pred. Contribs.')
_ = ax0.set_xlabel('Pred. Contribs.')
# plot mean shap values of 100 highest residual customers
_ = local_shap_values_pd.sort_values(by=resid).\
reset_index(drop=True).\
iloc[-1000:, :-1].mean(axis=0).\
plot(kind='barh', color='royalblue', ax=ax1,
title='Mean High Residual Pred. Contribs.')
_ = ax1.set_xlabel('Pred. Contribs.')
Here large differences in average Shapley values are visible between low and high residual model predictions. On average, correct model decisions have negative Shapley values, while incorrect decisions have small magnitude Shapley values. If, when issuing a prediction, the Shapley values are low magnitude, the row could be diverted to another model, say the linear benchmark model, or sent for human processing. Input data, particularly features with large global or local Shapley contributions to loss, could also be corrupted with missing values in hopes of attaining a better model prediction.
Shapley contributions to predictions can also be plotted against residual values, in hopes of understanding if certain model behaviors lead to higher errors.
# init three pane plot
fig, (ax0, ax1, ax2) = plt.subplots(ncols=3, sharey=False, figsize=(7, 5))
plt.subplots_adjust(left=0.0, right=2, wspace=0.4)
# plot shap vs. important variable PAY_0
_ = local_shap_values_pd.plot(kind='scatter', y=resid, x='PAY_0', color='royalblue',
title='PAY_0 Pred. Contribs. vs. ' + resid, ax=ax0)
_ = ax0.set_xlabel('PAY_0 Pred. Contribs.')
# plot shap vs. important variable PAY_3
_ = local_shap_values_pd.plot(kind='scatter', y=resid, x='PAY_3', color='royalblue',
title='PAY_3 Pred. Contribs. vs. ' + resid, ax=ax1)
_ = ax0.set_xlabel('PAY_3 Pred. Contribs.')
# plot shap vs. demographic variable SEX
_ = local_shap_values_pd.plot(kind='scatter', y=resid, x='SEX', color='royalblue',
title='SEX Pred. Contribs. vs. ' + resid, ax=ax2)
_ = ax2.set_xlabel('SEX Pred. Contribs.')
Here it appears there are no single Shapley contributions that can be associated with only high residuals. Every extremely high residual value also has low residual values for the same Shapley value. However, it can be seen that some Shapley values are only associated with low residuals, for instance 0 < PAY_0
Contribs. < 0.5
or PAY_3
Contribs. < -0.2
. If trying to detect scoring problems in real-time, Shapley values in these ranges for these features could indicate that these predictions are likely to be correct. Also, in terms of local feature importance, the model is not treating SEX
very differently -- a good sign from a fairness perspective but not a definitive test. From a security perspective, high residuals above low residuals are potentially interesting. These are individuals the model usually gets right, but for some reason they did not in this case. That could indicate manipulation of the training data or model scoring logic in rare cases.
If relationships between Shapley contributions and dependably high residuals cannot be found in two-dimensions, they may be visible in three dimensions when two features interact.
# init 3-D plot
fig, ax = plt.subplots(figsize=(10, 8))
ax = plt.axes(projection='3d')
# z-axis: residuals values
_z = local_shap_values_pd[resid].values
ax.set_zlabel(resid)
# x-axis: PAY_2 values
_x = local_shap_values_pd['PAY_2'].values
ax.set_xlabel('\nPAY_2 Pred. Contribs.')
# y-axis: PAY_3 values
_y = local_shap_values_pd['PAY_3'].values
ax.set_ylabel('\nPAY_3 Pred. Contribs.')
# plot and annotate
_ = ax.scatter3D(_x, _y, _z, color='royalblue')
fig.suptitle('PAY_2 and PAY_3 Pred. Contribs vs. ' + resid)
plt.tight_layout()
Here two features that are globally more important to logloss than to the model predictions are plotted against model residuals. Like directly above, these plots can also be used to find potentially interesting regions of purely high or low residuals, differences in model behavior across demographic segments, or strange patterns indicating possible manipulation of data or model logic. Cluster profiles of this space seem particularly interesting.
Much like individual Shapley contributions to predictions, individual Shapley contributions to logloss residuals show exactly how much a variable contributes to the logloss for any individual in the training data.
# select highest residual individual
row_id = test_yhat_sorted.loc[0, 'ID']
row = test_yhat[test_yhat['ID'] == row_id]
# match to their shap loss contribs
# create Pandas DataFrame
# display
s_df = pd.DataFrame(shap_error_values[row.index[0], :].reshape(23, 1),
columns=['Loss Contribs.'], index=X)
s_df.sort_values(by='Loss Contribs.', inplace=True, ascending=False)
s_df
Loss Contribs. | |
---|---|
PAY_AMT1 | 0.611533 |
LIMIT_BAL | 0.403705 |
PAY_AMT2 | 0.352150 |
PAY_0 | 0.311651 |
PAY_3 | 0.272713 |
PAY_AMT3 | 0.184478 |
PAY_6 | 0.147152 |
PAY_2 | 0.128540 |
BILL_AMT1 | 0.123718 |
MARRIAGE | 0.100163 |
PAY_5 | 0.092106 |
PAY_AMT5 | 0.086318 |
PAY_AMT6 | 0.075505 |
PAY_4 | 0.067393 |
BILL_AMT2 | 0.064727 |
EDUCATION | 0.051236 |
PAY_AMT4 | 0.046066 |
BILL_AMT4 | 0.022856 |
BILL_AMT6 | 0.011413 |
BILL_AMT5 | 0.006382 |
BILL_AMT3 | 0.000471 |
AGE | -0.050907 |
SEX | -0.073163 |
For this high-residual individual, PAY_AMT1
, LIMIT_BAL
, and PAY_AMT2
have the largest positive impact on the model error.
Just to make sure Shapley is locally accurate, it can be shown that the Shapley contributions sum directly to the logloss for this individual.
# should match residual value for row ... very close
print('Total Shapley contributions to logloss: %.6f' %
(s_df['Loss Contribs.'].sum() + explainer.expected_value(row[y].values[0])))
Total Shapley contributions to logloss: 4.826473
# plot barchart of top five loss contribs
ax = s_df.plot(kind='bar',
title='Local Feature Contributions to Logloss\n',
color='royalblue',
legend=False)
# annotate
_ = ax.set_ylabel('Loss Contribs.')
Just like reason codes, or simple written explanations, can be generated for each model prediction. Reason codes can also be generated for logloss for each model prediction by combining the ranking provided by Shapley values and the individual row values.
row # helps understand reason codes
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEFAULT_NEXT_MONTH | p_DEFAULT_NEXT_MONTH | r_DEFAULT_NEXT_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 983 | 500000 | 1 | 1 | 2 | 36 | -2 | -2 | -2 | -2 | -2 | -2 | 45106 | 81264 | 18122 | 27229 | 21462 | 27911 | 81690 | 18225 | 27365 | 21570 | 28050 | 17397 | 1 | 0.008015 | 4.826454 |
For this individual the top five characteristics contributing to the logloss residual are:
All of the variables contributing to high logloss for this individual have values that would be considered positive attributes: large payments, high credit limit, and good repayment statuses. Essentially the model is surprised this person defaulted. This could mean that the model is too reliant on repayment statuses, as was seen by plotting residuals against PAY_0
values, or it may also indicate the model requires more information to make good decisions, likely in the form of additional input variables.
In general, when examining local contributions to logloss, one might consider whether local contributions to predictions and errors are parsimonious or might indicate tampering with the data or model scoring logic or whether protected variables are weighing heavily in a model decision. Also, and although tedious and time-consuming today, one could use sensitivity analysis to understand how changing model inputs could affect model outcomes for different individuals.
A time-tested method of determining if a model is fundamentally biased, either missing data or too simple in form, is to determine whether strong patterns exist in the model residuals. One way to determine if these patterns exist is to model the residuals themselves. When the model used is a decision tree or rule-based model, this opens up the possibility of automatically generating rules that indicate when or why a model might be wrong.
This set of small functions is used to train decision trees on logloss residuals and display those trees.
# small utility functions
# all tightly coupled to global names and data structures !!
def train_cv_dt(model_id, frame):
""" Utility function to train decision trees.
Args:
model_id: h2o model identifier.
frame: Pandas DataFrame containing X and yhat on which to train
the decision trees.
Returns:
Path to saved model MOJO (Java scoring artifact),
model as h2o model object.
"""
# initialize single tree model
tree = H2ORandomForestEstimator(ntrees=1, # use only one tree
sample_rate=1, # use all rows in that tree
mtries=-2, # use all columns in that tree's split search
max_depth=4, # shallow trees are easier to understand
seed=SEED, # set random seed for reproducibility
nfolds=3, # cross-validation for stability and ...
# only way to get metrics for 1 tree in h2o
model_id=model_id) # gives MOJO artifact a recognizable name
# train single tree model
tree.train(x=X, y=resid, training_frame=h2o.H2OFrame(frame))
# persist MOJO (compiled Java representation of trained model)
# from which to generate plot of tree
mojo_path = tree.download_mojo(path='.')
print('Generated MOJO path:\n', mojo_path)
return mojo_path, tree
################################################################################
#
def get_gv(title, model_id, mojo_path):
""" Utility function to generate graphviz dot file from h2o MOJO using
a subprocess.
Args:
title: Title for displayed decision tree.
model_id: h2o model identifier.
mojo_path: Path to saved model MOJO (Java scoring artifact);
generated by train_cv_dt function above.
"""
# locate h2o jar
hs = H2OLocalServer()
h2o_jar_path = hs._find_jar()
print('Discovered H2O jar path:\n', h2o_jar_path)
# construct command line call to generate graphviz version of
# tree, see for more information:
# http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
gv_file_name = model_id + '.gv'
gv_args = str('-cp ' + h2o_jar_path +
' hex.genmodel.tools.PrintMojo --tree 0 -i '
+ mojo_path + ' -o').split()
gv_args.insert(0, 'java')
gv_args.append(gv_file_name)
if title is not None:
gv_args = gv_args + ['--title', title]
# call constructed command
print()
print('Calling external process ...')
print(' '.join(gv_args))
# if the line below is failing for you, try instead:
# _ = subprocess.call(gv_args, shell=True)
_ = subprocess.call(gv_args)
################################################################################
def get_png(model_id):
""" Utility function to generate PNGs from .dots using a subprocess.
Arg:
model_id: h2o model identifier.
"""
gv_file_name = model_id + '.gv'
# construct call to generate PNG from
# graphviz representation of the tree
png_file_name = model_id + '.png'
png_args = str('dot -Tpng ' + gv_file_name + ' -o ' + png_file_name)
png_args = png_args.split()
# call
print('Calling external process ...')
print(' '.join(png_args))
# if the line below is failing for you, try instead:
# _ = subprocess.call(png_args, shell=True)
_ = subprocess.call(png_args)
# train trees
mojo_path0, tree0 = train_cv_dt('tree0', test_yhat[test_yhat[y] == 0])
mojo_path1, tree1 = train_cv_dt('tree1', test_yhat[test_yhat[y] == 1])
# shutdown h2o
h2o.cluster().shutdown()
Parse progress: |█████████████████████████████████████████████████████████| 100% drf Model Build progress: |███████████████████████████████████████████████| 100% Generated MOJO path: /home/patrickh/workspace/interpretable_machine_learning_with_python/tree0.zip Parse progress: |█████████████████████████████████████████████████████████| 100% drf Model Build progress: |███████████████████████████████████████████████| 100% Generated MOJO path: /home/patrickh/workspace/interpretable_machine_learning_with_python/tree1.zip H2O session _sid_8235 closed.
DEFAULT_NEXT_MONTH = 0
logloss residuals¶# cv metrics for y = 0 tree
tree0.cross_validation_metrics_summary().as_data_frame()
mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | ||
---|---|---|---|---|---|---|
0 | mae | 0.055455755 | 0.0023644632 | 0.052755732 | 0.057156604 | 0.056454934 |
1 | mean_residual_deviance | 0.006993064 | 4.908051E-4 | 0.006517125 | 0.0074974936 | 0.0069645736 |
2 | mse | 0.006993064 | 4.908051E-4 | 0.006517125 | 0.0074974936 | 0.0069645736 |
3 | r2 | 0.88753355 | 0.012567164 | 0.8943817 | 0.8730297 | 0.8951892 |
4 | residual_deviance | 0.006993064 | 4.908051E-4 | 0.006517125 | 0.0074974936 | 0.0069645736 |
5 | rmse | 0.08359027 | 0.0029320545 | 0.08072871 | 0.08658807 | 0.08345402 |
6 | rmsle | 0.0591188 | 0.0025942866 | 0.05636424 | 0.061515696 | 0.059476465 |
The residuals for non-default decisions can be fit relatively well with a single shallow tree and with stable errors across three folds.
DEFAULT_NEXT_MONTH = 1
logloss residuals¶# cv metrics for y = 1 tree
tree1.cross_validation_metrics_summary().as_data_frame()
mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | ||
---|---|---|---|---|---|---|
0 | mae | 0.16745956 | 0.003669606 | 0.16916788 | 0.16324724 | 0.16996355 |
1 | mean_residual_deviance | 0.0674532 | 0.0036559089 | 0.06981851 | 0.069298685 | 0.063242406 |
2 | mse | 0.0674532 | 0.0036559089 | 0.06981851 | 0.069298685 | 0.063242406 |
3 | r2 | 0.88707805 | 0.008536131 | 0.8798821 | 0.8848426 | 0.89650947 |
4 | residual_deviance | 0.0674532 | 0.0036559089 | 0.06981851 | 0.069298685 | 0.063242406 |
5 | rmse | 0.25965294 | 0.007094731 | 0.26423192 | 0.26324642 | 0.25148043 |
6 | rmsle | 0.099633954 | 0.003624949 | 0.103740685 | 0.09828153 | 0.09687965 |
The residuals for default decisions can be fit relatively well with a single shallow tree and with stable errors across three folds.
.dot
files for visualizing decision trees¶get_gv('LogLoss Residual Tree (' + y + '=0)', 'tree0', mojo_path0) # .dot for y=0 tree
get_gv('LogLoss Residual Tree (' + y + '=1)', 'tree1', mojo_path1) # .dot for y=1 tree
Discovered H2O jar path: /home/patrickh/workspace/interpretable_machine_learning_with_python/env_iml/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar Calling external process ... java -cp /home/patrickh/workspace/interpretable_machine_learning_with_python/env_iml/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar hex.genmodel.tools.PrintMojo --tree 0 -i /home/patrickh/workspace/interpretable_machine_learning_with_python/tree0.zip -o tree0.gv --title LogLoss Residual Tree (DEFAULT_NEXT_MONTH=0) Discovered H2O jar path: /home/patrickh/workspace/interpretable_machine_learning_with_python/env_iml/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar Calling external process ... java -cp /home/patrickh/workspace/interpretable_machine_learning_with_python/env_iml/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar hex.genmodel.tools.PrintMojo --tree 0 -i /home/patrickh/workspace/interpretable_machine_learning_with_python/tree1.zip -o tree1.gv --title LogLoss Residual Tree (DEFAULT_NEXT_MONTH=1)
.dot
files into .png
s¶get_png('tree0') # .png for y=0 tree
get_png('tree1') # .png for y=1 tree
Calling external process ... dot -Tpng tree0.gv -o tree0.png Calling external process ... dot -Tpng tree1.gv -o tree1.png
DEFAULT_NEXT_MONTH
= 0 logloss residuals¶display(Image(('tree0.png')))
This relatively accurate tree shows that strong patterns likely exist in the model residuals for non-default decisions. The highest residuals appear to occur when: PAY_0 >= 1.5
and PAY_5 >= 1
and BILL_AMT2 < 18776.5
and PAY_4 >= 1
. This scenario could be easily converted into a model assertion for use when scoring new data. When these conditions are present, the model is likely to be wrong and action can be taken such as scoring this individual with another model, diverting them to a human case worker, simply predicting the mean value of DEFAULT_NEXT_MONTH
, or injecting missing values into their data in hopes of issuing a better prediction.
Since these residuals relate to individuals being awarded a loan, another potentially important pattern to be aware of in the residuals would be if employees, contractors, or consultants were consistently being awarded loans, particularly by high-residual model decisions. Such a pattern could be indicative of data poisoning or tampering with model scoring logic. It seems possible that these tree rules could also be used to create adversarial examples for white-hat hacking purposes.
DEFAULT_NEXT_MONTH
= 1 logloss residuals¶display(Image(('tree1.png')))
This relatively accurate tree shows that strong patterns likely also exist in the model residuals for default decisions. Rules of this tree could be used to create model assertions as well, and because the decisions of this tree result in denial of credit, any rules relating demographic variables to large residuals may be indicative of sociological bias in model predictions.
This notebook uses techniques related to residual analysis to:
Visualize residuals and learn about outlying, wrong predictions and to see that the trained model is too dependent on one input variable: PAY_0
.
Examine multiple types of errors across the levels of categorical variables and found that the model behaves poorly for PAY_0 >= 2
but seems to have roughly similar error characteristics across men and women.
Compare monotonic GBM predictions to simpler linear model predictions and discovered a group of customers for which the simpler linear model performs better than the GBM.
Analyze residuals with post-hoc explainable AI techniques to understand:
Once these kinds of model bugs have been identified, new options exist to remediate them during either training or scoring. This model was found to be too dependent on PAY_0
and there was remaining signal in the model residuals to train accurate decision trees. It seems there is additional information that would lessen this model's over-emphasis of PAY_0
and perhaps allow the constrained GBM to make more accurate decisions in general. Local contributions to logloss were also calculated. Is it possible to retrain the model with this information in some kind of novel row and column boosting scheme? Can the Shapley prediction contributions and the decision tree rules for residuals be used to create model assertions automatically? More importantly than this one toy case, can these strategies be applied to find accuracy, fairness, or security bugs in different types of models?