Dependent Variable
Independent Variables
This is a simple test to showcase the toolchain for SHAP
Dependent Variable
Independent Variables
Word-level examination
From Wich, Bauer and Groh (2020 WOAH)
Dependent Variable
Independent Variables
From the web appendix of Chernozhukov et al. (2017 AER)
Is there gender bias in annual compensation?
## Job.Titles Department Full.Time \
## 0 SERGEANT POLICE 1
## 1 POLICE OFFICER (ASSIGNED AS DETECTIVE) POLICE 1
## 2 Other GENERAL SERVICES 1
## 3 Other WATER MGMNT 1
## 4 Other TRANSPORTN 1
## ... ... ... ...
## 33581 POLICE OFFICER POLICE 1
## 33582 POLICE OFFICER POLICE 1
## 33583 POLICE OFFICER POLICE 1
## 33584 POLICE OFFICER POLICE 1
## 33585 Other Other 1
##
## Salaried Female
## 0 1 0.0
## 1 1 1.0
## 2 1 1.0
## 3 1 0.0
## 4 0 0.0
## ... ... ...
## 33581 1 1.0
## 33582 1 1.0
## 33583 1 0.0
## 33584 1 0.0
## 33585 1 0.0
##
## [33586 rows x 5 columns]
prefix=
lets us name the columns of the outputWe did this in Session 2
vars = one_hot1.columns.tolist() + \
one_hot2.columns.tolist() + \
['Full.Time', 'Salaried', 'Female']
dtrain = xgb.DMatrix(df[vars], label=df['Salary'], feature_names=vars)
param = {
'booster': 'gbtree', # default -- tree based
'nthread': 8, # number of threads to use for parallel processing
'objective': 'reg:squarederror', # RMSE error
'eval_metric': 'rmse', # maximize ROC AUC
'eta': 0.3, # shrinkage; [0, 1], default 0.3
'max_depth': 6, # maximum depth of each tree; default 6
'gamma': 0, # set above 0 to prune trees, [0, inf], default 0
'min_child_weight': 1, # higher leads to more pruning of tress, [0, inf], default 1
'subsample': 1, # Randomly subsample rows if in (0, 1), default 1
}
num_round=30
Here we see that having
Female=0
was the fourth most influential feature in the model, and that it led to a higher predicted salary
Here we see that having
Female=1
was the second most influential feature in the model, and that it led to a lower predicted salary
Aims to provide an explanation of the importance of model inputs in explaining model output
SHAP: SHapley Additive exPlanations
Remember that our model is nonparametric! Signs can be different even when the variable doesn’t change due to interactive effects
This may not be useful for XGBoost since it already has an importance metric, but many other models lack it
How does political bias in data impact hate speech classification?
From Wich, Bauer and Groh (2020 WOAH)
Left-wing text induces a statistically significant divergence in model performance when more than 2/3 of the text is left-wing
Right-wing text induces a statistically significant divergence in model performance when only 1/3 of the text is right-wing
SHAP will weight the extent to which a word indicates the presence of hate speech, in [conditional] expectation
Impact or overlap with methodological work by Susan Athey, Matthew Gentzkow, Trevor Hastie, Guido Imbens, Matt Taddy, and Stefan Wager
And repeat. Bootstrap this out and take the mean or median of the estimators
There are a lot of older methods that try to address this, though incompletely
\[ \begin{aligned} Y &= g_0(T, C) + \varepsilon_1\\ T &= m_0(C) + \varepsilon_2 \end{aligned} \]
We know these assumptions aren’t true!
How can we estimate a more general form for \(g_0\) and \(m_0\)?
Problem: How does 401k participation impact wealth?
## nifa net_tfa tw age inc fsize educ db marr \
## 0 0.0 0.0 4500.0 47 6765.0 2 8 0 0
## 1 6215.0 1015.0 22390.0 36 28452.0 1 16 0 0
## 2 0.0 -2000.0 -2000.0 37 3300.0 6 12 1 0
## 3 15000.0 15000.0 155000.0 58 52590.0 2 16 0 1
## 4 0.0 0.0 58000.0 32 21804.0 1 11 0 0
## ... ... ... ... ... ... ... ... .. ...
## 9910 98498.0 98858.0 157858.0 52 73920.0 1 16 1 0
## 9911 287.0 6230.0 15730.0 41 42927.0 4 14 0 1
## 9912 99.0 6099.0 7406.0 40 23619.0 2 16 1 0
## 9913 0.0 -32.0 2468.0 47 14280.0 4 6 1 0
## 9914 4000.0 5000.0 8857.0 33 11112.0 1 14 0 0
##
## twoearn e401 p401 pira hown
## 0 0 0 0 0 1
## 1 0 0 0 0 1
## 2 0 0 0 0 0
## 3 1 0 0 0 1
## 4 0 0 0 0 1
## ... ... ... ... ... ...
## 9910 0 1 1 0 1
## 9911 1 1 1 1 1
## 9912 0 1 0 1 0
## 9913 0 1 1 0 0
## 9914 0 1 1 0 0
##
## [9915 rows x 14 columns]
## ================== DoubleMLData Object ==================
##
## ------------------ Data summary ------------------
## Outcome variable: net_tfa
## Treatment variable(s): ['e401']
## Covariates: ['nifa', 'tw', 'age', 'inc', 'fsize', 'educ', 'db', 'marr', 'twoearn', 'p401', 'pira', 'hown']
## Instrument variable(s): None
## No. Observations: 9915
##
## ------------------ DataFrame info ------------------
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 9915 entries, 0 to 9914
## Columns: 14 entries, nifa to hown
## dtypes: float32(4), int8(10)
## memory usage: 329.2 KB
# Fix the random number generator for replicability
np.random.seed(1234)
# Run the model
dml_model_irm = dml.DoubleMLIRM(df_dml, g_0, m_0)
# Output the model's findings
print(dml_model_irm.fit())
## ================== DoubleMLIRM Object ==================
##
## ------------------ Data summary ------------------
## Outcome variable: net_tfa
## Treatment variable(s): ['e401']
## Covariates: ['nifa', 'tw', 'age', 'inc', 'fsize', 'educ', 'db', 'marr', 'twoearn', 'p401', 'pira', 'hown']
## Instrument variable(s): None
## No. Observations: 9915
##
## ------------------ Score & algorithm ------------------
## Score function: ATE
## DML algorithm: dml2
##
## ------------------ Resampling ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: True
##
## ------------------ Fit summary ------------------
## coef std err t P>|t| 2.5 % 97.5 %
## e401 3320.43343 383.604082 8.655887 4.890947e-18 2568.583245 4072.283614
# Run the model
dml_model_irm_ATTE = dml.DoubleMLIRM(df_dml, g_0, m_0, score='ATTE')
# Output the model's findings
print(dml_model_irm_ATTE.fit())
## ================== DoubleMLIRM Object ==================
##
## ------------------ Data summary ------------------
## Outcome variable: net_tfa
## Treatment variable(s): ['e401']
## Covariates: ['nifa', 'tw', 'age', 'inc', 'fsize', 'educ', 'db', 'marr', 'twoearn', 'p401', 'pira', 'hown']
## Instrument variable(s): None
## No. Observations: 9915
##
## ------------------ Score & algorithm ------------------
## Score function: ATTE
## DML algorithm: dml2
##
## ------------------ Resampling ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: True
##
## ------------------ Fit summary ------------------
## coef std err t P>|t| 2.5 % 97.5 %
## e401 10081.312662 392.074708 25.712734 8.421563e-146 9312.860354 10849.764969
dml2
dml1
using dml_procedure='dml1'
dml1
follows the math in these slides
dml2
solves the for the average of the condition being equal to zero overalln_rep=100
SHAP
DoubleML
Python
R
# Fully worked out DoubleML model using Histogram-based GBM
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
# set up the data
df = pd.read_stata('../Data/S5_sipp1991.dta')
y = 'net_tfa'
treat = 'e401'
controls = [x for x in df.columns.tolist() if x not in [y, treat]]
df_dml3 = dml.DoubleMLData(df, y_col=y, d_cols=treat, x_cols=controls)
#set up the nonparametric nuisance functions
g_0 = HistGradientBoostingRegressor(loss='least_squares',
learning_rate=0.01,
max_iter=1000,
max_depth=2,
early_stopping=False
)
m_0 = HistGradientBoostingClassifier(loss='binary_crossentropy',
learning_rate=0.01,
max_iter=1000,
max_depth=2,
early_stopping=False
)
np.random.seed(1234)
dml_model_ex_irm = dml.DoubleMLIRM(df_dml, g_0, m_0, n_folds=5, n_rep=100)
print(dml_model_ex_irm.fit())