Session_5.knit

Bias, #1: Quantifying bias in wages

Based on the City of Chicago wage data set

Dependent Variable

Annual salary

Independent Variables

Job title
Department
Full time / part time
Salaried or hourly
Female

This is a simple test to showcase the toolchain for SHAP

Bias, #2: Political bias in hate speech classification

Dependent Variable

Offensive speech

Independent Variables

“Non-offensive” tweets from left-wing/right-wing/neutral groups
Non-offensive tweets from GermEval 1 & 2
Tweet topics

Word-level examination

From Wich, Bauer and Groh (2020 WOAH)

Causal ML: Quantifying the impact of 401(k)s on wealth

An illustrative implementation of using Double ML for causality
The key motivator for the method is the

Dependent Variable

Net financial assets

Independent Variables

Treatment: 401K eligibility
Age
Income
Family size
Years of education
Marital status
Two-earner status indicator
Defined benefit pension indicator
IRA participation
Home ownership indicator

From the web appendix of Chernozhukov et al. (2017 AER)

An example of quantifying bias

Data: City of Chicago salaries
- 33,586 employees
Trained using a simple XGBoost model
Features:
- Job title
- Department
- Full time / part time
- Salaried or hourly
- Female

Is there gender bias in annual compensation?

The data

vars = ['Job.Titles', 'Department', 'Full.Time', 'Salaried', 'Female']
df[vars]

##                                    Job.Titles        Department  Full.Time  \
## 0                                    SERGEANT            POLICE          1   
## 1      POLICE OFFICER (ASSIGNED AS DETECTIVE)            POLICE          1   
## 2                                       Other  GENERAL SERVICES          1   
## 3                                       Other       WATER MGMNT          1   
## 4                                       Other        TRANSPORTN          1   
## ...                                       ...               ...        ...   
## 33581                          POLICE OFFICER            POLICE          1   
## 33582                          POLICE OFFICER            POLICE          1   
## 33583                          POLICE OFFICER            POLICE          1   
## 33584                          POLICE OFFICER            POLICE          1   
## 33585                                   Other             Other          1   
## 
##        Salaried  Female  
## 0             1     0.0  
## 1             1     1.0  
## 2             1     1.0  
## 3             1     0.0  
## 4             0     0.0  
## ...         ...     ...  
## 33581         1     1.0  
## 33582         1     1.0  
## 33583         1     0.0  
## 33584         1     0.0  
## 33585         1     0.0  
## 
## [33586 rows x 5 columns]

One hot encoding categorical data

Pandas has a function for this, pd.get_dummies()
- prefix= lets us name the columns of the output
As pd.get_dummies() outputs a new data frame only containing the new columns, we need to join them back
- df.join() makes this quick and easy

one_hot1 = pd.get_dummies(df['Job.Titles'], prefix='Job.Titles')
one_hot2 = pd.get_dummies(df['Department'], prefix='Department')

df = df.join(one_hot1)
df = df.join(one_hot2)

Prepping XGBoost

We did this in Session 2

vars = one_hot1.columns.tolist() + \
       one_hot2.columns.tolist() + \
       ['Full.Time', 'Salaried', 'Female']
dtrain = xgb.DMatrix(df[vars], label=df['Salary'], feature_names=vars)

param = {
    'booster': 'gbtree',             # default -- tree based
    'nthread': 8,                    # number of threads to use for parallel processing
    'objective': 'reg:squarederror', # RMSE error
    'eval_metric': 'rmse',           # maximize ROC AUC
    'eta': 0.3,                      # shrinkage; [0, 1], default 0.3
    'max_depth': 6,                  # maximum depth of each tree; default 6
    'gamma': 0,                      # set above 0 to prune trees, [0, inf], default 0
    'min_child_weight': 1,           # higher leads to more pruning of tress, [0, inf], default 1
    'subsample': 1,                  # Randomly subsample rows if in (0, 1), default 1
}
num_round=30

Building our model and prepping SHAP

We call xgb.train() to fit our XGBoost model

model_xgb = xgb.train(param, dtrain, num_round)

Since XGBoost is a tree-based model, we will use SHAP’s shap.TreeExplainer() function to analyze the model
Since we only have in-sample data, we will compute SHAP on the same data the XGBoost model was fit to
We will also prepare a small sample for more CPU-intense analyses

explainer = shap.TreeExplainer(model_xgb)
shap_values = explainer(df[vars])

df_small = df.sample(frac=0.01)
shap_values_small = explainer(df[vars])

Explaining a single observation

shap.plots.waterfall(shap_values[0])

An image of a SHAP waterfall

Here we see that having Female=0 was the fourth most influential feature in the model, and that it led to a higher predicted salary

Explaining a single observation

shap.plots.waterfall(shap_values[2])

An image of a SHAP waterfall

Here we see that having Female=1 was the second most influential feature in the model, and that it led to a lower predicted salary

What exactly is SHAP?

Aims to provide an explanation of the importance of model inputs in explaining model output

Game theoretic and theory driven
Unifies six other methods that tried to address this problem
It is a model itself, a model to explain models
Provides a simple to understand output

SHAP: SHapley Additive exPlanations

Based on Shapley, 1953, “A value for n-person games.”
SHAP itself is from Lundberg and Lee (2017)

Principles of SHAP

Local accuracy
- The simple model is able to accurately predict a model output on small subsets of the data
Missingness
- SHAP only uses data the original model had access to
- If data was missing from the original model, SHAP won’t use it
Consistency
- Akin to transitivity conditions in utility theory (Savage Axioms)
  - But instead of “utility,” we have “simplified model’s input’s contribution”

Intuition of SHAP

An image explaining SHAP

SHAP is defined by a series of [conditional] expectation of the impact of an input
For linear models, order of selecting inputs has no effect
For nonlinear models, SHAP averages inputs’ conditional expected impact over all possible orderings

A more concise point visualization

shap.plots.force(shap_values[1])

An image of a SHAP waterfall

Aggregating across the data

N=300
shap.plots.force(explainer.expected_value, shap_values.sample(N).values, feature_names=vars)

An image of a SHAP waterfall

Seeing more variables’ impact

A “Decision plot” uses a line chart to show the impact of more measures across the data

shap.decision_plot(
  explainer.expected_value,
  explainer.shap_values(df_small[df_small.Female==1][vars]),
  feature_names=vars)

An image of

shap.decision_plot(
  explainer.expected_value,
  explainer.shap_values(df_small[df_small.Female==0][vars]),
  feature_names=vars)

An image of

Aggregate analysis of an individual variable

If we want to see the full impact of “Female” on outcomes in our data, a scatter plot is useful

shap.plots.scatter(shap_values[:,"Female"], color=shap_values)

An image of a SHAP waterfall

Remember that our model is nonparametric! Signs can be different even when the variable doesn’t change due to interactive effects

Multiple scatterplots at once: Bee swarm

If you want a concise way to present multiple variables, the bee swarm plot can be useful

shap.plots.beeswarm(shap_values)

An image of a SHAP waterfall

Importance plot

Lastly, we can replicate XGBoost’s importance plot using \(|\text{SHAP}|\)

shap.plots.bar(shap_values)

An image of a SHAP waterfall

This may not be useful for XGBoost since it already has an importance metric, but many other models lack it

Addendum: Using R

If you are working explicitly with XGBoost, there is a great SHAPforxgboost package
To interface with the python shap package, you can use shapper
There is also shapr, though it isn’t as full-featured.

Paper background

How does political bias in data impact hate speech classification?

Baseline data is often a critical issue in measure construction
- This paper takes a strong stance in demonstrating this, by using a fixed, unbiased sample of hate speech content
- E.g., the “1” class is not impacted by any political bias, only the “0” class
The authors aim to show how using a politically biased non-offensive baseline can induce bias in hate speech classification models

From Wich, Bauer and Groh (2020 WOAH)

The models

The authors construct hate speech detection models using a combination of four corpuses
1. A baseline model from GermEval 1 & 2 that is politically neutral
2. A set of politically left-wing tweets
3. A set of politically right-wing tweets
In order to see the effect of political leaning on the model, they also run the model on mixtures of corpuses that are 1/3 or 2/3 neutral, with the remaining text from one of the non-neutral corpuses

Left-wing text induces a statistically significant divergence in model performance when more than 2/3 of the text is left-wing

Right-wing text induces a statistically significant divergence in model performance when only 1/3 of the text is right-wing

Applying SHAP to the models

Same workflow as we did, except tailored for a neural network
- Just replace the TreeExplainer with DeepExplainer
Conceptually, SHAP will behave the same across any nonlinear model
Since their data is word-level, the features fed to SHAP will be one hot encoded vectors of words

SHAP will weight the extent to which a word indicates the presence of hate speech, in [conditional] expectation

Examples of bias with SHAP

Figure 3 of Wich et al.

@user @user Of course, all do-gooders say “yes,” because they know that it won’t happen.
- Tagged categorization: Offensive
If the people had the right to elect the chancellor directly, Merkel would have been history a long time ago.
- Tagged categorization: Not offensive

What else could this paper have done?

Leverage the topic model to show if bias is generally pervasive when using biased corpuses
- Or perhaps bias creeps in only in certain contexts
- How? Examine SHAP at a per-topic level
Quantify the extent of bias
- They already quantified the impact on model accuracy, but innacuracy doesn’t directly imply bias
- How? Examine SHAP at the corpus level

Background

There are a number of relevant papers published in economics in recent years developing and using Double ML
The method is developed largely from:
- Chernozhukov et al. (2017 AER), “Double/debiased/Neyman machine learning of treatment effects”
- Chernozhukov et al. (2018 Econometrics J), “Double/debiased machine learning for treatment and structural parameters.”

Impact or overlap with methodological work by Susan Athey, Matthew Gentzkow, Trevor Hastie, Guido Imbens, Matt Taddy, and Stefan Wager

What is Double ML?

Split your sample as you would for \(K\)-fold cross validation, into sets \(\{I_k\}_{k\in \{1,...,K\}}\)
- \(K\) sample of \(N/K\) observations each
- Let \(I_k^c = \cup \{I_j\}_{j\ne k}\)
Construct \(K\) estimators using a machine learning estimator over nuisance parameters (e.g., controls) applied to the data \(I_K^c\)
Average the \(K\) estimators to obtain a final estimator
- This average estimator is approximately unbiased and normally distributed
- The estimator is also asymptotically efficient

And repeat. Bootstrap this out and take the mean or median of the estimators

Where Double ML excels: Endogenous treatment

Suppose a policy affects a subset of individuals (people, corporations, etc.)
Suppose individuals have the ability to alter their treatment status
- E.g., state laws (move), labor laws, etc.
Linear controls may be insufficient to claim causality of the treatment on anything

There are a lot of older methods that try to address this, though incompletely

Linear controls
Propensity score adjustments (e.g., weighting)
Matching methods
“doubly-robust” estimators

Why is machine learning needed?

Suppose a true form of a specification is as follows
- \(T\) is a treatment indicator, \(C\) is a vector of controls

\[ \begin{aligned} Y &= g_0(T, C) + \varepsilon_1\\ T &= m_0(C) + \varepsilon_2 \end{aligned} \]

We often assume \(g_0\) to be something like \(\alpha + \theta_0\ T + \gamma \cdot C\)
We often assume \(m_0\) to be a constant (i.e., assume that \(T\) is exogenous)

We know these assumptions aren’t true!

Why is machine learning needed?

How can we estimate a more general form for \(g_0\) and \(m_0\)?

We could use a more flexible econometric approach, such as including interactions between \(T\) and \(C\)
- This is still very restrictive – purely linear
We could include transformations of \(C\) and its interactions
- This is still restrictive – \(T\) is additive separable
We could use a nonparametric estimator!
- This is where machine learning is very useful: efficient and reasonably accurate nonparametric estimation
  - LASSO, random forest, XGBoost, etc.

Model variants

The models described in the last few slides are referred to as the “Interactive regression model” or IRM
If you can separate your treatment effect from the controls but suspect nonlinear effects of controls, the “Partially linear regression model” or PLR is appropriate
- Solves \(Y = \theta_0 T + g_0(C) + \varepsilon_0\) and \(T = m_0(C) + \varepsilon_2\)
There are also instrumental variable variants of both IRM and PLR

Reconciling these slides notation with the paper

These slides use a somewhat simpler/accounting-oriented notation.
Reconciliation from slides to papers:
- \(T\) is \(D\)
- \(C\) is \(X\)
- \(\varepsilon_0\) is \(U\) or \(\zeta\) depending on the paper
- \(\varepsilon_1\) is \(V\)

Walking through an implementation of DoubleML

Problem: How does 401k participation impact wealth?

This problem is walked through in Chernozhukov et al. (2017 AER, Web Appendix)
- The R code for the AER paper is available from AER as well
  - Quite clean code at that!
We will implement this in python using the DoubleML library
- Which Chernozhukov was involved in the development of

Importing the data

Conveniently, the data is available from the DoubleML package

# Grab the dataset
import doubleml.datasets
df = dml.datasets.fetch_401K('DataFrame')
df

##          nifa  net_tfa        tw  age      inc  fsize  educ  db  marr  \
## 0         0.0      0.0    4500.0   47   6765.0      2     8   0     0   
## 1      6215.0   1015.0   22390.0   36  28452.0      1    16   0     0   
## 2         0.0  -2000.0   -2000.0   37   3300.0      6    12   1     0   
## 3     15000.0  15000.0  155000.0   58  52590.0      2    16   0     1   
## 4         0.0      0.0   58000.0   32  21804.0      1    11   0     0   
## ...       ...      ...       ...  ...      ...    ...   ...  ..   ...   
## 9910  98498.0  98858.0  157858.0   52  73920.0      1    16   1     0   
## 9911    287.0   6230.0   15730.0   41  42927.0      4    14   0     1   
## 9912     99.0   6099.0    7406.0   40  23619.0      2    16   1     0   
## 9913      0.0    -32.0    2468.0   47  14280.0      4     6   1     0   
## 9914   4000.0   5000.0    8857.0   33  11112.0      1    14   0     0   
## 
##       twoearn  e401  p401  pira  hown  
## 0           0     0     0     0     1  
## 1           0     0     0     0     1  
## 2           0     0     0     0     0  
## 3           1     0     0     0     1  
## 4           0     0     0     0     1  
## ...       ...   ...   ...   ...   ...  
## 9910        0     1     1     0     1  
## 9911        1     1     1     1     1  
## 9912        0     1     0     1     0  
## 9913        0     1     1     0     0  
## 9914        0     1     1     0     0  
## 
## [9915 rows x 14 columns]

Using your own data

We can also do this manually, by importing the Stata file from AER
We then need to prep the data into the format DoubleML expects
- This is fairly straightforward, just defining our Y, treatment, and control variables

df = pd.read_stata('../../Data/S5_sipp1991.dta')

y = 'net_tfa'
treat = 'e401'
controls = [x for x in df.columns.tolist() if x not in [y, treat]]

df_dml = dml.DoubleMLData(df, y_col=y, d_cols=treat, x_cols=controls)

What is the data format used by DoubleML?

print(df_dml)

## ================== DoubleMLData Object ==================
## 
## ------------------ Data summary      ------------------
## Outcome variable: net_tfa
## Treatment variable(s): ['e401']
## Covariates: ['nifa', 'tw', 'age', 'inc', 'fsize', 'educ', 'db', 'marr', 'twoearn', 'p401', 'pira', 'hown']
## Instrument variable(s): None
## No. Observations: 9915
## 
## ------------------ DataFrame info    ------------------
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 9915 entries, 0 to 9914
## Columns: 14 entries, nifa to hown
## dtypes: float32(4), int8(10)
## memory usage: 329.2 KB

Pandas dataframe
A pre-specified outcome variable
One or more treatment indicators
One or more controls
Optional instruments

Set up the Nuisance functions

Recall that there are two functions, \(m_0\) and \(g_0\) that need to be solved for this method
We can specify any form for these that we want, so long as they are consistent with Scikit-learn

\(g_0\): Continuous GBM

g_0 = GradientBoostingRegressor(
  loss='ls',
  learning_rate=0.01,
  n_estimators=1000,
  subsample=0.5,
  max_depth=2
  )

\(m_0\): Binary GBM

m_0 = GradientBoostingClassifier(
  loss='exponential',
  learning_rate=0.01,
  n_estimators=1000,
  subsample=0.5,
  max_depth=2
  )

Run the DML model: Average Treatment Effects

# Fix the random number generator for replicability
np.random.seed(1234)
# Run the model
dml_model_irm = dml.DoubleMLIRM(df_dml, g_0, m_0)
# Output the model's findings
print(dml_model_irm.fit())

## ================== DoubleMLIRM Object ==================
## 
## ------------------ Data summary      ------------------
## Outcome variable: net_tfa
## Treatment variable(s): ['e401']
## Covariates: ['nifa', 'tw', 'age', 'inc', 'fsize', 'educ', 'db', 'marr', 'twoearn', 'p401', 'pira', 'hown']
## Instrument variable(s): None
## No. Observations: 9915
## 
## ------------------ Score & algorithm ------------------
## Score function: ATE
## DML algorithm: dml2
## 
## ------------------ Resampling        ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: True
## 
## ------------------ Fit summary       ------------------
##             coef     std err         t         P>|t|        2.5 %       97.5 %
## e401  3320.43343  383.604082  8.655887  4.890947e-18  2568.583245  4072.283614

Run the DML model: ATTE

ATTE: Average Treatment Effects of the Treated

# Run the model
dml_model_irm_ATTE = dml.DoubleMLIRM(df_dml, g_0, m_0, score='ATTE')
# Output the model's findings
print(dml_model_irm_ATTE.fit())

## ================== DoubleMLIRM Object ==================
## 
## ------------------ Data summary      ------------------
## Outcome variable: net_tfa
## Treatment variable(s): ['e401']
## Covariates: ['nifa', 'tw', 'age', 'inc', 'fsize', 'educ', 'db', 'marr', 'twoearn', 'p401', 'pira', 'hown']
## Instrument variable(s): None
## No. Observations: 9915
## 
## ------------------ Score & algorithm ------------------
## Score function: ATTE
## DML algorithm: dml2
## 
## ------------------ Resampling        ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: True
## 
## ------------------ Fit summary       ------------------
##               coef     std err          t          P>|t|        2.5 %         97.5 %  
## e401  10081.312662  392.074708  25.712734  8.421563e-146  9312.860354   10849.764969

Other twists on the model

Change the machine learning backend
- Our models used dml2
- You can switch to dml1 using dml_procedure='dml1'
- dml1 follows the math in these slides
  - Solve for a condition equal to zero for each model, and then average the estimators
  - dml2 solves the for the average of the condition being equal to zero overall
Run multiple iterations of the model
- The paper uses 100 iterations, emulate this by adding n_rep=100
Change the machine learning models fed to the DoubleML model
- An example of using “Histogram-based Gradient Boosting” is in the Jupyter notebook
  - This is a much faster GBM-like model

Addendum: Using R

The doubleML package is available in R as well

Wrap-up

SHAP

A flexible model for understanding how other models use data
Many visualization tools for different situations:
- Individual observations
- Individual measures
- Aggregations over all observations
- Importance plots

DoubleML

Leveraging the nonparametric nature of ML models to improve causality
Easy to gauge ATE and ATTE
Extendable to instrumental variable problems

Packages used for these slides

Python

doubleml
numpy
pandas
scikit-learn
shap
xgboost

kableExtra
knitr
reticulate
revealjs

References

Blevins, Cameron, and Lincoln Mullen. “Jane, John… Leslie? A Historical Method for Algorithmic Gender Prediction.” DHQ: Digital Humanities Quarterly 9, no. 3 (2015).
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. “Double/debiased/Neyman machine learning of treatment effects.” American Economic Review 107, no. 5 (2017): 261-65.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. “Double/debiased machine learning for treatment and structural parameters.” (2018): C1-C68.
Lundberg, Scott, and Su-In Lee. “A unified approach to interpreting model predictions.” arXiv preprint, arXiv:1705.07874 (2017).
Shapley, Lloyd. “A value for n-person Games.” Ann. Math. Study 28, Contributions to the Theory of Games, ed. by HW Kuhn, and AW Tucker (1953): 307-317.
Wich, Maximilian, Jan Bauer, and Georg Groh. “Impact of politically biased data on hate speech classification.” In Proceedings of the Fourth Workshop on Online Abuse and Harms, pp. 54-64. 2020.

Custom code

# Fully worked out DoubleML model using Histogram-based GBM

from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor

# set up the data
df = pd.read_stata('../Data/S5_sipp1991.dta')

y = 'net_tfa'
treat = 'e401'
controls = [x for x in df.columns.tolist() if x not in [y, treat]]

df_dml3 = dml.DoubleMLData(df, y_col=y, d_cols=treat, x_cols=controls)

#set up the nonparametric nuisance functions
g_0 = HistGradientBoostingRegressor(loss='least_squares',
                                    learning_rate=0.01,
                                    max_iter=1000,
                                    max_depth=2,
                                    early_stopping=False
                                   )
m_0 = HistGradientBoostingClassifier(loss='binary_crossentropy',
                                     learning_rate=0.01,
                                     max_iter=1000,
                                     max_depth=2,
                                     early_stopping=False
                                    )


np.random.seed(1234)
dml_model_ex_irm = dml.DoubleMLIRM(df_dml, g_0, m_0, n_folds=5, n_rep=100)
print(dml_model_ex_irm.fit())

Session 5: Economics Approaches to Machine Learning

2021 August 12

Dr. Richard M. Crowleyrcrowley@smu.edu.sghttp://rmc.link/

Main applications

Bias, #1: Quantifying bias in wages

Bias, #2: Political bias in hate speech classification

Causal ML: Quantifying the impact of 401(k)s on wealth

Introduction to Bias using SHAP

An example of quantifying bias

The data

One hot encoding categorical data

Prepping XGBoost

Building our model and prepping SHAP

Explaining a single observation

Explaining a single observation

What exactly is SHAP?

Principles of SHAP

Intuition of SHAP

Charting with SHAP

A more concise point visualization

Aggregating across the data

Seeing more variables’ impact

Aggregate analysis of an individual variable

Multiple scatterplots at once: Bee swarm

Importance plot

Addendum: Using R

SHAP for hate speech bias

Paper background

The models

Applying SHAP to the models

Examples of bias with SHAP

What else could this paper have done?

Double ML: Theory

Background

What is Double ML?

Where Double ML excels: Endogenous treatment

Why is machine learning needed?

Why is machine learning needed?

Model variants

Reconciling these slides notation with the paper

Implementing DoubleML

Walking through an implementation of DoubleML

Importing the data

Using your own data

What is the data format used by DoubleML?

Set up the Nuisance functions

Run the DML model: Average Treatment Effects

Run the DML model: ATTE

Other twists on the model

Addendum: Using R

Conclusion

Wrap-up

Packages used for these slides

References

Custom code

Dr. Richard M. Crowley
rcrowley@smu.edu.sg
http://rmc.link/