ACCT 420: Machine Learning and AI


Session 9


Dr. Richard M. Crowley

Front matter

Learning objectives

  • Theory:
    • Ensembling
    • Ethics
  • Application:
    • Varied
  • Methodology:
    • Any

Ensembles

What are ensembles?

  • Ensembles are models made out of models
  • Ex.: You train 3 models using different techniques, and each seems to work well in certain cases and poorly in others
    • If you use the models in isolation, then any of them would do an OK (but not great) job
    • If you make a model using all three, you can get better performance if their strengths all shine through
  • Ensembles range from simple to complex
    • Simple: a (weighted) average of a few model’s predictions

When are ensembles useful?

  1. You have multiple models that are all decent, but none are great
    • And, ideally, the models’ predictions are not highly correlated

When are ensembles useful?

  1. You have a really good model and a bunch of mediocre models
    • And, ideally the mediocre models are not highly correlated

When are ensembles useful?

  1. You really need to get just a bit more accuracy/less error out of the model, and you have some other models lying around
  2. We want a more stable model
    • It helps to stabilize predictions by limiting the effect that errors or outliers produced by any 1 model can have on our prediction
    • Think: Diversification (like in finance)

A simple ensemble (averaging)

  • For continuous predictions, simple averaging is viable
    • Often you may want to weight the best model a bit higher
  • For binary or categorical predictions, consider averaging ranks
    • i.e., instead of using a probability from a logit, use ranks 1, 2, 3, etc.
    • Ranks average a bit better, as scores on binary models (particularly when evaluated with measures like AUC) can have extremely different variances across models
      • In which case the ensemble is really just the most volatile model’s prediction…
        • Not much of an ensemble

A more complex ensemble (voting model)

  • If you have a model the is very good at predicting a binary outcome, ensembling can still help
    • This is particularly true when you have other models that capture different aspects of the problem
  • Let the other models vote against the best model, and use their prediction if they are above some threshhold of agreement

A lot more complex ensemble

  • Stacking models (2 layers)
    1. Train models on subsets of the training data and apply to what it didn’t see
    2. Train models across the full training data (like normal)
    3. Train a new model on the predictions from all the other models
  • Blending (similar to stacking)
    • Like stacking, but the first layer is only on a small sample of the training data, instead of across all partitions of the training data

Simple ensemble example

df <- readRDS('../../Data/Session_6_models.rds')
head(df) %>% select(-pred_F, -pred_S) %>% slice(1:2) %>% html_df()
Test AAER pred_FS pred_BCE pred_lmin pred_l1se pred_xgb
0 0 0.0395418 0.0661011 0.0301550 0.0296152 0.0478672
0 0 0.0173693 0.0344585 0.0328011 0.0309861 0.0616048
library(xgboost)

# Prep data
train_x <- model.matrix(AAER ~ ., data=df[df$Test==0,-1])[,-1]
train_y <- model.frame(AAER ~ ., data=df[df$Test==0,])[,"AAER"]
test_x <- model.matrix(AAER ~ ., data=df[df$Test==1,-1])[,-1]
test_y <- model.frame(AAER ~ ., data=df[df$Test==1,])[,"AAER"]

set.seed(468435)  #for reproducibility
xgbCV <- xgb.cv(max_depth=5, eta=0.10, gamma=5, min_child_weight = 4,
                subsample = 0.57, objective = "binary:logistic", data=train_x,
                label=train_y, nrounds=100, eval_metric="auc", nfold=10,
                stratified=TRUE, verbose=0)

fit_ens <- xgboost(params=xgbCV$params, data = train_x, label = train_y,
                   nrounds = which.max(xgbCV$evaluation_log$test_auc_mean),
                   verbose = 0)

Simple ensemble results

aucs  # Out of sample
##           Ensemble        Logit (BCE) Lasso (lambda.min) 
##          0.8271003          0.7599594          0.7290185 
##            XGBoost 
##          0.8083503

What drives the ensemble?

xgb.train.data = xgb.DMatrix(train_x, label = train_y, missing = NA)
col_names = attr(xgb.train.data, ".Dimnames")[[2]]
imp = xgb.importance(col_names, fit_ens)
# Variable importance
xgb.plot.importance(imp)

Practicalities

  • Methods like stacking or blending are much more complex than a simple averaging or voting based ensemble
    • But in practice they perform slightly better

Recall the tradeoff between complexity and accuracy!

  • As such, we may not prefer the complex ensemble in practice, unless we only care about accuracy

Example: In 2009, Netflix awarded a $1M prize to the BellKor’s Pragmatic Chaos team for beating Netflix’s own user preference algorithm by >10%. The alogorithm was so complex that Netflix never used it. It instead used a simpler algorithm with an 8% improvement.

[Geoff Hinton’s] Dark knowledge

  • Complex ensembles work well
  • Complex ensembles are exceedingly computationally intensive
    • This is bad for running on small or constrained devices (like phones)

Dark knowledge

  • We can (almost) always create a simple model that approximates the complex model
    • Interpret the above literally – we can train a model to fit the model

Dark knowledge

  • Train the simple model not on the actual DV from the training data, but on the best algorithm’s (softened) prediction for the training data
  • Somewhat surprisingly, this new, simple algorithm can work almost as well as the full thing!

An example of this dark knowledge

  • Google’s full model for interpretting human speech is >100GB
    • As of October 2019
  • In Google’s Pixel 4 phone, they have human speech interpretation running locally on the phone
    • Not in the cloud like it works on any other Android phone

How did they do this?

  • They can approximate the output of the complex speech model using a 0.5GB model
  • 0.5GB isn’t small, but it’s small enough to run on a phone

Learning more about Ensembling

There are many ways to ensemble, and there is no specific guide as to what is best. It may prove useful in the group project, however.

Ethics: Fairness

In class reading with case

  • From Datarobot’s Colin Preist:
  • The four points:
    1. Data can be correlated with features that are illegal to use
    2. Check for features that could lead to ethical or reputational problems
    3. “An AI only knows what it is taught”
    4. Entrenched bias in data can lead to biased algorithms

What was the issue here? Where might similar issues crop up in business?

Examples of Reputational damage

Fairness is complex!

  • There are many different (and disparate) definitions of fairness
  • For instance, in the court system example:
    • If an algorithm has the same accuracy across groups, but rates are different across groups, then true positive and false positive rates must be different!

Fairness requires considering different perspectives and identifying which perspectives are most important from an ethical perspective

How could the previous examples be avoided?

  • Filtering data used for learning the algorithms
    • Microsoft Tay should have been more careful about the language used in retraining the algorithm over time
      • Particularly given that the AI was trained on public information on Twitter, where coordination against it would be simple
  • Filtering output of the algorithms
    • Coca Cola could check the text for content that is likely racist, classist, sexist, etc.
  • Google may have been able to avoid this using training dataset that was sensitive to potential problems
    • For instance, using a balanced data set across races
    • As an intermediary measure, they removed searching for gorillas and its associated label from the app

Examining ethics for algorithms

  1. Understanding the problem and its impact
    • Think about the effects the algorithm will have!
      • Will it drastically affect lives? If yes, exercise more care!
    • Think about what you might expect to go wrong
      • What biases might you expect?
      • What biases might be in the data?
      • What biases do people doing the same task exhibit?
  2. Manual inspection
    • Check model outputs against problematic indicators
    • Test the algorithm before putting it into production
  3. Use methods like SHAP to explain models
  4. Use purpose-built tools

Areas where ethics is particularly important

  • Anything that impacts people’s livelihoods
    • Legal systems
    • Healthcare and insurance systems
    • Hiring and HR systems
    • Finance systems like credit scoring
    • Education
  • Anything where failure is catastrophic

A good article of examples of the above: Algorithms are great and all, but they can also ruin lives

Research on fairness

Compares a variety of unintended associations (top) and intended associations (bottom) across Global Vectors (GloVe) and USE

Ethical implications: Case

  • In Chicago, IL, USA, they are using a system to rank arrested individuals, and they use that rank for proactive policing
  • Read about the system here: rmc.link/420class9-2

What risks does such a system pose?

How would you feel if a similar system was implemented in Singapore?

Other references

Ethics: Data security

A motivating example

[Withheld from all public copies]

Anomymized data

  • Generally we anonymize data because, while the data itself is broadly useful, providing full information could harm others or oneself
  • Examples:
    • Studying criminal behavior use can create a list of people with potentially uncaught criminal offenses
      • If one retains a list of identities, then there is an ethical dilemma:
        • Protect study participants by withholding the list
        • Provide the list to the government
          • This harms future knowledge generation by sowing distrust
      • Solution: Anonymous by design
    • Website or app user behavior data

What could go wrong if the Uber data wasn’t anonymized?

Anonymization is tricky

Both Allman & Paxson, and Partridge warn against relying on the anonymisation of data since deanonymisation techniques are often surprisingly powerful. Robust anonymisation of data is difficult, particularly when it has high dimensionality, as the anonymisation is likely to lead to an unacceptable level of data loss [3]. – TPHCB 2017

  • There are natural limits to anonymization, particularly when there is a limited amount of potential participants in the data

Responsibilities generating data

  • Keep users as unidentifiable as feasible
  • If you need to record people’s private information, make sure they know
    • This is called informed consent
  • If you are recording sensitive information, consider not keeping identities at all
    • Create a new, unique identifier (if needed)
    • Maintain as little identifying information as necessary
    • Consider using encryption if sensitive data is retained
    • Can unintentionally lead to infringements of human rights if the data is used in unintended ways

Human rights

  • Recall the drug users example
    • If data was collected without their consent, and if it was not anonymized perfectly, then this could lead to leaking of drug user’s information to others

What risks does this pose? Consider contexts outside Singapore as well.

A novel approach using a GAN

How the GAN works

  • There are 2 general components to a GAN:
    1. A generative network: It’s goal is to generate something from some data
      • In this case, it generates a dataset with good predictability for the “predictor” network AND bad predictability for the “adversarial” network
    2. An adversarial: It’s goal is to thwart the generative network
      • In this case, it tries to determine the writers’ identitities

By iterating repeatedly, the generative network can find a strategy that can generally circumvent the discriminitive network

Responsibilities using data

“The collection, or use, of a dataset of illicit origin to support research can be advantageous. For example, legitimate access to data may not be possible, or the reuse of data of illicit origin is likely to require fewer resources than collecting data again from scratch. In addition, the sharing and reuse of existing datasets aids reproducibility, an important scientific goal. The disadvantage is that ethical and legal questions may arise as a result of the use of such data” (source)

Responsibilities using data

  • Respect for persons
    • Individuals should be treated as autonomous agents
      • People are people
    • Those without autonomy should be protected
  • Beneficence
    1. Do not harm (ideally)
    2. Maximize possible benefits and minimize possible harms
      • This can be a natural source of conflict
  • Justice
    • Benefits and risks should flow to the same groups – don’t use unwilling or disadvantaged groups who won’t receive any benefit

For experiments, see The Belmont Report; for electronic data, see The Menlo Report

End matter

Recap

Today, we:

  • Learned about combining models to create an even better model
    • And the limits to this as pointed out by Geoff Hinton
  • Discussed the potential ethical issues surrounding:
    • AI algorithms
    • Data creation
    • Data usage

For next week

  • For the next 2 weeks:
    • We will talk about neural networks and vector methods (which are generally neural network based)
      • These are important tools underpinning a lot of recent advancements
      • We will take a look at some of the advancements, and the tools that underpin them
    • If you would like to be well prepared, there is a nice introductory article here (8 parts though)
      • Part 1 is good enough for next week, but part 2 is also useful
      • For those very interested in machine learning, parts 3 through 8 are also great, but more technical and targeted at specific applications like facial recognition and machine translation
    • Keep working on the group project

Fun machine learning examples

Packages used for these slides