Ex.: You train 3 models using different techniques, and each seems to work well in certain cases and poorly in others
If you use the models in isolation, any of them would do an OK (but not great) job
If you make a model using all three, you can get better performance if their strengths all shine through
Ensembles range from simple to complex
Simple: a (weighted) average of a few model’s predictions
When are ensembles useful?
You have multiple models that are all decent, but none are great
And, ideally, the models’ predictions are not highly correlated
When are ensembles useful?
You have a really good model and a bunch of mediocre models
And, ideally the mediocre models are not highly correlated
When are ensembles useful?
You really need to get just a bit more accuracy/less error out of the model, and you have some other models lying around
You want a more stable model
It helps to stabilize predictions by limiting the effect of errors or outliers produced by any one model on your prediction
Think: Diversification (like in finance)
A simple ensemble (averaging)
For continuous predictions, simple averaging is viable
Often you may want to weight the best model a bit higher
For binary or categorical predictions, consider averaging ranks
i.e., instead of using a probability from a logit, use ranks 1, 2, 3, etc.
Ranks average a bit better, as scores on binary models (particularly when evaluated with measures like AUC) can have extremely different variances across models
In which case the ensemble is really just the most volatile model’s prediction…
Not much of an ensemble
A more complex ensemble (voting model)
If you have a model the is very good at predicting a binary outcome, ensembling can still help
This is particularly true when you have other models that capture different aspects of the problem
Let the other models vote against the best model, and use their prediction if they are above some threshold of agreement
A lot more complex ensemble
Stacking models (2 layers)
Train models on subsets (folds) of the training data
Make predictions for each model on the folds it wasn’t applied to
Train a new model that takes those predictions as inputs (and optionally the dataset as well)
Blending (similar to stacking)
Like stacking, but using predictions on a hold out sample instead of folds (and thus all models are using the same data for predictions)
Simple ensemble example
Ensembling all our fraud detection models from Session 6
Methods like stacking or blending are much more complex than a simple averaging or voting based ensemble
But in practice they perform slightly better
Recall the tradeoff between complexity and accuracy!
As such, we may not prefer the complex ensemble in practice, unless we only care about accuracy
Example: In 2009, Netflix awarded a $1M prize to the BellKor’s Pragmatic Chaos team for beating Netflix’s own user preference algorithm by >10%. The alogorithm was so complex that Netflix never used it. It instead used a simpler algorithm with an 8% improvement.
[Geoff Hinton’s] Dark knowledge
Complex ensembles work well
Complex ensembles are exceedingly computationally intensive
This is bad for running on small or constrained devices (like phones)
Dark knowledge
We can (almost) always create a simple model that approximates the complex model
Interpret the above literally – we can train a model to fit the model
Dark knowledge
Train the simple model not on the actual DV from the training data, but on the best algorithm’s (softened) prediction for the training data
Somewhat surprisingly, this new, simple algorithm can work almost as well as the full thing!
An example of this dark knowledge
Google’s full model for interpretting human speech is >100GB
As of October 2019
In Google’s Pixel 4 phone, they have human speech interpretation running locally on the phone
Not in the cloud like it works on any other Android phone
How did they do this?
They can approximate the output of the complex speech model using a 0.5GB model
0.5GB isn’t small, but it’s small enough to run on a phone
If an algorithm has the same accuracy across groups, but rates are different across groups, then true positive and false positive rates must be different!
Fairness requires considering different perspectives and identifying which perspectives are most important from an ethical perspective
How could the previous examples be avoided?
Filtering data used for learning the algorithms
Microsoft Tay should have been more careful about the language used in retraining the algorithm over time
Particularly given that the AI was trained on public information on Twitter, where coordination against it would be simple
Filtering output of the algorithms
Coca Cola could check the text for content that is likely racist, classist, sexist, etc.
Google may have been able to avoid this using training dataset that was sensitive to potential problems
For instance, using a balanced data set across races
As an intermediary measure, they removed searching for gorillas and its associated label from the app
Examining ethics for algorithms
Understanding the problem and its impact
Think about the effects the algorithm will have!
Will it drastically affect lives? If yes, exercise more care!
Think about what you might expect to go wrong
What biases might you expect or might find in the data?
What biases do people doing the same task exhibit?
Manual inspection
Check model outputs against problematic indicators
Test the algorithm before putting it into production
What could go wrong if the Uber data wasn’t anonymized?
Anonymization is tricky
Both Allman & Paxson, and Partridge warn against relying on the anonymisation of data since deanonymisation techniques are often surprisingly powerful. Robust anonymisation of data is difficult, particularly when it has high dimensionality, as the anonymisation is likely to lead to an unacceptable level of data loss [3]. – TPHCB 2017
There are natural limits to anonymization, particularly when there is a limited amount of potential participants in the data
If you need to record people’s private information, make sure they know
This is called informed consent
If you are recording sensitive information, consider not keeping identities at all
Create a new, unique identifier (if needed)
Maintain as little identifying information as necessary
Consider using encryption if sensitive data is retained
Can unintentionally lead to infringements of human rights if the data is used in unintended ways
Informed consent
When working with data about people, they should be informed of this and consent to the research, unless the data is publicly available
From SMU’s IRB Handbook: (2017 SEP 18 version)
“Informed consent: Respect for persons requires that participants, to the degree that they are capable, be given the opportunity to make their own judgments and choices. When researchers seek participants’ participation in research studies, they provide them the opportunity to make their own decisions to participate or not by ensuring that the following adequate standards for informed consent are satisfied:
Information: Participants are given sufficient information about the research study, e.g., research purpose, study procedures, risks, benefits, confidentiality of participants’ data.
Comprehension: The manner and context in which information is conveyed allows sufficient comprehension. The information is organized for easy reading and the language is easily comprehended by the participants.
Voluntariness: The manner in which researchers seek informed consent from the participants to participate in the research study must be free from any undue influence or coercion. Under such circumstances, participants are aware that they are not obliged to participate in the research study and their participation is on a voluntary basis.”
Also, note the existence of the PDPA law in Singapore
Human rights
Recall the drug users example
If data was collected without their consent, and if it was not anonymized perfectly, then this could lead to leaking of drug user’s information to others
What risks does this pose? Consider contexts outside Singapore as well.
Responsibilities using data
“The collection, or use, of a dataset of illicit origin to support research can be advantageous. For example, legitimate access to data may not be possible, or the reuse of data of illicit origin is likely to require fewer resources than collecting data again from scratch. In addition, the sharing and reuse of existing datasets aids reproducibility, an important scientific goal. The disadvantage is that ethical and legal questions may arise as a result of the use of such data” (source)
Responsibilities using data
Respect for persons
Individuals should be treated as autonomous agents
People are people
Those without autonomy should be protected
Beneficence
Do not harm (ideally)
Maximize possible benefits and minimize possible harms
This can be a natural source of conflict
Justice
Benefits and risks should flow to the same groups – don’t use unwilling or disadvantaged groups who won’t receive any benefit
Part 1 is good enough for next week, but part 2 is also useful
For those very interested in machine learning, parts 3 through 8 are also great, but more technical and targeted at specific applications like facial recognition and machine translation