df <- readRDS('../../Data/Session_6_models.rds')
#head(df) %>% select(-pred_F, -pred_S) %>% slice(1:2) %>% html_df()
library(xgboost)
# Prep data
train_x <- model.matrix(AAER ~ ., data=df[df$Test==0,-1])[,-1]
train_y <- model.frame(AAER ~ ., data=df[df$Test==0,])[,"AAER"]
test_x <- model.matrix(AAER ~ ., data=df[df$Test==1,-1])[,-1]
test_y <- model.frame(AAER ~ ., data=df[df$Test==1,])[,"AAER"]
set.seed(468435) #for reproducibility
xgbCV <- xgb.cv(max_depth=5, eta=0.10, gamma=5, min_child_weight = 4,
subsample = 0.5, objective = "binary:logistic", data=train_x,
label=train_y, nrounds=100, eval_metric="auc", nfold=10,
stratified=TRUE, verbose=0)
fit_ens <- xgboost(params=xgbCV$params, data = train_x, label = train_y,
nrounds = which.max(xgbCV$evaluation_log$test_auc_mean),
verbose = 0)
## [15:09:06] WARNING: amalgamation/../src/learner.cc:516:
## Parameters: { silent } might not be used.
##
## This may not be accurate due to some parameters are only used in language bindings but
## passed down to XGBoost core. Or some parameters are not used but slip through this
## verification. Please open an issue if you find above cases.
## Ensemble Logit (BCE) Lasso (lambda.min) XGBoost
## 0.8169036 0.7599594 0.7290185 0.8083503
Recall the tradeoff between complexity and accuracy!
Example: In 2009, Netflix awarded a $1M prize to the BellKor’s Pragmatic Chaos team for beating Netflix’s own user preference algorithm by >10%. The alogorithm was so complex that Netflix never used it. It instead used a simpler algorithm with an 8% improvement.
Dark knowledge
How did they do this?
There are many ways to ensemble, and there is no specific guide as to what is best. It may prove useful in the group project, however.
What was the issue in the case? Where might similar issues crop up in business? Any examples?
Fairness requires considering different perspectives and identifying which perspectives are most important from an ethical perspective
library('SHAPforxgboost')
#https://liuyanguu.github.io/post/2019/07/18/visualization-of-shap-for-xgboost/
shap_values <- shap.values(xgb_model = fit_xgb, X_train = train_x)
shap_values$mean_shap_score
## Salaried
## 15541.9012
## Full.Time
## 14542.4617
## Job.TitlesPOLICE OFFICER
## 5415.6150
## Female
## 3823.1335
## DepartmentOther
## 2477.2842
## DepartmentPOLICE
## 1525.8945
## Job.TitlesSERGEANT
## 1021.7853
## Job.TitlesPOLICE OFFICER (ASSIGNED AS DETECTIVE)
## 399.4674
## Job.TitlesOther
## 395.8012
## DepartmentWATER MGMNT
## 390.2932
## Job.TitlesMOTOR TRUCK DRIVER
## 369.7115
## DepartmentOEMC
## 242.1982
## DepartmentSTREETS & SAN
## 89.9351
A good article of examples of the above: Algorithms are great and all, but they can also ruin lives
Compares a variety of unintended associations (top) and intended associations (bottom) across Global Vectors (GloVe) and USE
What risks does such a system pose?
How would you feel if a similar system was implemented in Singapore?
[Withheld from all public copies]
What could go wrong if the Uber data wasn’t anonymized?
Both Allman & Paxson, and Partridge warn against relying on the anonymisation of data since deanonymisation techniques are often surprisingly powerful. Robust anonymisation of data is difficult, particularly when it has high dimensionality, as the anonymisation is likely to lead to an unacceptable level of data loss [3]. – TPHCB 2017
Also, note the existence of the PDPA law in Singapore
What risks does this pose? Consider contexts outside Singapore as well.
By iterating repeatedly, the generative network can find a strategy that can generally circumvent the discriminitive network
“The collection, or use, of a dataset of illicit origin to support research can be advantageous. For example, legitimate access to data may not be possible, or the reuse of data of illicit origin is likely to require fewer resources than collecting data again from scratch. In addition, the sharing and reuse of existing datasets aids reproducibility, an important scientific goal. The disadvantage is that ethical and legal questions may arise as a result of the use of such data” (source)
For experiments, see The Belmont Report; for electronic data, see The Menlo Report
Today, we: