Main question
How can we detect if a firm is involved in a major instance of missreporting?
- This is a pure forensic analytics question
- “Major instance of misreporting” will be implemented using AAERs
Dr. Richard M. Crowley
Misstatements that affect firms’ accounting statements and were done seemingly intentionally by management or other employees at the firm.
In the US:
Why did the SEC release this AAER regarding Sanofi?
How can we detect if a firm is involved in a major instance of missreporting?
All of these are discussed in a Brown, Crowley and Elliott (2018) – I will refer to the paper as BCE for short
bigNaudit
, midNaudit
)rsst_acc
)ni_at
)sentlen_u
)Topic_
prepended)Pulled from BCE’s working files
Test
What potential issues are there with our usual training and testing strategy?
df %>%
group_by(year) %>%
mutate(total_AAERS = sum(AAER), total_observations=n()) %>%
slice(1) %>%
ungroup() %>%
select(year, total_AAERS, total_observations) %>%
html_df
year | total_AAERS | total_observations |
---|---|---|
1999 | 46 | 2195 |
2000 | 50 | 2041 |
2001 | 43 | 2021 |
2002 | 50 | 2391 |
2003 | 57 | 2936 |
2004 | 49 | 2843 |
246 AAERs in the training data, 401 total variables…
fit_1990s <- glm(AAER ~ ebit + ni_revt + ni_at + log_lt + ltl_at + lt_seq +
lt_at + act_lct + aq_lct + wcap_at + invt_revt + invt_at +
ni_ppent + rect_revt + revt_at + d_revt + b_rect + b_rect +
r_gp + b_gp + gp_at + revt_m_gp + ch_at + log_at +
ppent_at + wcap,
data=df[df$Test==0,],
family=binomial)
summary(fit_1990s)
##
## Call:
## glm(formula = AAER ~ ebit + ni_revt + ni_at + log_lt + ltl_at +
## lt_seq + lt_at + act_lct + aq_lct + wcap_at + invt_revt +
## invt_at + ni_ppent + rect_revt + revt_at + d_revt + b_rect +
## b_rect + r_gp + b_gp + gp_at + revt_m_gp + ch_at + log_at +
## ppent_at + wcap, family = binomial, data = df[df$Test ==
## 0, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1391 -0.2275 -0.1661 -0.1190 3.6236
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.660e+00 8.336e-01 -5.591 2.26e-08 ***
## ebit -3.564e-04 1.094e-04 -3.257 0.00112 **
## ni_revt 3.664e-02 3.058e-02 1.198 0.23084
## ni_at -3.196e-01 2.325e-01 -1.374 0.16932
## log_lt 1.494e-01 3.409e-01 0.438 0.66118
## ltl_at -2.306e-01 7.072e-01 -0.326 0.74438
## lt_seq -2.826e-05 4.567e-04 -0.062 0.95067
## lt_at -8.559e-01 9.270e-01 -0.923 0.35586
## act_lct 1.401e-01 7.005e-02 2.000 0.04546 *
## aq_lct -1.751e-01 9.156e-02 -1.912 0.05588 .
## wcap_at -8.488e-01 6.487e-01 -1.308 0.19075
## invt_revt -9.165e-01 9.679e-01 -0.947 0.34371
## invt_at 2.350e+00 1.033e+00 2.276 0.02286 *
## ni_ppent 2.703e-03 1.203e-02 0.225 0.82223
## rect_revt 6.587e-02 7.427e-02 0.887 0.37511
## revt_at -4.790e-01 1.734e-01 -2.763 0.00572 **
## d_revt -7.848e-04 1.586e-03 -0.495 0.62079
## b_rect 8.212e-02 1.569e-01 0.524 0.60060
## r_gp -1.055e-04 6.623e-04 -0.159 0.87345
## b_gp 1.266e-01 1.556e-01 0.813 0.41605
## gp_at 1.031e-01 3.512e-01 0.293 0.76921
## revt_m_gp 2.621e-05 1.094e-05 2.396 0.01655 *
## ch_at -5.356e-01 7.909e-01 -0.677 0.49829
## log_at 2.687e-01 3.483e-01 0.771 0.44045
## ppent_at -2.956e+00 4.798e-01 -6.161 7.23e-10 ***
## wcap 7.009e-05 5.187e-05 1.351 0.17658
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2118.6 on 10268 degrees of freedom
## Residual deviance: 1928.8 on 10243 degrees of freedom
## (1315 observations deleted due to missingness)
## AIC: 1980.8
##
## Number of Fisher Scoring iterations: 12
## In sample AUC Out of sample AUC
## 0.7483132 0.7292981
Based on Dechow, Ge, Larson and Sloan (2011)
fit_2011 <- glm(AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv +
soft_assets + pct_chg_cashsales + chg_roa + issuance +
oplease_dum + book_mkt + lag_sdvol + merger + bigNaudit +
midNaudit + cffin + exfin + restruct,
data=df[df$Test==0,],
family=binomial)
summary(fit_2011)
##
## Call:
## glm(formula = AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv +
## soft_assets + pct_chg_cashsales + chg_roa + issuance + oplease_dum +
## book_mkt + lag_sdvol + merger + bigNaudit + midNaudit + cffin +
## exfin + restruct, family = binomial, data = df[df$Test ==
## 0, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8434 -0.2291 -0.1658 -0.1196 3.2614
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.1474558 0.5337491 -13.391 < 2e-16 ***
## logtotasset 0.3214322 0.0355467 9.043 < 2e-16 ***
## rsst_acc -0.2190095 0.3009287 -0.728 0.4667
## chg_recv 1.1020740 1.0590837 1.041 0.2981
## chg_inv 0.0389504 1.2507142 0.031 0.9752
## soft_assets 2.3094551 0.3325731 6.944 3.81e-12 ***
## pct_chg_cashsales -0.0006912 0.0108771 -0.064 0.9493
## chg_roa -0.2697984 0.2554262 -1.056 0.2908
## issuance 0.1443841 0.3187606 0.453 0.6506
## oplease_dum -0.2029022 0.1970597 -1.030 0.3032
## book_mkt 0.0150113 0.0110652 1.357 0.1749
## lag_sdvol 0.0517368 0.0555338 0.932 0.3515
## merger 0.3529102 0.1511729 2.334 0.0196 *
## bigNaudit -0.1998454 0.3598283 -0.555 0.5786
## midNaudit -0.4894029 0.5118388 -0.956 0.3390
## cffin 0.4557560 0.3435253 1.327 0.1846
## exfin -0.0053608 0.0393354 -0.136 0.8916
## restruct 0.3827721 0.1471729 2.601 0.0093 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2381.9 on 11583 degrees of freedom
## Residual deviance: 2189.7 on 11566 degrees of freedom
## AIC: 2225.7
##
## Number of Fisher Scoring iterations: 7
## In sample AUC Out of sample AUC
## 0.7445378 0.6849225
From a variety of papers
fit_2000s <- glm(AAER ~ bullets + headerlen + newlines + alltags +
processedsize + sentlen_u + wordlen_s + paralen_s +
repetitious_p + sentlen_s + typetoken + clindex + fog +
active_p + passive_p + lm_negative_p + lm_positive_p +
allcaps + exclamationpoints + questionmarks,
data=df[df$Test==0,],
family=binomial)
summary(fit_2000s)
##
## Call:
## glm(formula = AAER ~ bullets + headerlen + newlines + alltags +
## processedsize + sentlen_u + wordlen_s + paralen_s + repetitious_p +
## sentlen_s + typetoken + clindex + fog + active_p + passive_p +
## lm_negative_p + lm_positive_p + allcaps + exclamationpoints +
## questionmarks, family = binomial, data = df[df$Test == 0,
## ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9604 -0.2244 -0.1984 -0.1749 3.2318
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.662e+00 3.143e+00 -1.801 0.07165 .
## bullets -2.635e-05 2.625e-05 -1.004 0.31558
## headerlen -2.943e-04 3.477e-04 -0.846 0.39733
## newlines -4.821e-05 1.220e-04 -0.395 0.69271
## alltags 5.060e-08 2.567e-07 0.197 0.84376
## processedsize 5.709e-06 1.287e-06 4.435 9.19e-06 ***
## sentlen_u -3.790e-02 6.897e-02 -0.550 0.58259
## wordlen_s 1.278e-01 1.199e+00 0.107 0.91510
## paralen_s -4.808e-02 3.052e-02 -1.576 0.11514
## repetitious_p -1.673e+00 1.665e+00 -1.005 0.31508
## sentlen_s -2.098e-02 2.222e-02 -0.944 0.34518
## typetoken -3.609e-01 1.729e+00 -0.209 0.83469
## clindex 2.744e-01 1.519e-01 1.806 0.07085 .
## fog -1.839e-02 1.322e-01 -0.139 0.88935
## active_p -4.321e-01 1.459e+00 -0.296 0.76711
## passive_p 7.321e-02 3.204e+00 0.023 0.98177
## lm_negative_p 1.356e+01 1.304e+01 1.039 0.29875
## lm_positive_p -9.911e+01 3.842e+01 -2.580 0.00989 **
## allcaps -4.646e-04 1.835e-04 -2.532 0.01136 *
## exclamationpoints -3.520e-01 1.944e-01 -1.811 0.07010 .
## questionmarks 3.371e-03 2.816e-02 0.120 0.90471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2381.9 on 11583 degrees of freedom
## Residual deviance: 2329.2 on 11563 degrees of freedom
## AIC: 2371.2
##
## Number of Fisher Scoring iterations: 10
## In sample AUC Out of sample AUC
## 0.6377783 0.6295414
Why is it appropriate to combine the 2011 model with the 2000s model?
fit_2000f <- glm(AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv +
soft_assets + pct_chg_cashsales + chg_roa + issuance +
oplease_dum + book_mkt + lag_sdvol + merger + bigNaudit +
midNaudit + cffin + exfin + restruct + bullets + headerlen +
newlines + alltags + processedsize + sentlen_u + wordlen_s +
paralen_s + repetitious_p + sentlen_s + typetoken +
clindex + fog + active_p + passive_p + lm_negative_p +
lm_positive_p + allcaps + exclamationpoints + questionmarks,
data=df[df$Test==0,],
family=binomial)
summary(fit_2000f)
##
## Call:
## glm(formula = AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv +
## soft_assets + pct_chg_cashsales + chg_roa + issuance + oplease_dum +
## book_mkt + lag_sdvol + merger + bigNaudit + midNaudit + cffin +
## exfin + restruct + bullets + headerlen + newlines + alltags +
## processedsize + sentlen_u + wordlen_s + paralen_s + repetitious_p +
## sentlen_s + typetoken + clindex + fog + active_p + passive_p +
## lm_negative_p + lm_positive_p + allcaps + exclamationpoints +
## questionmarks, family = binomial, data = df[df$Test == 0,
## ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9514 -0.2237 -0.1596 -0.1110 3.3882
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.634e+00 3.415e+00 -0.479 0.63223
## logtotasset 3.437e-01 3.921e-02 8.766 < 2e-16 ***
## rsst_acc -2.123e-01 2.995e-01 -0.709 0.47844
## chg_recv 1.022e+00 1.055e+00 0.969 0.33274
## chg_inv -9.464e-02 1.226e+00 -0.077 0.93845
## soft_assets 2.573e+00 3.387e-01 7.598 3.01e-14 ***
## pct_chg_cashsales -1.134e-03 1.103e-02 -0.103 0.91811
## chg_roa -2.696e-01 2.470e-01 -1.092 0.27504
## issuance 1.471e-01 3.220e-01 0.457 0.64777
## oplease_dum -2.859e-01 2.020e-01 -1.416 0.15691
## book_mkt 1.531e-02 1.115e-02 1.373 0.16967
## lag_sdvol 6.147e-02 5.418e-02 1.135 0.25655
## merger 3.699e-01 1.536e-01 2.408 0.01604 *
## bigNaudit -1.992e-01 3.638e-01 -0.548 0.58392
## midNaudit -4.839e-01 5.139e-01 -0.942 0.34635
## cffin 4.608e-01 3.599e-01 1.280 0.20044
## exfin -1.235e-03 5.907e-02 -0.021 0.98332
## restruct 3.035e-01 1.555e-01 1.952 0.05099 .
## bullets -1.958e-05 2.519e-05 -0.777 0.43692
## headerlen -5.792e-04 4.038e-04 -1.434 0.15151
## newlines -7.423e-05 1.294e-04 -0.574 0.56615
## alltags -1.338e-07 2.701e-07 -0.495 0.62036
## processedsize 4.111e-06 1.446e-06 2.844 0.00446 **
## sentlen_u -5.086e-02 6.950e-02 -0.732 0.46433
## wordlen_s -1.618e+00 1.303e+00 -1.241 0.21450
## paralen_s -3.934e-02 2.923e-02 -1.346 0.17834
## repetitious_p -1.433e+00 1.658e+00 -0.864 0.38740
## sentlen_s -1.690e-02 2.212e-02 -0.764 0.44502
## typetoken -1.216e+00 1.788e+00 -0.680 0.49643
## clindex 9.620e-02 1.521e-01 0.632 0.52718
## fog 1.063e-02 1.339e-01 0.079 0.93675
## active_p 6.259e-01 1.435e+00 0.436 0.66271
## passive_p -5.848e+00 3.796e+00 -1.541 0.12337
## lm_negative_p 2.012e+01 1.220e+01 1.649 0.09906 .
## lm_positive_p -3.700e+01 3.864e+01 -0.958 0.33825
## allcaps -4.721e-04 1.912e-04 -2.469 0.01354 *
## exclamationpoints -4.510e-01 2.024e-01 -2.228 0.02586 *
## questionmarks 4.096e-03 2.853e-02 0.144 0.88586
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2381.9 on 11583 degrees of freedom
## Residual deviance: 2143.0 on 11546 degrees of freedom
## AIC: 2219
##
## Number of Fisher Scoring iterations: 10
## In sample AUC Out of sample AUC
## 0.7664115 0.7147021
Why use document content?
BCE_eq = as.formula(paste("AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv +
soft_assets + pct_chg_cashsales + chg_roa + issuance +
oplease_dum + book_mkt + lag_sdvol + merger + bigNaudit +
midNaudit + cffin + exfin + restruct + bullets + headerlen +
newlines + alltags + processedsize + sentlen_u + wordlen_s +
paralen_s + repetitious_p + sentlen_s + typetoken +
clindex + fog + active_p + passive_p + lm_negative_p +
lm_positive_p + allcaps + exclamationpoints + questionmarks + ",
paste(paste0("Topic_",1:30,"_n_oI"), collapse=" + "), collapse=""))
fit_BCE <- glm(BCE_eq,
data=df[df$Test==0,],
family=binomial)
summary(fit_BCE)
##
## Call:
## glm(formula = BCE_eq, family = binomial, data = df[df$Test ==
## 0, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0887 -0.2212 -0.1478 -0.0940 3.5401
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.032e+00 3.872e+00 -2.074 0.03806 *
## logtotasset 3.879e-01 4.554e-02 8.519 < 2e-16 ***
## rsst_acc -1.938e-01 3.055e-01 -0.634 0.52593
## chg_recv 8.581e-01 1.071e+00 0.801 0.42296
## chg_inv -2.607e-01 1.223e+00 -0.213 0.83119
## soft_assets 2.555e+00 3.796e-01 6.730 1.7e-11 ***
## pct_chg_cashsales -1.976e-03 6.997e-03 -0.282 0.77767
## chg_roa -2.532e-01 2.786e-01 -0.909 0.36354
## issuance 9.692e-02 3.269e-01 0.296 0.76687
## oplease_dum -3.451e-01 2.097e-01 -1.645 0.09989 .
## book_mkt 1.361e-02 1.151e-02 1.183 0.23692
## lag_sdvol 4.546e-02 5.709e-02 0.796 0.42589
## merger 3.224e-01 1.572e-01 2.051 0.04027 *
## bigNaudit -2.010e-01 3.711e-01 -0.542 0.58804
## midNaudit -4.641e-01 5.195e-01 -0.893 0.37169
## cffin 6.024e-01 3.769e-01 1.598 0.10998
## exfin -6.738e-03 4.063e-02 -0.166 0.86831
## restruct 1.915e-01 1.593e-01 1.202 0.22920
## bullets -1.612e-05 2.526e-05 -0.638 0.52334
## headerlen -3.956e-04 3.722e-04 -1.063 0.28776
## newlines -1.038e-04 1.333e-04 -0.779 0.43623
## alltags -1.338e-07 2.772e-07 -0.483 0.62934
## processedsize 4.178e-06 1.477e-06 2.828 0.00468 **
## sentlen_u -7.131e-02 7.881e-02 -0.905 0.36553
## wordlen_s 4.413e-01 1.430e+00 0.309 0.75761
## paralen_s -4.584e-02 2.976e-02 -1.540 0.12356
## repetitious_p -1.525e+00 1.780e+00 -0.857 0.39152
## sentlen_s -7.700e-04 2.238e-02 -0.034 0.97255
## typetoken 5.313e-02 1.934e+00 0.027 0.97809
## clindex 6.406e-02 1.669e-01 0.384 0.70116
## fog 7.432e-02 1.590e-01 0.467 0.64018
## active_p -3.159e-01 1.628e+00 -0.194 0.84620
## passive_p -6.714e+00 4.192e+00 -1.602 0.10926
## lm_negative_p 3.558e+00 1.422e+01 0.250 0.80240
## lm_positive_p -8.906e+01 4.641e+01 -1.919 0.05497 .
## allcaps -4.189e-04 1.916e-04 -2.186 0.02878 *
## exclamationpoints -4.685e-01 2.143e-01 -2.187 0.02878 *
## questionmarks 4.424e-03 2.876e-02 0.154 0.87774
## Topic_1_n_oI 3.157e+01 1.137e+02 0.278 0.78136
## Topic_2_n_oI 6.382e+01 6.784e+01 0.941 0.34684
## Topic_3_n_oI -1.853e+02 1.457e+02 -1.272 0.20335
## Topic_4_n_oI -4.247e+01 7.460e+01 -0.569 0.56919
## Topic_5_n_oI -8.240e+00 8.061e+01 -0.102 0.91858
## Topic_6_n_oI -5.072e+02 1.793e+02 -2.829 0.00468 **
## Topic_7_n_oI 1.994e+01 8.192e+01 0.243 0.80772
## Topic_8_n_oI -9.620e+01 8.419e+01 -1.143 0.25321
## Topic_9_n_oI 5.963e+01 7.064e+01 0.844 0.39863
## Topic_10_n_oI -4.641e+01 6.705e+01 -0.692 0.48885
## Topic_11_n_oI -1.415e+02 7.384e+01 -1.916 0.05536 .
## Topic_12_n_oI 2.147e-01 8.441e+01 0.003 0.99797
## Topic_13_n_oI -1.420e+02 2.612e+02 -0.543 0.58686
## Topic_14_n_oI 2.086e+01 7.840e+01 0.266 0.79019
## Topic_15_n_oI -3.519e+01 6.332e+01 -0.556 0.57840
## Topic_16_n_oI 1.629e+01 1.156e+02 0.141 0.88793
## Topic_17_n_oI -5.188e+01 8.901e+01 -0.583 0.56000
## Topic_18_n_oI 2.239e+01 9.340e+01 0.240 0.81058
## Topic_19_n_oI -1.071e+02 8.063e+01 -1.328 0.18422
## Topic_20_n_oI -4.885e+01 9.701e+01 -0.504 0.61455
## Topic_21_n_oI -1.515e+02 9.116e+01 -1.662 0.09654 .
## Topic_22_n_oI -6.818e+00 6.472e+01 -0.105 0.91610
## Topic_23_n_oI -1.226e+01 6.965e+01 -0.176 0.86031
## Topic_24_n_oI -6.545e+01 9.523e+01 -0.687 0.49189
## Topic_25_n_oI -6.863e+01 7.094e+01 -0.967 0.33336
## Topic_26_n_oI -2.182e+00 6.580e+01 -0.033 0.97354
## Topic_27_n_oI 9.374e+00 1.418e+02 0.066 0.94729
## Topic_28_n_oI 5.882e+00 7.937e+01 0.074 0.94092
## Topic_29_n_oI -4.209e+01 6.670e+01 -0.631 0.52798
## Topic_30_n_oI -4.605e+02 2.672e+02 -1.724 0.08477 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2381.9 on 11583 degrees of freedom
## Residual deviance: 2069.5 on 11516 degrees of freedom
## AIC: 2205.5
##
## Number of Fisher Scoring iterations: 11
## In sample AUC Out of sample AUC
## 0.7941841 0.7599594
## 1990s 2011 2000s 2000s + 2011 BCE
## 0.7483132 0.7445378 0.6377783 0.7664115 0.7941841
Great if you have way too many inputs in your model
y
vector and an x
matrix instead of our usual y ~ x
formula
model.matrix()
~
, and your datax
as.matrix()
on a data frame of your input variables"binomial"
instead of binomial
Ridge regression
Elastic net regression
glmnet()
library(glmnet)
x <- model.matrix(BCE_eq, data=df[df$Test==0,])[,-1] # [,-1] to remove intercept
y <- model.frame(BCE_eq, data=df[df$Test==0,])[,"AAER"]
fit_LASSO <- glmnet(x=x, y=y,
family = "binomial",
alpha = 1 # Specifies LASSO. alpha = 0 is ridge
)
plot(fit_LASSO)
print(fit_LASSO)
##
## Call: glmnet(x = x, y = y, family = "binomial", alpha = 1)
##
## Df %Dev Lambda
## [1,] 0 1.312e-13 1.433e-02
## [2,] 1 8.060e-03 1.305e-02
## [3,] 1 1.461e-02 1.189e-02
## [4,] 1 1.995e-02 1.084e-02
## [5,] 2 2.471e-02 9.874e-03
## [6,] 2 3.219e-02 8.997e-03
## [7,] 2 3.845e-02 8.197e-03
## [8,] 2 4.371e-02 7.469e-03
## [9,] 2 4.813e-02 6.806e-03
## [10,] 3 5.224e-02 6.201e-03
## [11,] 3 5.591e-02 5.650e-03
## [12,] 4 5.906e-02 5.148e-03
## [13,] 4 6.249e-02 4.691e-03
## [14,] 5 6.573e-02 4.274e-03
## [15,] 7 6.894e-02 3.894e-03
## [16,] 8 7.224e-02 3.548e-03
## [17,] 10 7.522e-02 3.233e-03
## [18,] 12 7.834e-02 2.946e-03
## [19,] 15 8.156e-02 2.684e-03
## [20,] 15 8.492e-02 2.446e-03
## [21,] 15 8.780e-02 2.229e-03
## [22,] 15 9.026e-02 2.031e-03
## [23,] 18 9.263e-02 1.850e-03
## [24,] 20 9.478e-02 1.686e-03
## [25,] 22 9.689e-02 1.536e-03
## [26,] 25 9.892e-02 1.400e-03
## [27,] 28 1.010e-01 1.275e-03
## [28,] 30 1.033e-01 1.162e-03
## [29,] 31 1.054e-01 1.059e-03
## [30,] 33 1.074e-01 9.647e-04
## [31,] 36 1.093e-01 8.790e-04
## [32,] 37 1.111e-01 8.009e-04
## [33,] 39 1.126e-01 7.297e-04
## [34,] 42 1.142e-01 6.649e-04
## [35,] 43 1.157e-01 6.058e-04
## [36,] 44 1.171e-01 5.520e-04
## [37,] 47 1.185e-01 5.030e-04
## [38,] 48 1.197e-01 4.583e-04
## [39,] 48 1.208e-01 4.176e-04
## [40,] 48 1.218e-01 3.805e-04
## [41,] 51 1.227e-01 3.467e-04
## [42,] 52 1.235e-01 3.159e-04
## [43,] 54 1.243e-01 2.878e-04
## [44,] 54 1.249e-01 2.623e-04
## [45,] 54 1.255e-01 2.390e-04
## [46,] 55 1.260e-01 2.177e-04
## [47,] 56 1.265e-01 1.984e-04
## [48,] 58 1.270e-01 1.808e-04
## [49,] 60 1.274e-01 1.647e-04
## [50,] 61 1.279e-01 1.501e-04
## [51,] 62 1.282e-01 1.367e-04
## [52,] 63 1.286e-01 1.246e-04
## [53,] 65 1.289e-01 1.135e-04
## [54,] 66 1.292e-01 1.034e-04
## [55,] 66 1.294e-01 9.425e-05
## [56,] 66 1.297e-01 8.588e-05
## [57,] 66 1.298e-01 7.825e-05
## [58,] 66 1.300e-01 7.130e-05
## [59,] 66 1.302e-01 6.496e-05
## [60,] 66 1.303e-01 5.919e-05
## [61,] 67 1.304e-01 5.393e-05
## [62,] 67 1.305e-01 4.914e-05
## [63,] 67 1.306e-01 4.478e-05
## [64,] 67 1.306e-01 4.080e-05
## [65,] 67 1.307e-01 3.717e-05
## [66,] 67 1.308e-01 3.387e-05
## [67,] 67 1.308e-01 3.086e-05
## [68,] 67 1.309e-01 2.812e-05
## [69,] 67 1.309e-01 2.562e-05
## [70,] 67 1.309e-01 2.335e-05
## [71,] 67 1.310e-01 2.127e-05
## [72,] 67 1.310e-01 1.938e-05
## [73,] 67 1.310e-01 1.766e-05
## [74,] 67 1.310e-01 1.609e-05
## [75,] 67 1.310e-01 1.466e-05
#coef(fit_LASSO, s=0.002031)
coefplot(fit_LASSO, lambda=0.002031, sort='magnitude')
# na.pass has model.matrix retain NA values (so the # of rows is constant)
xp <- model.matrix(BCE_eq, data=df, na.action='na.pass')[,-1]
# s= specifies the version of the model to use
pred <- predict(fit_LASSO, xp, type="response", s = 0.002031)
## In sample AUC Out of sample AUC
## 0.7593828 0.7239785
"lambda.min"
: The best performing model"lambda.1se"
: The simplest model within 1 standard error of "lambda.min"
# Cross validation
set.seed(697435) #for reproducibility
cvfit = cv.glmnet(x=x, y=y,family = "binomial", alpha = 1, type.measure="auc")
plot(cvfit)
cvfit$lambda.min
## [1] 0.00139958
cvfit$lambda.1se
## [1] 0.002684268
lambda.min
lambda.1se
# s= specifies the version of the model to use
pred <- predict(cvfit, xp, type="response", s = "lambda.min")
pred2 <- predict(cvfit, xp, type="response", s = "lambda.1se")
## In sample AUC, lambda.min Out of sample AUC, lambda.min
## 0.7665463 0.7330212
## In sample AUC, lambda.1se Out of sample AUC, lambda.1se
## 0.7509946 0.7124231
What other data could we use to predict corporate fraud?