ACCT 420: Logistic Regression for Corporate Fraud


Session 7


Dr. Richard M. Crowley

Front matter

Learning objectives

  • Theory:
    • Economics
    • Psychology
  • Application:
    • Predicting fraud contained in annual reports
  • Methodology:
    • Logistic regression
    • LASSO

Datacamp

  • Explore on your own
  • No specific required class this week
  • We will start having some assigned chapters after the break
    • I’ve post them already, so you can work on them at your leisure

Corporate/Securities Fraud

Traditional accounting fraud

  1. A company is underperforming
  2. Management cooks up some scheme to increase earnings
    • Worldcom (1999-2001)
      • Fake revenue entries
      • Capitalizing line costs (should be expensed)
    • Olympus (late 1980s-2011): Hide losses in a separate entity
      • “Tobashi scheme”
    • Wells Fargo (2011-2018?)
      • Fake/duplicate customers and transactions
  3. Create accounting statements using the fake information

Reversing it

  1. A company is overperforming
  2. Management cooks up a scheme to “save up” excess performance for a rainy day
  3. Recognize revenue/earnings when needed in the future to hit earnings targets

Other accounting fraud types

Other accounting fraud types

  • Suprema Specialties (1998-2001)
    • Round-tripping: Transactions to inflate revenue that have no substance
  • Bribery
    • Keppel O&M (2001-2014): $55M USD in bribes to Brazilian officials for contracts
    • Baker Hughes (2001, 2007): Payments to officials in Indonesia, and possibly to Brazil and India (2001) and to officials in Angola, Indonesia, Nigeria, Russia, and Uzbekistan (2007)
  • ZZZZ Best (1982-1987): Fake the whole company, get funding from insurance fraud, theft, credit card fraud, and fake contracts
    • Also faked a real project to get a clean audit to take the company public

Other securities fraud types

Some of the more interesting cases

What will we look at today?

Misstatements that affect firms’ accounting statements and were done seemingly intentionally by management or other employees at the firm.

How do misstatements come to light?

  1. The company/management admits to it publicly
  2. A government entity forces the company to disclose
    • In more egregious cases, government agencies may disclose the fraud publicly as well
  3. Investors sue the firm, forcing disclosure

Where are these disclosed?

In the US:

  1. 10-K/A filings (/A means amendment)
  2. In a note inside a 10-K filing
    • These are sometimes referred to as “little r” restatements
  3. SEC AAERs: Accounting and Auditing Enforcement Releases
    • Generally highlight larger or more important cases
    • Written by the SEC, not the company

AAERs

  • Today we will examine these AAERs
    • Using a proprietary data set of >1,000 such releases
  • To get a sense of the data we’re working with, read the Summary section (starting on page 2) of this AAER against Sanofi

Why did the SEC release this AAER regarding Sanofi?

Predicting Fraud

Main question

How can we detect if a firm is involved in a major instance of missreporting?

  • This is a pure forensic analytics question
  • “Major instance of misreporting” will be implemented using AAERs

Approaches

  • In these slides, I’ll walk through the primary detection methods since the 1990s, up to currently used methods
  • 1990s: Financials and financial ratios
    • Follow up in 2011
  • Late 2000s/early 2010s: Characteristics of firm’s disclosures
  • mid 2010s: More holistic text-based measures of disclosures
    • This will tie to next lesson where we will explore how to work with text

All of these are discussed in a Brown, Crowley and Elliott (2018) – I will refer to the paper as BCE for short

The data

  • I have provided some preprocessed data, sanitized of AAER data (which is partially public, partially proprietary)
  • It contains 399 variables
    • From Compustat, CRSP, and the SEC (which I personally collected)
    • Many precalculated measures including:
      • Firm characteristics, such as auditor type (bigNaudit, midNaudit)
      • Financial measures, such as total accruals (rsst_acc)
      • Financial ratios, such as ROA (ni_at)
      • Annual report characteristics, such as the mean sentence length (sentlen_u)
      • Machine learning based content analysis (everything with Topic_ prepended)

Pulled from BCE’s working files

Training and Testing

  • Already has testing and training set up in variable Test
    • Training is annual reports released in 2003 through 2007
    • Testing is annual reports released in 2008

What potential issues are there with our usual training and testing strategy?

Censoring

  • Censoring training data helps to emulate historical situations
    • Build an algorithm using only the data that was available at the time a decision would need to have been made
  • Do not censor the testing data
    • Testing emulates where we want to make an optimal choice in real life
      • We want to find frauds regardless of how well hidden they are!

Event frequency

  • Very low event frequencies can make things tricky
df %>%
  group_by(year) %>%
  mutate(total_AAERS = sum(AAER), total_observations=n()) %>%
  slice(1) %>%
  ungroup() %>%
  select(year, total_AAERS, total_observations) %>%
  html_df
year total_AAERS total_observations
1999 46 2195
2000 50 2041
2001 43 2021
2002 50 2391
2003 57 2936
2004 49 2843

246 AAERs in the training data, 401 total variables…

Dealing with infrequent events

  • A few ways to handle this
    1. Very careful model selection (keep it sufficiently simple)
    2. Sophisticated degenerate variable identification criterion + simulation to implement complex models that are just barely simple enough
      • The main method in BCE
    3. Automated methodologies for pairing down models
      • We’ll discuss using LASSO for this at the end of class
        • Also implemented in BCE

1990s approach

The 1990s model

  • Many financial measures and ratios can help to predict fraud
  • EBIT
  • Earnings / revenue
  • ROA
  • Log of liabilities
  • liabilities / equity
  • liabilities / assets
  • quick ratio
  • Working capital / assets
  • Inventory / revenue
  • inventory / assets
  • earnings / PP&E
  • A/R / revenue
  • Change in revenue
  • Change in A/R + 1
  • >10\% change in A/R
  • Change in gross profit + 1
  • >10\% change in gross profit
  • Gross profit / assets
  • Revenue minus gross profit
  • Cash / assets
  • Log of assets
  • PP&E / assets
  • Working capital

Approach

fit_1990s <- glm(AAER ~ ebit + ni_revt + ni_at + log_lt + ltl_at + lt_seq +
                   lt_at + act_lct + aq_lct + wcap_at + invt_revt + invt_at +
                   ni_ppent + rect_revt + revt_at + d_revt + b_rect + b_rect +
                   r_gp + b_gp + gp_at + revt_m_gp + ch_at + log_at +
                   ppent_at + wcap,
                 data=df[df$Test==0,],
                 family=binomial)
summary(fit_1990s)
## 
## Call:
## glm(formula = AAER ~ ebit + ni_revt + ni_at + log_lt + ltl_at + 
##     lt_seq + lt_at + act_lct + aq_lct + wcap_at + invt_revt + 
##     invt_at + ni_ppent + rect_revt + revt_at + d_revt + b_rect + 
##     b_rect + r_gp + b_gp + gp_at + revt_m_gp + ch_at + log_at + 
##     ppent_at + wcap, family = binomial, data = df[df$Test == 
##     0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1391  -0.2275  -0.1661  -0.1190   3.6236  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.660e+00  8.336e-01  -5.591 2.26e-08 ***
## ebit        -3.564e-04  1.094e-04  -3.257  0.00112 ** 
## ni_revt      3.664e-02  3.058e-02   1.198  0.23084    
## ni_at       -3.196e-01  2.325e-01  -1.374  0.16932    
## log_lt       1.494e-01  3.409e-01   0.438  0.66118    
## ltl_at      -2.306e-01  7.072e-01  -0.326  0.74438    
## lt_seq      -2.826e-05  4.567e-04  -0.062  0.95067    
## lt_at       -8.559e-01  9.270e-01  -0.923  0.35586    
## act_lct      1.401e-01  7.005e-02   2.000  0.04546 *  
## aq_lct      -1.751e-01  9.156e-02  -1.912  0.05588 .  
## wcap_at     -8.488e-01  6.487e-01  -1.308  0.19075    
## invt_revt   -9.165e-01  9.679e-01  -0.947  0.34371    
## invt_at      2.350e+00  1.033e+00   2.276  0.02286 *  
## ni_ppent     2.703e-03  1.203e-02   0.225  0.82223    
## rect_revt    6.587e-02  7.427e-02   0.887  0.37511    
## revt_at     -4.790e-01  1.734e-01  -2.763  0.00572 ** 
## d_revt      -7.848e-04  1.586e-03  -0.495  0.62079    
## b_rect       8.212e-02  1.569e-01   0.524  0.60060    
## r_gp        -1.055e-04  6.623e-04  -0.159  0.87345    
## b_gp         1.266e-01  1.556e-01   0.813  0.41605    
## gp_at        1.031e-01  3.512e-01   0.293  0.76921    
## revt_m_gp    2.621e-05  1.094e-05   2.396  0.01655 *  
## ch_at       -5.356e-01  7.909e-01  -0.677  0.49829    
## log_at       2.687e-01  3.483e-01   0.771  0.44045    
## ppent_at    -2.956e+00  4.798e-01  -6.161 7.23e-10 ***
## wcap         7.009e-05  5.187e-05   1.351  0.17658    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2118.6  on 10268  degrees of freedom
## Residual deviance: 1928.8  on 10243  degrees of freedom
##   (1315 observations deleted due to missingness)
## AIC: 1980.8
## 
## Number of Fisher Scoring iterations: 12

ROC

##     In sample AUC Out of sample AUC 
##         0.7483132         0.7292981

The 2011 follow up

The 2011 model

  • Log of assets
  • Total accruals
  • % change in A/R
  • % change in inventory
  • % soft assets
  • % change in sales from cash
  • % change in ROA
  • Indicator for stock/bond issuance
  • Indicator for operating leases
  • BV equity / MV equity
  • Lag of stock return minus value weighted market return
  • Below are BCE’s additions
  • Indicator for mergers
  • Indicator for Big N auditor
  • Indicator for medium size auditor
  • Total financing raised
  • Net amount of new capital raised
  • Indicator for restructuring

Based on Dechow, Ge, Larson and Sloan (2011)

The model

fit_2011 <- glm(AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv +
                  soft_assets + pct_chg_cashsales + chg_roa + issuance +
                  oplease_dum + book_mkt + lag_sdvol + merger + bigNaudit +
                  midNaudit + cffin + exfin + restruct,
                 data=df[df$Test==0,],
                 family=binomial)
summary(fit_2011)
## 
## Call:
## glm(formula = AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv + 
##     soft_assets + pct_chg_cashsales + chg_roa + issuance + oplease_dum + 
##     book_mkt + lag_sdvol + merger + bigNaudit + midNaudit + cffin + 
##     exfin + restruct, family = binomial, data = df[df$Test == 
##     0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8434  -0.2291  -0.1658  -0.1196   3.2614  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.1474558  0.5337491 -13.391  < 2e-16 ***
## logtotasset        0.3214322  0.0355467   9.043  < 2e-16 ***
## rsst_acc          -0.2190095  0.3009287  -0.728   0.4667    
## chg_recv           1.1020740  1.0590837   1.041   0.2981    
## chg_inv            0.0389504  1.2507142   0.031   0.9752    
## soft_assets        2.3094551  0.3325731   6.944 3.81e-12 ***
## pct_chg_cashsales -0.0006912  0.0108771  -0.064   0.9493    
## chg_roa           -0.2697984  0.2554262  -1.056   0.2908    
## issuance           0.1443841  0.3187606   0.453   0.6506    
## oplease_dum       -0.2029022  0.1970597  -1.030   0.3032    
## book_mkt           0.0150113  0.0110652   1.357   0.1749    
## lag_sdvol          0.0517368  0.0555338   0.932   0.3515    
## merger             0.3529102  0.1511729   2.334   0.0196 *  
## bigNaudit         -0.1998454  0.3598283  -0.555   0.5786    
## midNaudit         -0.4894029  0.5118388  -0.956   0.3390    
## cffin              0.4557560  0.3435253   1.327   0.1846    
## exfin             -0.0053608  0.0393354  -0.136   0.8916    
## restruct           0.3827721  0.1471729   2.601   0.0093 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2381.9  on 11583  degrees of freedom
## Residual deviance: 2189.7  on 11566  degrees of freedom
## AIC: 2225.7
## 
## Number of Fisher Scoring iterations: 7

ROC

##     In sample AUC Out of sample AUC 
##         0.7445378         0.6849225

Late 2000s/early 2010s approach

The late 2000s/early 2010s model

  • Log of # of bullet points + 1
  • # of characters in file header
  • # of excess newlines
  • Amount of html tags
  • Length of cleaned file, characters
  • Mean sentence length, words
  • S.D. of word length
  • S.D. of paragraph length (sentences)
  • Word choice variation
  • Readability
    • Coleman Liau Index
    • Fog Index
  • % active voice sentences
  • % passive voice sentences
  • # of all cap words
  • # of !
  • # of ?

From a variety of papers

Theory

  • Generally pulled from the communications literature
    • Sometimes ad hoc
  • The main idea:
    • Companies that are misreporting probably write their annual report differently

The late 2000s/early 2010s model

fit_2000s <- glm(AAER ~ bullets + headerlen + newlines + alltags +
                   processedsize + sentlen_u + wordlen_s + paralen_s +
                   repetitious_p + sentlen_s + typetoken + clindex + fog +
                   active_p + passive_p + lm_negative_p + lm_positive_p +
                   allcaps + exclamationpoints + questionmarks,
                 data=df[df$Test==0,],
                 family=binomial)
summary(fit_2000s)
## 
## Call:
## glm(formula = AAER ~ bullets + headerlen + newlines + alltags + 
##     processedsize + sentlen_u + wordlen_s + paralen_s + repetitious_p + 
##     sentlen_s + typetoken + clindex + fog + active_p + passive_p + 
##     lm_negative_p + lm_positive_p + allcaps + exclamationpoints + 
##     questionmarks, family = binomial, data = df[df$Test == 0, 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9604  -0.2244  -0.1984  -0.1749   3.2318  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -5.662e+00  3.143e+00  -1.801  0.07165 .  
## bullets           -2.635e-05  2.625e-05  -1.004  0.31558    
## headerlen         -2.943e-04  3.477e-04  -0.846  0.39733    
## newlines          -4.821e-05  1.220e-04  -0.395  0.69271    
## alltags            5.060e-08  2.567e-07   0.197  0.84376    
## processedsize      5.709e-06  1.287e-06   4.435 9.19e-06 ***
## sentlen_u         -3.790e-02  6.897e-02  -0.550  0.58259    
## wordlen_s          1.278e-01  1.199e+00   0.107  0.91510    
## paralen_s         -4.808e-02  3.052e-02  -1.576  0.11514    
## repetitious_p     -1.673e+00  1.665e+00  -1.005  0.31508    
## sentlen_s         -2.098e-02  2.222e-02  -0.944  0.34518    
## typetoken         -3.609e-01  1.729e+00  -0.209  0.83469    
## clindex            2.744e-01  1.519e-01   1.806  0.07085 .  
## fog               -1.839e-02  1.322e-01  -0.139  0.88935    
## active_p          -4.321e-01  1.459e+00  -0.296  0.76711    
## passive_p          7.321e-02  3.204e+00   0.023  0.98177    
## lm_negative_p      1.356e+01  1.304e+01   1.039  0.29875    
## lm_positive_p     -9.911e+01  3.842e+01  -2.580  0.00989 ** 
## allcaps           -4.646e-04  1.835e-04  -2.532  0.01136 *  
## exclamationpoints -3.520e-01  1.944e-01  -1.811  0.07010 .  
## questionmarks      3.371e-03  2.816e-02   0.120  0.90471    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2381.9  on 11583  degrees of freedom
## Residual deviance: 2329.2  on 11563  degrees of freedom
## AIC: 2371.2
## 
## Number of Fisher Scoring iterations: 10

ROC

##     In sample AUC Out of sample AUC 
##         0.6377783         0.6295414

Combining the 2000s and 2011 models

Why is it appropriate to combine the 2011 model with the 2000s model?

  • 2011 model: Parsimonious financial model
  • 2000s model: Textual characteristics

The model

fit_2000f <- glm(AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv +
                   soft_assets + pct_chg_cashsales + chg_roa + issuance +
                   oplease_dum + book_mkt + lag_sdvol + merger + bigNaudit +
                   midNaudit + cffin + exfin + restruct + bullets + headerlen +
                   newlines + alltags + processedsize + sentlen_u + wordlen_s +
                   paralen_s + repetitious_p + sentlen_s + typetoken +
                   clindex + fog + active_p + passive_p + lm_negative_p +
                   lm_positive_p + allcaps + exclamationpoints + questionmarks,
                 data=df[df$Test==0,],
                 family=binomial)
summary(fit_2000f)
## 
## Call:
## glm(formula = AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv + 
##     soft_assets + pct_chg_cashsales + chg_roa + issuance + oplease_dum + 
##     book_mkt + lag_sdvol + merger + bigNaudit + midNaudit + cffin + 
##     exfin + restruct + bullets + headerlen + newlines + alltags + 
##     processedsize + sentlen_u + wordlen_s + paralen_s + repetitious_p + 
##     sentlen_s + typetoken + clindex + fog + active_p + passive_p + 
##     lm_negative_p + lm_positive_p + allcaps + exclamationpoints + 
##     questionmarks, family = binomial, data = df[df$Test == 0, 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9514  -0.2237  -0.1596  -0.1110   3.3882  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.634e+00  3.415e+00  -0.479  0.63223    
## logtotasset        3.437e-01  3.921e-02   8.766  < 2e-16 ***
## rsst_acc          -2.123e-01  2.995e-01  -0.709  0.47844    
## chg_recv           1.022e+00  1.055e+00   0.969  0.33274    
## chg_inv           -9.464e-02  1.226e+00  -0.077  0.93845    
## soft_assets        2.573e+00  3.387e-01   7.598 3.01e-14 ***
## pct_chg_cashsales -1.134e-03  1.103e-02  -0.103  0.91811    
## chg_roa           -2.696e-01  2.470e-01  -1.092  0.27504    
## issuance           1.471e-01  3.220e-01   0.457  0.64777    
## oplease_dum       -2.859e-01  2.020e-01  -1.416  0.15691    
## book_mkt           1.531e-02  1.115e-02   1.373  0.16967    
## lag_sdvol          6.147e-02  5.418e-02   1.135  0.25655    
## merger             3.699e-01  1.536e-01   2.408  0.01604 *  
## bigNaudit         -1.992e-01  3.638e-01  -0.548  0.58392    
## midNaudit         -4.839e-01  5.139e-01  -0.942  0.34635    
## cffin              4.608e-01  3.599e-01   1.280  0.20044    
## exfin             -1.235e-03  5.907e-02  -0.021  0.98332    
## restruct           3.035e-01  1.555e-01   1.952  0.05099 .  
## bullets           -1.958e-05  2.519e-05  -0.777  0.43692    
## headerlen         -5.792e-04  4.038e-04  -1.434  0.15151    
## newlines          -7.423e-05  1.294e-04  -0.574  0.56615    
## alltags           -1.338e-07  2.701e-07  -0.495  0.62036    
## processedsize      4.111e-06  1.446e-06   2.844  0.00446 ** 
## sentlen_u         -5.086e-02  6.950e-02  -0.732  0.46433    
## wordlen_s         -1.618e+00  1.303e+00  -1.241  0.21450    
## paralen_s         -3.934e-02  2.923e-02  -1.346  0.17834    
## repetitious_p     -1.433e+00  1.658e+00  -0.864  0.38740    
## sentlen_s         -1.690e-02  2.212e-02  -0.764  0.44502    
## typetoken         -1.216e+00  1.788e+00  -0.680  0.49643    
## clindex            9.620e-02  1.521e-01   0.632  0.52718    
## fog                1.063e-02  1.339e-01   0.079  0.93675    
## active_p           6.259e-01  1.435e+00   0.436  0.66271    
## passive_p         -5.848e+00  3.796e+00  -1.541  0.12337    
## lm_negative_p      2.012e+01  1.220e+01   1.649  0.09906 .  
## lm_positive_p     -3.700e+01  3.864e+01  -0.958  0.33825    
## allcaps           -4.721e-04  1.912e-04  -2.469  0.01354 *  
## exclamationpoints -4.510e-01  2.024e-01  -2.228  0.02586 *  
## questionmarks      4.096e-03  2.853e-02   0.144  0.88586    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2381.9  on 11583  degrees of freedom
## Residual deviance: 2143.0  on 11546  degrees of freedom
## AIC: 2219
## 
## Number of Fisher Scoring iterations: 10

ROC

##     In sample AUC Out of sample AUC 
##         0.7664115         0.7147021

The BCE model

The BCE approach

  • Retain the variables from the other regressions
  • Add in a machine-learning based measure quantifying how much documents talked about different topics common across all filings
    • Learned on just the 1999-2003 filings

What the topics look like

Theory behind the BCE model

Why use document content?

The model

BCE_eq = as.formula(paste("AAER ~ logtotasset + rsst_acc + chg_recv + chg_inv +
  soft_assets + pct_chg_cashsales + chg_roa + issuance +
  oplease_dum + book_mkt + lag_sdvol + merger + bigNaudit +
  midNaudit + cffin + exfin + restruct + bullets + headerlen +
  newlines + alltags + processedsize + sentlen_u + wordlen_s +
  paralen_s + repetitious_p + sentlen_s + typetoken +
  clindex + fog + active_p + passive_p + lm_negative_p +
  lm_positive_p + allcaps + exclamationpoints + questionmarks + ",
  paste(paste0("Topic_",1:30,"_n_oI"), collapse=" + "), collapse=""))
fit_BCE <- glm(BCE_eq,
               data=df[df$Test==0,],
               family=binomial)
summary(fit_BCE)
## 
## Call:
## glm(formula = BCE_eq, family = binomial, data = df[df$Test == 
##     0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0887  -0.2212  -0.1478  -0.0940   3.5401  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -8.032e+00  3.872e+00  -2.074  0.03806 *  
## logtotasset        3.879e-01  4.554e-02   8.519  < 2e-16 ***
## rsst_acc          -1.938e-01  3.055e-01  -0.634  0.52593    
## chg_recv           8.581e-01  1.071e+00   0.801  0.42296    
## chg_inv           -2.607e-01  1.223e+00  -0.213  0.83119    
## soft_assets        2.555e+00  3.796e-01   6.730  1.7e-11 ***
## pct_chg_cashsales -1.976e-03  6.997e-03  -0.282  0.77767    
## chg_roa           -2.532e-01  2.786e-01  -0.909  0.36354    
## issuance           9.692e-02  3.269e-01   0.296  0.76687    
## oplease_dum       -3.451e-01  2.097e-01  -1.645  0.09989 .  
## book_mkt           1.361e-02  1.151e-02   1.183  0.23692    
## lag_sdvol          4.546e-02  5.709e-02   0.796  0.42589    
## merger             3.224e-01  1.572e-01   2.051  0.04027 *  
## bigNaudit         -2.010e-01  3.711e-01  -0.542  0.58804    
## midNaudit         -4.641e-01  5.195e-01  -0.893  0.37169    
## cffin              6.024e-01  3.769e-01   1.598  0.10998    
## exfin             -6.738e-03  4.063e-02  -0.166  0.86831    
## restruct           1.915e-01  1.593e-01   1.202  0.22920    
## bullets           -1.612e-05  2.526e-05  -0.638  0.52334    
## headerlen         -3.956e-04  3.722e-04  -1.063  0.28776    
## newlines          -1.038e-04  1.333e-04  -0.779  0.43623    
## alltags           -1.338e-07  2.772e-07  -0.483  0.62934    
## processedsize      4.178e-06  1.477e-06   2.828  0.00468 ** 
## sentlen_u         -7.131e-02  7.881e-02  -0.905  0.36553    
## wordlen_s          4.413e-01  1.430e+00   0.309  0.75761    
## paralen_s         -4.584e-02  2.976e-02  -1.540  0.12356    
## repetitious_p     -1.525e+00  1.780e+00  -0.857  0.39152    
## sentlen_s         -7.700e-04  2.238e-02  -0.034  0.97255    
## typetoken          5.313e-02  1.934e+00   0.027  0.97809    
## clindex            6.406e-02  1.669e-01   0.384  0.70116    
## fog                7.432e-02  1.590e-01   0.467  0.64018    
## active_p          -3.159e-01  1.628e+00  -0.194  0.84620    
## passive_p         -6.714e+00  4.192e+00  -1.602  0.10926    
## lm_negative_p      3.558e+00  1.422e+01   0.250  0.80240    
## lm_positive_p     -8.906e+01  4.641e+01  -1.919  0.05497 .  
## allcaps           -4.189e-04  1.916e-04  -2.186  0.02878 *  
## exclamationpoints -4.685e-01  2.143e-01  -2.187  0.02878 *  
## questionmarks      4.424e-03  2.876e-02   0.154  0.87774    
## Topic_1_n_oI       3.157e+01  1.137e+02   0.278  0.78136    
## Topic_2_n_oI       6.382e+01  6.784e+01   0.941  0.34684    
## Topic_3_n_oI      -1.853e+02  1.457e+02  -1.272  0.20335    
## Topic_4_n_oI      -4.247e+01  7.460e+01  -0.569  0.56919    
## Topic_5_n_oI      -8.240e+00  8.061e+01  -0.102  0.91858    
## Topic_6_n_oI      -5.072e+02  1.793e+02  -2.829  0.00468 ** 
## Topic_7_n_oI       1.994e+01  8.192e+01   0.243  0.80772    
## Topic_8_n_oI      -9.620e+01  8.419e+01  -1.143  0.25321    
## Topic_9_n_oI       5.963e+01  7.064e+01   0.844  0.39863    
## Topic_10_n_oI     -4.641e+01  6.705e+01  -0.692  0.48885    
## Topic_11_n_oI     -1.415e+02  7.384e+01  -1.916  0.05536 .  
## Topic_12_n_oI      2.147e-01  8.441e+01   0.003  0.99797    
## Topic_13_n_oI     -1.420e+02  2.612e+02  -0.543  0.58686    
## Topic_14_n_oI      2.086e+01  7.840e+01   0.266  0.79019    
## Topic_15_n_oI     -3.519e+01  6.332e+01  -0.556  0.57840    
## Topic_16_n_oI      1.629e+01  1.156e+02   0.141  0.88793    
## Topic_17_n_oI     -5.188e+01  8.901e+01  -0.583  0.56000    
## Topic_18_n_oI      2.239e+01  9.340e+01   0.240  0.81058    
## Topic_19_n_oI     -1.071e+02  8.063e+01  -1.328  0.18422    
## Topic_20_n_oI     -4.885e+01  9.701e+01  -0.504  0.61455    
## Topic_21_n_oI     -1.515e+02  9.116e+01  -1.662  0.09654 .  
## Topic_22_n_oI     -6.818e+00  6.472e+01  -0.105  0.91610    
## Topic_23_n_oI     -1.226e+01  6.965e+01  -0.176  0.86031    
## Topic_24_n_oI     -6.545e+01  9.523e+01  -0.687  0.49189    
## Topic_25_n_oI     -6.863e+01  7.094e+01  -0.967  0.33336    
## Topic_26_n_oI     -2.182e+00  6.580e+01  -0.033  0.97354    
## Topic_27_n_oI      9.374e+00  1.418e+02   0.066  0.94729    
## Topic_28_n_oI      5.882e+00  7.937e+01   0.074  0.94092    
## Topic_29_n_oI     -4.209e+01  6.670e+01  -0.631  0.52798    
## Topic_30_n_oI     -4.605e+02  2.672e+02  -1.724  0.08477 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2381.9  on 11583  degrees of freedom
## Residual deviance: 2069.5  on 11516  degrees of freedom
## AIC: 2205.5
## 
## Number of Fisher Scoring iterations: 11

ROC

##     In sample AUC Out of sample AUC 
##         0.7941841         0.7599594

Comparison across all models

##        1990s         2011        2000s 2000s + 2011          BCE 
##    0.7483132    0.7445378    0.6377783    0.7664115    0.7941841

Simplifying models with LASSO

What is LASSO?

  • Least Absolute Shrinkage and Selection Operator
    • Least absolute: uses an error term like |\varepsilon|
    • Shrinkage: it will make coefficients smaller
      • Less sensitive → less overfitting issues
    • Selection: it will completely remove some variables
      • Less variables → less overfitting issues
  • Sometimes called L^1 regularization
    • L^1 means 1 dimensional distance, i.e., |\varepsilon|

Great if you have way too many inputs in your model

  • This is how we can, in theory, put more variables in our model than data points

How does it work?

\min_{\beta\in \mathbb{R}} \left\{\frac{1}{N}\left|\varepsilon\right|_2^2 + \lambda \left|\beta\right|_1\right\}
  • Add an additional penalty term that is increasing in the absolute value of each \beta
    • Incentivizes lower \betas, shrinking them
  • The selection is part is explainable geometrically

Package for LASSO

  1. For all regression commands, they expect a y vector and an x matrix instead of our usual y ~ x formula
    • R has a helper function to convert a formula to a matrix: model.matrix()
      • Supply it the right hand side of the equation, starting with ~, and your data
      • It outputs the matrix x
    • Alternatively, use as.matrix() on a data frame of your input variables
  2. It’s family argument should be specified in quotes, i.e., "binomial" instead of binomial

What else can the package do?

Ridge regression

  • Similar to LASSO, but with an L^2 penalty (Euclidean norm)

Elastic net regression

How to run a LASSO

  • To run a simple LASSO model, use glmnet()
  • Let’s LASSO the BCE model
library(glmnet)
x <- model.matrix(BCE_eq, data=df[df$Test==0,])[,-1]  # [,-1] to remove intercept
y <- model.frame(BCE_eq, data=df[df$Test==0,])[,"AAER"]
fit_LASSO <- glmnet(x=x, y=y,
                    family = "binomial",
                    alpha = 1  # Specifies LASSO.  alpha = 0 is ridge
                    )

Visualizing Lasso

plot(fit_LASSO)

What’s under the hood?

print(fit_LASSO)
## 
## Call:  glmnet(x = x, y = y, family = "binomial", alpha = 1) 
## 
##       Df      %Dev    Lambda
##  [1,]  0 1.312e-13 1.433e-02
##  [2,]  1 8.060e-03 1.305e-02
##  [3,]  1 1.461e-02 1.189e-02
##  [4,]  1 1.995e-02 1.084e-02
##  [5,]  2 2.471e-02 9.874e-03
##  [6,]  2 3.219e-02 8.997e-03
##  [7,]  2 3.845e-02 8.197e-03
##  [8,]  2 4.371e-02 7.469e-03
##  [9,]  2 4.813e-02 6.806e-03
## [10,]  3 5.224e-02 6.201e-03
## [11,]  3 5.591e-02 5.650e-03
## [12,]  4 5.906e-02 5.148e-03
## [13,]  4 6.249e-02 4.691e-03
## [14,]  5 6.573e-02 4.274e-03
## [15,]  7 6.894e-02 3.894e-03
## [16,]  8 7.224e-02 3.548e-03
## [17,] 10 7.522e-02 3.233e-03
## [18,] 12 7.834e-02 2.946e-03
## [19,] 15 8.156e-02 2.684e-03
## [20,] 15 8.492e-02 2.446e-03
## [21,] 15 8.780e-02 2.229e-03
## [22,] 15 9.026e-02 2.031e-03
## [23,] 18 9.263e-02 1.850e-03
## [24,] 20 9.478e-02 1.686e-03
## [25,] 22 9.689e-02 1.536e-03
## [26,] 25 9.892e-02 1.400e-03
## [27,] 28 1.010e-01 1.275e-03
## [28,] 30 1.033e-01 1.162e-03
## [29,] 31 1.054e-01 1.059e-03
## [30,] 33 1.074e-01 9.647e-04
## [31,] 36 1.093e-01 8.790e-04
## [32,] 37 1.111e-01 8.009e-04
## [33,] 39 1.126e-01 7.297e-04
## [34,] 42 1.142e-01 6.649e-04
## [35,] 43 1.157e-01 6.058e-04
## [36,] 44 1.171e-01 5.520e-04
## [37,] 47 1.185e-01 5.030e-04
## [38,] 48 1.197e-01 4.583e-04
## [39,] 48 1.208e-01 4.176e-04
## [40,] 48 1.218e-01 3.805e-04
## [41,] 51 1.227e-01 3.467e-04
## [42,] 52 1.235e-01 3.159e-04
## [43,] 54 1.243e-01 2.878e-04
## [44,] 54 1.249e-01 2.623e-04
## [45,] 54 1.255e-01 2.390e-04
## [46,] 55 1.260e-01 2.177e-04
## [47,] 56 1.265e-01 1.984e-04
## [48,] 58 1.270e-01 1.808e-04
## [49,] 60 1.274e-01 1.647e-04
## [50,] 61 1.279e-01 1.501e-04
## [51,] 62 1.282e-01 1.367e-04
## [52,] 63 1.286e-01 1.246e-04
## [53,] 65 1.289e-01 1.135e-04
## [54,] 66 1.292e-01 1.034e-04
## [55,] 66 1.294e-01 9.425e-05
## [56,] 66 1.297e-01 8.588e-05
## [57,] 66 1.298e-01 7.825e-05
## [58,] 66 1.300e-01 7.130e-05
## [59,] 66 1.302e-01 6.496e-05
## [60,] 66 1.303e-01 5.919e-05
## [61,] 67 1.304e-01 5.393e-05
## [62,] 67 1.305e-01 4.914e-05
## [63,] 67 1.306e-01 4.478e-05
## [64,] 67 1.306e-01 4.080e-05
## [65,] 67 1.307e-01 3.717e-05
## [66,] 67 1.308e-01 3.387e-05
## [67,] 67 1.308e-01 3.086e-05
## [68,] 67 1.309e-01 2.812e-05
## [69,] 67 1.309e-01 2.562e-05
## [70,] 67 1.309e-01 2.335e-05
## [71,] 67 1.310e-01 2.127e-05
## [72,] 67 1.310e-01 1.938e-05
## [73,] 67 1.310e-01 1.766e-05
## [74,] 67 1.310e-01 1.609e-05
## [75,] 67 1.310e-01 1.466e-05

One of the 100 models

#coef(fit_LASSO, s=0.002031)
coefplot(fit_LASSO, lambda=0.002031, sort='magnitude')

How does this perform?

# na.pass has model.matrix retain NA values (so the # of rows is constant)
xp <- model.matrix(BCE_eq, data=df, na.action='na.pass')[,-1]
# s= specifies the version of the model to use
pred <- predict(fit_LASSO, xp, type="response", s = 0.002031)

##     In sample AUC Out of sample AUC 
##         0.7593828         0.7239785

Automating model selection

  • LASSO seems nice, but picking between the 100 models is tough!
  • It also contains a method of k-fold cross validation (default, k=10)
    1. Randomly splits the data into k groups
    2. Runs the algorithm on 90% of the data (k-1 groups)
    3. Determines the best model
    4. Repeat steps 2 and 3 k-1 more times
    5. Uses the best overall model across all k hold out samples
  • It gives 2 model options:
    • "lambda.min": The best performing model
    • "lambda.1se": The simplest model within 1 standard error of "lambda.min"
      • This is the better choice if you are concerned about overfitting

Running a cross validated model

# Cross validation
set.seed(697435)  #for reproducibility
cvfit = cv.glmnet(x=x, y=y,family = "binomial", alpha = 1, type.measure="auc")
plot(cvfit)

cvfit$lambda.min
## [1] 0.00139958
cvfit$lambda.1se
## [1] 0.002684268

Models

lambda.min

lambda.1se

CV LASSO performance

# s= specifies the version of the model to use
pred <- predict(cvfit, xp, type="response", s = "lambda.min")
pred2 <- predict(cvfit, xp, type="response", s = "lambda.1se")

##     In sample AUC, lambda.min Out of sample AUC, lambda.min 
##                     0.7665463                     0.7330212 
##     In sample AUC, lambda.1se Out of sample AUC, lambda.1se 
##                     0.7509946                     0.7124231

Drawbacks of LASSO

  1. No p-values on coefficients
    • Simple solution – run the resulting model with glm()
    • Solution only if using family="gaussian":
      • Run the lasso use the lars package
        • m <- lars(x=x, y=y, type="lasso")
      • Then test coefficients using the covTest package
        • covTest(m, x, y)
  2. Generally worse in sample performance
  3. Sometimes worse out of sample performance (short run)
    • BUT: predictions will be more stable

Wrap up

Predicting fraud

What other data could we use to predict corporate fraud?

  • What is the reason that this event or data would be useful for prediction?
    • I.e., how does it fit into your mental model?
  • What if we were…
    • Auditors?
    • Internal auditors?
    • Regulators?
    • Investors?

End matter

For next week

  • Next week:
    • Break week 😃
  • For two weeks from now:
    • Third individual assignment
      • On binary prediction
      • Finish by the end of Thursday
      • Can be done in pairs
      • Submit on eLearn
    • Datacamp
      • Practice a bit more to keep up to date
        • Using R more will make it more natural

Packages used for these slides