Session 1: Workflow and ML Regression

Getting started

First, we will load the packages we need for these exercises.

Next, we need to import the dataset for the sessions exercises.

While the original csv file is ~38MB, the gzipped version is only 17MB. Using gzipped (or zipped) files is a good way to save storage space while working with csv files in Python (and in R as well).

We will also define some lists out of convenience for later:

Below I have also defined some custom functions. Don't worry about these until we get to using them.

Splitting samples into training and testing

There are multiple ways to do this. The most robust is to separate the training and testing samples by time. We can do this using the techniques taught in part 1.

The second approach is to randomly assign observations into the training and testing samples. This is a reasonable choice if predictive performance is not the primary goal of the analysis, or if the data is all from the same point in time. This can be done simply by using scikit-learn. Note that scikit-learn expects the DVs and IVs to be in separate dataframes, so we will need to do this first.

Regressions with statsmodels

Linear regression

First, we will try a simple linear regression using statsmodels to replicate a replication of Bao and Datta (2014) from Brown, Crowley and Elliott (2020) (henceforth BCE). This is simply a linear regression of future stock return volatility, sdvol1, on the topic measures from the BCE study.

We will use this as a stepping stone to get to the LASSO model.

Like with R, statsmodels' equation interface is quite flexible. For instance, we can apply functions like np.log() inside the equation, we can store the equation in a variable and simple pass the variable instead, or we can make the equation programmatically on the fly (as we will see with the last model).

Here we define a the formula programmatically so as to keep things both succinct and easy to extend.

Logistic regression

Statsmodels supports many other types of regression, including logistic, probit, and Poisson, among others. A full list of supported types for your installed version of statsmodels can be found by running the below code:

For the logistic regression example, we will run a single window of the intentional restatement prediction test from BCE. All we need to do is use a different formula function from statsmodels and update our formula to match our desired equation.

Estimating how well a model works

For in-sample prediction, we can apply the .predict() method to our current data. For out-of-sample prediction, we can likewise apply .predict() to our testing data.

What we really want are quantitative measures of how well a model works. For that, scikit-learn is very handy. In sklearn.metrics there are a ton of measures available including Mean Squared Error (square of RMSE), Mean Absolute Error (MAE), NDCG@k, and ROC AUC.

We will check in-sample and out-of-sample scores for the following:

As seen above, performance out-of-sample is quite similar to performance in-sample for predicting stock market volatility. This is a good sign for our model.

We can also plot out an ROC curve by calculating the True Positive rate and False Positive Rate. We can use metrics.roc_curve() to calculate these. Then we can display the curve using metrics.RocCurveDisplay(). Note that the function metrics.plot_roc_curve() used to exist, and worked with models run in scikit-learn itself. This is now rolled into metrics.RocCurveDisplay().

Using machine learning for regression

One of the simplest machine learning approaches to implement is LASSO. LASSO is available in scikit learn as part of the sklearn.linear_model. LASSO is not far removed from the OLS and logistic regressions we have seen -- the only difference between them is that LASSO adds an L1-norm penalty based on the coefficients in the model. More specifically, the penalty is the sum of the absolute values of the model coefficients times a penalty parameter.

Because there is a new parameter though, we can't simply solve by minimizing the sum of squared error! Instead, we'll need to also provide a value for the parameter, or ideally optimize it.

We will start by re-running our above analyses with a pre-specified value of $\alpha=0.1$ and $C=0.1$.

Linear regression with LASSO

Unforetunately, scikit-learn's functions lack a tidy summary of the results. As such, we need to code our own summary. Below are two possibilities. The first simply lists the variables and their respective coefficient. The second summarizes the same output, but plots the coefficients from most negative to most positive, aiming to make interpretation a bit easier. This is more or less a reworking of the glmnet integration the coefplot library has in R. The code for this function is at the top of this file.

Logistic Regression with LASSO

Unforetunately, scikit-learn's functions lack a tidy summary of the results. As such, we need to code our own summary. Below are two possibilities. The first simply lists the variables and their respective coefficient. The second summarizes the same output, but plots the coefficients from most negative to most positive, aiming to make interpretation a bit easier. This is more or less a reworking of the glmnet integration the coefplot library has in R. The code for this function is at the top of this file.

Note how many variables were dropped in the above regression! This is a manifestation of the "Selection" part of LASSO.

Optimizing the LASSO parameter -- Linear regression

Using cross-validation is built right into scikit-learn. We can simply call the correspondingly tweaked function, tell it how many folds we want, and let it optimize!

For linear LASSO, the cross-validation approach will optimize $R^2$. If this is not your target, then you would need to build your own CV method (e.g., using scikit-learn's inbuilt GridSearch

To see what exactly was going on in the CV routine above, we can replicate the LASSO output at each alpha it tested. The below code does this and then plots a graph showing how, as the penatly alpha increases, coefficients converge to 0.

To this end, I have provided two custom functions in this notebook: lasso_coefpath() and lasso_scorepath(). The first function allows you to visualize how the coefficients on each IV change as the penalty changes. The second function allows you to see how the optimized score changes as the penalty changes. The functions work for both linear and logistic CV LASSO.

Note: The below code takes quite a while to run. This is because there are gains to be made running the computation in parallel using CV that we do not get when running individual alphas. The warm_start=True parameter in the lasso_coefpath() definition attempts to utilize some of the cross-overy to save around half the computation time.

Optimizing the LASSO parameter -- Logistic regression

The CV LASSO for logistic regression in scikit-learn is a bit more flexible than the linear LASSO. In particular, this funciton allows us to specify what score we would like to use for optimization. If the scoring= parameter is left off, it will default to accuracy. However, since we are using ROC AUC for measuring performance, we can specify that instead using scoring="roc_auc". A full list of scoring functions is available here.

Elastic nets with cross validation

Once you have the hang of LASSO, branching out to elastic net or ridge is very straightforward, as the functions for these follow more-or-less the same format. The only addition is needing to specify levels for l1_ratio: 1 is LASSO, 0 is Ridge, and any value in-between is elastic net. You can specify a list of values and let the cross-validation figure out which is best.

CV elastic net for OLS regression

CV elastic net for logistic regression

Final step: Comparing out of sample performance -- Logistic regression

From the graphs, we can see that both elastic net and LASSO outperform logistic regression. Both of these have comparable performance to one another, both with an AUC of 0.66.