# Run this in the R Console inside RStudio
install.packages(c("tidyverse","plotly"))
Datacamp is optional. If you find the coding difficult in today’s lesson, you should go through the suggested datacamp chapters
# Run this in the R Console inside RStudio
install.packages(c("tidyverse","plotly"))
The format will generally all be filled out – you will just add to it, answer questions, analyze data, and explain your work. Instructions and hints are in the same file
#
and ##
, respectively```{r}
and end with ```
`r
and ending with `
*
or _
around text**
around text
**bold text**
becomes bold text
$
to use LaTeX notation
$\frac{revt}{at}$
becomes \(\frac{revt}{at}\)
$$
>
How can we predict revenue for a company, leveraging data about that company, related companies, and macro factors
Photo courtesy of Wikimedia Commons
\[ \hat{y}=\alpha + \beta \hat{x} + \varepsilon \]
I will refer to this as an OLS model – Ordinary Least Square regression
Let’s predict UOL’s revenue for 2016
lm()
~
is used in place of an equals sign
+
log()
*
A*B
includes: A
, B
, and A times B
in the model:
A:B
only includes A times B
in the model# Example:
lm(revt ~ at, data = uol)
Call:
lm(formula = revt ~ at, data = uol)
Residuals:
Min 1Q Median 3Q Max
-362.48 -141.73 -33.20 61.29 951.62
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.069230 75.749121 0.674 0.506
at 0.112330 0.007174 15.657 9.41e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 240.8 on 26 degrees of freedom
Multiple R-squared: 0.9041, Adjusted R-squared: 0.9004
F-statistic: 245.1 on 1 and 26 DF, p-value: 9.408e-15
$1 more in assets leads to $0.12 more revenue
library(tidyverse)
# uolg is defined in the class session code file
uolg %>% ggplot(aes(y=revt, x=at)) +
geom_point(aes(shape=point, group=point)) +
scale_shape_manual(values=c(NA,18)) +
geom_smooth(method="lm", se=FALSE) +
geom_errorbarh(aes(xmax=xright, xmin = xleft)) +
geom_errorbar(aes(ymax=ytop, ymin = ybottom)) +
theme(legend.position="none") + xlim(0, 20000) + ylim(0, 2500)
Abstracting a problem
If we don’t want to factor in firm size, we can use ratios to abstract away from it!
\[ \Delta x_t = \frac{x_t}{x_{t-1}} - 1 \]
lag()
function# tidyverse
uol <- uol %>%
mutate(revt_growth1 = revt / lag(revt) - 1)
# R way
uol$revt_growth2 = uol$revt / c(NA, uol$revt[-length(uol$revt)]) - 1
# Check that both ways are equivalent
identical(uol$revt_growth1, uol$revt_growth2)
[1] TRUE
You can use whichever you are comfortable with
mutate()
adds variables to an existing data frame
across()
applies a transformation to multiple columns in a data framestarts_with()
or ends_with()
to pick columns by pattern# Make the other needed change
uol <- uol %>%
mutate(at_growth = at / lag(at) - 1) %>% # Calculate asset growth
rename(revt_growth = revt_growth1) # Rename for readability
# Run the OLS model
mod2 <- lm(revt_growth ~ at_growth, data = uol)
summary(mod2)
Call:
lm(formula = revt_growth ~ at_growth, data = uol)
Residuals:
Min 1Q Median 3Q Max
-0.57261 -0.13261 -0.00151 0.15371 0.42832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08725 0.05569 1.567 0.1298
at_growth 0.57277 0.29580 1.936 0.0642 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2496 on 25 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.1304, Adjusted R-squared: 0.09564
F-statistic: 3.749 on 1 and 25 DF, p-value: 0.06421
\[ \hat{y}=\alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \varepsilon \]
We have… 464 variables from Compustat Global alone!
Now what?
Building a model requires careful thought!
What makes sense to add to our model?
We will use test statistics to test the hypotheses
\[ \begin{align*} \text{Theory:}\quad y &= \alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \varepsilon \\ \text{Data:}\quad \hat{y} &= \alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \hat{\varepsilon} \end{align*} \]
Models designed under a frequentist approach can only answer the question of “does this matter?”
Linear models are very flexible
Simple OLS measures a simple linear relationship between 1 input and 1 output
OLS measures simple linear relationships between a set of inputs and 1 output
IV/2SLS models linear relationships where the effect of some \(x_i\) on \(y\) may be confounded by outside factors.
SUR models systems with related error terms
3SLS models systems of equations with related outputs
SEM can model abstract and multi-level relationships
Pick what fits your problem!
There are many more model types though!
Own knowledge
Or: Why do we like simpler models so much?
An overfitted model works really well on its own data, and quite poorly on new data
“The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.”
– Dahiru 2008
All of these have p-values, except for AIC
There are many ways to measure a model, each with their own merits. They don’t always agree, and it’s on us to pick a reasonable measure.
\(A \rightarrow B\)
\[ A \rightarrow B \text{ or } A \leftarrow B ?\\ A_t \rightarrow B_{t+1} \]
\[ Revenue_{t+1} = Revenue_t + \ldots \]
It would be quite difficult for \(Revenue_{t+1}\) to cause \(Revenue_t\)
\[ A_t \rightarrow B_{t+1}? \quad \text{OR} \quad C \rightarrow A_t\text{ and }C \rightarrow B_{t+1}? \]
anova(mod2, mod3, test="Chisq")
Analysis of Variance Table
Model 1: revt_growth ~ at_growth
Model 2: revt_growth ~ act_growth + che_growth + lct_growth
Res.Df RSS Df Sum of Sq Pr(>Chi)
1 25 1.5580
2 23 1.2344 2 0.32359 0.04906 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A bit better at \(p<0.05\)
Adding more data usually helps improve predictions
Problems with our original question
What this revised question does better
uol <- uol %>% mutate(revt_lead = lead(revt)) # From dplyr
forecast1 <- lm(revt_lead ~ act + che + lct, data=uol)
library(broom) # Lets us view bigger regression outputs in a tidy fashion
tidy(forecast1) # Present regression output
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 235. 139. 1.69 0.104
2 act 0.548 0.145 3.77 0.000999
3 che -0.181 0.322 -0.561 0.580
4 lct -0.0700 0.242 -0.289 0.775
glance(forecast1) # Present regression statistics
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.826 0.803 337. 36.4 0.00000000675 3 -193. 397. 403.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
This model is ok, but we can do better.
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 148. 119. 1.24 0.228
2 revt 1.63 0.311 5.22 0.0000414
3 act 0.317 0.165 1.92 0.0687
4 che 0.124 0.322 0.384 0.705
5 lct -0.189 0.193 -0.981 0.338
6 dp -3.66 3.39 -1.08 0.293
7 ebit -3.63 0.995 -3.65 0.00159
glance(forecast2)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.929 0.907 231. 43.4 1.97e-10 6 -181. 379. 389.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
anova(forecast1, forecast2, test="Chisq")
Analysis of Variance Table
Model 1: revt_lead ~ act + che + lct
Model 2: revt_lead ~ revt + act + che + lct + dp + ebit
Res.Df RSS Df Sum of Sq Pr(>Chi)
1 23 2616067
2 20 1071637 3 1544429 2.439e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is better (Adj. \(R^2\), \(\chi^2\), AIC).
forecast3 <-
lm(revt_lead ~ revt + act + che + lct + dp + ebit,
data=df_clean[df_clean$fic=="SGP",])
tidy(forecast3)
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0134 7.86 0.00170 9.99e- 1
2 revt 0.652 0.0555 11.7 2.00e-27
3 act 0.154 0.0306 5.03 7.48e- 7
4 che 0.234 0.0807 2.90 3.98e- 3
5 lct 0.0768 0.0575 1.34 1.82e- 1
6 dp 1.63 0.748 2.17 3.04e- 2
7 ebit -0.802 0.206 -3.90 1.15e- 4
glance(forecast3)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.884 0.883 123. 488. 5.63e-176 6 -2427. 4870. 4902.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Lower adjusted \(R^2\) – This is worse? Why?
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 357. 666. 0.536 5.92e- 1
2 revt 1.03 0.00599 171. 0
3 act -0.0307 0.00602 -5.11 3.33e- 7
4 che 0.0274 0.0116 2.35 1.86e- 2
5 lct 0.0701 0.00919 7.63 2.78e-14
6 dp 0.237 0.166 1.42 1.55e- 1
7 ebit 0.0319 0.0490 0.651 5.15e- 1
glance(forecast4)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.948 0.948 46353. 15089. 0 6 -60617. 121249. 121302.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Higher adjusted \(R^2\) – better!
Why is the UOL model better than the Singapore model?
Data frames are usually wide panels
expectations <- read_csv("../../Data/Session_2-Macro.csv",
skip=10, na="na") %>% # Needed to load file
filter(row_number() < 21) %>% # Drop the footer
rename(industry=`Data Series`) %>% # Rename column
pivot_longer(!industry, names_to='yearQ',
values_to='fin_sentiment') %>% # Cast wide to long
mutate(year = as.numeric(substr(yearQ, 1, 4))) %>% # split out year
mutate(quarter = as.numeric(substr(yearQ, 6, 6))) %>% # split out quarter
select(-yearQ) # Remove measure
# extract out Q1, finance only
expectations_re <- expectations %>%
filter(quarter == 1, # Keep only the Q1
industry == "Real Estate") # Keep only real estate
Casting between data frame shapes
The pivot_wider()
and pivot_longer()
functions work well. See the dplyr documentation for more details. In the code above, the first argument is the columns to turn into rows. !industry
means all columns except industry
. names_to
(value_to
) specifies the variable name to contain the column names (data) after transforming to long.
*_join()
commands
left_join()
for merging a dataset into anotherinner_join()
for keeping only matched observationsouter_join()
for making all possible combinationsarrange()
command is easy to use
Merge in the finance sentiment data to our accounting data
# subset out our Singaporean data, since our macro data is Singapore-specific
df_SG <- df_clean %>% filter(fic == "SGP")
# Create year in df_SG (date is given by datadate as YYYYMMDD)
df_SG$year = round(df_SG$datadate / 10000, digits=0)
# Combine datasets
# Notice how it automatically figures out to join by "year"
df_SG_macro <- left_join(df_SG, expectations_re[,c("year","fin_sentiment")])
macro1 <- lm(revt_lead ~ revt + act + che + lct + dp + ebit + fin_sentiment,
data=df_SG_macro)
library(broom)
tidy(macro1)
# A tibble: 8 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.119 8.00 0.0149 9.88e- 1
2 revt 0.652 0.0563 11.6 1.01e-26
3 act 0.155 0.0316 4.90 1.41e- 6
4 che 0.231 0.0823 2.81 5.23e- 3
5 lct 0.0755 0.0582 1.30 1.96e- 1
6 dp 1.63 0.761 2.15 3.25e- 2
7 ebit -0.804 0.208 -3.86 1.35e- 4
8 fin_sentiment 0.0174 0.177 0.0980 9.22e- 1
It isn’t significant. Why is this?
Why isn’t the macro data significant?
fin_sentiment
is a constant scale…
df_SG_macro %>%
ggplot(aes(y=revt_lead,
x=fin_sentiment)) +
geom_point()
df_SG_macro %>%
ggplot(aes(y=revt_lead,
x=fin_sentiment * revt)) +
geom_point()
macro3 <-
lm(revt_lead ~ revt + act + che + lct + dp + ebit + fin_sentiment:revt,
data=df_SG_macro)
tidy(macro3)
# A tibble: 8 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.83 7.91 0.231 8.18e- 1
2 revt 0.655 0.0556 11.8 1.63e-27
3 act 0.133 0.0316 4.21 3.21e- 5
4 che 0.267 0.0821 3.25 1.24e- 3
5 lct 0.0619 0.0577 1.07 2.84e- 1
6 dp 1.94 0.757 2.57 1.06e- 2
7 ebit -0.804 0.206 -3.90 1.12e- 4
8 revt:fin_sentiment -0.00175 0.000596 -2.94 3.51e- 3
glance(macro3)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.887 0.885 123. 421. 1.28e-173 7 -2388. 4794. 4830.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Ensure that we use the same data (fin_sentiment is missing in 1994)
baseline <-
lm(revt_lead ~ revt + act + che + lct + dp + ebit,
data=df_SG_macro[!is.na(df_SG_macro$fin_sentiment),])
glance(baseline)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.884 0.882 124. 480. 3.97e-173 6 -2392. 4801. 4832.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(macro3)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.887 0.885 123. 421. 1.28e-173 7 -2388. 4794. 4830.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Adjusted \(R^2\) and AIC are slightly better with macro data
anova(baseline, macro3, test="Chisq")
Analysis of Variance Table
Model 1: revt_lead ~ revt + act + che + lct + dp + ebit
Model 2: revt_lead ~ revt + act + che + lct + dp + ebit + fin_sentiment:revt
Res.Df RSS Df Sum of Sq Pr(>Chi)
1 377 5799708
2 376 5669613 1 130094 0.003311 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Macro model definitely fits better than the baseline model!
Interpretating the macro variable
fin_sentiment
(36.1 points)
Building a model requires careful thought!
This is where having accounting and business knowledge comes in!
What other macro data would you like to add to this model?
p_uol <- predict(forecast2, uol[uol$fyear==2021,])
p_base <- predict(baseline,
df_SG_macro[df_SG_macro$isin=="SG1S83002349" & df_SG_macro$fyear==2021,])
p_macro <- predict(macro3,
df_SG_macro[df_SG_macro$isin=="SG1S83002349" & df_SG_macro$fyear==2021,])
p_world <- predict(forecast4,
df_clean[df_clean$isin=="SG1S83002349" & df_clean$fyear==2021,])
preds <- c(p_uol, p_base, p_macro, p_world)
names(preds) <- c("UOL 2022 UOL", "UOL 2022 Base", "UOL 2022 Macro",
"UOL 2022 World")
preds
UOL 2022 UOL UOL 2022 Base UOL 2022 Macro UOL 2022 World
3608.571 2834.237 2745.834 3136.901
# series vectors calculated here -- See appendix
rmse <- function(v1, v2) {
sqrt(mean((v1 - v2)^2, na.rm=T))
}
rmse <- c(rmse(actual_series, uol_series), rmse(actual_series, base_series),
rmse(actual_series, macro_series), rmse(actual_series, world_series))
names(rmse) <- c("UOL 2018 UOL", "UOL 2018 Base", "UOL 2018 Macro", "UOL 2018 World")
rmse
UOL 2018 UOL UOL 2018 Base UOL 2018 Macro UOL 2018 World
199.2242 274.2474 266.2979 455.7594
Why is UOL the best for in sample?
UOL is trained to minimize variation only in that context. It is potentially overfitted, meaning it won’t predict well out of sample. Out of sample prediction is much more useful than in sample, however.
UOL posted a $3.2B in revenue in 2022.
preds
UOL 2022 UOL UOL 2022 Base UOL 2022 Macro UOL 2022 World
3608.571 2834.237 2745.834 3136.901
Why is the global model better?
# Graph showing squared error (slide 4.6)
uolg <- uol[,c("at","revt")]
uolg$resid <- mod1$residuals
uolg$xleft <- ifelse(uolg$resid < 0,uolg$at,uolg$at - uolg$resid)
uolg$xright <- ifelse(uolg$resid < 0,uolg$at - uolg$resid, uol$at)
uolg$ytop <- ifelse(uolg$resid < 0,uolg$revt - uolg$resid,uol$revt)
uolg$ybottom <- ifelse(uolg$resid < 0,uolg$revt, uolg$revt - uolg$resid)
uolg$point <- TRUE
uolg2 <- uolg
uolg2$point <- FALSE
uolg2$at <- ifelse(uolg$resid < 0,uolg2$xright,uolg2$xleft)
uolg2$revt <- ifelse(uolg$resid < 0,uolg2$ytop,uolg2$ybottom)
uolg <- rbind(uolg, uolg2)
uolg %>% ggplot(aes(y=revt, x=at, group=point)) +
geom_point(aes(shape=point)) +
scale_shape_manual(values=c(NA,18)) +
geom_smooth(method="lm", se=FALSE) +
geom_errorbarh(aes(xmax=xright, xmin = xleft)) +
geom_errorbar(aes(ymax=ytop, ymin = ybottom)) +
theme(legend.position="none")
# Chart of mean revt_lead for Singaporean firms (slide 12.6)
df_clean %>% # Our data frame
filter(fic=="SGP") %>% # Select only Singaporean firms
group_by(isin) %>% # Group by firm
mutate(mean_revt_lead=mean(revt_lead, na.rm=T)) %>% # Determine each firm's mean revenue (lead)
slice(1) %>% # Take only the first observation for each group
ungroup() %>% # Ungroup (we don't need groups any more)
ggplot(aes(x=mean_revt_lead)) + # Initialize plot and select data
geom_histogram(aes(y = ..density..)) + # Plots the histogram as a density so that geom_density is visible
geom_density(alpha=.4, fill="#FF6666") # Plots smoothed density