Dr. Richard M. Crowley
# Run this in the R Console inside RStudio
install.packages(c("tidyverse","plotly","tufte","reshape2"))
The format will generally all be filled out – you will just add to it, answer questions, analyze data, and explain your work. Instructions and hints are in the same file
#
and ##
, respectively```{r}
and end with ```
`r `
and ending with `
*
or _
around text**
around text
**bold text**
becomes bold text$
to use LaTeX notation
$\frac{revt}{at}$
becomes \(\frac{revt}{at}\)$$
>
How can we predict revenue for a company, leveraging data about that company, related companies, and macro factors
\[ \hat{y}=\alpha + \beta \hat{x} + \varepsilon \]
I will refer to this as an OLS model – Ordinary Least Square regression
Let’s predict UOL’s revenue for 2016
# revt: Revenue, at: Assets
summary(uol[,c("revt", "at")])
## revt at
## Min. : 94.78 Min. : 1218
## 1st Qu.: 193.41 1st Qu.: 3044
## Median : 427.44 Median : 3478
## Mean : 666.38 Mean : 5534
## 3rd Qu.:1058.61 3rd Qu.: 7939
## Max. :2103.15 Max. :19623
~
is used in place of an equals sign
+
log()
*
A*B
includes, A, B, and A times B in the model:
A:B
just includes A times B in the model# Example:
lm(revt ~ at, data = uol)
mod1 <- lm(revt ~ at, data = uol)
summary(mod1)
##
## Call:
## lm(formula = revt ~ at, data = uol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -295.01 -101.29 -41.09 47.17 926.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.831399 67.491305 -0.205 0.839
## at 0.122914 0.009678 12.701 6.7e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.2 on 27 degrees of freedom
## Multiple R-squared: 0.8566, Adjusted R-squared: 0.8513
## F-statistic: 161.3 on 1 and 27 DF, p-value: 6.699e-13
\[ \Delta x_t = \frac{x_t}{x_{t-1}} - 1 \]
# tidyverse
uol <- uol %>%
mutate(revt_growth1 = revt / lag(revt) - 1)
# R way
uol$revt_growth2 = uol$revt / c(NA, uol$revt[-length(uol$revt)]) - 1
identical(uol$revt_growth1, uol$revt_growth2)
## [1] TRUE
You can use whichever you are comfortable with
mutate_all()
, mutate_at()
, mutate_if()
mutate_all()
applies a transformation to all values in a data frame and adds these to the data framemutate_at()
does this for a set of specified variablesmutate_if()
transforms all variables matching a condition
is.numeric
# Make the other needed change
uol <- uol %>%
mutate(at_growth = at / lag(at) - 1) %>% # Calculate asset growth
rename(revt_growth = revt_growth1) # Rename for readability
# Run the OLS model
mod2 <- lm(revt_growth ~ at_growth, data = uol)
summary(mod2)
##
## Call:
## lm(formula = revt_growth ~ at_growth, data = uol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57736 -0.10534 -0.00953 0.15132 0.42284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09024 0.05620 1.606 0.1204
## at_growth 0.53821 0.27717 1.942 0.0631 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2444 on 26 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1267, Adjusted R-squared: 0.09307
## F-statistic: 3.771 on 1 and 26 DF, p-value: 0.06307
\[ \hat{y}=\alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \varepsilon \]
We have… 464 variables from Compustat Global alone!
Now what?
Building a model requires careful thought!
What makes sense to add to our model?
A specific test is one of an infinite number of replications
Focus on distributions and beliefs
A separate school of statistics thought
This is why we use more than 1 data point
detector <- function() {
dice <- sample(1:6, size=2, replace=TRUE)
if (sum(dice) == 12) {
"exploded"
} else {
"still there"
}
}
experiment <- replicate(1000,detector())
# p value
p <- sum(experiment == "still there") / 1000
if (p < 0.05) {
paste("p-value: ", p, "-- Fail to reject H_A, sun appears to have exploded")
} else {
paste("p-value: ", p, "-- Reject H_A that sun exploded")
}
## [1] "p-value: 0.97 -- Reject H_A that sun exploded"
Frequentist: The sun didn’t explode
\[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} \]
\[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} = \frac{\frac{35}{36}\times\sim 0}{\frac{1}{36}} = 35\times \sim 0 \approx 0 \]
Bayesian: The sun didn’t explode
We will use both to some extent – for our purposes, we will not debate the merits of either school of thought, but use tools derived from both
There are many ways to measure a model, each with their own merits. They don’t always agree, and it’s on us to pick a reasonable measure.
We will use test statistics to test the hypotheses
\[ y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \varepsilon \]
\[ \hat{y} = \alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \hat{\varepsilon} \]
Models designed under a frequentist approach can only answer the question of “does this matter?”
Linear models are very flexible
Simple OLS measures a simple linear relationship between an input and an output
OLS measures simple linear relationships between a set of inputs and one output
IV/2SLS models linear relationships where the effect of some \(x_i\) on \(y\) may be confounded by outside factors.
SUR models systems with related error terms
3SLS models systems of equations with related outputs
SEM can model abstract and multi-level relationships
Pick what fits your problem!
There are many more model types though!
Own knowledge
Or: Why do we like simpler models so much?
An overfitted model works really well on its own data, and quite poorly on new data
“The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.”
– Dahiru 2008
All of these have p-values, except for AIC
\(A \rightarrow B\)
\(A \rightarrow B\) or \(A \leftarrow B\)?
\(A_t \rightarrow B_{t+1}\)
\[ Revenue_{t+1} = Revenue_t + \ldots \]
\(A_t \rightarrow B_{t+1}\)? \(\quad\) OR \(\quad\) \(C \rightarrow A_t\) and \(C \rightarrow B_{t+1}\)?
anova(mod2, mod3, test="Chisq")
## Analysis of Variance Table
##
## Model 1: revt_growth ~ at_growth
## Model 2: revt_growth ~ lct_growth + che_growth + ebit_growth
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 26 1.5534
## 2 24 1.1918 2 0.36168 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A bit better at \(p<0.05\)
Adding more data usually helps improve predictions
# Ensure firms have at least $1M (local currency), and have revenue
# df contains all real estate companies excluding North America
df_clean <- filter(df, df$at>1, df$revt>0)
# We cleaned out 578 observations!
print(c(nrow(df), nrow(df_clean)))
## [1] 5161 4583
# Another useful cleaning funtion:
# Replaces NaN, Inf, and -Inf with NA for all numeric variables in the data!
df_clean <- df_clean %>%
mutate_if(is.numeric, list(~replace(., !is.finite(.), NA)))
uol <- uol %>% mutate(revt_lead = lead(revt)) # From dplyr
forecast1 <- lm(revt_lead ~ lct + che + ebit, data=uol)
library(broom) # Lets us view bigger regression outputs in a tidy fashion
tidy(forecast1) # Present regression output
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 87.4 124. 0.707 0.486
## 2 lct 0.213 0.291 0.731 0.472
## 3 che 0.112 0.349 0.319 0.752
## 4 ebit 2.49 1.03 2.42 0.0236
glance(forecast1) # Present regression statistics
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.655 0.612 357. 15.2 9.39e-6 4 -202. 414. 421.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
This model is ok, but we can do better.
forecast2 <-
lm(revt_lead ~ revt + act + che + lct + dp + ebit , data=uol)
tidy(forecast2)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 15.6 97.0 0.161 0.874
## 2 revt 1.49 0.414 3.59 0.00174
## 3 act 0.324 0.165 1.96 0.0629
## 4 che 0.0401 0.310 0.129 0.898
## 5 lct -0.198 0.179 -1.10 0.283
## 6 dp 3.63 5.42 0.669 0.511
## 7 ebit -3.57 1.36 -2.62 0.0161
glance(forecast2)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.903 0.875 203. 32.5 1.41e-9 7 -184. 385. 396.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
anova(forecast1, forecast2, test="Chisq")
## Analysis of Variance Table
##
## Model 1: revt_lead ~ lct + che + ebit
## Model 2: revt_lead ~ revt + act + che + lct + dp + ebit
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 24 3059182
## 2 21 863005 3 2196177 1.477e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is better (Adj. \(R^2\), \(\chi^2\), AIC).
Data frames are usually wide panels
# Note the group_by -- without it, lead() will pull from the subsequent firm!
# ungroup() tells R that we finished grouping
df_clean <- df_clean %>%
group_by(isin) %>%
mutate(revt_lead = lead(revt)) %>%
ungroup()
forecast3 <-
lm(revt_lead ~ revt + act + che + lct + dp + ebit , data=df_clean[df_clean$fic=="SGP",])
tidy(forecast3)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 25.0 13.2 1.89 5.95e- 2
## 2 revt 0.505 0.0762 6.63 1.43e-10
## 3 act -0.0999 0.0545 -1.83 6.78e- 2
## 4 che 0.494 0.155 3.18 1.62e- 3
## 5 lct 0.396 0.0860 4.60 5.95e- 6
## 6 dp 4.46 1.55 2.88 4.21e- 3
## 7 ebit -0.951 0.271 -3.51 5.18e- 4
glance(forecast3)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.844 0.841 210. 291. 2.63e-127 7 -2237. 4489.
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Lower adjusted \(R^2\) – This is worse? Why?
forecast4 <-
lm(revt_lead ~ revt + act + che + lct + dp + ebit , data=df_clean)
tidy(forecast4)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 222. 585. 0.379 7.04e- 1
## 2 revt 0.997 0.00655 152. 0.
## 3 act -0.00221 0.00547 -0.403 6.87e- 1
## 4 che -0.150 0.0299 -5.02 5.36e- 7
## 5 lct 0.0412 0.0113 3.64 2.75e- 4
## 6 dp 1.52 0.184 8.26 1.89e-16
## 7 ebit 0.308 0.0650 4.74 2.25e- 6
glance(forecast4)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.944 0.944 36459. 11299. 0 7 -47819. 95654.
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Higher adjusted \(R^2\) – better!
Why is 1 model better while the other model is worse?
Statistical noise is random error in the data
Noise is OK, but the more we remove, the better!
forecast3.1 <-
lm(revt_lead ~ revt + act + che + lct + dp + ebit + factor(isin),
data=df_clean[df_clean$fic=="SGP",])
# n=7 to prevent outputting every fixed effect
print(tidy(forecast3.1), n=15)
## # A tibble: 27 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.58 39.4 0.0401 0.968
## 2 revt 0.392 0.0977 4.01 0.0000754
## 3 act -0.0538 0.0602 -0.894 0.372
## 4 che 0.304 0.177 1.72 0.0869
## 5 lct 0.392 0.0921 4.26 0.0000276
## 6 dp 4.71 1.73 2.72 0.00687
## 7 ebit -0.851 0.327 -2.60 0.00974
## 8 factor(isin)SG1AA6000000 218. 76.5 2.85 0.00463
## 9 factor(isin)SG1AD8000002 -11.7 67.4 -0.174 0.862
## 10 factor(isin)SG1AE2000006 4.02 79.9 0.0503 0.960
## 11 factor(isin)SG1AG0000003 -13.6 61.1 -0.223 0.824
## 12 factor(isin)SG1BG1000000 -0.901 69.5 -0.0130 0.990
## 13 factor(isin)SG1BI9000008 7.76 64.3 0.121 0.904
## 14 factor(isin)SG1DE5000007 -10.8 61.1 -0.177 0.860
## 15 factor(isin)SG1EE1000009 -6.90 66.7 -0.103 0.918
## # ... with 12 more rows
glance(forecast3.1)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.856 0.844 208. 69.4 1.15e-111 27 -2223. 4502.
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
anova(forecast3, forecast3.1, test="Chisq")
## Analysis of Variance Table
##
## Model 1: revt_lead ~ revt + act + che + lct + dp + ebit
## Model 2: revt_lead ~ revt + act + che + lct + dp + ebit + factor(isin)
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 324 14331633
## 2 304 13215145 20 1116488 0.1765
This isn’t much different. Why? There is another source of noise within Singapore real estate companies
library(lfe)
forecast3.2 <-
felm(revt_lead ~ revt + act + che + lct + dp + ebit | factor(isin),
data=df_clean[df_clean$fic=="SGP",])
summary(forecast3.2)
##
## Call:
## felm(formula = revt_lead ~ revt + act + che + lct + dp + ebit | factor(isin), data = df_clean[df_clean$fic == "SGP", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1181.88 -23.25 -1.87 18.03 1968.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## revt 0.39200 0.09767 4.013 7.54e-05 ***
## act -0.05382 0.06017 -0.894 0.37181
## che 0.30370 0.17682 1.718 0.08690 .
## lct 0.39209 0.09210 4.257 2.76e-05 ***
## dp 4.71275 1.73168 2.721 0.00687 **
## ebit -0.85080 0.32704 -2.602 0.00974 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 208.5 on 304 degrees of freedom
## (29 observations deleted due to missingness)
## Multiple R-squared(full model): 0.8558 Adjusted R-squared: 0.8435
## Multiple R-squared(proj model): 0.7806 Adjusted R-squared: 0.7618
## F-statistic(full model):69.41 on 26 and 304 DF, p-value: < 2.2e-16
## F-statistic(proj model): 180.3 on 6 and 304 DF, p-value: < 2.2e-16
What else could we do to improve our prediction model?
# Graph showing squared error (slide 4.6)
uolg <- uol[,c("at","revt")]
uolg$resid <- mod1$residuals
uolg$xleft <- ifelse(uolg$resid < 0,uolg$at,uolg$at - uolg$resid)
uolg$xright <- ifelse(uolg$resid < 0,uolg$at - uolg$resid, uol$at)
uolg$ytop <- ifelse(uolg$resid < 0,uolg$revt - uolg$resid,uol$revt)
uolg$ybottom <- ifelse(uolg$resid < 0,uolg$revt, uolg$revt - uolg$resid)
uolg$point <- TRUE
uolg2 <- uolg
uolg2$point <- FALSE
uolg2$at <- ifelse(uolg$resid < 0,uolg2$xright,uolg2$xleft)
uolg2$revt <- ifelse(uolg$resid < 0,uolg2$ytop,uolg2$ybottom)
uolg <- rbind(uolg, uolg2)
uolg %>% ggplot(aes(y=revt, x=at, group=point)) +
geom_point(aes(shape=point)) +
scale_shape_manual(values=c(NA,18)) +
geom_smooth(method="lm", se=FALSE) +
geom_errorbarh(aes(xmax=xright, xmin = xleft)) +
geom_errorbar(aes(ymax=ytop, ymin = ybottom)) +
theme(legend.position="none")
# Chart of mean revt_lead for Singaporean firms (slide 12.6)
df_clean %>% # Our data frame
filter(fic=="SGP") %>% # Select only Singaporean firms
group_by(isin) %>% # Group by firm
mutate(mean_revt_lead=mean(revt_lead, na.rm=T)) %>% # Determine each firm's mean revenue (lead)
slice(1) %>% # Take only the first observation for each group
ungroup() %>% # Ungroup (we don't need groups any more)
ggplot(aes(x=mean_revt_lead)) + # Initialize plot and select data
geom_histogram(aes(y = ..density..)) + # Plots the histogram as a density so that geom_density is visible
geom_density(alpha=.4, fill="#FF6666") # Plots smoothed density