The format will generally all be filled out – you will just add to it, answer questions, analyze data, and explain your work. Instructions and hints are in the same file
For each assignment, you will have until 2 classes from release to complete it (2 weeks)
#
and ##
, respectively```{r}
and end with ```
`r
and ending with `
*
or _
around text**
around text
**bold text**
becomes bold text$
to use LaTeX notation
$\frac{revt}{at}$
becomes \(\frac{revt}{at}\)$$
>
How can we predict revenue for a company, leveraging data about that company, related companies, and macro factors
\[ \hat{y}=\alpha + \beta \hat{x} + \varepsilon \]
I will refer to this as an OLS model – Ordinary Least Square regression
Let’s predict UOL’s revenue for 2016
## revt at
## Min. : 94.78 Min. : 1218
## 1st Qu.: 193.41 1st Qu.: 3044
## Median : 427.44 Median : 3478
## Mean : 666.38 Mean : 5534
## 3rd Qu.:1058.61 3rd Qu.: 7939
## Max. :2103.15 Max. :19623
lm()
~
is used in place of an equals sign
+
log()
*
A*B
includes, A, B, and A times B in the model:
A:B
just includes A times B in the model##
## Call:
## lm(formula = revt ~ at, data = uol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -295.01 -101.29 -41.09 47.17 926.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.831399 67.491305 -0.205 0.839
## at 0.122914 0.009678 12.701 6.7e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.2 on 27 degrees of freedom
## Multiple R-squared: 0.8566, Adjusted R-squared: 0.8513
## F-statistic: 161.3 on 1 and 27 DF, p-value: 6.699e-13
$1 more in assets leads to $0.12 more revenue
\[ \Delta x_t = \frac{x_t}{x_{t-1}} - 1 \]
package:tidyverse
’s package:dplyr
lag()
function# tidyverse
uol <- uol %>%
mutate(revt_growth1 = revt / lag(revt) - 1)
# R way
uol$revt_growth2 = uol$revt / c(NA, uol$revt[-length(uol$revt)]) - 1
identical(uol$revt_growth1, uol$revt_growth2)
## [1] TRUE
You can use whichever you are comfortable with
mutate()
adds variables to an existing data frame
mutate_all()
, mutate_at()
, mutate_if()
mutate_all()
applies a transformation to all values in a data frame and adds these to the data framemutate_at()
does this for a set of specified variablesmutate_if()
transforms all variables matching a condition
is.numeric
# Make the other needed change
uol <- uol %>%
mutate(at_growth = at / lag(at) - 1) %>% # Calculate asset growth
rename(revt_growth = revt_growth1) # Rename for readability
# Run the OLS model
mod2 <- lm(revt_growth ~ at_growth, data = uol)
summary(mod2)
##
## Call:
## lm(formula = revt_growth ~ at_growth, data = uol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57736 -0.10534 -0.00953 0.15132 0.42284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09024 0.05620 1.606 0.1204
## at_growth 0.53821 0.27717 1.942 0.0631 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2444 on 26 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1267, Adjusted R-squared: 0.09307
## F-statistic: 3.771 on 1 and 26 DF, p-value: 0.06307
\[ \hat{y}=\alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \varepsilon \]
We have… 464 variables from Compustat Global alone!
Now what?
Building a model requires careful thought!
What makes sense to add to our model?
That’s a lot!
# lct: short term liabilities, che: cash and equivalents, ebit: EBIT
uol <- uol %>%
mutate_at(vars(lct, che, ebit), list(growth = ~. / lag(.) - 1)) # Calculate 3 growths
mod3 <- lm(revt_growth ~ lct_growth + che_growth + ebit_growth, data=uol)
summary(mod3)
##
## Call:
## lm(formula = revt_growth ~ lct_growth + che_growth + ebit_growth,
## data = uol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46531 -0.15097 0.00205 0.17601 0.31997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.07498 0.04915 1.526 0.14018
## lct_growth 0.23482 0.07319 3.209 0.00376 **
## che_growth -0.11561 0.09227 -1.253 0.22230
## ebit_growth 0.03808 0.02208 1.724 0.09751 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2228 on 24 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.33, Adjusted R-squared: 0.2462
## F-statistic: 3.94 on 3 and 24 DF, p-value: 0.02033
A specific test is one of an infinite number of replications
Focus on distributions and beliefs
A separate school of statistics thought
This is why we use more than 1 data point
detector <- function() {
dice <- sample(1:6, size=2, replace=TRUE)
if (sum(dice) == 12) {
"exploded"
} else {
"still there"
}
}
experiment <- replicate(1000,detector())
# p value
p <- sum(experiment == "still there") / 1000
if (p < 0.05) {
paste("p-value: ", p, "-- Fail to reject H_A, sun appears to have exploded")
} else {
paste("p-value: ", p, "-- Reject H_A that sun exploded")
}
## [1] "p-value: 0.972 -- Reject H_A that sun exploded"
Frequentist: The sun didn’t explode
\[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} \]
\[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} = \frac{\frac{35}{36}\times\sim 0}{\frac{1}{36}} = 35\times \sim 0 \approx 0 \]
Bayesian: The sun didn’t explode
We will use both to some extent – for our purposes, we will not debate the merits of either school of thought, but we will use tools derived from both
There are many ways to measure a model, each with their own merits. They don’t always agree, and it’s on us to pick a reasonable measure.
We will use test statistics to test the hypotheses
\[ \begin{align*} \text{Theory:}\quad y &= \alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \varepsilon \\ \text{Data:}\quad \hat{y} &= \alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \hat{\varepsilon} \end{align*} \]
Models designed under a frequentist approach can only answer the question of “does this matter?”
Linear models are very flexible
Simple OLS measures a simple linear relationship between an input and an output
OLS measures simple linear relationships between a set of inputs and one output
IV/2SLS models linear relationships where the effect of some \(x_i\) on \(y\) may be confounded by outside factors.
SUR models systems with related error terms
3SLS models systems of equations with related outputs
SEM can model abstract and multi-level relationships
Pick what fits your problem!
There are many more model types though!
Own knowledge
Or: Why do we like simpler models so much?
An overfitted model works really well on its own data, and quite poorly on new data
“The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.”
– Dahiru 2008
All of these have p-values, except for AIC
\(A \rightarrow B\)
\(A \rightarrow B\) or \(A \leftarrow B\)?
\(A_t \rightarrow B_{t+1}\)
\[ Revenue_{t+1} = Revenue_t + \ldots \]
\(A_t \rightarrow B_{t+1}\)? \(\quad\) OR \(\quad\) \(C \rightarrow A_t\) and \(C \rightarrow B_{t+1}\)?
## Analysis of Variance Table
##
## Model 1: revt_growth ~ at_growth
## Model 2: revt_growth ~ lct_growth + che_growth + ebit_growth
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 26 1.5534
## 2 24 1.1918 2 0.36168 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A bit better at \(p<0.05\)
Adding more data usually helps improve predictions
# Ensure firms have at least $1M (local currency), and have revenue
# df contains all real estate companies excluding North America
df_clean <- df %>%
filter(at>1, revt>0)
# We cleaned out 578 observations!
print(c(nrow(df), nrow(df_clean)))
## [1] 5161 4583
uol <- uol %>% mutate(revt_lead = lead(revt)) # From dplyr
forecast1 <- lm(revt_lead ~ lct + che + ebit, data=uol)
library(broom) # Lets us view bigger regression outputs in a tidy fashion
tidy(forecast1) # Present regression output
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 87.4 124. 0.707 0.486
## 2 lct 0.213 0.291 0.731 0.472
## 3 che 0.112 0.349 0.319 0.752
## 4 ebit 2.49 1.03 2.42 0.0236
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.655 0.612 357. 15.2 9.39e-6 3 -202. 414. 421.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
This model is ok, but we can do better.
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 15.6 97.0 0.161 0.874
## 2 revt 1.49 0.414 3.59 0.00174
## 3 act 0.324 0.165 1.96 0.0629
## 4 che 0.0401 0.310 0.129 0.898
## 5 lct -0.198 0.179 -1.10 0.283
## 6 dp 3.63 5.42 0.669 0.511
## 7 ebit -3.57 1.36 -2.62 0.0161
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.903 0.875 203. 32.5 1.41e-9 6 -184. 385. 396.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## Analysis of Variance Table
##
## Model 1: revt_lead ~ lct + che + ebit
## Model 2: revt_lead ~ revt + act + che + lct + dp + ebit
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 24 3059182
## 2 21 863005 3 2196177 1.477e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is better (Adj. \(R^2\), \(\chi^2\), AIC).
Data frames are usually wide panels
# Note the group_by -- without it, lead() will pull from the subsequent firm!
# ungroup() tells R that we finished grouping
df_clean <- df_clean %>%
group_by(isin) %>%
mutate(revt_lead = lead(revt)) %>%
ungroup()
forecast3 <-
lm(revt_lead ~ revt + act + che + lct + dp + ebit,
data=df_clean[df_clean$fic=="SGP",])
tidy(forecast3)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 25.0 13.2 1.89 5.95e- 2
## 2 revt 0.505 0.0762 6.63 1.43e-10
## 3 act -0.0999 0.0545 -1.83 6.78e- 2
## 4 che 0.494 0.155 3.18 1.62e- 3
## 5 lct 0.396 0.0860 4.60 5.95e- 6
## 6 dp 4.46 1.55 2.88 4.21e- 3
## 7 ebit -0.951 0.271 -3.51 5.18e- 4
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.844 0.841 210. 291. 2.63e-127 6 -2237. 4489. 4519.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Lower adjusted \(R^2\) – This is worse? Why?
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 222. 585. 0.379 7.04e- 1
## 2 revt 0.997 0.00655 152. 0.
## 3 act -0.00221 0.00547 -0.403 6.87e- 1
## 4 che -0.150 0.0299 -5.02 5.36e- 7
## 5 lct 0.0412 0.0113 3.64 2.75e- 4
## 6 dp 1.52 0.184 8.26 1.89e-16
## 7 ebit 0.308 0.0650 4.74 2.25e- 6
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.944 0.944 36459. 11299. 0 6 -47819. 95654. 95705.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Higher adjusted \(R^2\) – better!
Why is the UOL model better than the Singapore model?
Different sources of noise, amounts of data
Statistical noise is random error in the data
Noise is OK, but the more we remove, the better!
forecast3.1 <-
lm(revt_lead ~ revt + act + che + lct + dp + ebit + factor(isin),
data=df_clean[df_clean$fic=="SGP",])
# n=7 to prevent outputting every fixed effect
print(tidy(forecast3.1), n=15)
## # A tibble: 27 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.58 39.4 0.0401 0.968
## 2 revt 0.392 0.0977 4.01 0.0000754
## 3 act -0.0538 0.0602 -0.894 0.372
## 4 che 0.304 0.177 1.72 0.0869
## 5 lct 0.392 0.0921 4.26 0.0000276
## 6 dp 4.71 1.73 2.72 0.00687
## 7 ebit -0.851 0.327 -2.60 0.00974
## 8 factor(isin)SG1AA6000000 218. 76.5 2.85 0.00463
## 9 factor(isin)SG1AD8000002 -11.7 67.4 -0.174 0.862
## 10 factor(isin)SG1AE2000006 4.02 79.9 0.0503 0.960
## 11 factor(isin)SG1AG0000003 -13.6 61.1 -0.223 0.824
## 12 factor(isin)SG1BG1000000 -0.901 69.5 -0.0130 0.990
## 13 factor(isin)SG1BI9000008 7.76 64.3 0.121 0.904
## 14 factor(isin)SG1DE5000007 -10.8 61.1 -0.177 0.860
## 15 factor(isin)SG1EE1000009 -6.90 66.7 -0.103 0.918
## # ... with 12 more rows
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.856 0.844 208. 69.4 1.15e-111 26 -2223. 4502. 4609.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## Analysis of Variance Table
##
## Model 1: revt_lead ~ revt + act + che + lct + dp + ebit
## Model 2: revt_lead ~ revt + act + che + lct + dp + ebit + factor(isin)
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 324 14331633
## 2 304 13215145 20 1116488 0.1765
This isn’t much different. Why? There is another source of noise within Singapore real estate companies
package:lfe
has felm()
: fixed effects linear model
library(lfe)
forecast3.2 <-
felm(revt_lead ~ revt + act + che + lct + dp + ebit | factor(isin),
data=df_clean[df_clean$fic=="SGP",])
summary(forecast3.2)
##
## Call:
## felm(formula = revt_lead ~ revt + act + che + lct + dp + ebit | factor(isin), data = df_clean[df_clean$fic == "SGP", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1181.88 -23.25 -1.87 18.03 1968.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## revt 0.39200 0.09767 4.013 7.54e-05 ***
## act -0.05382 0.06017 -0.894 0.37181
## che 0.30370 0.17682 1.718 0.08690 .
## lct 0.39209 0.09210 4.257 2.76e-05 ***
## dp 4.71275 1.73168 2.721 0.00687 **
## ebit -0.85080 0.32704 -2.602 0.00974 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 208.5 on 304 degrees of freedom
## (29 observations deleted due to missingness)
## Multiple R-squared(full model): 0.8558 Adjusted R-squared: 0.8435
## Multiple R-squared(proj model): 0.7806 Adjusted R-squared: 0.7618
## F-statistic(full model):69.41 on 26 and 304 DF, p-value: < 2.2e-16
## F-statistic(proj model): 180.3 on 6 and 304 DF, p-value: < 2.2e-16
What else could we do to improve our prediction model?
package:broom
package:DT
package:knitr
package:lfe
package:magrittr
package:plotly
package:revealjs
package:tidyverse
# Graph showing squared error (slide 4.6)
uolg <- uol[,c("at","revt")]
uolg$resid <- mod1$residuals
uolg$xleft <- ifelse(uolg$resid < 0,uolg$at,uolg$at - uolg$resid)
uolg$xright <- ifelse(uolg$resid < 0,uolg$at - uolg$resid, uol$at)
uolg$ytop <- ifelse(uolg$resid < 0,uolg$revt - uolg$resid,uol$revt)
uolg$ybottom <- ifelse(uolg$resid < 0,uolg$revt, uolg$revt - uolg$resid)
uolg$point <- TRUE
uolg2 <- uolg
uolg2$point <- FALSE
uolg2$at <- ifelse(uolg$resid < 0,uolg2$xright,uolg2$xleft)
uolg2$revt <- ifelse(uolg$resid < 0,uolg2$ytop,uolg2$ybottom)
uolg <- rbind(uolg, uolg2)
uolg %>% ggplot(aes(y=revt, x=at, group=point)) +
geom_point(aes(shape=point)) +
scale_shape_manual(values=c(NA,18)) +
geom_smooth(method="lm", se=FALSE) +
geom_errorbarh(aes(xmax=xright, xmin = xleft)) +
geom_errorbar(aes(ymax=ytop, ymin = ybottom)) +
theme(legend.position="none")
# Chart of mean revt_lead for Singaporean firms (slide 12.6)
df_clean %>% # Our data frame
filter(fic=="SGP") %>% # Select only Singaporean firms
group_by(isin) %>% # Group by firm
mutate(mean_revt_lead=mean(revt_lead, na.rm=T)) %>% # Determine each firm's mean revenue (lead)
slice(1) %>% # Take only the first observation for each group
ungroup() %>% # Ungroup (we don't need groups any more)
ggplot(aes(x=mean_revt_lead)) + # Initialize plot and select data
geom_histogram(aes(y = ..density..)) + # Plots the histogram as a density so that geom_density is visible
geom_density(alpha=.4, fill="#FF6666") # Plots smoothed density
Add an illustration of fixed effects