Dr. Richard M. Crowley
Getting familiar with forecasting using real data and R
Higher level financial forecasting, detection, and AI/ML
Datacamp’s tutorials teach R from the ground up, and are mandatory unless you can already code in R.
In class
Out of class
*Almost every class will touch on each of these three aspects
Oxford: The systematic computational analysis of data or statistics
Webster: The method of logical analysis
Gartner: catch-all term for a variety of different business intelligence […] and application-related initiatives
Simply put: Answering questions using data
How will Analytics/AI/ML change society and the accounting profession?
This is where the “educated” comes in
These are not mutually exclusive. Forensic analytics can be used for forecasting!
53% of companies were using big data in a 2017 survey!
Programming in R provides a way of talking with the computer to make it do what you want it to do
#
are comments
# Addition uses '+'
1 + 1
## [1] 2
# Subtraction uses '-'
2 - 1
## [1] 1
# Multiplication uses '*'
3 * 3
## [1] 9
# Division uses '/'
4 / 2
## [1] 2
x ^ y
# Exponentiation uses '^'
5 ^ 5
## [1] 3125
# Modulus (aka the remainder) uses '%%'
46 %% 6
## [1] 4
# Integer division uses '%/%'
46 %/% 6
## [1] 7
<-
command# Store 2 in 'x'
x <- 2
# Check the value of x
x
## [1] 2
# Store arithmetic in y
y <- x * 2
# Check the value of y
y
## [1] 4
y <- 2 * x
x
and y
remain unchanged!# Previous value of x and y
x
## [1] 2
y
## [1] 4
# Change x, then recheck the value
# of x and y
x <- 200
x
## [1] 200
y
## [1] 4
Set a variable
growth
to the amount of Singtel’s earnings growth percent in 2018
# Data from Singtel's earnings reports, in Millions of SGD
singtel_2017 <- 3831.0
singtel_2018 <- 5430.3
# Compute growth
growth <- singtel_2018 / singtel_2017 - 1
# Check the value of growth
growth
## [1] 0.4174628
Scaling this up will give use a lot more value
TRUE
or FALSE
T
and F
company_name <- "Google" # character
company_name
## [1] "Google"
company_name <- 'Google' # character
company_name
## [1] "Google"
tech_firm <- TRUE # boolean
tech_firm
## [1] TRUE
earnings <- 12662 # numeric, $M USD
earnings
## [1] 12662
Examples:
\[ \begin{matrix} \left(\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix}\right) & \text{or} & \left(\begin{matrix} 1 & 2 & 3 & 4 \end{matrix}\right) \end{matrix} \]
A row (or column) of data
# Calculating proit margin for all public US tech firms
# 715 tech firms in Compustat with >1M sales in 2017
# Data:
# earnings_2017: vector of earnings, $M USD
# revenue_2017: vector of revenue, $M USD
# names_2017: a vector of tickers (strings)
# Namining the vectors
names(earnings_2017) <- names_2017
names(revenue_2017) <- names_2017
earnings_2017[1:6]
## AVX CORP BK TECHNOLOGIES ADVANCED MICRO DEVICES
## 4.910 -3.626 43.000
## ASM INTERNATIONAL NV SKYWORKS SOLUTIONS INC ANALOG DEVICES
## 543.878 1010.200 727.259
revenue_2017[1:6]
## AVX CORP BK TECHNOLOGIES ADVANCED MICRO DEVICES
## 1562.474 39.395 5329.000
## ASM INTERNATIONAL NV SKYWORKS SOLUTIONS INC ANALOG DEVICES
## 886.503 3651.400 5107.503
# Summarizing vectors
summary(earnings_2017)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4307.49 -15.98 1.84 296.84 91.36 48351.00
summary(revenue_2017)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.06 102.62 397.57 3023.78 1531.59 229234.00
# Calculating profit margin
margin <- earnings_2017 / revenue_2017
summary(margin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -13.97960 -0.10253 0.01353 -0.10967 0.09295 1.02655
# Worst, midpoint, and best profit margin firms in 2017. Our names carried over :)
margin[order(margin)][c(1,length(margin)/2,length(margin))]
## HELIOS AND MATHESON ANALYTIC NLIGHT INC
## -13.97960161 0.01325588
## CCUR HOLDINGS INC
## 1.02654899
Example:
\[ \left(\begin{matrix} 1 & 2 & 3 & 4\\ 5 & 6 & 7 & 8\\ 9 & 10 & 11 & 12 \end{matrix}\right) \]
A rows and columns of data
matrix_name[rows,columns]
columns <- c("Google", "Microsoft",
"Goldman")
rows <- c("Earnings","Revenue")
firm_data <- matrix(data=
c(12662, 21204, 4286, 110855,
89950, 42254), nrow=2)
# Equivalent:
# matrix(data=c(12662, 21204, 4286,
# 110855, 89950, 42254), ncol=3)
# Apply names
rownames(firm_data) <- rows
colnames(firm_data) <- columns
# Print the matrix
firm_data
## Google Microsoft Goldman
## Earnings 12662 4286 89950
## Revenue 21204 110855 42254
firm_data[2, 3]
## [1] 42254
firm_data[, c("Google", "Microsoft")]
## Google Microsoft
## Earnings 12662 4286
## Revenue 21204 110855
firm_data[1,]
## Google Microsoft Goldman
## 12662 4286 89950
firm_data["Revenue", "Goldman"]
## [1] 42254
# Preloaded: industry codes as indcode (vector)
# - GICS codes: 40=Financials, 45=Information Technology
# - See: https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard
# Preloaded: JPMorgan data as jpdata (vector)
mat <- rbind(firm_data,indcode) # Add a row
rownames(mat)[3] <- "Industry" # Name the new row
mat
## Google Microsoft Goldman
## Earnings 12662 4286 89950
## Revenue 21204 110855 42254
## Industry 45 45 40
mat <- cbind(firm_data,jpdata) # Add a column
colnames(mat)[4] <- "JPMorgan" # Name the new column
mat
## Google Microsoft Goldman JPMorgan
## Earnings 12662 4286 89950 17370
## Revenue 21204 110855 42254 115475
# Ignore this code for now...
model <- summary(lm(earnings ~ revenue, data=tech_df))
#Note that this function is hiding something...
model
##
## Call:
## lm(formula = earnings ~ revenue, data = tech_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16045.0 20.0 141.6 177.1 12104.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.837e+02 4.491e+01 -4.091 4.79e-05 ***
## revenue 1.589e-01 3.564e-03 44.585 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1166 on 713 degrees of freedom
## Multiple R-squared: 0.736, Adjusted R-squared: 0.7356
## F-statistic: 1988 on 1 and 713 DF, p-value: < 2.2e-16
[[index]]
[[c()]]
will drill through lists, as opposed to pulling multiple values$
model["r.squared"]
## $r.squared
## [1] 0.7360059
model[["r.squared"]]
## [1] 0.7360059
model$r.squared
## [1] 0.7360059
earnings <- c(12662, 21204, 4286)
company <- c("Google", "Microsoft", "Goldman")
names(earnings) <- company
earnings["Google"]
## Google
## 12662
earnings[["Google"]]
## [1] 12662
#Can't use $ with vectors
str(model)
## List of 11
## $ call : language lm(formula = earnings ~ revenue, data = tech_df)
## $ terms :Classes 'terms', 'formula' language earnings ~ revenue
## .. ..- attr(*, "variables")= language list(earnings, revenue)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "earnings" "revenue"
## .. .. .. ..$ : chr "revenue"
## .. ..- attr(*, "term.labels")= chr "revenue"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(earnings, revenue)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "earnings" "revenue"
## $ residuals : Named num [1:715] -59.7 173.8 -620.2 586.7 613.6 ...
## ..- attr(*, "names")= chr [1:715] "1" "2" "3" "4" ...
## $ coefficients : num [1:2, 1:4] -1.84e+02 1.59e-01 4.49e+01 3.56e-03 -4.09 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "revenue"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:2] FALSE FALSE
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "revenue"
## $ sigma : num 1166
## $ df : int [1:3] 2 713 2
## $ r.squared : num 0.736
## $ adj.r.squared: num 0.736
## $ fstatistic : Named num [1:3] 1988 1 713
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:2, 1:2] 1.48e-03 -2.83e-08 -2.83e-08 9.35e-12
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "revenue"
## .. ..$ : chr [1:2] "(Intercept)" "revenue"
## - attr(*, "class")= chr "summary.lm"
Like a matrix:
[]
Like a list:
$
Columns \(\approx\) variables, e.g., earnings
Rows \(\approx\) observations, e.g., Google in 2017
There are three schools of thought on this
tibble
package:data.table
)
data.table
Cast either to a
data.frame
usingas.data.frame()
Note: Base R methods are explained in the R Supplement
library(tidyverse) # Imports most tidy packages
# Base R data import -- stringsAsFactors is important here
df <- read.csv("../../Data/Session_1-2.csv", stringsAsFactors=FALSE)
df <- subset(df, fyear == 2017 & !is.na(revt) & !is.na(ni) &
revt > 1 & gsector == 45)
df$margin = df$ni / df$revt
summary(df)
## gvkey datadate fyear indfmt
## Min. : 1072 Min. :20170630 Min. :2017 Length:715
## 1st Qu.: 20231 1st Qu.:20171231 1st Qu.:2017 Class :character
## Median : 33232 Median :20171231 Median :2017 Mode :character
## Mean : 79699 Mean :20172029 Mean :2017
## 3rd Qu.:148393 3rd Qu.:20171231 3rd Qu.:2017
## Max. :315629 Max. :20180430 Max. :2017
##
## consol popsrc datafmt
## Length:715 Length:715 Length:715
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## tic conm curcd
## Length:715 Length:715 Length:715
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## ni revt cik
## Min. :-4307.49 Min. : 1.06 Min. : 2186
## 1st Qu.: -15.98 1st Qu.: 102.62 1st Qu.: 887604
## Median : 1.84 Median : 397.57 Median :1102307
## Mean : 296.84 Mean : 3023.78 Mean :1086969
## 3rd Qu.: 91.36 3rd Qu.: 1531.59 3rd Qu.:1405497
## Max. :48351.00 Max. :229234.00 Max. :1725579
## NA's :3
## costat gind gsector gsubind
## Length:715 Min. :451010 Min. :45 Min. :45101010
## Class :character 1st Qu.:451020 1st Qu.:45 1st Qu.:45102020
## Mode :character Median :451030 Median :45 Median :45103020
## Mean :451653 Mean :45 Mean :45165290
## 3rd Qu.:452030 3rd Qu.:45 3rd Qu.:45203012
## Max. :453010 Max. :45 Max. :45301020
##
## margin
## Min. :-13.97960
## 1st Qu.: -0.10253
## Median : 0.01353
## Mean : -0.10967
## 3rd Qu.: 0.09295
## Max. : 1.02655
##
# Tidy import
df <- read_csv("../../Data/Session_1-2.csv") %>%
filter(fyear == 2017, # fiscal year
!is.na(revt), # revenue not missing
!is.na(ni), # net income not missing
revt > 1, # at least 1M USD in revenue
gsector == 45) %>% # tech firm
mutate(margin = ni/revt) # profit margin
summary(df)
## gvkey datadate fyear indfmt
## Length:715 Min. :20170630 Min. :2017 Length:715
## Class :character 1st Qu.:20171231 1st Qu.:2017 Class :character
## Mode :character Median :20171231 Median :2017 Mode :character
## Mean :20172029 Mean :2017
## 3rd Qu.:20171231 3rd Qu.:2017
## Max. :20180430 Max. :2017
## consol popsrc datafmt
## Length:715 Length:715 Length:715
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## tic conm curcd
## Length:715 Length:715 Length:715
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## ni revt cik
## Min. :-4307.49 Min. : 1.06 Length:715
## 1st Qu.: -15.98 1st Qu.: 102.62 Class :character
## Median : 1.84 Median : 397.57 Mode :character
## Mean : 296.84 Mean : 3023.78
## 3rd Qu.: 91.36 3rd Qu.: 1531.59
## Max. :48351.00 Max. :229234.00
## costat gind gsector gsubind
## Length:715 Min. :451010 Min. :45 Min. :45101010
## Class :character 1st Qu.:451020 1st Qu.:45 1st Qu.:45102020
## Mode :character Median :451030 Median :45 Median :45103020
## Mean :451653 Mean :45 Mean :45165290
## 3rd Qu.:452030 3rd Qu.:45 3rd Qu.:45203012
## Max. :453010 Max. :45 Max. :45301020
## margin
## Min. :-13.97960
## 1st Qu.: -0.10253
## Median : 0.01353
## Mean : -0.10967
## 3rd Qu.: 0.09295
## Max. : 1.02655
Pipe notation is never necessary and not built in to R
%>%
pipe is loaded with tidyverse%>%
Left %>% Right(arg2, ...)
is the same as Right(Left, arg2, ...)
Piping can drastically improve code readability
%<>%
Left %<>% Right(arg2, ...)
is the same as Left <- Right(Left, arg2, ...)
Note how unreadable this gets (but output is the same)
df <- mutate(
filter(
read_csv("../../Data/Session_1-2.csv"),
fyear == 2017, # fiscal year
!is.na(revt), # revenue not missing
!is.na(ni), # net income not missing
revt > 1, # at least 1M USD in revenue
gsector == 45), # tech firm
margin = ni/revt) # profit margin
summary(df)
## gvkey datadate fyear indfmt
## Length:715 Min. :20170630 Min. :2017 Length:715
## Class :character 1st Qu.:20171231 1st Qu.:2017 Class :character
## Mode :character Median :20171231 Median :2017 Mode :character
## Mean :20172029 Mean :2017
## 3rd Qu.:20171231 3rd Qu.:2017
## Max. :20180430 Max. :2017
## consol popsrc datafmt
## Length:715 Length:715 Length:715
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## tic conm curcd
## Length:715 Length:715 Length:715
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## ni revt cik
## Min. :-4307.49 Min. : 1.06 Length:715
## 1st Qu.: -15.98 1st Qu.: 102.62 Class :character
## Median : 1.84 Median : 397.57 Mode :character
## Mean : 296.84 Mean : 3023.78
## 3rd Qu.: 91.36 3rd Qu.: 1531.59
## Max. :48351.00 Max. :229234.00
## costat gind gsector gsubind
## Length:715 Min. :451010 Min. :45 Min. :45101010
## Class :character 1st Qu.:451020 1st Qu.:45 1st Qu.:45102020
## Mode :character Median :451030 Median :45 Median :45103020
## Mean :451653 Mean :45 Mean :45165290
## 3rd Qu.:452030 3rd Qu.:45 3rd Qu.:45203012
## Max. :453010 Max. :45 Max. :45301020
## margin
## Min. :-13.97960
## 1st Qu.: -0.10253
## Median : 0.01353
## Mean : -0.10967
## 3rd Qu.: 0.09295
## Max. : 1.02655
Many useful functions are highlighted in the R Supplement
# Install the tidyverse package from inside R
install.packages("tidyverse")
# Load the package
library(tidyverse)
# To see a help page for a function (such as data.frame()) run either of:
help(data.frame)
?data.frame
# To see the arguments a function takes, run:
args(data.frame)
## function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,
## fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors())
## NULL
my_func <- function(agruments) {code}
Simple function: Add 2 to a number
add_two <- function(n) {
n + 2
}
add_two(500)
## [1] 502
mult_together <- function(n1, n2=0, square=FALSE) {
if (!square) {
n1 * n2
} else {
n1 * n1
}
}
mult_together(5,6)
## [1] 30
mult_together(5,6,square=TRUE)
## [1] 25
mult_together(5,square=TRUE)
## [1] 25
FXRate <- function(from="USD", to="SGD", dt=Sys.Date()) {
options("getSymbols.warning4.0"=FALSE)
require(quantmod)
data <- getSymbols(paste0(from, "/", to), from=dt-1, to=dt, src="oanda", auto.assign=F)
return(data[[1]])
}
date()
## [1] "Sun Aug 18 16:31:56 2019"
FXRate(from="USD", to="SGD") # Today's SGD to USD rate
## [1] 1.38463
FXRate(from="SGD", to="CNY") # Today's SGD to CNY rate
## [1] 5.086488
FXRate(from="USD", to="SGD", dt=Sys.Date()-90) # Last quarter's SGD to USD rate
## [1] 1.378014
# Custom code to use Google Ngrams data
library('ngramr')
ngd <- c("(Analytics + analytics)", "(Machine learning + machine learning)")
ggram(ngd, year_start=1960, geom = "area", google_theme=F, smoothing = 3) + theme(legend.position="bottom", legend.direction="horizontal")