The above all use some sort of Neural Network or text analytics approach.
Getting familiar with forecasting using real data and R
Higher level financial forecasting, detection, and AI/ML
Datacamp’s tutorials teach R from the ground up, and are mandatory unless you can already code in R.
In class
Out of class
All sessions will be recorded to provide flexibility for anyone missing class to still see the material. It also allows you to easily review the class material.
Groups will be randomized each class session to encourage you to meet each other. Once group project groups are set, breakout sessions will be with your group project group.
*Almost every class will touch on each of these three aspects
Oxford: The systematic computational analysis of data or statistics
Webster: The method of logical analysis
Gartner: catch-all term for a variety of different business intelligence […] and application-related initiatives
Simply put: Answering questions using data
How will Analytics/AI/ML change society and the accounting profession?
This is where the “educated” comes in
These are not mutually exclusive. Forensic analytics can be used for forecasting!
53% of companies were using big data in a 2017 survey!
Programming in R provides a way of talking with the computer to make it do what you want it to do
#
are comments
x ^ y
<-
commandy <- 2 * x
x
and y
remain unchanged!Set a variable
growth
to the amount of Singtel’s earnings growth percent in 2018
# Data from Singtel's earnings reports, in Millions of SGD
singtel_2017 <- 3831.0
singtel_2018 <- 5430.3
# Compute growth
growth <- singtel_2018 / singtel_2017 - 1
# Check the value of growth
growth
## [1] 0.4174628
Scaling this up will give use a lot more value
TRUE
or FALSE
T
and F
Examples:
\[ \begin{matrix} \left(\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix}\right) & \text{or} & \left(\begin{matrix} 1 & 2 & 3 & 4 \end{matrix}\right) \end{matrix} \]
A row (or column) of data
# Calculating profit margin for all public US tech firms
# 715 tech firms in Compustat with >1M sales in 2017
# Data:
# earnings_2017: vector of earnings, $M USD
# revenue_2017: vector of revenue, $M USD
# names_2017: a vector of tickers (strings)
# Namining the vectors
names(earnings_2017) <- names_2017
names(revenue_2017) <- names_2017
earnings_2017[1:6]
## AVX CORP BK TECHNOLOGIES ADVANCED MICRO DEVICES
## 4.910 -3.626 43.000
## ASM INTERNATIONAL NV SKYWORKS SOLUTIONS INC ANALOG DEVICES
## 543.878 1010.200 727.259
## AVX CORP BK TECHNOLOGIES ADVANCED MICRO DEVICES
## 1562.474 39.395 5329.000
## ASM INTERNATIONAL NV SKYWORKS SOLUTIONS INC ANALOG DEVICES
## 886.503 3651.400 5107.503
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4307.49 -15.98 1.84 296.84 91.36 48351.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.06 102.62 397.57 3023.78 1531.59 229234.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -13.97960 -0.10253 0.01353 -0.10967 0.09295 1.02655
Example:
\[ \left(\begin{matrix} 1 & 2 & 3 & 4\\ 5 & 6 & 7 & 8\\ 9 & 10 & 11 & 12 \end{matrix}\right) \]
A rows and columns of data
matrix_name[rows,columns]
columns <- c("Google", "Microsoft",
"Goldman")
rows <- c("Earnings","Revenue")
firm_data <- matrix(data=
c(12662, 21204, 4286, 110855,
89950, 42254), nrow=2)
# Equivalent:
# matrix(data=c(12662, 21204, 4286,
# 110855, 89950, 42254), ncol=3)
# Apply names
rownames(firm_data) <- rows
colnames(firm_data) <- columns
# Print the matrix
firm_data
## Google Microsoft Goldman
## Earnings 12662 4286 89950
## Revenue 21204 110855 42254
## [1] 42254
## Google Microsoft
## Earnings 12662 4286
## Revenue 21204 110855
## Google Microsoft Goldman
## 12662 4286 89950
## [1] 42254
# Preloaded: industry codes as indcode (vector)
# - GICS codes: 40=Financials, 45=Information Technology
# - See: https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard
# Preloaded: JPMorgan data as jpdata (vector)
mat <- rbind(firm_data,indcode) # Add a row
rownames(mat)[3] <- "Industry" # Name the new row
mat
## Google Microsoft Goldman
## Earnings 12662 4286 89950
## Revenue 21204 110855 42254
## Industry 45 45 40
mat <- cbind(firm_data,jpdata) # Add a column
colnames(mat)[4] <- "JPMorgan" # Name the new column
mat
## Google Microsoft Goldman JPMorgan
## Earnings 12662 4286 89950 17370
## Revenue 21204 110855 42254 115475
# Ignore this code for now...
model <- summary(lm(earnings ~ revenue, data=tech_df))
#Note that this function is hiding something...
model
##
## Call:
## lm(formula = earnings ~ revenue, data = tech_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16045.0 20.0 141.6 177.1 12104.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.837e+02 4.491e+01 -4.091 4.79e-05 ***
## revenue 1.589e-01 3.564e-03 44.585 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1166 on 713 degrees of freedom
## Multiple R-squared: 0.736, Adjusted R-squared: 0.7356
## F-statistic: 1988 on 1 and 713 DF, p-value: < 2.2e-16
[[index]]
[[c()]]
will drill through lists, as opposed to pulling multiple values$
## $r.squared
## [1] 0.7360059
## [1] 0.7360059
## [1] 0.7360059
## List of 11
## $ call : language lm(formula = earnings ~ revenue, data = tech_df)
## $ terms :Classes 'terms', 'formula' language earnings ~ revenue
## .. ..- attr(*, "variables")= language list(earnings, revenue)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "earnings" "revenue"
## .. .. .. ..$ : chr "revenue"
## .. ..- attr(*, "term.labels")= chr "revenue"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(earnings, revenue)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "earnings" "revenue"
## $ residuals : Named num [1:715] -59.7 173.8 -620.2 586.7 613.6 ...
## ..- attr(*, "names")= chr [1:715] "1" "2" "3" "4" ...
## $ coefficients : num [1:2, 1:4] -1.84e+02 1.59e-01 4.49e+01 3.56e-03 -4.09 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "revenue"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:2] FALSE FALSE
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "revenue"
## $ sigma : num 1166
## $ df : int [1:3] 2 713 2
## $ r.squared : num 0.736
## $ adj.r.squared: num 0.736
## $ fstatistic : Named num [1:3] 1988 1 713
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:2, 1:2] 1.48e-03 -2.83e-08 -2.83e-08 9.35e-12
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "revenue"
## .. ..$ : chr [1:2] "(Intercept)" "revenue"
## - attr(*, "class")= chr "summary.lm"
Like a matrix:
[]
Like a list:
$
Columns \(\approx\) variables, e.g., earnings
Rows \(\approx\) observations, e.g., Google in 2017
There are three schools of thought on this
tibble
data.table
Cast either to a
data.frame
using as.data.frame()
Note: Base R methods are explained in the R Supplement
library(tidyverse) # Imports most tidy packages
# Base R data import -- stringsAsFactors is important here
df <- read.csv("../../Data/Session_1-2.csv", stringsAsFactors=FALSE)
df <- subset(df, fyear == 2017 & !is.na(revt) & !is.na(ni) &
revt > 1 & gsector == 45)
df$margin = df$ni / df$revt
summary(df)
## gvkey datadate fyear indfmt
## Min. : 1072 Min. :20170630 Min. :2017 Length:715
## 1st Qu.: 20231 1st Qu.:20171231 1st Qu.:2017 Class :character
## Median : 33232 Median :20171231 Median :2017 Mode :character
## Mean : 79699 Mean :20172029 Mean :2017
## 3rd Qu.:148393 3rd Qu.:20171231 3rd Qu.:2017
## Max. :315629 Max. :20180430 Max. :2017
##
## consol popsrc datafmt tic
## Length:715 Length:715 Length:715 Length:715
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## conm curcd ni revt
## Length:715 Length:715 Min. :-4307.49 Min. : 1.06
## Class :character Class :character 1st Qu.: -15.98 1st Qu.: 102.62
## Mode :character Mode :character Median : 1.84 Median : 397.57
## Mean : 296.84 Mean : 3023.78
## 3rd Qu.: 91.36 3rd Qu.: 1531.59
## Max. :48351.00 Max. :229234.00
##
## cik costat gind gsector
## Min. : 2186 Length:715 Min. :451010 Min. :45
## 1st Qu.: 887604 Class :character 1st Qu.:451020 1st Qu.:45
## Median :1102307 Mode :character Median :451030 Median :45
## Mean :1086969 Mean :451653 Mean :45
## 3rd Qu.:1405497 3rd Qu.:452030 3rd Qu.:45
## Max. :1725579 Max. :453010 Max. :45
## NA's :3
## gsubind margin
## Min. :45101010 Min. :-13.97960
## 1st Qu.:45102020 1st Qu.: -0.10253
## Median :45103020 Median : 0.01353
## Mean :45165290 Mean : -0.10967
## 3rd Qu.:45203012 3rd Qu.: 0.09295
## Max. :45301020 Max. : 1.02655
##
# Tidy import
df <- read_csv("../../Data/Session_1-2.csv") %>%
filter(fyear == 2017, # fiscal year
!is.na(revt), # revenue not missing
!is.na(ni), # net income not missing
revt > 1, # at least 1M USD in revenue
gsector == 45) %>% # tech firm
mutate(margin = ni/revt) # profit margin
summary(df)
## gvkey datadate fyear indfmt
## Length:715 Min. :20170630 Min. :2017 Length:715
## Class :character 1st Qu.:20171231 1st Qu.:2017 Class :character
## Mode :character Median :20171231 Median :2017 Mode :character
## Mean :20172029 Mean :2017
## 3rd Qu.:20171231 3rd Qu.:2017
## Max. :20180430 Max. :2017
## consol popsrc datafmt tic
## Length:715 Length:715 Length:715 Length:715
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## conm curcd ni revt
## Length:715 Length:715 Min. :-4307.49 Min. : 1.06
## Class :character Class :character 1st Qu.: -15.98 1st Qu.: 102.62
## Mode :character Mode :character Median : 1.84 Median : 397.57
## Mean : 296.84 Mean : 3023.78
## 3rd Qu.: 91.36 3rd Qu.: 1531.59
## Max. :48351.00 Max. :229234.00
## cik costat gind gsector
## Length:715 Length:715 Min. :451010 Min. :45
## Class :character Class :character 1st Qu.:451020 1st Qu.:45
## Mode :character Mode :character Median :451030 Median :45
## Mean :451653 Mean :45
## 3rd Qu.:452030 3rd Qu.:45
## Max. :453010 Max. :45
## gsubind margin
## Min. :45101010 Min. :-13.97960
## 1st Qu.:45102020 1st Qu.: -0.10253
## Median :45103020 Median : 0.01353
## Mean :45165290 Mean : -0.10967
## 3rd Qu.:45203012 3rd Qu.: 0.09295
## Max. :45301020 Max. : 1.02655
Pipe notation is never necessary and not built in to R
%>%
pipe is loaded with tidyverse%>%
Left %>% Right(arg2, ...)
is the same as Right(Left, arg2, ...)
Piping can drastically improve code readability
%<>%
Left %<>% Right(arg2, ...)
is the same as Left <- Right(Left, arg2, ...)
Note how unreadable this gets (but output is the same)
df <- mutate(
filter(
read_csv("../../Data/Session_1-2.csv"),
fyear == 2017, # fiscal year
!is.na(revt), # revenue not missing
!is.na(ni), # net income not missing
revt > 1, # at least 1M USD in revenue
gsector == 45), # tech firm
margin = ni/revt) # profit margin
summary(df)
## gvkey datadate fyear indfmt
## Length:715 Min. :20170630 Min. :2017 Length:715
## Class :character 1st Qu.:20171231 1st Qu.:2017 Class :character
## Mode :character Median :20171231 Median :2017 Mode :character
## Mean :20172029 Mean :2017
## 3rd Qu.:20171231 3rd Qu.:2017
## Max. :20180430 Max. :2017
## consol popsrc datafmt tic
## Length:715 Length:715 Length:715 Length:715
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## conm curcd ni revt
## Length:715 Length:715 Min. :-4307.49 Min. : 1.06
## Class :character Class :character 1st Qu.: -15.98 1st Qu.: 102.62
## Mode :character Mode :character Median : 1.84 Median : 397.57
## Mean : 296.84 Mean : 3023.78
## 3rd Qu.: 91.36 3rd Qu.: 1531.59
## Max. :48351.00 Max. :229234.00
## cik costat gind gsector
## Length:715 Length:715 Min. :451010 Min. :45
## Class :character Class :character 1st Qu.:451020 1st Qu.:45
## Mode :character Mode :character Median :451030 Median :45
## Mean :451653 Mean :45
## 3rd Qu.:452030 3rd Qu.:45
## Max. :453010 Max. :45
## gsubind margin
## Min. :45101010 Min. :-13.97960
## 1st Qu.:45102020 1st Qu.: -0.10253
## Median :45103020 Median : 0.01353
## Mean :45165290 Mean : -0.10967
## 3rd Qu.:45203012 3rd Qu.: 0.09295
## Max. :45301020 Max. : 1.02655
Many useful functions are highlighted in the R Supplement
# Install the tidyverse package from inside R
install.packages("tidyverse")
# Load the package
library(tidyverse)
# To see a help page for a function (such as data.frame()) run either of:
help(data.frame)
?data.frame
## function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,
## fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors())
## NULL
my_func <- function(agruments) {code}
Simple function: Add 2 to a number
## [1] 502
mult_together <- function(n1, n2=0, square=FALSE) {
if (!square) {
n1 * n2
} else {
n1 * n1
}
}
mult_together(5,6)
## [1] 30
## [1] 25
## [1] 25
FXRate <- function(from="USD", to="SGD", dt=Sys.Date()) {
options("getSymbols.warning4.0"=FALSE)
require(quantmod)
data <- getSymbols(paste0(from, "/", to), from=dt-3, to=dt, src="oanda", auto.assign=F)
return(data[[1]])
}
date()
## [1] "Tue Aug 18 19:13:56 2020"
## [1] 1.370855
## [1] 5.070048
## [1] 1.42738