What is the common thread?
All of the above use text-based data paired with AI/ML algorithms. A secondary thread is the importance of content, while some papers also push for better causality in research.
Getting familiar with forecasting using real data and R
Higher level financial forecasting, detection, and AI/ML
Datacamp’s tutorials teach R from the ground up, and are mandatory unless you can already code in R.
In class
Out of class
Expect one class session to be cancelled in mid-to-late October
All sessions will be recorded to provide flexibility for anyone missing class to still see the material. It also allows you to easily review the class material.
Background: Mariia Sennikova, Zoom Faces, 2021-2022.
Background: Mariia Sennikova, Zoom Faces, 2021-2022.
Before project groups are settled, groups will be randomized each class session to encourage you to meet each other. Once group project groups are set, breakout sessions will be with your group project group.
Background: Mariia Sennikova, Zoom Faces, 2021-2022.
*Almost every class will touch on each of these three aspects
Oxford: The systematic computational analysis of data or statistics
Webster: The method of logical analysis
Gartner: catch-all term for a variety of different business intelligence […] and application-related initiatives
Simply put: Answering questions using data
How will Analytics/AI/ML change society and the accounting profession?
This is where the “educated” comes in
These are not mutually exclusive. Forensic analytics can be used for forecasting!
53% of companies were using big data in a 2017 survey!
What do the spikes indicate?
To learn more, check out these slides
Programming in R provides a way of talking with the computer to make it do what you want it to do
#
are comments
# Addition uses '+'
1 + 1
[1] 2
# Subtraction uses '-'
2 - 1
[1] 1
# Multiplication uses '*'
3 * 3
[1] 9
# Division uses '/'
4 / 2
[1] 2
x ^ y
<-
operator# Store 2 in 'x'
x <- 2
# Check the value of x
x
[1] 2
# Store arithmetic in y
y <- x * 2
# Check the value of y
y
[1] 4
Set a variable
growth
to the amount of Singtel’s earnings growth percent in 2018
# Data from Singtel's earnings reports, in Millions of SGD
singtel_2017 <- 3831.0
singtel_2018 <- 5430.3
# Compute growth
growth <- singtel_2018 / singtel_2017 - 1
# Check the value of growth
growth
[1] 0.4174628
Scaling this up will give use a lot more value
TRUE
or FALSE
T
and F
company_name <- "Google" # character
company_name
[1] "Google"
company_name <- 'Google' # character
company_name
[1] "Google"
tech_firm <- TRUE # boolean
tech_firm
[1] TRUE
earnings <- 12662 # numeric, $M USD
earnings
[1] 12662
c()
– holds only 1 typematrix()
– holds only 1 typelist()
– holds anything (including other structures)data.frame()
– holds different types by columnExamples:
\[ \begin{matrix} \left(\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix}\right) & \text{or} & \left(\begin{matrix} 1 & 2 & 3 & 4 \end{matrix}\right) \end{matrix} \]
A row (or column) of data
# Calculating profit margin for all public US tech firms
# 715 tech firms in Compustat with >1M sales in 2017
# Data:
# earnings_2017: vector of earnings, $M USD
# revenue_2017: vector of revenue, $M USD
# names_2017: a vector of tickers (strings)
# Naming the vectors
names(earnings_2017) <- names_2017
names(revenue_2017) <- names_2017
earnings_2017[1:6]
AVX CORP BK TECHNOLOGIES ADVANCED MICRO DEVICES
4.910 -3.626 43.000
ASM INTERNATIONAL NV SKYWORKS SOLUTIONS INC ANALOG DEVICES
543.878 1010.200 727.259
revenue_2017[1:6]
AVX CORP BK TECHNOLOGIES ADVANCED MICRO DEVICES
1562.474 39.395 5329.000
ASM INTERNATIONAL NV SKYWORKS SOLUTIONS INC ANALOG DEVICES
886.503 3651.400 5107.503
# Summarizing vectors
summary(earnings_2017)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4307.49 -15.98 1.84 296.84 91.36 48351.00
summary(revenue_2017)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.06 102.62 397.57 3023.78 1531.59 229234.00
# Calculating profit margin
margin <- earnings_2017 / revenue_2017
summary(margin)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.97960 -0.10253 0.01353 -0.10967 0.09295 1.02655
Example:
\[ \left(\begin{matrix} 1 & 2 & 3 & 4\\ 5 & 6 & 7 & 8\\ 9 & 10 & 11 & 12 \end{matrix}\right) \]
A rows and columns of data
matrix_name[rows,columns]
columns <- c("Google", "Microsoft", "Goldman")
rows <- c("Earnings","Revenue")
firm_data <- matrix(data=c(12662, 21204, 4286,
110855, 89950, 42254), nrow=2)
# Equivalent:
# matrix(data=c(12662, 21204, 4286,
# 110855, 89950, 42254), ncol=3)
# Apply names
rownames(firm_data) <- rows
colnames(firm_data) <- columns
# Print the matrix
firm_data
Google Microsoft Goldman
Earnings 12662 4286 89950
Revenue 21204 110855 42254
firm_data[2, 3]
[1] 42254
firm_data[, c("Google", "Microsoft")]
Google Microsoft
Earnings 12662 4286
Revenue 21204 110855
firm_data[1,]
Google Microsoft Goldman
12662 4286 89950
firm_data["Revenue", "Goldman"]
[1] 42254
rbind()
cbind()
# Preloaded: industry codes as indcode (vector)
# - GICS codes: 40=Financials, 45=Information Technology
# - See: https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard
# Preloaded: JPMorgan data as jpdata (vector)
mat <- rbind(firm_data,indcode) # Add a row
rownames(mat)[3] <- "Industry" # Name the new row
mat
Google Microsoft Goldman
Earnings 12662 4286 89950
Revenue 21204 110855 42254
Industry 45 45 40
mat <- cbind(firm_data,jpdata) # Add a column
colnames(mat)[4] <- "JPMorgan" # Name the new column
mat
Google Microsoft Goldman JPMorgan
Earnings 12662 4286 89950 17370
Revenue 21204 110855 42254 115475
# Ignore this code for now...
model <- summary(lm(earnings ~ revenue, data=tech_df))
#Note that this function is hiding something...
model
Call:
lm(formula = earnings ~ revenue, data = tech_df)
Residuals:
Min 1Q Median 3Q Max
-16045.0 20.0 141.6 177.1 12104.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.837e+02 4.491e+01 -4.091 4.79e-05 ***
revenue 1.589e-01 3.564e-03 44.585 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1166 on 713 degrees of freedom
Multiple R-squared: 0.736, Adjusted R-squared: 0.7356
F-statistic: 1988 on 1 and 713 DF, p-value: < 2.2e-16
[[index]]
[[c()]]
will drill through lists, as opposed to pulling multiple values$
model["r.squared"]
$r.squared
[1] 0.7360059
model[["r.squared"]]
[1] 0.7360059
model$r.squared
[1] 0.7360059
str()
will tell us what’s in this liststr(model)
List of 11
$ call : language lm(formula = earnings ~ revenue, data = tech_df)
$ terms :Classes 'terms', 'formula' language earnings ~ revenue
.. ..- attr(*, "variables")= language list(earnings, revenue)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "earnings" "revenue"
.. .. .. ..$ : chr "revenue"
.. ..- attr(*, "term.labels")= chr "revenue"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(earnings, revenue)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "earnings" "revenue"
$ residuals : Named num [1:715] -59.7 173.8 -620.2 586.7 613.6 ...
..- attr(*, "names")= chr [1:715] "1" "2" "3" "4" ...
$ coefficients : num [1:2, 1:4] -1.84e+02 1.59e-01 4.49e+01 3.56e-03 -4.09 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "revenue"
.. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
$ aliased : Named logi [1:2] FALSE FALSE
..- attr(*, "names")= chr [1:2] "(Intercept)" "revenue"
$ sigma : num 1166
$ df : int [1:3] 2 713 2
$ r.squared : num 0.736
$ adj.r.squared: num 0.736
$ fstatistic : Named num [1:3] 1988 1 713
..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
$ cov.unscaled : num [1:2, 1:2] 1.48e-03 -2.83e-08 -2.83e-08 9.35e-12
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "revenue"
.. ..$ : chr [1:2] "(Intercept)" "revenue"
- attr(*, "class")= chr "summary.lm"
Like a matrix:
[]
Like a list:
$
Columns \(\approx\) variables, e.g., earnings
Rows \(\approx\) observations, e.g., Google in 2017
tibble
data.table
title.optional
You can case tibble
and data.table
objects to a data.frame
using as.data.frame()
Note: Base R methods are explained in the R Supplement
library(tidyverse) # Imports most tidy packages
# Base R data import -- stringsAsFactors is important here
df <- read.csv("../../Data/Session_1-2.csv", stringsAsFactors=FALSE)
df <- subset(df, fyear == 2017 & !is.na(revt) & !is.na(ni) &
revt > 1 & gsector == 45)
df$margin = df$ni / df$revt
summary(df)
gvkey datadate fyear indfmt
Min. : 1072 Min. :20170630 Min. :2017 Length:715
1st Qu.: 20231 1st Qu.:20171231 1st Qu.:2017 Class :character
Median : 33232 Median :20171231 Median :2017 Mode :character
Mean : 79699 Mean :20172029 Mean :2017
3rd Qu.:148393 3rd Qu.:20171231 3rd Qu.:2017
Max. :315629 Max. :20180430 Max. :2017
consol popsrc datafmt tic
Length:715 Length:715 Length:715 Length:715
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
conm curcd ni revt
Length:715 Length:715 Min. :-4307.49 Min. : 1.06
Class :character Class :character 1st Qu.: -15.98 1st Qu.: 102.62
Mode :character Mode :character Median : 1.84 Median : 397.57
Mean : 296.84 Mean : 3023.78
3rd Qu.: 91.36 3rd Qu.: 1531.59
Max. :48351.00 Max. :229234.00
cik costat gind gsector
Min. : 2186 Length:715 Min. :451010 Min. :45
1st Qu.: 887604 Class :character 1st Qu.:451020 1st Qu.:45
Median :1102307 Mode :character Median :451030 Median :45
Mean :1086969 Mean :451653 Mean :45
3rd Qu.:1405497 3rd Qu.:452030 3rd Qu.:45
Max. :1725579 Max. :453010 Max. :45
NA's :3
gsubind margin
Min. :45101010 Min. :-13.97960
1st Qu.:45102020 1st Qu.: -0.10253
Median :45103020 Median : 0.01353
Mean :45165290 Mean : -0.10967
3rd Qu.:45203012 3rd Qu.: 0.09295
Max. :45301020 Max. : 1.02655
# Tidy import
df <- read_csv("../../Data/Session_1-2.csv") %>%
filter(fyear == 2017, # fiscal year
!is.na(revt), # revenue not missing
!is.na(ni), # net income not missing
revt > 1, # at least 1M USD in revenue
gsector == 45) %>% # tech firm
mutate(margin = ni/revt) # profit margin
summary(df)
gvkey datadate fyear indfmt
Length:715 Min. :20170630 Min. :2017 Length:715
Class :character 1st Qu.:20171231 1st Qu.:2017 Class :character
Mode :character Median :20171231 Median :2017 Mode :character
Mean :20172029 Mean :2017
3rd Qu.:20171231 3rd Qu.:2017
Max. :20180430 Max. :2017
consol popsrc datafmt tic
Length:715 Length:715 Length:715 Length:715
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
conm curcd ni revt
Length:715 Length:715 Min. :-4307.49 Min. : 1.06
Class :character Class :character 1st Qu.: -15.98 1st Qu.: 102.62
Mode :character Mode :character Median : 1.84 Median : 397.57
Mean : 296.84 Mean : 3023.78
3rd Qu.: 91.36 3rd Qu.: 1531.59
Max. :48351.00 Max. :229234.00
cik costat gind gsector
Length:715 Length:715 Min. :451010 Min. :45
Class :character Class :character 1st Qu.:451020 1st Qu.:45
Mode :character Mode :character Median :451030 Median :45
Mean :451653 Mean :45
3rd Qu.:452030 3rd Qu.:45
Max. :453010 Max. :45
gsubind margin
Min. :45101010 Min. :-13.97960
1st Qu.:45102020 1st Qu.: -0.10253
Median :45103020 Median : 0.01353
Mean :45165290 Mean : -0.10967
3rd Qu.:45203012 3rd Qu.: 0.09295
Max. :45301020 Max. : 1.02655
arrange()
group_by()
ungroup()
once you are doneselect()
Pipe notation is never necessary and not built in to R
%>%
pipe is loaded with tidyverse
%>%
Left %>% Right(arg2, ...)
is the same as Right(Left, arg2, ...)
Why Pipes?
Piping can drastically improve code readability. Piped code tells you the list of operations applied to a single starting object, such as a data frame.
df <- mutate(
filter(
read_csv("../../Data/Session_1-2.csv"),
fyear == 2017, # fiscal year
!is.na(revt), # revenue not missing
!is.na(ni), # net income not missing
revt > 1, # at least 1M USD in revenue
gsector == 45), # tech firm
margin = ni/revt) # profit margin
summary(df)
gvkey datadate fyear indfmt
Length:715 Min. :20170630 Min. :2017 Length:715
Class :character 1st Qu.:20171231 1st Qu.:2017 Class :character
Mode :character Median :20171231 Median :2017 Mode :character
Mean :20172029 Mean :2017
3rd Qu.:20171231 3rd Qu.:2017
Max. :20180430 Max. :2017
consol popsrc datafmt tic
Length:715 Length:715 Length:715 Length:715
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
conm curcd ni revt
Length:715 Length:715 Min. :-4307.49 Min. : 1.06
Class :character Class :character 1st Qu.: -15.98 1st Qu.: 102.62
Mode :character Mode :character Median : 1.84 Median : 397.57
Mean : 296.84 Mean : 3023.78
3rd Qu.: 91.36 3rd Qu.: 1531.59
Max. :48351.00 Max. :229234.00
cik costat gind gsector
Length:715 Length:715 Min. :451010 Min. :45
Class :character Class :character 1st Qu.:451020 1st Qu.:45
Mode :character Mode :character Median :451030 Median :45
Mean :451653 Mean :45
3rd Qu.:452030 3rd Qu.:45
Max. :453010 Max. :45
gsubind margin
Min. :45101010 Min. :-13.97960
1st Qu.:45102020 1st Qu.: -0.10253
Median :45103020 Median : 0.01353
Mean :45165290 Mean : -0.10967
3rd Qu.:45203012 3rd Qu.: 0.09295
Max. :45301020 Max. : 1.02655
Many useful functions are highlighted in the R Supplement
# Install the tidyverse package from inside R
install.packages("tidyverse")
# Load the package
library(tidyverse)
# To see a help page for a function (such as data.frame()) run either of:
help(data.frame)
?data.frame
# To see the arguments a function takes, run:
args(data.frame)
function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,
fix.empty.names = TRUE, stringsAsFactors = FALSE)
NULL
function()
function!
my_func <- function(agruments) {code}
Simple function: Add 2 to a number
add_two <- function(n) {
n + 2
}
add_two(500)
[1] 502
mult_together <- function(n1, n2=0, square=FALSE) {
if (!square) {
n1 * n2
} else {
n1 * n1
}
}
mult_together(5,6)
[1] 30
mult_together(5,6,square=TRUE)
[1] 25
mult_together(5,square=TRUE)
[1] 25
FXRate <- function(from="USD", to="SGD", dt=Sys.Date()) {
options("getSymbols.warning4.0"=FALSE)
if (!require(quantmod)) return("Install the quantmod package for this function to work")
data <- getSymbols(paste0(from, "/", to), from=dt-3, to=dt, src="oanda", auto.assign=F)
return(data[[1]])
}
date()
[1] "Sat Jul 22 18:27:02 2023"
FXRate(from="USD", to="SGD") # Today's SGD to USD rate
[1] 1.32522
FXRate(from="SGD", to="CNY") # Today's SGD to CNY rate
[1] 5.44584
FXRate(from="USD", to="SGD", dt=Sys.Date()-90) # Last quarter's SGD to USD rate
[1] 1.33312
A picture walkthrough is on the next set of slides