For reading large files, readr is your friend
For saving intermediary results,
saveRDS()
+readRDS()
your friend
If you look at the code file for this lesson, you’ll see this used extensively
summary(corp)
Text Types Tokens Sentences
1 0000002178-21-000034.txt 3906 42087 1352
2 0000002969-21-000055.txt 4848 57425 1863
3 0000003499-21-000005.txt 3413 32839 989
4 0000003570-21-000039.txt 5092 70180 1725
5 0000004127-21-000058.txt 4417 40081 1106
6 0000004281-21-000049.txt 5351 71989 2119
7 0000004457-21-000040.txt 3107 22717 785
8 0000004904-21-000010.txt 7444 160570 4711
9 0000004962-21-000013.txt 5805 82050 2155
10 0000004969-21-000009.txt 3406 35469 960
11 0000004977-21-000047.txt 5782 91119 2928
12 0000005513-21-000015.txt 5953 108414 3193
13 0000006201-21-000014.txt 5870 127350 3423
14 0000006281-21-000294.txt 4794 56351 1631
15 0000006845-21-000010.txt 3726 32487 1023
16 0000006951-21-000043.txt 4182 48369 1354
17 0000006955-21-000031.txt 4028 40131 1149
18 0000007039-21-000036.txt 3291 24652 750
19 0000007084-21-000008.txt 4849 57123 1597
20 0000007332-21-000016.txt 5395 79034 2194
21 0000007789-21-000018.txt 5806 88614 2541
22 0000008063-21-000007.txt 4202 45559 1512
23 0000008670-21-000027.txt 4669 44530 1285
24 0000008858-21-000036.txt 3238 29094 851
25 0000008868-21-000004.txt 5242 77291 2204
26 0000008947-21-000052.txt 3955 40226 1235
27 0000009346-21-000024.txt 3742 42873 1202
28 0000009984-21-000063.txt 4980 62694 2009
29 0000010048-21-000026.txt 4327 57000 1550
30 0000010254-21-000045.txt 4542 47180 1332
31 0000010795-21-000091.txt 4853 54704 1572
32 0000011544-21-000013.txt 4591 59531 1864
33 0000012208-21-000012.txt 4335 43967 1329
34 0000012927-21-000011.txt 5111 69379 2205
35 0000013239-21-000005.txt 6194 97014 2970
36 0000014272-21-000066.txt 6202 75439 2369
37 0000014693-21-000091.txt 4556 44208 1449
38 0000014707-21-000010.txt 4575 52736 1623
39 0000014846-21-000008.txt 3763 38589 1053
40 0000014930-21-000029.txt 4125 33569 997
41 0000015615-21-000006.txt 5473 79158 2292
42 0000016099-21-000047.txt 3822 45783 1411
43 0000016732-21-000103.txt 4121 47046 1473
44 0000016875-21-000008.txt 5664 67426 2220
45 0000016918-21-000060.txt 4496 56444 1581
46 0000017313-21-000075.txt 4677 80835 2203
47 0000017843-21-000030.txt 4123 50058 1569
48 0000018230-21-000063.txt 5125 63814 2071
49 0000018255-21-000004.txt 4368 36198 999
50 0000018349-21-000014.txt 5768 79631 2186
51 0000018926-21-000017.txt 5534 70434 1999
52 0000019584-21-000003.txt 2846 22490 746
53 0000019612-21-000031.txt 5655 83426 2444
54 0000019617-21-000236.txt 7701 173810 4731
55 0000020212-21-000019.txt 5249 64176 1783
56 0000020286-21-000013.txt 5624 81960 2717
57 0000021076-21-000016.txt 3316 24576 626
58 0000021175-21-000022.txt 4371 61659 1830
59 0000021344-21-000008.txt 5876 94705 2783
60 0000021510-21-000060.txt 5978 80631 2384
61 0000021665-21-000007.txt 4443 55593 1537
62 0000022356-21-000015.txt 5285 71881 2331
63 0000022444-21-000126.txt 4349 42979 1323
64 0000023194-21-000015.txt 3970 32748 1032
65 0000023197-21-000082.txt 5748 78652 2332
66 0000024090-21-000012.txt 5025 63463 1991
67 0000024491-21-000011.txt 4592 51961 1559
68 0000024545-21-000004.txt 6056 93506 2651
69 0000025232-21-000008.txt 4250 47677 1344
70 0000026172-21-000012.txt 5008 59732 1824
71 0000026324-21-000007.txt 4309 41245 1315
72 0000027419-21-000010.txt 3545 26796 864
73 0000027904-21-000003.txt 5330 71448 2558
74 0000027996-21-000033.txt 4710 63000 1955
75 0000028412-21-000054.txt 5421 76597 2245
76 0000028823-21-000059.txt 5324 59713 1638
77 0000028917-21-000066.txt 3116 22366 725
78 0000029332-21-000015.txt 3898 35472 1242
79 0000029644-21-000029.txt 3665 31719 1026
80 0000029905-21-000010.txt 4926 55720 1652
81 0000029915-21-000003.txt 3417 29762 918
82 0000029989-21-000004.txt 2727 24447 770
83 0000030625-21-000034.txt 5328 67204 1956
84 0000030697-21-000002.txt 5041 69427 1918
85 0000031791-21-000003.txt 5814 75407 2093
86 0000032604-21-000038.txt 4490 38969 1220
87 0000033185-21-000025.txt 5369 72460 2351
88 0000033213-21-000006.txt 5471 74141 2102
89 0000034067-21-000015.txt 4795 47094 1491
90 0000034088-21-000012.txt 4820 49954 1686
91 0000034563-21-000061.txt 4048 33522 980
92 0000034782-21-000036.txt 5135 55219 1807
93 0000034903-21-000020.txt 4412 49681 1361
94 0000035214-21-000006.txt 3946 45199 1335
95 0000035527-21-000100.txt 6929 143059 3952
96 0000036047-21-000018.txt 4841 60926 1680
97 0000036966-21-000012.txt 7196 120376 3636
98 0000037472-21-000023.txt 2881 21127 640
99 0000037785-21-000024.txt 5440 64611 2072
100 0000037808-21-000011.txt 6030 84205 2519
# Uses ~20GB of RAM... Break corp into chunks if RAM constrained
corp_FOG <- textstat_readability(corp, "FOG")
corp_FOG %>%
head() %>%
html_df()
document | FOG |
---|---|
0000002178-21-000034.txt | 21.11264 |
0000002969-21-000055.txt | 22.01396 |
0000003499-21-000005.txt | 21.81568 |
0000003570-21-000039.txt | 24.91956 |
0000004127-21-000058.txt | 23.87785 |
0000004281-21-000049.txt | 22.83374 |
Recall that Micorsoft’s annual report had a Fog index of 20.88
df_SIC <- read.csv('../../Data/Session_8-Filings2021.csv') %>%
select(accession, regsic) %>%
mutate(accession=paste0(accession, ".txt")) %>%
rename(document=accession) %>%
mutate(industry = case_when(
regsic >=0100 & regsic <= 0999 ~ "Agriculture",
regsic >=1000 & regsic <= 1499 ~ "Mining",
regsic >=1500 & regsic <= 1799 ~ "Construction",
regsic >=2000 & regsic <= 3999 ~ "Manufacturing",
regsic >=4000 & regsic <= 4999 ~ "Utilities",
regsic >=5000 & regsic <= 5199 ~ "Wholesale Trade",
regsic >=5200 & regsic <= 5999 ~ "Retail Trade",
regsic >=6000 & regsic <= 6799 ~ "Finance",
regsic >=7000 & regsic <= 8999 ~ "Services",
regsic >=9100 & regsic <= 9999 ~ "Public Admin" )) %>%
group_by(document) %>%
slice(1) %>%
ungroup()
corp_FOG <- corp_FOG %>% left_join(df_SIC)
document | FOG | regsic | industry |
---|---|---|---|
0000002178-21-000034.txt | 21.11264 | 5172 | Wholesale Trade |
0000002969-21-000055.txt | 22.01396 | 2810 | Manufacturing |
0000003499-21-000005.txt | 21.81568 | 6798 | Finance |
0000003570-21-000039.txt | 24.91956 | 4924 | Utilities |
0000004127-21-000058.txt | 23.87785 | 3674 | Manufacturing |
0000004281-21-000049.txt | 22.83374 | 3350 | Manufacturing |
quanteda
bonus: References across text (Global warming)corp_tokens <- tokens(corp) # This takes a couple hours to run
# kwic() is very fast to run though
kwic(corp_tokens, pattern = phrase("global warming"), window = 3) %>%
as.tibble() %>%
mutate(text=paste(pre,keyword,post)) %>%
left_join(select(df_SIC, document, industry), by = c("docname" = "document")) %>%
select(docname, text) %>%
sample_n(100) %>%
datatable(options = list(pageLength = 5), rownames=F)
quanteda
bonus: Mentions by industryquanteda
bonus: References across text (COVID-19)corp_tokens <- tokens(corp) # This takes a couple hours to run
# kwic() is very fast to run though
kwic(corp_tokens, pattern = phrase(c("COVID-19", "coronavirus")), window = 3) %>%
as.tibble() %>%
mutate(text=paste(pre,keyword,post)) %>%
left_join(select(df_SIC, document, industry), by = c("docname" = "document")) %>%
select(docname, text) %>%
sample_n(100) %>%
datatable(options = list(pageLength = 5), rownames=F)
quanteda
bonus: Mentions by industrySolutions
This is our end goal, but we’ll work our way up
Think about the structure of a matrix where rows are document names and columns are individual words. How much of this matrix will be 0s?
dfm()
dfm()
to dfm_remove()
to remove stopwords
remove=stopwords()
for a simple listremove=stopwords(source='smart')
stopwords()
function is provided by the stopwords package, and actually supports over 50 languages, including Chinese, English, Hindi, and Malayremove=stopwords("zh", source="stopwords-iso")
remove=c(...)
, You can supply a list of stop words to removedfm_trim()
tokens()
output as well
tokens_wordstem()
for stemming
tokens()
has the options remove_punct=T
and remove_numbers=T
too# Simplest way
tdm <- dfm(corp_tokens)
# With stopwords
tdm <- dfm(corp_tokens) %>%
dfm_remove(stopwords(source='smart'))
# With stopwords and stemming -> Used in next slides
# 683M elements in the output
corp_tokens2 <- tokens(corp_tokens, remove_punct=TRUE, remove_numbers=TRUE) %>%
tokens_wordstem()
tdm <- dfm(corp_tokens2) %>%
dfm_remove(stopwords(source='smart'))
dfm_trim(min_termfreq=10, termfreq_type = "count")
# adding industry to the tdm
docs <- docnames(corp)
docs <- data.frame(document=docs)
docs <- docs %>% left_join(df_SIC)
docvars(tdm, field="industry") <- docs$industry
topfeatures(tdm, n=5, groups="industry")
$Agriculture
compani $ oper financi year
9223 8862 6112 5829 5317
$Construction
$ oper compani financi million
14917 11563 11447 11268 10931
$Finance
loan compani busi $ financi
466138 450468 424405 423439 360063
$Manufacturing
product compani $ includ financi
690259 536844 498176 411262 362766
$Mining
oper compani $ financi oil
68529 66552 66374 48157 45090
$`Retail Trade`
compani $ oper financi includ
60697 60076 53995 46448 44738
$Services
compani $ oper financi busi
250136 234430 190355 188140 181628
$Utilities
$ oper compani includ million
109619 96571 88532 80891 79997
$`Wholesale Trade`
compani $ oper financi product
32805 30518 23833 23060 21674
This isn’t very informative
\[ \frac{\text{How many times a word is in the document}}{\text{How many documents the word is in}} \]
dfm_tfidf()
from quanteda
\[ \frac{f_{w,d}}{f_d} \cdot -\log_2\left(\frac{n_w}{N}\right) \]
tfidf_mat <- dfm_tfidf(tdm, base=2, scheme_tf="prop")
topfeatures(tfidf_mat, n=5, groups=industry)
$Agriculture
cannabi prc avocado yew uspb
0.2668476 0.2599917 0.2108610 0.1990909 0.1921867
$Construction
homebuild 2020-12-31 2019-12-31 home ck1723866
0.4848714 0.2985789 0.2360784 0.2351432 0.2049281
$Finance
mortgag fargo ab 2020-12-31 2019-12-31
22.799752 14.987289 13.155708 11.641575 7.004365
$Manufacturing
clinic fda trial 2020-12-31 patient
12.176848 8.397263 8.002860 6.812589 6.555764
$Mining
oil gas drill miner gold
3.567497 3.309704 2.865018 1.945640 1.350271
$`Retail Trade`
restaur 2021-01-30 merchandis franchise store
2.1737584 1.1272627 1.0692899 0.9448629 0.8498510
$Services
2020-12-31 game 2019-12-31 client 2020-01-01
2.650871 2.199351 1.852566 1.701759 1.451234
$Utilities
2020-12-31 2019-12-31 gas 2020-01-01 2018-12-31
2.8123115 2.0785280 1.5214459 1.2668153 0.8849584
$`Wholesale Trade`
2020-12-31 2019-12-31 graybar 2021-09-30 hsiao
0.4319546 0.3064838 0.2765405 0.2624720 0.2567148
These terms are often more meaningful
What type of Algorithm is LDA?
{stmBrowser}
package for visualization (on Github){lda}
: A somewhat rigid package with difficult setup syntax, but it plays nicely with the great LDAvis package for visualizing models. Supported by quanteda.{topicmodels}
: An extensible topic modeling framework that plays nicely with quanteda
# quanteda's conversion for the stm package
out <- convert(tdm, to = 'stm')
# quanteda's conversion for the lda package
# out <- convert(tdm, to = 'lda')
# quanteda's conversion for the topicmodels package
# out <- convert(tdm, to = 'topicmodels')
out$documents
: Index number for each word with count/documentout$vocab
: Words and their index numbersout$meta
a data frame of information from the corpus (industry
)out$documents[[1]][,386:390]
[,1] [,2] [,3] [,4] [,5]
[1,] 23097 23101 23124 23144 23153
[2,] 2 2 1 3 89
out$vocab[c(out$documents[[1]][,386:390][1,])]
[1] "consult" "consum" "consumpt" "contamin" "content"
stm()
function from the stm package
K
, the number of topics we want. I’ll use 10 for simplicity, but often we need more to neatly categorize the text
K=100
is a popular choice when we are using the output of LDA as an input to another modelK=31
, as that captures the most restatements in-samplelabelTopics(topics)
Topic 1 Top Words:
Highest Prob: 2020-12-31, 2019-12-31, 2020-01-01, 2018-12-31, 2019-01-01, 2018-01-01, decemb
FREX: nnn:operatingleasememb, vtr:seniorshousingcommunitiesmemb, fcpt:olivegardenmemb, wpc:realestatesubjecttooperatingleasememb, exc:exelongenerationcollcmemb, ess:unencumberedapartmentcommunitiesmemb, kim:shoppingcentermemb
Lift: adc:seniorunsecureddebtmemb, aegco, aep:amortizationofdeferredcostsmemb, aep:changesinfundedstatusmemb, aep:excessaditthatisnotsubjecttoratenormalizationrequirementsmemb, aep:ohiopowercomemb, aep:publicservicecoofoklahomamemb
Score: 2020-12-31, 2019-12-31, 2020-01-01, 2018-12-31, 2019-01-01, 2018-01-01, nnn:operatingleasememb
Topic 2 Top Words:
Highest Prob: servic, loan, mortgag, exhibit, report, bank, nation
FREX: corelog, pentalpha, dbtca, dbntc, lnr, ncmslt, cwcapit
Lift: ikb, -1122, #39, #41, 2013-c10, 2013-c11, 2013-c12
Score: mortgag, pentalpha, dbtca, dbntc, fargo, cwcapit, corelog
Topic 3 Top Words:
Highest Prob: servic, oper, busi, result, includ, financi, custom
FREX: telesat, telesat’, zoominfo, apus, hcn, sciplay, airtoken
Lift: 001-32877, 704game, 90-10, accd:commonstockwarrantsmemb, acuriti, addus, adt:firstliencreditagreementmemb
Score: custom, revenu, cost, increas, softwar, client, game
Topic 4 Top Words:
Highest Prob: compani, $, stock, share, financi, common, year
FREX: ncm, hemp-deriv, auctus, yew, zander, appliancesmart, indus’s
Lift: 280e, adnani, ankang, baod, bellridg, biofirst, biohealth
Score: company’, cannabi, compani, mr, penni, board, stock
Topic 5 Top Words:
Highest Prob: invest, compani, interest, $, ani, asset, valu
FREX: uscf, pre-incent, ck11, bdcs, cik22, unl, bno
Lift: euroccp, sdci, -0.5x, -1x, -2x, 001-32975, 001-34057
Score: invest, reit, adviser, ck11, uscf, tenant, bdc
Topic 6 Top Words:
Highest Prob: oper, $, million, cost, includ, financi, gas
FREX: ferc, entergi, ngls, cleco, ameren, sce, pbf
Lift: _________________________________________________________________________________________, —pge, #2016-n01, 000-00404, 001-05324, 001-37419, 1-06196
Score: oil, gas, drill, ferc, cost, ngls, energi
Topic 7 Top Words:
Highest Prob: product, develop, clinic, ani, approv, includ, requir
FREX: preclin, nda, tumor, cgmp, post-market, biosimilar, placebo
Lift: —licens, 0.3mg, 03-133, 1-caart, 1-gram, 11-point, 16-posit
Score: clinic, fda, patent, preclin, patient, nda, trial
Topic 8 Top Words:
Highest Prob: busi, combin, share, ani, initi, compani, public
FREX: post-busi, founder, pre-initi, kaleyra, blank, consumm, worthless
Lift: 14e-5, 3—initial, 340-10-s99-1, 5am’s, 7gc, acpgp, alberton’
Score: warrant, combin, busi, target, initi, founder, redempt
Topic 9 Top Words:
Highest Prob: $, million, compani, oper, product, includ, financi
FREX: 2019-12-29, 2019-12-28, 2021-01-02, 2018-12-30, 2021-10-31, 2021-01-30, 2018-12-29
Lift: 000-06920, 1-12936, 2015-10-27, 2017-11-26, 2018-04-28, 2018-05-27, 2018-11-25
Score: us-gaap:pensionplansdefinedbenefitmemb, custom, cost, 2021-01-02, 2019-12-29, 2019-12-28, product
Topic 10 Top Words:
Highest Prob: loan, $, financi, compani, bank, loss, rate
FREX: fhlb, us-gaap:residentialportfoliosegmentmemb, tdrs, cet1, cblr, alll, four-famili
Lift: 000-32891, 001-33572, 001-33912, 001-34042, 001-35593, 09-4, 30-59
Score: us-gaap:commercialportfoliosegmentmemb, fhlb, us-gaap:consumerportfoliosegmentmemb, loan, us-gaap:commercialrealestateportfoliosegmentmemb, mortgag, rate
Highest prob
is a straightforward measure to interpret
out$meta$industry <- factor(out$meta$industry)
doc_topics = data.frame(document=names(out$documents),
industry=out$meta$industry,
topic=1,
weight=topics$theta[,1])
for (i in 2:10) {
temp = data.frame(document=names(out$documents),
industry=out$meta$industry,
topic=i,
weight=topics$theta[,i])
doc_topics = rbind(doc_topics, temp)
}
# Proportional topics (%)
doc_topics <- doc_topics %>%
group_by(document) %>%
mutate(topic_prop = weight / sum(weight)) %>%
ungroup()
# Manually label topics
topic_labels = data.frame(topic = 1:10,
topic_name = c('XBRL', 'Banking', 'Services', 'Equity',
'Investment', 'Energy', 'R&D',
'Compensation', 'Financial', 'Debt'))
doc_topics <- doc_topics %>% left_join(topic_labels)
{STM}
’s toLDAvis()
function
# Code to generate LDAvis
toLDAvis(topics, out$documents, R=10)
{stmBrowser}
’s stmBrowser()
function
# code to generate stmBrowser
stmBrowser(topics, data=data.frame(text=names(out$documents),
industry=out$meta$industry),
c('industry'), text='text')
# A tibble: 10 × 6
document industry topic weight topic_prop topic_name
<chr> <fct> <dbl> <dbl> <dbl> <chr>
1 0001564590-21-039151.txt Services 1 0.00488 0.00488 XBRL
2 0001564590-21-039151.txt Services 2 0.0000168 0.0000168 Banking
3 0001564590-21-039151.txt Services 3 0.814 0.814 Services
4 0001564590-21-039151.txt Services 4 0.000219 0.000219 Equity
5 0001564590-21-039151.txt Services 5 0.000164 0.000164 Investment
6 0001564590-21-039151.txt Services 6 0.0000879 0.0000879 Energy
7 0001564590-21-039151.txt Services 7 0.00116 0.00116 R&D
8 0001564590-21-039151.txt Services 8 0.000330 0.000330 Compensation
9 0001564590-21-039151.txt Services 9 0.177 0.177 Financial
10 0001564590-21-039151.txt Services 10 0.00158 0.00158 Debt
doc_topics %>%
filter(document=='0001564590-21-039151.txt' |
document=='0000320193-21-000105.txt') %>%
mutate(Company=ifelse(document=='0001564590-21-039151.txt', 'MSFT','AAPL')) %>%
ggplot(aes(x=factor(topic_name), y=topic_prop, fill=factor(topic_name))) +
geom_col() + facet_wrap(~Company) +
theme(axis.text.x=element_blank(),axis.ticks.x = element_blank())
How might we leverage LDA (or other topic modeling methods) to improve and simplify analytics?
We’ll build a different classification system, based on what they discuss in their annual reports
\[ \min_{C_k} \sum_{k=1}^{K} \sum_{x_i\in C_k} \left(x_i - \mu_k\right)^2 \]
library(tidyr)
wide_topics <- spread(doc_topics[,c(1,2,5,6)], topic_name, topic_prop)
# Note: dropping XBRL here
mat <- wide_topics[,3:11]
mat[,1:6] %>% head(n=3) %>% html_df(highlight_cols = c())
Banking | Compensation | Debt | Energy | Equity | Financial |
---|---|---|---|---|---|
0.0000806 | 0.0007570 | 0.0045723 | 0.6573965 | 0.0012891 | 0.2155210 |
0.0000057 | 0.0000372 | 0.0000445 | 0.1259373 | 0.0000565 | 0.8653703 |
0.0372616 | 0.0004645 | 0.0083611 | 0.1996815 | 0.0501601 | 0.0380236 |
set.seed(6845868)
clusters <- kmeans(mat, 9)
# Add clusters back into our data
wide_topics$kmean <- clusters$cluster
clusters$cluster %>% head()
[1] 4 3 5 4 3 3
kmeans()
is built into R – no packages neededcbind(as.data.frame(clusters$center), data.frame(kmean=1:9)) %>%
gather("Topics","weights",-kmean) %>%
ggplot(aes(x=factor(Topics), y=weights, fill=factor(Topics))) +
geom_col() +
facet_wrap(~kmean) +
theme(axis.text.x=element_blank(),axis.ticks.x = element_blank())
Colored by SIC codes
ggplot(wide_topics,
aes(x = umap1, y = umap2,
color = industry)) +
geom_point(alpha = 0.3) +
theme_bw()
Colored by kmeans
ggplot(wide_topics,
aes(x = umap1, y = umap2,
color = factor(kmean))) +
geom_point(alpha = 0.3) +
theme_bw()
ggplot(wide_topics, aes(x=kmean)) + geom_bar() + facet_wrap(~factor(industry))
wide_topics$dist <- sqrt(rowSums(abs(mat - fitted(clusters)))) # Distance from center
wide_topics[,c(1,2,4,8,16)] %>% arrange(desc(dist)) %>% slice(1:5) %>% html_df()
document | industry | Compensation | Financial | dist |
---|---|---|---|---|
0001104659-21-044134.txt | Finance | 0.9995782 | 2.8e-06 | 1.156605 |
0001104659-21-043939.txt | Finance | 0.9995650 | 3.4e-06 | 1.156594 |
0001140361-21-010304.txt | Finance | 0.9995592 | 3.7e-06 | 1.156589 |
0001193125-21-098450.txt | Services | 0.9995478 | 3.4e-06 | 1.156579 |
0001193125-21-102380.txt | Finance | 0.9995064 | 3.9e-06 | 1.156544 |
We are a blank check company incorporated on ________ as a Cayman Islands exempted company for the purpose of effecting a merger, share exchange, asset acquisition, share purchase, reorganization or similar business combination with one or more businesses or entities (a “Business Combination”). All 5 files…
wide_topics[,c(1,2,4,8,16)] %>%
filter(industry!="Finance") %>%
arrange(desc(dist)) %>%
mutate(id=1:n()) %>%
select(id,everything()) %>%
slice(1:7) %>%
html_df()
id | document | industry | Compensation | Financial | dist |
---|---|---|---|---|---|
1 | 0001193125-21-098450.txt | Services | 0.9995478 | 3.40e-06 | 1.156579 |
2 | 0001193125-21-092793.txt | Manufacturing | 0.9988654 | 1.19e-05 | 1.155990 |
3 | 0001193125-21-100874.txt | Services | 0.9963810 | 1.17e-05 | 1.153839 |
4 | 0001140361-21-010411.txt | Services | 0.9778598 | 2.05e-05 | 1.153371 |
5 | 0001104659-21-031725.txt | Manufacturing | 0.9770098 | 2.31e-05 | 1.152625 |
6 | 0001213900-21-013228.txt | Manufacturing | 0.9944066 | 5.28e-05 | 1.152517 |
7 | 0001213900-21-010315.txt | Services | 0.9941140 | 4.19e-05 | 1.151873 |
All: Yet more SPACs, just with the wrong industry in their filings…
How many SPACs are there?
wide_topics[,c(1,2,4,8,16)] %>%
filter(industry!="Finance", Compensation < 0.5) %>%
arrange(desc(dist)) %>%
mutate(id=1:n()) %>%
select(id,everything()) %>%
slice(1,2,3,8,9,10) %>%
html_df()
id | document | industry | Compensation | Financial | dist |
---|---|---|---|---|---|
1 | 0001731122-21-000373.txt | Construction | 0.4180281 | 0.0405295 | 1.0988544 |
2 | 0001628280-21-000722.txt | Construction | 0.0004643 | 0.2485153 | 1.0749207 |
3 | 0001654954-21-004244.txt | Services | 0.0120396 | 0.2998639 | 1.0438606 |
8 | 0001410578-21-000612.txt | Construction | 0.0092240 | 0.3515138 | 1.0067509 |
9 | 0001564590-21-009825.txt | Services | 0.0003849 | 0.1270724 | 0.9992496 |
10 | 0001712923-21-000017.txt | Services | 0.0101671 | 0.0004910 | 0.9990207 |
Text based industry classification using 10-Ks has been shown to be quite viable, such as in work by Hoberg and Phillips.
What else could we use clustering to solve?
industry
measure was NA
library(caret)
trControl <- trainControl(method='cv', number=20)
tout <- train(industry ~ .,
method = 'knn',
tuneGrid = expand.grid(k=1:20),
trControl = trControl,
metric = "Accuracy",
data = train[,c(2:11)])
tout
k-Nearest Neighbors
6742 samples
9 predictor
9 classes: 'Agriculture', 'Construction', 'Finance', 'Manufacturing', 'Mining', 'Retail Trade', 'Services', 'Utilities', 'Wholesale Trade'
No pre-processing
Resampling: Cross-Validated (20 fold)
Summary of sample sizes: 6403, 6404, 6406, 6404, 6405, 6407, ...
Resampling results across tuning parameters:
k Accuracy Kappa
1 0.7097283 0.6108027
2 0.7049892 0.6046680
3 0.7312486 0.6373751
4 0.7450382 0.6538656
5 0.7471123 0.6556158
6 0.7560251 0.6668534
7 0.7561713 0.6664935
8 0.7528979 0.6618832
9 0.7560264 0.6655262
10 0.7555822 0.6646337
11 0.7561713 0.6651377
12 0.7589815 0.6687527
13 0.7551291 0.6631862
14 0.7583906 0.6676277
15 0.7594323 0.6688836
16 0.7606157 0.6705087
17 0.7591355 0.6684971
18 0.7576532 0.6664638
19 0.7582510 0.6672672
20 0.7589906 0.6682744
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 16.
k
ggplot(tout$results, aes(x=k, y=Accuracy)) +
geom_line() +
geom_ribbon(aes(ymin=Accuracy - AccuracySD*1.96,
ymax=Accuracy + AccuracySD*1.96), alpha=0.2) +
geom_vline(xintercept=15, color="blue") +
xlab("k, optimal = 15")
label$industry_pred <- predict(tout,
label)
label[,c("document",
"industry_pred")] %>%
head %>% html_df
document | industry_pred |
---|---|
0000017313-21-000075.txt | Finance |
0000052827-21-000035.txt | Manufacturing |
0000820027-21-000014.txt | Finance |
0000837465-21-000003.txt | Manufacturing |
0001017386-21-000166.txt | Services |
0001047469-21-000783.txt | Finance |
Today, we: