Extracting meaningful information from text
NLP is a field devoted to understanding how to understand human language
Consider the following situation:
You have a collection of 1 million sentences, and you want to know which are accounting relevant
Manual content analysis
Indexes
Readability
Automation
Dictionaries take the helm
Applying financial dictionaries “without modification to other media such as earnings calls and social media is likely to be problematic” (Loughran and McDonald 2016)
Fragmentation and new methods
A lot of choices
# Topics generated using R's stm library
labelTopics(topics)
## Topic 1 Top Words:
## Highest Prob: properti, oper, million, decemb, compani, interest, leas
## FREX: ffo, efih, efh, tenant, hotel, casino, guc
## Lift: aliansc, baluma, change-of-ownership, crj700s, directly-reimburs, escena, hhmk
## Score: reit, hotel, game, ffo, tenant, casino, efih
## Topic 2 Top Words:
## Highest Prob: compani, stock, share, common, financi, director, offic
## FREX: prc, asher, shaanxi, wfoe, eit, hubei, yew
## Lift: aagc, abramowitz, accello, akash, alix, alkam, almati
## Score: prc, compani, penni, stock, share, rmb, director
## Topic 3 Top Words:
## Highest Prob: product, develop, compani, clinic, market, includ, approv
## FREX: dose, preclin, nda, vaccin, oncolog, anda, fdas
## Lift: 1064nm, 12-001hr, 25-gaug, 2ml, 3shape, 503b, 600mg
## Score: clinic, fda, preclin, dose, patent, nda, product
## Topic 4 Top Words:
## Highest Prob: invest, fund, manag, market, asset, trade, interest
## FREX: uscf, nfa, unl, uga, mlai, bno, dno
## Lift: a-1t, aion, apx-endex, bessey, bolduc, broyhil, buran
## Score: uscf, fhlbank, rmbs, uga, invest, mlai, ung
## Topic 5 Top Words:
## Highest Prob: servic, report, file, program, provid, network, requir
## FREX: echostar, fcc, fccs, telesat, ilec, starz, retransmiss
## Lift: 1100-n, 2-usb, 2011-c1, 2012-ccre4, 2013-c9, aastra, accreditor
## Score: entergi, fcc, echostar, wireless, broadcast, video, cabl
## Topic 6 Top Words:
## Highest Prob: loan, bank, compani, financi, decemb, million, interest
## FREX: nonaccru, oreo, tdrs, bancorp, fdic, charge-off, alll
## Lift: 100bp, 4-famili, acnb, acquired-impair, amerihom, ameriserv, annb
## Score: fhlb, loan, bank, mortgag, risk-weight, tdrs, nonaccru
## Topic 7 Top Words:
## Highest Prob: compani, million, oper, financi, revenu, result, includ
## FREX: vmware, imax, franchise, merchandis, affinion, exhibitor, softwar
## Lift: 4.75x, 9corpor, accessdm, acvc, adtech, adxpos, aecsoft
## Score: million, product, restaur, custom, game, video, merchandis
## Topic 8 Top Words:
## Highest Prob: compani, insur, million, loss, financi, invest, rate
## FREX: policyhold, reinsur, lae, dac, annuiti, ambac, cede
## Lift: agcpl, ahccc, amcareco, argoglob, asil, connecticut-domicil, feinsod
## Score: reinsur, policyhold, lae, onebeacon, insur, million, dac
## Topic 9 Top Words:
## Highest Prob: million, compani, oper, financi, cost, product, decemb
## FREX: wafer, alcoa, pepco, dpl, nstar, usec, kcsm
## Lift: 1.5mw, 11-hour, 1ynanomet, 3dfx, 3ms, 3pd, 40g
## Score: million, product, ameren, cleco, manufactur, wafer, postretir
## Topic 10 Top Words:
## Highest Prob: gas, oper, oil, natur, million, cost, decemb
## FREX: ngl, ngls, oneok, mgp, permian, qep, wes
## Lift: 12asset, 1businesscommerci, 94-mile, aivh, amargo, amopp, angell
## Score: gas, drill, oil, ngl, crude, unithold, ngls
“The prevalence of polysemes in English – words that have multiple meanings – makes an absolute mapping of specific words into financial sentiment impossible.” – Loughran and McDonald (2011)
“[…] The use of word lists derived outside the context of business applications has the potential for errors that are not simply noise and can serve as unintended measures of industry, firm, or time period. The computational linguistics literature has long emphasized the importance of developing categorization procedures in the context of the problem being studied (e.g., Berelson [1952]).” – LM 2016
“There are problems with the face validity of the accounting readability studies. Accounting researchers have, in general, assumed that the readability formulas measure not only readability but also understandability. Indeed, readability and understandability have often been used interchangeably, the assumption being they are synonymous. However, although these concepts are related, they do differ.” – Jones and Shoemaker (1994 JAL)
The literature has not addressed this.
Tailor-made measures
Python:
spaCy
gensim
NLTK
, SpaCy
, or handcode using Counter()
(super fast)scikit-learn
or keras
or pytorch
NLTK
, spaCy
R:
stm
+ quanteda
+ convert(dfm,to='stm')
tidytext
caret
, e1071
, or keras