Understanding Sentiment Through Financial Context

With Franco Wong


November 2022


Dr. Richard M. Crowley

Slides: https://rmc.link/Rotman \(\cdot\) @prof_rmc

Understanding Sentiment Through Financial Context

With Franco Wong


November 2022


Dr. Richard M. Crowley

Slides: https://rmc.link/Rotman \(\cdot\) @prof_rmc

Motivation

Motivation

What is sentiment?

  • More explicitly: What does text sentiment measure in a financial context?
  • More pointedly: What is the theoretical construct underlying sentiment?
    • This is a tricky question:
      • Sentiment from a dictionary is applied in a bag-of-words fashion
      • Most words in financial sentiment are difficult to interpret in isolation

Sentiment words’ underlying context is the key to answering the above questions.

Motivation

What is sentiment?

  • More explicitly: What does text sentiment measure in a financial context?
  • More pointedly: What is the theoretical construct underlying sentiment?
    • This is a tricky question:
      • Sentiment from a dictionary is applied in a bag-of-words fashion
      • Most words in financial sentiment are difficult to interpret the meaning of in isolation

Sentiment words’ underlying context is the key to answering the above questions.

What do we mean by context?

Cambridge dictionary: “The text or speech that comes immediately before and after a particular phrase or piece of text and helps to explain its meaning”

  • What is the surrounding text around the word of interest?
    • E.g., is the text about accounting policies, operating performance, contracting, etc.

A motivating example

  • If asked to answer what a single sentiment word means from a document, a logical approach is to examine the text it came from

Word: Loss, LM sentiment: Negative

  • Actual extractions from 10-K MD&As:
    1. “[…]net loss was due to $4.7 million goodwill write off[…]”
      • Makes sense
    2. “[…]loss ratio decreased to 43% as result of segment’s adherence to underwriting guidelines to claims[…]”
      • Opposite…
    3. “[…]functional coffees dedicated to weight loss[…]”
      • Unrelated…
    4. “[…]loan loss provision totaled $4.6 million in 2008 compared[…]”
      • Loan loss provision is a frequent mishit

A methodological contribution of our study is asking this en masse to derive context

What we do

  • Method
    1. Identify any clause of text that appears to contain content (from a linguistic perspective)
    2. Identify all contexts in the text using an unsupervised approach
    3. Assign each identified clause to a context
  • Empirics
    1. How does dictionary sentiment depends on context?
    2. Do prior results using financial sentiment hold across contexts?
      • How stable is the context dependency of sentiment across DVs?
    3. Replicate results using other sentiment approaches

Why? To understand what financial sentiment captures and if it is empirically consistent

Main results

  1. Only a few key contexts drive each financial sentiment result
    • Aggregation to document-level sentiment adds a lot of noise
  2. Sentiment, at the context level, often contradicts prior results
    • Aggregation removes nuance from our understanding
  3. Different contexts drive prediction for different DVs
    • Sentiment captures different empirical constructs in different regressions
  4. The above results hold across 2 other financial sentiment dictionaries
    • Our results are not unique to the LM dictionary
  5. The above results hold using a neural network-based sentiment measure
    • Bag-of-words isn’t the problem – financial sentiment, as a construct, likely is

Punchline: Sentiment should be measured on fine-grained contexts, not full documents

  • In other words, a precise matching between the text used and the economic question examined is needed

Measuring context

The idea

  • Our goal is to replicate a natural approach that one would take to identify contexts by hand:
    1. Take a reference clause
    2. Look to see what the clause is about (the “context”)
    3. Assign the clauses into logical groupings of contexts
    4. After: Interpret sentiment of a clause within context

In order to better understand context and its link to sentiment, we will examine a broad set of contexts spanning all MD&A content

Illustration of extracting context

“The company’s earnings increased by 5% due to an improvement in operating efficiency.”

 

 

 

 

 

 

 

  • (company; has; earnings)
  • (company’s earnings; increased by; 5%)
  • (company’s earnings; increased due; improved operating efficiency)
  • (company’s earnings; increased due; operating efficiency)

Illustration of extracting context

“The company’s earnings increased by 5% due to an improvement in operating efficiency.”

 

 

 

 

 

 

 

  • (company; has; earnings)
  • (company’s earnings; increased by; 5%)
  • (company’s earnings; increased due; improved operating efficiency)
  • (company’s earnings; increased due; operating efficiency)

Extracting context

Automating with Stanford Open IE

  • Open IE is an open information extraction algorithm
  • Generates triples of context of the form (subject; relation verb; object)
  • Multi-step algorithm:
    1. Creates the dependency parse tree
    2. Resolves any co-references (“it,” “her,” etc.)
    3. Determines clause boundaries (multinomial logistic model)
    4. Determines triples within each clause (linguistic patterns)

This nets 179,703,756 extractions

Cutting this down a bit

  • Some are superfluous as we saw earlier.
    • More likely to happen with longer sentences
  • Approach: Keep the shortest extractions such that…
    1. We cover as much of the sentence as possible without having nested extractions
    2. We don’t drop words from LM
    3. We don’t drop accounting content

This cuts out 73% \(\Rightarrow\) still have 48,576,229 extractions

“Accounting content”

Some shared words: collateral, specialist, hedge, debit, inventory

We preprocess the above to make them machine-usable

  1. Drop any words with fewer than 4 characters, e.g., “A”
  2. Remove stopwords from the dictionary
  3. Restrict to only single words (to match LM structure), manually review unmatched phrases
  4. Inflect words using the python word_forms library
    • E.g., given “periodicity,” also match “periodicities” and “periodic”

Matching extractions together

  • Two step approach:
    1. Map all extractions to a 512 dimension vector space that represents underlying meaning using Universal Sentence Encoder (USE; Cer et al. 2018)
      • We mask out certain tokens that USE tends to focus on too much
        • Dates, times, dollar amounts, percentages, quantities, and ordinals
    2. Cluster within the 512-dim vector space with Mini-Batch K-means (Sculley 2010)
      • Optimize this with Tibshirani et al. 2001’s Gap statistic

How does USE work?

  • Input: Words mapped to embeddings, word order
  • Processing: Transformer-based neural network (uses “attention”)
  • Output: A sentence embedding of 512 dimensions

USE abstracts away from word choice!

How does Mini-Batch K-means work?

  • Input: USE vectors
  • Algorithm: K-means (in principle)
  • Output: Cluster centers and convergence info (Inertia)

Why Mini-Batch?

  • Mini-Batch K-means is an online version of K-means
    • Online means we can iteratively add new data to update the model
  • I.e., instead of feeding 48+M observations at once, we can do 1M at a time
    • Much more memory efficient
      • K-means is a one shot algorithm, and thus processes all data at once
        • It would need multiple 48M x 512 matrices – ~300GB in total

How does the Gap statistic work?

  • Let…
    • \(k\) be the number of clusters,
    • \(B\) the number of simulated samples
    • \(W_k\) be the K-Means inertia score on actual data
    • \(W_{k,r}^*\) be the K-Means inertia score for iteration \(r\) with synthetic data
    • \(\bar{l}\) be the average of the \(W_{k,r}^*\)s

\[ \begin{aligned} Gap(k) &= \left(\frac{1}{B}\right) \sum_{r=1}^{B} \text{log}\left(W_{k,r}^*\right)-\text{log}\left(W_k\right) \text{and}\\ s_k &= sd_k \sqrt{1+\frac{1}{B}},\text{ where }sd_k = \sqrt{\left(\frac{1}{B}\right)\sum_{r=1}^{B}\left\{\text{log}\left(W_{k,r}^*-\bar{l}\right)\right\}^2} \end{aligned} \]

  • Select the lowest \(k\) such that \(Gap(k) \ge Gap(k+1) - s_{k+1}\)

I.e., select the lowest \(k\) s.t. the log-scaled error removed by clustering on real data at \(k\) is no worse than 1 SD below the log-scaled error removed at \(k+1\)

Gap simulation

13 is the lowest \(k\) s.t. the log-scaled error removed by clustering on real data at \(k\) is no worse than 1 SD below the log-scaled error removed at \(k+1\) (red circle)

Caveat from Tibshirani et al. (2001): the initial optimal \(k\) may be too small in more varied text; we use the first \(k\) that is optimal compared to \(k+1\) and \(k+2\): 137 (blue circle)

Context examples

Accounting

  • Accounting policies
    • Accounting assumptions
    • Revenue Recognition
    • Tax
  • Accounting standards
    • Accounting standards
    • New accounting standard
  • General and Balance sheet discussion
    • Cash flows
    • Deferred tax
  • Income statement discussion
    • Accounting losses
    • Depreciation and amortization

Business operations

  • Debt, Equity, and Investment
    • Financing
    • Loans
  • Expectations and future
    • Management expectations
    • Risk factor disclosures
  • Macroeconomics
    • Interest rates
    • Market risk
  • Operations
    • Growth
  • Structure
    • Subsidiaries

Context examples

Changes

  • Changes in sales
  • Changes in expenses
  • Changes in operating measures
  • Declines in value or performance
  • Increase in expenses
  • Inreases in income or revenue

Ungrouped

  • Grammatical patterns
    • Company information with name
    • Modal weak statements
  • Timeframes
    • Dates
    • Reporting periods
  • Unrelated statements
    • Unrelated statements 1-6
  • Unrelated statements with specific words
    • “Company” + unrelated statements 1-4

Validation work

  1. Overlap of original extractions with accounting dictionaries:
    • 95.2% contain at least 1 word in the Campbell Harvey’s dictionary
    • 84.8% contain at least 1 word in the NYSSCPA dictionary
  2. Intrusion task
    • Take 3 clauses from 1 context and an “intruder” from another
    • Can you tell which is the intruder?
      • E.g.:
        1. average market rate is in effect
        2. price swings are due to commodity costs
        3. net sales impact is in same store sales
        4. Volatility is in commodity prices
    • 4 RAs average 86% on the task; 500 questions each
  3. Regress MD&A sentiment on clusters conditional on sentiment
    • 82.3% (68.6%) of variation captured for negative (positive) sentiment

Empirical approach

Data

  • All 10-K and 10-K405 MD&A sections to build the text model
    • 107,596 MD&As
    • 48,576,229 extractions
  • Only MD&As subject to many requirements for empirical tests
    • 35,362 MD&As
    • 22,669,186 extractions
  • Loughran McDonald sentiment from their 10X File summaries file
  • MD&A LM sentiment based on the 10-K parser from Brown, Crowley and Elliott (2020) (BCE)
    • The BCE parser has Pearson correlations \(>80\%\) for full text sentiment measures with LM
  • Accounting data from Compustat
  • Stock data from CRSP
  • Material weaknesses from Audit Analytics

Empirics sketch

Three regression structures used throughout

  1. To examine how sentiment relates to context
    • \(Sentiment_{f,t} = \alpha + \sum_{i=1}^{137}\beta_i Context_{i,f,t} + \gamma \cdot Controls_{f,t} + \delta \cdot Industry~FE+\varepsilon\)
      • Run using a LASSO regression
  2. To replicate results from Loughran McDonald (2011)
    • \(DV_{f,t} = \alpha + \beta_0 Sentiment_{f,t} + \gamma \cdot Controls_{f,t} + \delta \cdot Industry~FE+\varepsilon\)
      • Run using a linear regression
  3. To partition the replication on context
    • \(DV_{f,t} = \alpha + \sum_{i=1}^{137}\beta_i Sentiment_{Context,i,f,t} + \gamma \cdot Controls_{f,t} + \delta \cdot Industry~FE+\varepsilon\)
      • Run using a LASSO regression

LASSO regression

  • Least Absolute Shrinkage and Selection Operator

LASSO replaces the standard OLS minimization problem of \(\min_{\beta,\gamma,\delta} \frac{1}{N} |\varepsilon|_2^2\) with:

\[ \min_{\beta,\gamma,\delta} \frac{1}{N} |\varepsilon|_2^2 + \lambda \sum_{b\in \{\beta, \gamma, \delta\}} \left|b\right|_1 \]

  • \(\lambda\) is a penalty term that needs to be optimized
    • We do this with 10-fold cross-validation
  • LASSO is the same as adding an \(L^1\) penalty to a regression!
    • Standard technique for dealing with overly high VIFs

Derive p-values using Post-LASSO estimator (Belloni and Chernozhukov 2013)

Results

Sentiment regressed on context

92 (79) contexts drive negative (positive) sentiment

  • Some uniformly drive both positive and negative sentiment
    • E.g., “cautionary statements” and “reduction in accounts”
  • Some only drive negative sentiment
    • E.g., “accounting losses” and “risk factor disclosures”
  • Some only drive positive sentiment
    • E.g., “increases in performance” and “tax”
  • Some drive a lack of sentiment
    • E.g., “depreciation and amortization” and “credit facilities”

As more coefficients’ signs match to our intuition for negative sentiment, we argue that negative sentiment is more tied to context.

Filing period excess return

Prediction: Positive relation between sentiment and return

  • Expected signs: Negative for negative sentiment, positive for positive sentiment
  • Replication: Expected result for negative sentiment, null result for positive sentiment
  • Contexts: Mixed findings, both sentiments drive results in both directions
  • Double LASSO: Results are consistent

Filing period abnormal volume

Prediction: More sentiment (either), higher volume

  • Expected signs: Positive for both
  • Replication: Opposite result for negative sentiment, null result for positive sentiment
  • Contexts: Mixed findings, but mostly in line with predictions
  • Double LASSO: Results are consistent

Post-filing return volatility

Prediction: More sentiment (either), higher volatility

  • Expected signs: Positive for both
  • Replication: Expected result for negative sentiment, null result for positive sentiment
  • Contexts: Mixed findings, both sentiments drive results in both directions
  • Double LASSO: Results are consistent

Future material weakness

Prediction: Inverse relation between sentiment and Material weaknesses

  • Expected signs: Positive for negative sentiment, negative for positive sentiment
  • Replication: Null result for negative sentiment, expected result for positive sentiment
  • Contexts: Mixed findings, both sentiments drive results in both directions
  • Double LASSO: Results are consistent

Falsification test

Randomly assign each clause to one of 137 groups using a uniform distribution

  • Table 4 replication
    • Context drives 31.5% of the variation in negative sentiment
    • Context drives 10.8% of the variation in positive sentiment
  • Tables 5 through 8 replication
    • All falsification tests have fewer significant coefficients on the context measures than our main results
    • All falsification tests have lower adjusted \(R^2\) than our main results

Our main results are unlikely to be driven by disaggregation in general

  • Context is meaningful for sentiment

Construct validity of sentiment

Is sentiment a consistent construct? It doesn’t appear to be.

  • Negative sentiment:
    • No context always loads
    • “Discussion of accounting procedures” and “decreases in expenses or performance” load 3/4 of the time
    • 13 contexts significant only twice
    • 35 contexts significant only once
  • Positive sentiment:
    • No context always loads
    • “decrease’ + unrelated statements” loads 3/4 of the time
    • 4 contexts significant only twice
    • 43 contexts significant only once

This appears to violate how we approach sentiment empirically

  • Aggregation is likely a problem

Other sentiment measures

Results are the same with the Henry (2008) and Harvard General Inquirer dictionaries

  • The problem we document is not due to the LM dictionary’s construction

Results are the same using FinBERT

  • This means that bag-of-words isn’t the source of the problem
  • It also means that the problem source likely isn’t classification accuracy

Aggregation is very likely to be the source of the problem

Simulating aggregation: Negative sentiment

Simulating aggregation: Positive sentiment

Conclusion

Wrap-up

  • What do we do?
    1. We build a measure of context within annual report MD&As
    2. We regress sentiment on context
    3. We replicate sentiment results by partitioning sentiment by context
  • What do we find?
    1. Sentiment relies more on some contexts than others
    2. Context matters for when regressing on sentiment
      • Some contexts behave as expected for sentiment, many others do not!
    3. The regression DV matters
      • Sentiment results are driven by different contexts for different DVs

Takeaways

  1. Sentiment, at the document level, is not a consistent construct.
  2. Sentiment should be analyzed on fine-grained contexts.
    • At a level where there is a clear economic link

Thanks!


Richard M. Crowley
Singapore Management University
https://rmc.link/
@prof_rmc

Packages used for these slides

  • dplyr
  • ggplot
  • gridExtra
  • kableExtra
  • knitr
  • revealjs

Other tables

Table 1

Table 2, Panels A and B

Table 2, Panels C and D

Table 3

What contexts are high in both sentiments?

What contexts skew towards negative sentiment?

What contexts skew towards positive sentiment?

What contexts are low in both sentiments?

Full Table 5

NOTES

  • Won’t be displayed ever