Lost in Aggregation: Measuring Sentiment at Document and Context Levels

Dr. Richard M. Crowley

With Franco Wong

rcrowley@smu.edu.sg / @prof_rmc / https://rmc.link/
Slides @ rmc.link/CUHKSZ

Research question and background

Motivation

How does context affect measurement of text sentiment?

More pointedly: Empirically, what does sentiment capture when applied to financial documents?
This is a tricky question:
- Most papers measure sentiment by applying dictionaries to full documents
  - But what actually drives the results then?
- ML-based sentiment is often calculated sentence-level, but most studies aggregate this to document level

Key issue

In both cases, aggregating to document level makes it difficult to see what, exactly, drives the results

It is unlikely that sentiment is uniformly distributed across documents

How do we approach this?

We argue that the context underlying sentiment-laden discuss is key to understanding its measurement

Cambridge dictionary on Context
- “The text or speech that comes immediately before and after a particular phrase or piece of text and helps to explain its meaning”
- E.g., is the text about accounting policies, operating performance, contracting, …
Consider the word loss:
1. “[…]net loss increased $3.0 million[…]”
2. “[…]net loss was due to $4.7 million goodwill write off[…]”
3. “[…]loss ratio decreased to 43% as result of segment’s adherence to underwriting guidelines[…]”
4. “[…]loan loss provision totaled $4.6 million in 2008 compared[…]”

The surrounding text helps us to understand:
1) What type of loss we are measuring;
2) How serious the implications of the loss are (e.g., transient, persistent, irrelevant)

What do we ask?

Do prior results using financial sentiment hold across contexts?
- If not, why?
Are prior results for different outcomes derived from the same underlying contexts?
- Or is more care needed in explaining why sentiment explains phenonmena?

Why?

To understand what financial sentiment captures
To understand if financial sentiment is empirically consistent
To provide methodological suggestions to more cleanly measure sentiment in future studies

Empirical setting: Use 10-K MD&A text to predict…
1. Filing period excess return
2. Future material weaknesses
3. Post-event return volatility
4. Filing period abnormal volume

Main results

Sentiment, at the context level, often contradicts prior results using the LM dictionary
- Broadly, aggregation removes nuance from our understanding
This pattern holds across other dictionary and machine learning sentiment measures
- Henry (2008) and Harvard GI dictionaries; FinBERT model
Different contexts drive prediction for different outcomes
- Sentiment captures different empirical constructs across regressions
We provide additional evidence on how aggregation leads to the observed outcomes
We illustrate an example of using context-based sentiment to study Critical Accounting Policy (CAP) disclosure

Punchline

Sentiment should be measured on fine-grained contexts, not full documents

In other words, a precise matching between the text used and the economic question examined is needed

Methodology: Measuring context

The idea

Our goal is to replicate a natural approach one would take to identify contexts by hand:
1. Take a reference clause
2. Look to see what the clause is about (the “context”)
3. Assign the clauses into logical groupings of contexts
4. After: Interpret sentiment of a clause within context

In order to better understand context and its link to sentiment, we will examine a broad set of contexts spanning all MD&A content

Implementation
- Step 1a: Clause extraction and reconstruction (OpenIE)
- Step 1b: Filtering overlapping clauses
- Step 2: Extracting a numeric representation of the context (USE)
- Step 3: Clustering into contexts (MB K-means + Gap statistic)

Examples of contexts

Accounting

Policies: Assumptions, Revenue Recognition, Tax, Cautionary Statements
Standards: Standards, New standards
General or B/S: Cash flow, Deferred tax
Income statement discussion: Accounting losses, Depreciation and amortization

Business operations

Debt, Equity, and Investment: Financing, Loans
Expectations and future: Management expectations, Risk factor disclosures
Macroeconomics: Interest rates, Market risk
Operations: Growth, Customers, Products
Structure: Subsidiaries, Partnerships

Changes

Changes in: sales, expenses, operating measures
Declines in: value or performance
Increase in: expenses, income or revenue

Ungrouped

Grammatical patterns
Timeframes
Unrelated statements
Unrelated statements with specific words

What clauses are in the contexts?

Accounting assumptions

“Option pricing models require input of highly subjective assumptions particularly for expected stock price volatility”
“Weighted average assumptions determine net periodic pension benefit expense”

Growth

“Growth was partially offset by closure”
“Diamond ’s capital expenditure budget is Diamond ’s highest at approximately $ 250 million with much related to internal growth activities comprised of expansions of facilities”

Deferred tax

“Adtalem recognizes future tax benefits associated with tax loss as deferred tax assets”
“Company fully impaired deferred tax asset resulting in 5 % effective tax benefit rate”

Market risk

“We are exposed to market risk related to interest rate risk on investment of cash in securities with original maturities”
“Currency gains related to market risk”

Step 1a: Extracting clauses

Automating with Stanford Open IE

Open IE is an open information extraction algorithm
Generates triples of context of the form (subject; relation verb; object)
Multi-step algorithm:
1. Creates the dependency parse tree
2. Resolves any co-references (“it,” “her,” etc.)
3. Determines clause boundaries (multinomial logistic model)
4. Determines triples within each clause (linguistic patterns)

This nets 179,703,756 extractions which can be formed back into clauses

Step 1b: Cutting this down a bit

Some clauses are superfluous as we saw earlier
Approach: Keep the shortest clauses such that…
1. We cover as much of the sentence as possible without having nested clauses
2. We don’t drop words from LM
3. We don’t drop accounting content

This cuts out 73% of clauses $\Rightarrow$ still have 48,576,229 clauses

Accounting content

Harvey’s hypertextual finance glossary
“The largest financial glossary on the Internet”
Some words unique to this dictionary:
- demonetization, boilerplate, deductible

NYSSCPA’s Accounting Terminology Guide
“Over 1,000 Accounting and Finance Terms”
Some words unique to this dictionary:
- GASB, MD&A, periodicity

Some shared words: collateral, specialist, hedge, debit, inventory

Step 2: Getting a numeric representation

Map all clauses to a 512-dimension vector space that represents underlying meaning
- Universal Sentence Encoder (USE; Cer et al. 2018)
- We mask out certain tokens that USE tends to focus on too much
  - Dates, times, dollar amounts, percentages, quantities, and ordinals

How does USE work?

Input: Clauses’ words and word order
Processing: Transformer-based NN
- Uses “attention” like BERT, etc.
Output: A 512-dim vector per clause

USE abstracts from word choice!

Step 3: Clustering to contexts

We cluster within the 512-dim vector space with Mini-Batch K-means (Sculley 2010)
- Mini-Batch K-means is an online version of K-means
  - Output is the same as K-means, but the process is more memory-friendly

Optimizing with Gap statistic

Gap statistic (Tibshirani et al. 2001) is a simulation approach to supervising clustering
Goal: Select the lowest $k$ by comparing the informativeness of clustering on real data vs. synthetic data
- Compare informativeness at $k$ vs. at $k+1$, look for a gap $<$ 1 S.D.
- Caveat: Optimal $k$ may be too small in more varied text; thus we compare $k$ to $k+1$ and $k+2$

13 is the lowest $k$ vs $k+1$ (red circle)
137 is the lowest $k$ vs $k+1$ and $k+2$ (blue circle)

137 contexts in the data

Empirical approach

Data

All 10-K and 10-K405 MD&A sections to build the text model
- 107,596 MD&As
- 48,576,229 extractions
Only MD&As subject to many requirements for empirical tests
- 35,362 MD&As
- 22,669,186 extractions
Loughran McDonald sentiment from their 10X File summaries file
MD&A LM sentiment based on the 10-K parser from Brown et al. (2020) (BCE)
- The BCE parser has Pearson correlations $>80\%$ for full text sentiment measures with LM
Accounting data from Compustat
Stock data from CRSP
Material weaknesses from Audit Analytics

Empirics sketch

To replicate results from Loughran McDonald (2011): OLS

\[ DV_{f,t} = \alpha + \beta_0 Sentiment_{f,t} + \gamma \cdot Controls_{f,t} + \delta \cdot Industry~FE+\varepsilon \]

To partition the replication on context: LASSO

\[ \begin{aligned} DV_{f,t} &= \alpha + \sum_{i=1}^{137}\beta_i Sentiment_{Context,i,f,t} + \gamma \cdot Controls_{f,t} \\ &+ \delta \cdot Industry~FE+\varepsilon \end{aligned} \]

$Sentiment_{Context,i,f,t}$ is the percent of clauses with a given sentiment within the set of clauses from $Context_i$

Practical issues with 137 IVs

137 text-derived measures means multicollinearity could flip coefficient signs and drop adjusted $R^2$

Solution 1: LASSO

Replace OLS problem of $\min_{\beta,\gamma,\delta} \frac{1}{N} |\varepsilon|_2^2$ with:

$\min_{\beta,\gamma,\delta} \frac{1}{N} |\varepsilon|_2^2 + \lambda \sum_{b\in \{\beta, \gamma, \delta\}} \left|b\right|_1$

Optimize $\lambda$ with 10-fold cross-validation
LASSO is is also called $L^1$ regularization
- Standard technique for dealing with high VIFs
Derive p-values using Post-LASSO estimator

Much less worry about multicollinearity

But some worry about dropping causal links

Solution 2: Double LASSO

Determine causal links with $1 + \#\text{IV}$ LASSO regressions
Estimate result using post-LASSO
Ensures statistically significant links aren’t dropped

Avoids biasing against finding significant coefficients

If we find that only a few contexts matter, it’s because they really don’t predict the outcome

Complementary measures

LASSO limits the possibility of observing incorrect signs due to multicollinearity
Double LASSO ensures consistent coefficient inclusion in the models

Main results

Filing period excess return

Prediction: Positive relation between sentiment and return

Expected signs: Negative for negative sentiment, positive for positive sentiment
Replication: Expected result for negative sentiment, null result for positive sentiment
Contexts: Mixed findings, both sentiments drive results in both directions
Double LASSO: Results are consistent

Other sentiment measures

Results are the same with the Henry (2008) and Harvard General Inquirer dictionaries

The problem we document is not due to the LM dictionary’s construction

Results are the same using FinBERT

This helps to rule out dictionaries as the source of the problem
It also means that the problem source likely isn’t classification accuracy
- As FinBERT is best-in-class, presently

Results point to document level aggregation being problematic

Future material weakness

Prediction: Inverse relation between sentiment and Material weaknesses

Expected signs: Positive for negative sentiment, negative for positive sentiment
Replication: Null result for negative sentiment, expected result for positive sentiment
Contexts: Mixed findings, both sentiments drive results in both directions
Double LASSO: Results are consistent

Post-filing return volatility

Prediction: More sentiment (either), higher volatility

Expected signs: Positive for both
Replication: Expected result for negative sentiment, null result for positive sentiment
Contexts: Mixed findings, both sentiments drive results in both directions
Double LASSO: Results are consistent

Filing period abnormal volume

Prediction: More sentiment (either), higher volume

Expected signs: Positive for both
Replication: Opposite result for negative sentiment, null result for positive sentiment
Contexts: Mixed findings, but mostly in line with predictions
Double LASSO: Results are consistent

Construct validity of sentiment

Is sentiment a consistent construct? It doesn’t appear to be.

Negative sentiment:
- No context always loads
- “Discussion of accounting procedures” and “decreases in expenses or performance” load 3/4 of the time
- 13 contexts significant only twice
- 35 contexts significant only once

Positive sentiment:
- No context always loads
- 1 loads 3/4 of the time
- 4 contexts significant only twice
- 43 contexts significant only once

Violation of empirical assumptions

Different portions of 10-K MD&As drive results for different DVs. Aggregation masks this problem.

Ruling out noise from context measurement

Distribution of context and negative sentiment

Losses and poor performance are strongly tied to negative sentiment
Uncertainty-related discussions, such as Risk factor disclosures, are also frequently negative

Distribution of context and positive sentiment

Tax and Accounting standards being positive is largely driven by the word “effective”
- Dropping this word doesn’t effect our results though
Some inherently positive discussion, such as Growth, correlates with positive discussion
Discussion of Losses is rarely positive

Validation: Do the contexts behave as expected?

Intrusion task
- Take 3 clauses from 1 context and an “intruder” from another, e.g.:
  1. average market rate is in effect
  2. price swings are due to commodity costs
  3. net sales impact is in same store sales
  4. Volatility is in commodity prices
- 4 RAs average 86% on the task; 500 questions each
  - This is a very high score on the task!
Overlap of original extractions with accounting dictionaries:
- 95.2% contain at least 1 word in Campbell Harvey’s dictionary
- 84.8% contain at least 1 word in the NYSSCPA dictionary
Regress MD&A sentiment on clusters conditional on sentiment
- 82.3% (68.6%) of variation captured for negative (positive) sentiment

Falsification test

Randomly assign each clause to one of 137 groups using a uniform distribution

Rerun all results (all 4 DVs, all 4 sentiment methods)
All falsification tests have:
- Fewer significant coefficients for negative sentiment
- Fewer significant coefficients for positive sentiment
- Lower adjusted $R^2$

Our main results are unlikely to be driven by disaggregation in general

Context is likely meaningful for sentiment

Simulating aggregation

Simulation setup

We simulate randomly aggregating our 137 contexts into a single measure containing $n\in[2,136]$ contexts
We then rerun our main regression filing period excess return on sentiment at varying level of aggregations
We randomly repeat the above process 1,000 times for each $n$
- We have also done this with the other 3 DVs – results are consistent
In addition, we present simulations focussed on only aggregating business-relevant contexts and only ungrouped contexts

Simulating aggregation: Negative sentiment

What does this simulation show us?

Aggregation suppresses the mixed results found at low levels of aggregation
The negative sign on full document sentiment is driven mostly by ungrouped contexts

Simulating aggregation: Positive sentiment

What does this simulation show us?

Aggregation suppresses the mixed results found at low levels of aggregation
Inclusion of ungrouped contexts suppresses the expected positive coefficient on positive sentiment at document level

Example: Critical Accounting Policy disclosure

Example application: CAP Disclosure

Critical Accounting Policy (CAP) disclosure¹…
- May reduce information asymmetry and predict future financing
- May decrease financial statement reliability or increase perceived risk
- Empirically, can both increase and decrease short-run stock returns
Mechanism revolves around if CAP disclosures increase or decrease FS reliability

Our argument

Positively phrased CAP disclosures may, on average, decrease reliability
Negatively phrased CAP disclosures may, on average, increase reliability

$H_0$: CAP disclosures with negative sentiment have no impact on short run stock return. $H_A$: CAP disclosures with negative sentiment are positively impact short run stock return.

Example application: CAP Disclosure

Traditional approach

Fail to reject $H_0$
No main effect of CAP Disclosure

Context approach

Results strongly support $H_A$
No results for positive sentiment

Conclusion

Findings and takeaways

Sentiment, at the document level, is not a consistent measure.
- Consistent phenomenon regardless of sentiment implementation
Sentiment results are driven by different contexts for different DVs

Takeaways

The context sentiment is measured on matters
The dependent variable being regressed on sentiment also matters

What should we, as researchers, do then?

For most papers

Focus on one context in documents
Match to theory
Example: Hassan et al. (2019 QJE)

Need a broad set of discussion?

Our context measure offers a solution
Unsupervised, automated, replicable
Works for any document type

Thanks!

Dr. Richard M. Crowley
rcrowley@smu.edu.sg
@prof_rmc
rmc.link/

Packages used for these slides

dplyr
ggplot2
kableExtra
knitr
quarto
- material-icons
revealjs
scales

Methodology slides

Illustration of extracting clauses

“The company’s earnings increased by 5% due to an improvement in operating efficiency.”

(company; has; earnings)
(company’s earnings; increased by; 5%)
(company’s earnings; increased due; improved operating efficiency)
(company’s earnings; increased due; operating efficiency)

How does the Gap statistic work?

Let $k$ be the number of clusters,
$B$ the number of simulated samples,
$W_k$ be the K-Means inertia score on actual data,
$W_{k,r}^*$ be the K-Means inertia score for iteration $r$ with synthetic data,
$\bar{l}$ be the average of the $W_{k,r}^*$ scores

\[ \begin{aligned} Gap(k) &= \left(\frac{1}{B}\right) \sum_{r=1}^{B} \text{log}\left(W_{k,r}^*\right)-\text{log}\left(W_k\right) \text{and}\\ s_k &= sd_k \sqrt{1+\frac{1}{B}},\text{ where }sd_k = \sqrt{\left(\frac{1}{B}\right)\sum_{r=1}^{B}\left\{\text{log}\left(W_{k,r}^*-\bar{l}\right)\right\}^2} \end{aligned} \]

Select the lowest $k$ such that $Gap(k) \ge Gap(k+1) - s_{k+1}$

I.e., select the lowest $k$ s.t. the log-scaled error removed by clustering on real data at $k$ is no worse than 1 SD below the log-scaled error removed at $k+1$

Double LASSO

Drawbacks of handling multicollinearity

In doing so, one may remove variables that are potentially causally important. This can lead to questions on the validity of inferences derived from LASSO-based coefficients.

Solution: Double LASSO (Belloni, Chernozhukov and Hansen 2014 JEP)

\[ \begin{align} (1)&\qquad DV_{f,t} = \alpha + \sum_{i=1}^{137}\beta_i Sentiment_{Context,i,f,t} + \gamma \cdot Controls_{f,t} + \delta \cdot Industry~FE+\varepsilon\\ (2)&\qquad Sentiment_{Context,i,f,t} = \alpha + \sum_{j\ne i}\beta_j Sentiment_{Context,j,f,t} + [...]\\ (3)&\qquad DV_{f,t} = \alpha + \sum_{i\in S}\beta_i Sentiment_{Context,i,f,t} + \gamma \cdot Controls_{f,t} + \delta \cdot Industry~FE+\varepsilon\\ & \qquad S = [1,137]\backslash\{i~\text{ s.t. } 0 < \sum_{\beta_i\text{ from }\{(1), (2)\}} I(\beta_i \ne 0)\} \end{align} \]

Run 138 LASSO regressions to determine significant links between outcome or IVs and IVs
Run a post OLS keeping only variables that had significant impact on the LASSO regressions

Full tables

Table 1

Table 2

Full Table 3

Table 4 Panels A and B

Table 4 Panels C and D

Sentiment regressed on context

92 (79) contexts drive negative (positive) sentiment

Some uniformly drive both positive and negative sentiment
- “Cautionary statements” and “reduction in accounts”
Some only drive negative sentiment
- “Accounting losses” and “risk factor disclosures”
Some only drive positive sentiment
- “Increases in performance” and “tax”
Some drive a lack of sentiment
- “Depreciation and amortization” and “credit facilities”

As more coefficients’ signs match to our intuition for negative sentiment, we argue that negative sentiment is more tied to context.

What contexts are high in both sentiments?

What contexts skew towards negative sentiment?

What contexts skew towards positive sentiment?

What contexts are low in both sentiments?

Lost in Aggregation: Measuring Sentiment at Document and Context Levels

Research question and background

Motivation

How do we approach this?

What do we ask?

Main results

Related literature

Related literature

Methodology: Measuring context

The idea

Examples of contexts

What clauses are in the contexts?

Step 1a: Extracting clauses

Step 1b: Cutting this down a bit

Step 2: Getting a numeric representation

Step 3: Clustering to contexts

Empirical approach

Data

Empirics sketch

Practical issues with 137 IVs

Main results

Filing period excess return

Other sentiment measures

Future material weakness

Post-filing return volatility

Filing period abnormal volume

Construct validity of sentiment

Ruling out noise from context measurement

Distribution of context and negative sentiment

Distribution of context and positive sentiment

Validation: Do the contexts behave as expected?

Falsification test

Simulating aggregation

Simulation setup

Simulating aggregation: Negative sentiment

Simulating aggregation: Positive sentiment

Example: Critical Accounting Policy disclosure

Example application: CAP Disclosure

Example application: CAP Disclosure

Conclusion

Findings and takeaways

Thanks!

Dr. Richard M. Crowleyrcrowley@smu.edu.sg@prof_rmcrmc.link/

Packages used for these slides

Methodology slides

Illustration of extracting clauses

How does the Gap statistic work?

Double LASSO

Full tables

Table 1

Table 2

Full Table 3

Table 4 Panels A and B

Table 4 Panels C and D

Sentiment regressed on context

Sentiment regressed on context

What contexts are high in both sentiments?

What contexts skew towards negative sentiment?

What contexts skew towards positive sentiment?

What contexts are low in both sentiments?

Dr. Richard M. Crowley
rcrowley@smu.edu.sg
@prof_rmc
rmc.link/