Lost in Aggregation: Measuring Sentiment at Document and Context Levels

Dr. Richard M. Crowley

rcrowley@smu.edu.sg / @prof_rmc / https://rmc.link/
Slides @ rmc.link/CUHKSZ

Research question and background

Motivation

How does context affect measurement of text sentiment?

  • More pointedly: Empirically, what does sentiment capture when applied to financial documents?
  • This is a tricky question:
    • Most papers measure sentiment by applying dictionaries to full documents
      • But what actually drives the results then?
    • ML-based sentiment is often calculated sentence-level, but most studies aggregate this to document level

Key issue

In both cases, aggregating to document level makes it difficult to see what, exactly, drives the results

  • It is unlikely that sentiment is uniformly distributed across documents

How do we approach this?

We argue that the context underlying sentiment-laden discuss is key to understanding its measurement

  • Cambridge dictionary on Context
    • “The text or speech that comes immediately before and after a particular phrase or piece of text and helps to explain its meaning”
    • E.g., is the text about accounting policies, operating performance, contracting, …
  • Consider the word loss:
    1. “[…]net loss increased $3.0 million[…]”
    2. “[…]net loss was due to $4.7 million goodwill write off[…]”
    3. “[…]loss ratio decreased to 43% as result of segment’s adherence to underwriting guidelines[…]”
    4. “[…]loan loss provision totaled $4.6 million in 2008 compared[…]”

The surrounding text helps us to understand:
1) What type of loss we are measuring;
2) How serious the implications of the loss are (e.g., transient, persistent, irrelevant)

What do we ask?

  1. Do prior results using financial sentiment hold across contexts?
    • If not, why?
  2. Are prior results for different outcomes derived from the same underlying contexts?
    • Or is more care needed in explaining why sentiment explains phenonmena?

Why?

  1. To understand what financial sentiment captures
  2. To understand if financial sentiment is empirically consistent
  3. To provide methodological suggestions to more cleanly measure sentiment in future studies
  • Empirical setting: Use 10-K MD&A text to predict…
    1. Filing period excess return
    2. Future material weaknesses
    3. Post-event return volatility
    4. Filing period abnormal volume

Main results

  1. Sentiment, at the context level, often contradicts prior results using the LM dictionary
    • Broadly, aggregation removes nuance from our understanding
  2. This pattern holds across other dictionary and machine learning sentiment measures
    • Henry (2008) and Harvard GI dictionaries; FinBERT model
  3. Different contexts drive prediction for different outcomes
    • Sentiment captures different empirical constructs across regressions
  4. We provide additional evidence on how aggregation leads to the observed outcomes
  5. We illustrate an example of using context-based sentiment to study Critical Accounting Policy (CAP) disclosure

Punchline

Sentiment should be measured on fine-grained contexts, not full documents

  • In other words, a precise matching between the text used and the economic question examined is needed

Methodology: Measuring context

The idea

  • Our goal is to replicate a natural approach one would take to identify contexts by hand:
    1. Take a reference clause
    2. Look to see what the clause is about (the “context”)
    3. Assign the clauses into logical groupings of contexts
    4. After: Interpret sentiment of a clause within context

In order to better understand context and its link to sentiment, we will examine a broad set of contexts spanning all MD&A content

  • Implementation
    • Step 1a: Clause extraction and reconstruction (OpenIE)
    • Step 1b: Filtering overlapping clauses
    • Step 2: Extracting a numeric representation of the context (USE)
    • Step 3: Clustering into contexts (MB K-means + Gap statistic)

Examples of contexts

Accounting

  • Policies: Assumptions, Revenue Recognition, Tax, Cautionary Statements
  • Standards: Standards, New standards
  • General or B/S: Cash flow, Deferred tax
  • Income statement discussion: Accounting losses, Depreciation and amortization

Business operations

  • Debt, Equity, and Investment: Financing, Loans
  • Expectations and future: Management expectations, Risk factor disclosures
  • Macroeconomics: Interest rates, Market risk
  • Operations: Growth, Customers, Products
  • Structure: Subsidiaries, Partnerships


Changes

  • Changes in: sales, expenses, operating measures
  • Declines in: value or performance
  • Increase in: expenses, income or revenue

Ungrouped

  • Grammatical patterns
  • Timeframes
  • Unrelated statements
  • Unrelated statements with specific words

What clauses are in the contexts?

Accounting assumptions

  1. “Option pricing models require input of highly subjective assumptions particularly for expected stock price volatility”
  2. “Weighted average assumptions determine net periodic pension benefit expense”

Growth

  1. “Growth was partially offset by closure”
  2. “Diamond ’s capital expenditure budget is Diamond ’s highest at approximately $ 250 million with much related to internal growth activities comprised of expansions of facilities”


Deferred tax

  1. “Adtalem recognizes future tax benefits associated with tax loss as deferred tax assets”
  2. “Company fully impaired deferred tax asset resulting in 5 % effective tax benefit rate”

Market risk

  1. “We are exposed to market risk related to interest rate risk on investment of cash in securities with original maturities”
  2. “Currency gains related to market risk”

Step 1a: Extracting clauses

Automating with Stanford Open IE

  • Open IE is an open information extraction algorithm
  • Generates triples of context of the form (subject; relation verb; object)
  • Multi-step algorithm:
    1. Creates the dependency parse tree
    2. Resolves any co-references (“it,” “her,” etc.)
    3. Determines clause boundaries (multinomial logistic model)
    4. Determines triples within each clause (linguistic patterns)

This nets 179,703,756 extractions which can be formed back into clauses

Step 1b: Cutting this down a bit

  • Some clauses are superfluous as we saw earlier
  • Approach: Keep the shortest clauses such that…
    1. We cover as much of the sentence as possible without having nested clauses
    2. We don’t drop words from LM
    3. We don’t drop accounting content

This cuts out 73% of clauses \(\Rightarrow\) still have 48,576,229 clauses

Accounting content

Some shared words: collateral, specialist, hedge, debit, inventory

Step 2: Getting a numeric representation

  1. Map all clauses to a 512-dimension vector space that represents underlying meaning
    • Universal Sentence Encoder (USE; Cer et al. 2018)
    • We mask out certain tokens that USE tends to focus on too much
      • Dates, times, dollar amounts, percentages, quantities, and ordinals

How does USE work?

  • Input: Clauses’ words and word order
  • Processing: Transformer-based NN
    • Uses “attention” like BERT, etc.
  • Output: A 512-dim vector per clause

USE abstracts from word choice!

Step 3: Clustering to contexts

  • We cluster within the 512-dim vector space with Mini-Batch K-means (Sculley 2010)
    • Mini-Batch K-means is an online version of K-means
      • Output is the same as K-means, but the process is more memory-friendly

Optimizing with Gap statistic

  • Gap statistic (Tibshirani et al. 2001) is a simulation approach to supervising clustering
  • Goal: Select the lowest \(k\) by comparing the informativeness of clustering on real data vs. synthetic data
    • Compare informativeness at \(k\) vs. at \(k+1\), look for a gap \(<\) 1 S.D.
    • Caveat: Optimal \(k\) may be too small in more varied text; thus we compare \(k\) to \(k+1\) and \(k+2\)
  • 13 is the lowest \(k\) vs \(k+1\) (red circle)
  • 137 is the lowest \(k\) vs \(k+1\) and \(k+2\) (blue circle)

137 contexts in the data

Empirical approach

Data

  • All 10-K and 10-K405 MD&A sections to build the text model
    • 107,596 MD&As
    • 48,576,229 extractions
  • Only MD&As subject to many requirements for empirical tests
    • 35,362 MD&As
    • 22,669,186 extractions
  • Loughran McDonald sentiment from their 10X File summaries file
  • MD&A LM sentiment based on the 10-K parser from Brown et al. (2020) (BCE)
    • The BCE parser has Pearson correlations \(>80\%\) for full text sentiment measures with LM
  • Accounting data from Compustat
  • Stock data from CRSP
  • Material weaknesses from Audit Analytics

Empirics sketch

  • To replicate results from Loughran McDonald (2011): OLS

\[ DV_{f,t} = \alpha + \beta_0 Sentiment_{f,t} + \gamma \cdot Controls_{f,t} + \delta \cdot Industry~FE+\varepsilon \]

  • To partition the replication on context: LASSO

\[ \begin{aligned} DV_{f,t} &= \alpha + \sum_{i=1}^{137}\beta_i Sentiment_{Context,i,f,t} + \gamma \cdot Controls_{f,t} \\ &+ \delta \cdot Industry~FE+\varepsilon \end{aligned} \]

\(Sentiment_{Context,i,f,t}\) is the percent of clauses with a given sentiment within the set of clauses from \(Context_i\)

Practical issues with 137 IVs

  • 137 text-derived measures means multicollinearity could flip coefficient signs and drop adjusted \(R^2\)

Solution 1: LASSO

  • Replace OLS problem of \(\min_{\beta,\gamma,\delta} \frac{1}{N} |\varepsilon|_2^2\) with:

\(\min_{\beta,\gamma,\delta} \frac{1}{N} |\varepsilon|_2^2 + \lambda \sum_{b\in \{\beta, \gamma, \delta\}} \left|b\right|_1\)

  • Optimize \(\lambda\) with 10-fold cross-validation
  • LASSO is is also called \(L^1\) regularization
    • Standard technique for dealing with high VIFs
  • Derive p-values using Post-LASSO estimator

Much less worry about multicollinearity

  • But some worry about dropping causal links

Solution 2: Double LASSO

  • Determine causal links with \(1 + \#\text{IV}\) LASSO regressions
  • Estimate result using post-LASSO
  • Ensures statistically significant links aren’t dropped

Avoids biasing against finding significant coefficients

  • If we find that only a few contexts matter, it’s because they really don’t predict the outcome

Complementary measures

  • LASSO limits the possibility of observing incorrect signs due to multicollinearity
  • Double LASSO ensures consistent coefficient inclusion in the models

Main results

Filing period excess return

Prediction: Positive relation between sentiment and return

  • Expected signs: Negative for negative sentiment, positive for positive sentiment
  • Replication: Expected result for negative sentiment, null result for positive sentiment
  • Contexts: Mixed findings, both sentiments drive results in both directions
  • Double LASSO: Results are consistent

Other sentiment measures

Results are the same with the Henry (2008) and Harvard General Inquirer dictionaries

  • The problem we document is not due to the LM dictionary’s construction

Results are the same using FinBERT

  • This helps to rule out dictionaries as the source of the problem
  • It also means that the problem source likely isn’t classification accuracy
    • As FinBERT is best-in-class, presently

Results point to document level aggregation being problematic

Future material weakness

Prediction: Inverse relation between sentiment and Material weaknesses

  • Expected signs: Positive for negative sentiment, negative for positive sentiment
  • Replication: Null result for negative sentiment, expected result for positive sentiment
  • Contexts: Mixed findings, both sentiments drive results in both directions
  • Double LASSO: Results are consistent

Post-filing return volatility

Prediction: More sentiment (either), higher volatility

  • Expected signs: Positive for both
  • Replication: Expected result for negative sentiment, null result for positive sentiment
  • Contexts: Mixed findings, both sentiments drive results in both directions
  • Double LASSO: Results are consistent

Filing period abnormal volume

Prediction: More sentiment (either), higher volume

  • Expected signs: Positive for both
  • Replication: Opposite result for negative sentiment, null result for positive sentiment
  • Contexts: Mixed findings, but mostly in line with predictions
  • Double LASSO: Results are consistent

Construct validity of sentiment

Is sentiment a consistent construct? It doesn’t appear to be.

  • Negative sentiment:
    • No context always loads
    • “Discussion of accounting procedures” and “decreases in expenses or performance” load 3/4 of the time
    • 13 contexts significant only twice
    • 35 contexts significant only once
  • Positive sentiment:
    • No context always loads
    • 1 loads 3/4 of the time
    • 4 contexts significant only twice
    • 43 contexts significant only once

Violation of empirical assumptions

Different portions of 10-K MD&As drive results for different DVs. Aggregation masks this problem.

Ruling out noise from context measurement

Distribution of context and negative sentiment

  • Losses and poor performance are strongly tied to negative sentiment
  • Uncertainty-related discussions, such as Risk factor disclosures, are also frequently negative

Distribution of context and positive sentiment

  • Tax and Accounting standards being positive is largely driven by the word “effective”
    • Dropping this word doesn’t effect our results though
  • Some inherently positive discussion, such as Growth, correlates with positive discussion
  • Discussion of Losses is rarely positive

Validation: Do the contexts behave as expected?

  1. Intrusion task
    • Take 3 clauses from 1 context and an “intruder” from another, e.g.:
      1. average market rate is in effect
      2. price swings are due to commodity costs
      3. net sales impact is in same store sales
      4. Volatility is in commodity prices
    • 4 RAs average 86% on the task; 500 questions each
      • This is a very high score on the task!
  2. Overlap of original extractions with accounting dictionaries:
    • 95.2% contain at least 1 word in Campbell Harvey’s dictionary
    • 84.8% contain at least 1 word in the NYSSCPA dictionary
  3. Regress MD&A sentiment on clusters conditional on sentiment
    • 82.3% (68.6%) of variation captured for negative (positive) sentiment

Falsification test

Randomly assign each clause to one of 137 groups using a uniform distribution

  • Rerun all results (all 4 DVs, all 4 sentiment methods)
  • All falsification tests have:
    • Fewer significant coefficients for negative sentiment
    • Fewer significant coefficients for positive sentiment
    • Lower adjusted \(R^2\)

Our main results are unlikely to be driven by disaggregation in general

  • Context is likely meaningful for sentiment

Simulating aggregation

Simulation setup

  • We simulate randomly aggregating our 137 contexts into a single measure containing \(n\in[2,136]\) contexts
  • We then rerun our main regression filing period excess return on sentiment at varying level of aggregations
  • We randomly repeat the above process 1,000 times for each \(n\)
    • We have also done this with the other 3 DVs – results are consistent
  • In addition, we present simulations focussed on only aggregating business-relevant contexts and only ungrouped contexts

Simulating aggregation: Negative sentiment

What does this simulation show us?

  • Aggregation suppresses the mixed results found at low levels of aggregation
  • The negative sign on full document sentiment is driven mostly by ungrouped contexts

Simulating aggregation: Positive sentiment

What does this simulation show us?

  • Aggregation suppresses the mixed results found at low levels of aggregation
  • Inclusion of ungrouped contexts suppresses the expected positive coefficient on positive sentiment at document level

Example: Critical Accounting Policy disclosure

Example application: CAP Disclosure

  • Critical Accounting Policy (CAP) disclosure1
    • May reduce information asymmetry and predict future financing
    • May decrease financial statement reliability or increase perceived risk
    • Empirically, can both increase and decrease short-run stock returns
  • Mechanism revolves around if CAP disclosures increase or decrease FS reliability

Our argument

  • Positively phrased CAP disclosures may, on average, decrease reliability
  • Negatively phrased CAP disclosures may, on average, increase reliability

\(H_0\): CAP disclosures with negative sentiment have no impact on short run stock return. \(H_A\): CAP disclosures with negative sentiment are positively impact short run stock return.

Example application: CAP Disclosure

Traditional approach

  • Fail to reject \(H_0\)
  • No main effect of CAP Disclosure

Context approach

  • Results strongly support \(H_A\)
  • No results for positive sentiment

Conclusion

Findings and takeaways

  • Sentiment, at the document level, is not a consistent measure.
    • Consistent phenomenon regardless of sentiment implementation
  • Sentiment results are driven by different contexts for different DVs

Takeaways

  1. The context sentiment is measured on matters
  2. The dependent variable being regressed on sentiment also matters

What should we, as researchers, do then?

For most papers

  • Focus on one context in documents
  • Match to theory
  • Example: Hassan et al. (2019 QJE)

Need a broad set of discussion?

  • Our context measure offers a solution
  • Unsupervised, automated, replicable
  • Works for any document type

Thanks!


Dr. Richard M. Crowley
rcrowley@smu.edu.sg
@prof_rmc
rmc.link/

Packages used for these slides

  • dplyr
  • ggplot2
  • kableExtra
  • knitr
  • quarto
    • material-icons
  • revealjs
  • scales

Methodology slides

Illustration of extracting clauses

“The company’s earnings increased by 5% due to an improvement in operating efficiency.”

 

 

 

 

 

  • (company; has; earnings)
  • (company’s earnings; increased by; 5%)
  • (company’s earnings; increased due; improved operating efficiency)
  • (company’s earnings; increased due; operating efficiency)

How does the Gap statistic work?

  • Let \(k\) be the number of clusters,
  • \(B\) the number of simulated samples,
  • \(W_k\) be the K-Means inertia score on actual data,
  • \(W_{k,r}^*\) be the K-Means inertia score for iteration \(r\) with synthetic data,
  • \(\bar{l}\) be the average of the \(W_{k,r}^*\) scores

\[ \begin{aligned} Gap(k) &= \left(\frac{1}{B}\right) \sum_{r=1}^{B} \text{log}\left(W_{k,r}^*\right)-\text{log}\left(W_k\right) \text{and}\\ s_k &= sd_k \sqrt{1+\frac{1}{B}},\text{ where }sd_k = \sqrt{\left(\frac{1}{B}\right)\sum_{r=1}^{B}\left\{\text{log}\left(W_{k,r}^*-\bar{l}\right)\right\}^2} \end{aligned} \]

  • Select the lowest \(k\) such that \(Gap(k) \ge Gap(k+1) - s_{k+1}\)

I.e., select the lowest \(k\) s.t. the log-scaled error removed by clustering on real data at \(k\) is no worse than 1 SD below the log-scaled error removed at \(k+1\)

Double LASSO

Drawbacks of handling multicollinearity

In doing so, one may remove variables that are potentially causally important. This can lead to questions on the validity of inferences derived from LASSO-based coefficients.

  • Solution: Double LASSO (Belloni, Chernozhukov and Hansen 2014 JEP)

\[ \begin{align} (1)&\qquad DV_{f,t} = \alpha + \sum_{i=1}^{137}\beta_i Sentiment_{Context,i,f,t} + \gamma \cdot Controls_{f,t} + \delta \cdot Industry~FE+\varepsilon\\ (2)&\qquad Sentiment_{Context,i,f,t} = \alpha + \sum_{j\ne i}\beta_j Sentiment_{Context,j,f,t} + [...]\\ (3)&\qquad DV_{f,t} = \alpha + \sum_{i\in S}\beta_i Sentiment_{Context,i,f,t} + \gamma \cdot Controls_{f,t} + \delta \cdot Industry~FE+\varepsilon\\ & \qquad S = [1,137]\backslash\{i~\text{ s.t. } 0 < \sum_{\beta_i\text{ from }\{(1), (2)\}} I(\beta_i \ne 0)\} \end{align} \]

  1. Run 138 LASSO regressions to determine significant links between outcome or IVs and IVs
  2. Run a post OLS keeping only variables that had significant impact on the LASSO regressions

Full tables

Table 1

Table 2

Full Table 3

Table 4 Panels A and B

Table 4 Panels C and D

Sentiment regressed on context

Sentiment regressed on context

92 (79) contexts drive negative (positive) sentiment

  • Some uniformly drive both positive and negative sentiment
    • “Cautionary statements” and “reduction in accounts”
  • Some only drive negative sentiment
    • “Accounting losses” and “risk factor disclosures”
  • Some only drive positive sentiment
    • “Increases in performance” and “tax”
  • Some drive a lack of sentiment
    • “Depreciation and amortization” and “credit facilities”

As more coefficients’ signs match to our intuition for negative sentiment, we argue that negative sentiment is more tied to context.

What contexts are high in both sentiments?

What contexts skew towards negative sentiment?

What contexts skew towards positive sentiment?

What contexts are low in both sentiments?