ML for SS: Transformers

Dr. Richard M. Crowley

https://rmc.link/

Overview

Papers

Paper 1: Crowley and Wong (2025 Working)

  • Takes topic modeling to a more granular (sentence) level using BERTopic

Paper 2: Fritsch, Zhang, and Zheng (2025 JAR)

  • Identifying hard vs soft climate disclosure using a BERT model

Paper 3: Assael et al. (2022 Nature)

  • A more complex, but well motivated, use case for transformers

Technical Discussion: BERT

Python

  • Implementing BERTopic
    • Grootendorst (2022)
  • Implementing FinBERT
    • Huang, Wang, and Yang (2023 CAR)
  • Training a BERT model using fine-tuning applied to FinBERT

Python is by far the easiest option for these topics.

There is code for all of the examples either on eLearn/my website or Google Colab. Data is on eLearn.

Main application: Analyzing Wall Street Journal articles

  • On eLearn you will find a full month of the WSJ in text format

Tasks

  • Apply a topic model to the documents at the sentence level
  • Calculate the sentiment of any sentence
  • Play around with BERT models
    • Determine author gender based on the first 500 words of an article

Transformers

What are neural networks?

  • The phrase neural network is thrown around almost like a buzz word
  • Neural networks are actually a specific type class algorithms
    • There are many implementations with different primary uses

What are neural networks?

  • Originally, the goal was to construct an algorithm that behaves like a human brain
    • Thus the name
  • Current methods don’t quite reflect human brains, however:
    1. We don’t fully understand how our brains work, which makes replication rather difficult
    2. Most neural networks are constructed for specialized tasks (not general tasks)
    3. Some (but not all) neural networks use tools our brain may not have

What are neural networks?

  • Neural networks are a method by which a computer can learn from observational data
  • In practice:
    • They were not computationally worthwhile until the mid 2000s
    • They have been known since the 1950s (perceptrons)
    • They can be used to construct algorithms that, at times, perform better than humans themselves
      • But these algorithms are often quite computationally intense, complex, and difficult to understand
    • Much work has been and is being done to make them more accessible

Types of neural networks

  • There are a lot of neural network types
  • Some of the more interesting ones which we will see or have seen:
    • RNN: Recurrent Neural Network
    • LSTM: Long/Short Term Memory
    • CNN: Convolutional Neural Network
    • DAN: Deep Averaging Network
    • GAN: Generative Adversarial Network
  • Others worth noting
    • VAE (Variational Autoencoder): Generating new data from datasets
  • Not in the Zoo, but of note:

RNN: Recurrent NN

  • Recurrent neural networks embed a history of information in the network
    • The previous computation affects the next one
    • Leads to a short term memory
  • Used for speech recognition, image captioning, anomaly detection, and many others

LSTM: Long Short Term Memory

  • LSTM improves the long term memory of the network while explicitly modeling a short term memory
  • Used wherever RNNs are used, and then some
    • Ex.: Seq2seq (machine translation)

CNN: Convolutional NN

  • Networks that excel at object detection (in images)
  • Can be applied to other data as well
  • Ex.: Inception-v3

DAN: Deep Averaging Network

  • DANs are simple networks that simply average their inputs
  • Averaged inputs are then processed a few times
  • These networks have found a home in NLP

GAN: Generative Adversarial Network

  • Feature two networks working against each other
  • Many novel uses
    • Ex.: Anonymizing clinical trial data by simulating an attack on the dataset
    • Ex.: Aging images

VAE: Variational Autoencoder

  • An autoencoder (AE) is an algorithm that can recreate input data
  • Variational means this type of AE can vary other aspects to generate completely new output
  • Like a simpler, noisier GAN

Transformer

  • Shares some similarities with RNN and LSTM: Focuses on attention
  • Currently being applied to solve many types of problems
  • Examples: BERT, GPT-3, XLNEt, RoBERTa, ChatGPT

More than words

Why might we want to go beyond words?

Transformers: How we use ML to go beyond words

  • Note that some models only keep the encoder stack on the left (BERT), and some only keep the decoder stack on the right (GPT). Some use both (T5).
    • Encoder builds embeddings
    • Decoder generates language

A simple approach for sentences: USE

  • USE is based on DAN (Deep Averaging Networks) and Transformers
    • There are variants using each
      • DAN is faster
      • Transformer is more accurate
    • USE learns the meaning of sentences via words’ implied meanings
  • Learn more: Original paper and TensorFlow site
  • In practice, it works quite well
    • But BERT quickly took over due to its fine-tuning capabilities

Example of USE in daily life

Recall that USE focuses on representing sentence-length chunks of text

Demo: Comparing sentence meaning with USE

An approach for slightly longer text: BERT

Bidirectional Encoder Representations from Transformers

  • BERT handles text up to 512 tokens
    • A good fit for sentences or up to small paragraphs
  • It comes pretrained on Wikipedia and a corpus of books
  • Many other variants trained on other text exist
    • FinBERT on 10-K and 10-Q filings, Earnings calls, and analyst reports
  • Surprisingly easy to use, thanks to platforms like HuggingFace 🤗 and sentence_transformers

Why use BERT over USE?

USE is just an embedding. BERT provides the ability to fine-tune the embeddings to map to your own classification through a supervised step.

How are models like this trained?

  1. Masked language modeling: Similar to CBOW, but using the whole sentence
  2. Next sentence prediction: Feed it pairs of sentences, half correct and half incorrect
  3. Do both! BERT is a 50/50 mix

Next Sentence Prediction

  • Correct: “Regulatory changes and requirements continue to create uncertainties for Citi and its businesses.” \(\Rightarrow\) “While the U.S. economy continues to improve, it remains susceptible to global events and volatility.”
  • Incorrect: “Regulatory changes and requirements continue to create uncertainties for Citi and its businesses.” \(\Rightarrow\) “Citigroup expenses increased 14% versus 2013 to $55.1 billion.”

What can we do with these embeddings?

  1. Direct measurement
    • Similarity of vectors \(\Rightarrow\) similarity of meaning
      • See Crowley, Huang and Lu (2025) for an example
  2. Include the embeddings in a regression or penalized regression
    • See Brown, Crowley and Elliott (2020) for an example applied to LDA
  3. Cluster within the embedding space to determine groupings of meaning
    • Done in Crowley and Wong (2025) using BERTopic
    • Also done by using kmeans + USE

BERTopic

What is BERTopic?

  • BERTopic is a BERT embedding oriented approach to topic modeling

Benefits

  • Uses more than just word co-occurence
  • Can assign topics at the sentence level
  • Easy to implement

Drawbacks

  • Slow (compared to LDA)
  • Can’t handle larger documents
    • Max determined by embedding, usually 512 tokens
  • Documents can only have 1 topic

Step 1: Embeddings

  • BERT-based embeddings are sentence level
  • They represent the location of a sentence in a high-dimensional space
  • The model supports any pretrained BERT embedding
    • There are many pretrained embeddings
    • Use the sentence_transformers package to import them
    • Multilingual models also exist

Step 1.5: Dimensionality reduction

  • We can compress the dimensionality of the embedding space
  • The embedding space can be very sparse within the context of our corpus

Compress dimensionality without losing [much] information with UMAP

  • The n_components parameter lets you pick the number of dimensions to use
  • Note: UMAP has a random initialization: specify random_state to get a replicable model

Step 2: Clustering with HDBSCAN

Images courtesy of hdbscan docs

Step 2: Clustering

Images courtesy of hdbscan docs

Step 3: Topic representation

  • This is done ex post to the topic modeling
    • Purely to explain the clusters
  • Relies on a class-based TF-IDF method
  • Like for LDA, we can drop stopwords and infrequent terms to make it more understandable
  • We can also use n-grams
  • There is also a ChatGPT option

Unless using ChatGPT, the representation is bag-of-words

Implementation

See the python code file for an example of implementing BERTopic on 1K and 5K WSJ sentences.

FinBERT

Using FinBERT

  • FinBERT is available via huggingface:
    • Via the transformers package
  • A demo of each are available via my colab shares:

What is transfer learning?

  • It is a method of training an algorithm on one domain and then applying the algorithm on another domain
  • It is useful when…
    • You don’t have enough data for your primary task
      • And you have enough for a related task
    • You want to augment a model with even more

How can transfer learning be useful?

  • Simple neural networks are easy enough to train yourself using standard hardware
  • Some more complex models, such as BERT models, are too complex to train fully on consumer hardware
    • E.g., the FinBERT paper is trained on an DGX-1 (128 GB of Video RAM)

As researchers, most of us won’t have sufficient hardware to train these models

  • Transfer learning represents an efficient middle ground
    • Take a pretrained model and retrain it for a specific domain
      • Neural networks often need much less data to retrain than to train initially
      • This retraining is called fine-tuning

Fine tuning

  • We can use the existing FinBERT model without its built in classification
  • We can feed it our own documents along with a new classification to replicate
  • The amount of data needed depends on:
    • Task complexity
    • Classification noise
    • Classification balance

As few as 1,000 documents can be enough; often 10,000-30,000 documents is ideal

Conclusion

Wrap-up

BERT

  • Good for custom measurement of complex, supervised tasks applied to text data
  • Plenty of off the shelf models available
  • Not infeasible to train your own as well

BERTopic

  • Good for when you need topics assigned more granularly
  • Sacrifices a bit of theoretical/statistical grounding in the process

Packages used for these slides

Python

  • bertopic
  • datasets
  • evaluate
  • hdbscan
  • matplotlib
  • numpy
  • pandas
  • scikit-learn
  • seaborn
  • sentence_transformers
  • spacy
  • torch
  • transformers
  • umap-learn

References

  • Assael, Yannis, Thea Sommerschield, Brendan Shillingford, Mahyar Bordbar, John Pavlopoulos, Marita Chatzipanagiotou, Ion Androutsopoulos, Jonathan Prag, and Nando de Freitas. “Restoring and attributing ancient texts using deep neural networks.” Nature 603, no. 7900 (2022): 280-283.
  • Brown, Nerissa C., Richard Crowley, and W. Brooke Elliott. 2020. “What Are You Saying? Using Topic to Detect Financial Misreporting.” Journal of Accounting Research.
  • Crowley, Richard, Wenli Huang, and Hai Lu. 2025. “Executive Tweets.” SMU Working Paper
  • Crowley, Richard M., and M. H. Wong. 2025. “Understanding sentiment through context.” SMU Working Paper 4316229.
  • Fritsch, Felix, Qi Zhang, and Xiang Zheng. “Responding to Climate Change Crises: Firms’ Trade‐Offs.” Journal of Accounting Research (2025).
  • Grootendorst, Maarten. “BERTopic: Neural topic modeling with a class-based TF-IDF procedure.” arXiv preprint arXiv:2203.05794 (2022).
  • Huang, Allen H., Hui Wang, and Yi Yang. “FinBERT: A large language model for extracting information from financial text.” Contemporary Accounting Research 40, no. 2 (2023): 806-841.