ML for SS: Transformers

Dr. Richard M. Crowley

rcrowley@smu.edu.sg

https://rmc.link/

Overview

Papers

Paper 1: Crowley and Wong (2025 Working)

Takes topic modeling to a more granular (sentence) level using BERTopic

Paper 2: Fritsch, Zhang, and Zheng (2025 JAR)

Identifying hard vs soft climate disclosure using a BERT model

Paper 3: Assael et al. (2022 Nature)

A more complex, but well motivated, use case for transformers

Technical Discussion: BERT

Python

Implementing BERTopic
- Grootendorst (2022)
Implementing FinBERT
- Huang, Wang, and Yang (2023 CAR)
Training a BERT model using fine-tuning applied to FinBERT

Python is by far the easiest option for these topics.

There is code for all of the examples either on eLearn/my website or Google Colab. Data is on eLearn.

Main application: Analyzing Wall Street Journal articles

On eLearn you will find a full month of the WSJ in text format

Tasks

Apply a topic model to the documents at the sentence level
Calculate the sentiment of any sentence
Play around with BERT models
- Determine author gender based on the first 500 words of an article

Transformers

What are neural networks?

The phrase neural network is thrown around almost like a buzz word
Neural networks are actually a specific type class algorithms
- There are many implementations with different primary uses

What are neural networks?

Originally, the goal was to construct an algorithm that behaves like a human brain
- Thus the name
Current methods don’t quite reflect human brains, however:
1. We don’t fully understand how our brains work, which makes replication rather difficult
2. Most neural networks are constructed for specialized tasks (not general tasks)
3. Some (but not all) neural networks use tools our brain may not have
  - I.e., backpropogation is potentially possible in brains, but it is not pinned down how such a function occurs (if it does occur)

What are neural networks?

Neural networks are a method by which a computer can learn from observational data
In practice:
- They were not computationally worthwhile until the mid 2000s
- They have been known since the 1950s (perceptrons)
- They can be used to construct algorithms that, at times, perform better than humans themselves
  - But these algorithms are often quite computationally intense, complex, and difficult to understand
- Much work has been and is being done to make them more accessible

Types of neural networks

There are a lot of neural network types
- See The “Neural Network Zoo”
Some of the more interesting ones which we will see or have seen:
- RNN: Recurrent Neural Network
- LSTM: Long/Short Term Memory
- CNN: Convolutional Neural Network
- DAN: Deep Averaging Network
- GAN: Generative Adversarial Network
Others worth noting
- VAE (Variational Autoencoder): Generating new data from datasets
Not in the Zoo, but of note:
- Transformer: Networks with “attention”
  - From Attention is All You Need

RNN: Recurrent NN

Recurrent neural networks embed a history of information in the network
- The previous computation affects the next one
- Leads to a short term memory
Used for speech recognition, image captioning, anomaly detection, and many others
- Also the foundation of LSTM
- SketchRNN (live demo)

LSTM: Long Short Term Memory

LSTM improves the long term memory of the network while explicitly modeling a short term memory
Used wherever RNNs are used, and then some
- Ex.: Seq2seq (machine translation)

CNN: Convolutional NN

Networks that excel at object detection (in images)
Can be applied to other data as well
Ex.: Inception-v3

DAN: Deep Averaging Network

DANs are simple networks that simply average their inputs
Averaged inputs are then processed a few times
These networks have found a home in NLP
- Ex.: Universal Sentence Encoder

GAN: Generative Adversarial Network

Feature two networks working against each other
Many novel uses
- Ex.: Anonymizing clinical trial data by simulating an attack on the dataset
- Ex.: Aging images

VAE: Variational Autoencoder

An autoencoder (AE) is an algorithm that can recreate input data
Variational means this type of AE can vary other aspects to generate completely new output
- Good for creating fake data
Like a simpler, noisier GAN

Transformer

Shares some similarities with RNN and LSTM: Focuses on attention
Currently being applied to solve many types of problems
Examples: BERT, GPT-3, XLNEt, RoBERTa, ChatGPT

More than words

Why might we want to go beyond words?

Transformers: How we use ML to go beyond words

Note that some models only keep the encoder stack on the left (BERT), and some only keep the decoder stack on the right (GPT). Some use both (T5).
- Encoder builds embeddings
- Decoder generates language

A simple approach for sentences: USE

USE is based on DAN (Deep Averaging Networks) and Transformers
- There are variants using each
  - DAN is faster
  - Transformer is more accurate
- USE learns the meaning of sentences via words’ implied meanings
Learn more: Original paper and TensorFlow site
In practice, it works quite well
- But BERT quickly took over due to its fine-tuning capabilities

Example of USE in daily life

Recall that USE focuses on representing sentence-length chunks of text

Demo: Comparing sentence meaning with USE

Code is available at: https://rmc.link/colab_use
- This is a hosted environment provided by Google

An approach for slightly longer text: BERT

Bidirectional Encoder Representations from Transformers

BERT handles text up to 512 tokens
- A good fit for sentences or up to small paragraphs
It comes pretrained on Wikipedia and a corpus of books
Many other variants trained on other text exist
- FinBERT on 10-K and 10-Q filings, Earnings calls, and analyst reports
Surprisingly easy to use, thanks to platforms like HuggingFace 🤗 and sentence_transformers

Why use BERT over USE?

USE is just an embedding. BERT provides the ability to fine-tune the embeddings to map to your own classification through a supervised step.

How are models like this trained?

Masked language modeling: Similar to CBOW, but using the whole sentence
Next sentence prediction: Feed it pairs of sentences, half correct and half incorrect
Do both! BERT is a 50/50 mix

Next Sentence Prediction

Correct: “Regulatory changes and requirements continue to create uncertainties for Citi and its businesses.” $\Rightarrow$ “While the U.S. economy continues to improve, it remains susceptible to global events and volatility.”
Incorrect: “Regulatory changes and requirements continue to create uncertainties for Citi and its businesses.” $\Rightarrow$ “Citigroup expenses increased 14% versus 2013 to $55.1 billion.”

What can we do with these embeddings?

Direct measurement
- Similarity of vectors $\Rightarrow$ similarity of meaning
  - See Crowley, Huang and Lu (2025) for an example
Include the embeddings in a regression or penalized regression
- See Brown, Crowley and Elliott (2020) for an example applied to LDA
Cluster within the embedding space to determine groupings of meaning
- Done in Crowley and Wong (2025) using BERTopic
- Also done by using kmeans + USE

BERTopic

What is BERTopic?

BERTopic is a BERT embedding oriented approach to topic modeling

Benefits

Uses more than just word co-occurence
Can assign topics at the sentence level
Easy to implement

Drawbacks

Slow (compared to LDA)
Can’t handle larger documents
- Max determined by embedding, usually 512 tokens
Documents can only have 1 topic

Step 1: Embeddings

BERT-based embeddings are sentence level
They represent the location of a sentence in a high-dimensional space
The model supports any pretrained BERT embedding
- There are many pretrained embeddings
- Use the sentence_transformers package to import them
- Multilingual models also exist

Step 1.5: Dimensionality reduction

We can compress the dimensionality of the embedding space
The embedding space can be very sparse within the context of our corpus

Compress dimensionality without losing [much] information with UMAP

The n_components parameter lets you pick the number of dimensions to use
Note: UMAP has a random initialization: specify random_state to get a replicable model

Step 2: Clustering with HDBSCAN

Images courtesy of hdbscan docs

Step 2: Clustering

Images courtesy of hdbscan docs

Step 3: Topic representation

This is done ex post to the topic modeling
- Purely to explain the clusters
Relies on a class-based TF-IDF method
Like for LDA, we can drop stopwords and infrequent terms to make it more understandable
We can also use n-grams
There is also a ChatGPT option

Unless using ChatGPT, the representation is bag-of-words

Implementation

See the python code file for an example of implementing BERTopic on 1K and 5K WSJ sentences.

FinBERT

Using FinBERT

FinBERT is available via huggingface:
- Via the transformers package
A demo of each are available via my colab shares:
- Pretrained huggingface model
  - Supports tone, ESG, and forward looking statement classification
- Fine-tuning via pytorch model
  - Note: Colab lacks the computational power needed for this task – but you can probably run a few epochs for free.

What is transfer learning?

It is a method of training an algorithm on one domain and then applying the algorithm on another domain
It is useful when…
- You don’t have enough data for your primary task
  - And you have enough for a related task
- You want to augment a model with even more

How can transfer learning be useful?

Simple neural networks are easy enough to train yourself using standard hardware
Some more complex models, such as BERT models, are too complex to train fully on consumer hardware
- E.g., the FinBERT paper is trained on an DGX-1 (128 GB of Video RAM)

As researchers, most of us won’t have sufficient hardware to train these models

Transfer learning represents an efficient middle ground
- Take a pretrained model and retrain it for a specific domain
  - Neural networks often need much less data to retrain than to train initially
  - This retraining is called fine-tuning

Fine tuning

We can use the existing FinBERT model without its built in classification
We can feed it our own documents along with a new classification to replicate
The amount of data needed depends on:
- Task complexity
- Classification noise
- Classification balance

As few as 1,000 documents can be enough; often 10,000-30,000 documents is ideal

There is an example of building a gender classification model in the Finetuning-MLSS and Inference-MLSS python notebooks

Conclusion

Wrap-up

BERT

Good for custom measurement of complex, supervised tasks applied to text data
Plenty of off the shelf models available
Not infeasible to train your own as well

BERTopic

Good for when you need topics assigned more granularly
Sacrifices a bit of theoretical/statistical grounding in the process

Packages used for these slides

Python

bertopic
datasets
evaluate
hdbscan
matplotlib
numpy
pandas
scikit-learn
seaborn
sentence_transformers
spacy
torch
transformers
umap-learn

References

Assael, Yannis, Thea Sommerschield, Brendan Shillingford, Mahyar Bordbar, John Pavlopoulos, Marita Chatzipanagiotou, Ion Androutsopoulos, Jonathan Prag, and Nando de Freitas. “Restoring and attributing ancient texts using deep neural networks.” Nature 603, no. 7900 (2022): 280-283.
Brown, Nerissa C., Richard Crowley, and W. Brooke Elliott. 2020. “What Are You Saying? Using Topic to Detect Financial Misreporting.” Journal of Accounting Research.
Crowley, Richard, Wenli Huang, and Hai Lu. 2025. “Executive Tweets.” SMU Working Paper
Crowley, Richard M., and M. H. Wong. 2025. “Understanding sentiment through context.” SMU Working Paper 4316229.
Fritsch, Felix, Qi Zhang, and Xiang Zheng. “Responding to Climate Change Crises: Firms’ Trade‐Offs.” Journal of Accounting Research (2025).
Grootendorst, Maarten. “BERTopic: Neural topic modeling with a class-based TF-IDF procedure.” arXiv preprint arXiv:2203.05794 (2022).
Huang, Allen H., Hui Wang, and Yi Yang. “FinBERT: A large language model for extracting information from financial text.” Contemporary Accounting Research 40, no. 2 (2023): 806-841.