ML for SS: Topic modeling

Dr. Richard M. Crowley

https://rmc.link/

Overview

Papers

Brown, Crowley, and Elliott (2020 JAR)

  • A rigorous implementation of LDA
    • Time varying + Out-of-sample prediction, Experimental validation

Roberts et al. (2014 AJPS)

  • Demonstrates an interesting variant of LDA (STM) that can help with identifying differences in information across groups or conditions

Dieng, Ruiz, and Blei (2020 ACL)

  • ETM: Combines word2vec and LDA such that words and topics exist in the same embedding space

Technical Discussion: Ensembling

Python

R

  • LDA
    • stm can do a lot more than just standard LDA
    • {lda} and {topicmodels} both play nicely with quanteda
    • mallet gives an interface to the venerable MALLET Java package, capable of more advanced topic modeling
  • STM
    • The stm package is your go to option
  • ETM
    • The {topicmodels.etm} package is quite nice

R is very good here. Python is good for basic LDA models

There is a fully worked out solution using python and R, data is on eLearn.

Main application: Analyzing Wall Street Journal articles

  • On eLearn you will find a full month of the WSJ in text format

Tasks

  • Apply a topic model to the documents
  • Analyze the effect of gender on topic discussion using STM
  • Apply ETM to the documents to see how words and topics relate

LDA

What is LDA?

  • Latent Dirichlet Allocation
  • One of the most popular methods under the field of topic modeling
  • LDA is a Bayesian method of assessing the content of a document
  • LDA assumes there are a set of topics in each document, and that this set follows a Dirichlet prior for each document
    • Words within topics also have a Dirichlet prior

From Blei, Ng, and Jordan (2003). More details from the creator

An example of LDA

How does it work?

  1. Reads all the documents
    • Calculates counts of each word within the document, tied to a specific ID used across all documents
  2. Uses variation in words within and across documents to infer topics
    • By using a Gibbs sampler to simulate the underlying distributions
      • An MCMC method
  • It’s quite complicated in the background, but it boils down to a system where generating a document follows a couple rules:
    1. Topics in a document follow a multinomial/categorical distribution
    2. Words in a topic follow a multinomial/categorical distribution

Implementing LDA in python

  • The best python package for this is gensim
    • As long as your data fits in memory comfortably, it is easy to use
    • If not, you will need to construct a generator to pass to it, which is more complex
      • The code file for this session has an example of this!
  • In terms of computation time, you will likely spend more time prepping your text than running the LDA model
  • An implementation in R is also in the code file – R is pretty easy here as well

Prepping text

  • We will take a more thorough approach using spaCy for preprocessing
    • Remove stopwords using spaCy
    • Remove numbers, symbols, and punctuation based on a neural network dependency parser
    • Lemmatize words based on the word and its POS tags
  • If accuracy is less important or your computer can’t handle spaCy’s approach, another approach is:
    • Use a regex or NLTK to tokenize into words
    • Use the stop-words package or NLTK to get a list of stopwords
      • Filter them out using a list comprehension
        • doc = [w for w in doc if w not in stopwords]
    • Apply a word-based lemmatizer from NLTK such as WordNet

Running the LDA model

# docs contains all of our cleaned 10-K filings
# doc_names contains the filings' accession numbers

# Prepare the needed parts for gensim's LDA implementation
words = gensim.corpora.Dictionary(articles)
words.filter_extremes(no_below=3, no_above=0.5)
words.filter_tokens(bad_ids=[words.token2id['_']])  # '_' is not treated as a symbol by spaCy
corpus = [words.doc2bow(doc) for doc in articles]

# Save the intermediate data -- useful if we want to tweak model parameters and re-run later
with open('../../Data/corpus_WSJ.pkl', 'wb') as f:
    pickle.dump([corpus, words], f, protocol=pickle.HIGHEST_PROTOCOL)

# Run the model
lda = gensim.models.ldamodel.LdaModel(corpus, id2word=words, num_topics=10, passes=5,
                                      update_every=5, alpha='auto', eta='auto')
# Save the output
lda.save('../../Data/lda_WSJ')

Examining the LDA model

  1. Load in the LDA model along with the corpus structure and the document names
    • No need to do this if the model is still in memory
lda = gensim.models.ldamodel.LdaModel.load('../../Data/lda_WSJ')
with open('../../Data/corpus_WSJ.pkl', 'rb') as f:
    corpus, words, doc_names = pickle.load(f)
  1. Examine a topic
# Parameters: topic number, number of words
lda.show_topic(0, 10)
[('quarter', 0.0074630263), ('sale', 0.006757144), ('share', 0.0059989654), ('stock', 0.005518357)] 
 [('market', 0.005500558), ('price', 0.0053370525), ('fall', 0.0046162326), ('u.s.', 0.004529298)] 
 [('expect', 0.0040102066), ('maker', 0.0039168703)]

Note the weights associated with the words – some words are more meaningful than others

Examining the LDA model

  1. See the top words in each topic
for i in range(0,10):
    top = lda.show_topic(i, 10)
    top_words = [w for w, _ in top ]
    print('{}: {}'.format(i, ' '.join(top_words)))
0: quarter sale share stock market price fall u.s. expect maker
1: obama sen. state u.s. mccain president government tax people campaign
2: employee like work ms. old people include u.s. come '
3: bank government financial market capital u.s. plan loan & credit
4: fund business market people service inc. like product money investor
5: oil price market tax high energy business gas share financial
6: china state government work week imf report plan economic u.s.
7: market price like sell stock inc. people credit call city
8: share u.s. price wine like early market play store music
9: market rate financial bank u.s. stock investor economic fall economy

Examining the LDA model

  • The pyLDAvis package produces a nice interactive map of the topics
ldavis = pyLDAvis.gensim_models.prepare(lda, corpus, words, sort_topics=False)
pyLDAvis.display(ldavis)

Click here to see the output

STM

STM

  • STM (Structural Topic Modeling) adds two elements to the standard LDA approach:
    1. Covariates can be included in determining the distribution of topics overall (“prevalence”)
    2. Covariates can be included in determining the weights of words within topics (“content”)

This allows us to better examine the impact of characteristics on textual content

A worked out example is in the R code file

Preparation: covariates

  1. We will use author gender as the covariate for STM.
    • Use the gender package to calculate gender based on first names.
    • We do have to drop:
      • Articles with no given author
      • Articles with multiple authors
      • Articles whose author’s gender is ambiguous based on the name
genders <- gender(df$author) %>%
  group_by(name) %>%
  slice(1) %>%
  ungroup() %>%
  mutate(female = if_else(proportion_male > 0.5, 0, 1)) %>%
  rename(author=name) %>%
  select(author, female)
df <- df %>% left_join(genders) %>%
  filter(!is.na(female))

Preparation: Text

  • Next, we need to process the text into word counts
corp = corpus(df)  # Convert from data frame to corpus
toks <- tokens(corp, remove_punct=T)  # tokenize, removing punctuation
toks <- tokens_select(toks, pattern = stopwords("en"), selection = "remove")  # Remove stopwords
tdm <- dfm(toks)  # Term document matrix
out <- convert(tdm, to = 'stm')  # outputs all needed data for the STM package

A basic LDA model in STM

  • Using the stm package, we can construct a standard LDA model:
topics_lda <- stm(out$documents, out$vocab, K=5)
labelTopics(topics_lda)
Topic 1 Top Words:
     Highest Prob: $, company, said, mr, new, says, million 
     FREX: wi-fi, fda, postal, iphone, verizon, medtronic, nextel 
     Lift: 2000l, add-on, airfare, airtran, android, apollo's, areva's 
     Score: wi-fi, medtronic, software, postal, zimbra, valve, ticketmaster 
Topic 2 Top Words:
     Highest Prob: says, one, can, new, like, mr, people 
     FREX: nets, sushi, omegaven, dante, pumpkin, seaweed, salmon 
     Lift: 03-3535-3600, 03-3547-6807, 07, 08, 1-15, 1-inch, 1-ranked 
     Score: sushi, dante, omegaven, pumpkin, nets, microbes, raptors 
Topic 3 Top Words:
     Highest Prob: $, said, billion, market, year, investors, financial 
     FREX: euro, yields, money-market, ratio, nyse, etf, etfs 
     Lift: 0.07, 0.16, 0.27, 0.36, 0.8, 0.86, 1.10 
     Score: stocks, earnings, third-quarter, investors, yen, 3100, billion 
Topic 4 Top Words:
     Highest Prob: mr, said, one, new, says, sen, like 
     FREX: wayne, gop, manchin, woodson, phillies, erdogan, polling 
     Lift: 11th, 13-year-old, 14th, 17.2, 1904, 1918, 1942 
     Score: obama, sen, manchin, mccain, gop, eyman, woodson 
Topic 5 Top Words:
     Highest Prob: said, tax, mr, government, $, u.s, new 
     FREX: eu, deduction, mustaf, 9-9-9, strauss-kahn, nato, cain's 
     Lift: 16.5, 30-day, 777, abated, abduction, abdul, agreement's 
     Score: obama, mccain, imf, sen, eu, mustaf, volcker 

We can also visualize the model using the {stmBrowser} package, similar to ldavis in python

STM model with prevalence

topics_stm <- stm(out$documents, out$vocab, K=5,
                  prevalence= ~ female, data = out$meta)

labelTopics(topics_stm)
Topic 1 Top Words:
     Highest Prob: $, company, said, mr, new, says, million 
     FREX: wi-fi, fda, postal, iphone, verizon, medtronic, nextel 
     Lift: 1.99, 110,000, 15-second, 190, 19a, 2000l, 4s 
     Score: wi-fi, medtronic, software, postal, zimbra, valve, ticketmaster 
Topic 2 Top Words:
     Highest Prob: says, one, can, new, like, mr, people 
     FREX: nets, sushi, omegaven, dante, pumpkin, seaweed, salmon 
     Lift: 07, 08, 1-15, 1-ranked, 1-year-old, 1,020-square, 1,121 
     Score: sushi, dante, omegaven, pumpkin, nets, microbes, raptors 
Topic 3 Top Words:
     Highest Prob: $, said, billion, market, year, investors, financial 
     FREX: euro, yields, money-market, ratio, nyse, etf, etfs 
     Lift: abbett, abbett's, aig's, bearish, blackrock, citic, closed-end 
     Score: stocks, earnings, third-quarter, investors, yen, 3100, billion 
Topic 4 Top Words:
     Highest Prob: mr, said, one, new, says, sen, like 
     FREX: wayne, gop, manchin, woodson, phillies, erdogan, polling 
     Lift: 0-and-162, 0-for-4, 05, 1-8, 1,000-seat, 1,270, 1,421-seat 
     Score: obama, sen, manchin, mccain, gop, eyman, woodson 
Topic 5 Top Words:
     Highest Prob: said, tax, mr, government, $, u.s, new 
     FREX: eu, deduction, mustaf, 9-9-9, strauss-kahn, nato, cain's 
     Lift: 16.5, abated, airstrikes, alt-a, alty, asif, aviv 
     Score: obama, mccain, imf, sen, eu, mustaf, fannie

STM Prevalence model: Effect of gender

predict_topics<-estimateEffect(
  formula = 1:5 ~ female,
  stmobj = topics_stm,
  metadata = out$meta,
  uncertainty = "Global")

plot(predict_topics, covariate = "female",
     topics = 1:5, model = topics_stm,
     method = "difference",
     cov.value1 = "female",
     cov.value2 = "male",
     xlab = "More male leaning ... More female leaning",
     main = "Topics by Gender in WSJ",
     xlim = c(-.3, .3), labeltype = "custom",
     custom.labels = 1:10)

After adjusting for differing word usage within topics, one topic is still more prevalent among male writers, and one topic is still more prevalent among female writers.

STM: Content model

topics_stm2 <- stm(out$documents, out$vocab, K=5,
                   content = ~ female, data = out$meta)
labelTopics(topics_stm2)
Topic Words:
 Topic 1: moonves, royalty, devise, t-mobile, jha, playstation, ipads 
 Topic 2: bloomingdale's, eats, interception, chairs, beta, quarterback, frankly 
 Topic 3: inflation-protected, repurchase, resource-rich, cusip, low-yielding, fortis, zone's 
 Topic 4: sing, biographies, libretto, brotherhood, play's, sidney, g.i 
 Topic 5: deductibles, army's, upper-income, cftc, holtz-eakin, undermined, pre-existing 
 
 Covariate Words:
 Group 0: responds, returning, vs, raymond, apache, venezuela, night's 
 Group 1: theories, spate, pink, oppenheimer, guest, disability, unreliable 
 
 Topic-Covariate Interactions:
 Topic 1, Group 0: set-top, landfills, carbon-credit, methane, landfill, fintura, pavelek 
 Topic 1, Group 1: twingo, haitian, r2g, dena, pop-up, kiwisweat, graf 
 
 Topic 2, Group 0: second-best, grosvenor, intertwined, hauck, sterile, saarinen, poirier 
 Topic 2, Group 1: chablis, brie, hosiery, thigh-highs, instructions, husbands, paducah 
 
 Topic 3, Group 0: wholesalers, 24.9, near-collapse, overvalued, haarde, simor, vip 
 Topic 3, Group 1: saskatoon, salmar, watanabe, aquaculture, kimberly-clark, amex, issuer 
 
 Topic 4, Group 0: atwater, lenin, websites, counterinsurgency, petraeus, sharmaidze, woodruff 
 Topic 4, Group 1: manchin, grooves, parris, 17.2, holocaust, resistance, batali 
 
 Topic 5, Group 0: gaza, qaeda, pakistan's, strauss-kahn, deductions, guantanamo, inequality 
 Topic 5, Group 1: kabi, omegaven, puder, fresenius, gura, mustaf, hashim

Notes on Content models

The model is useful, as it allows us to observe deviations in word choices across covariates. However, the modeling for this is much more complex, and thus even this simple model takes ~75 minutes on a strong CPU.

STM Content model: Effect of gender

predict_topics<-estimateEffect(
  formula = 1:5 ~ female,
  stmobj = topics_stm2,
  metadata = out$meta,
  uncertainty = "Global")

plot(predict_topics, covariate = "female",
     topics = 1:5, model = topics_stm2,
     method = "difference",
     cov.value1 = "female",
     cov.value2 = "male",
     xlab = "More male leaning ... More female leaning",
     main = "Topics by Gender in WSJ",
     xlim = c(-.3, .3), labeltype = "custom",
     custom.labels = 1:10)

One topic is more prevalent among male writers, and one topic is more prevalent among female writers.

ETM

LDA as an embedding model?

  • Think back to how we defined embedding models last session
  • In what ways is LDA an embedding? And how is it different?

ETM: All in on embeddings

  • LDA is an embedding of topics
    • Words are distributed (not embedded) within topics
    • We can robustly compare words within topics, but not across topics
  • word2vec is an embedding of words
    • We can directly compare words

What if we built an LDA model on top of word embeddings instead of words?

  • Key benefits:
    1. The word embedding space allows us to compare words
    2. We can embed the topics in the same embedding space
      • We can compare words and topics, not just words within topics
    3. We can thus derive distances and probabilities across all relations

Prerequisites

  1. You need to set up the following packages
    • torch for training the ETM (on GPU too!)
    • {topicmodels.etm} for the ETM model
    • word2vec and doc2vec for embedding spaces
    • Optional: udpipe for text cleaning
  2. For visualization, you need some additional packages:

Step 1: Build an embedding space

  • ETM just uses word2vec as the base
  • We can build a new word2vec space
  • We can import an existing word2vec model

Building a word2vec model in R

w2v <- word2vec(x = df$text,
                dim = 25,
                type = "skip-gram",
                iter = 10,
                min_count = 5,
                threads = 2)
embeddings <- as.matrix(w2v)

Importing a word2vec model in R

model_w2v <- read.word2vec(
  file = word2vec_file,
  normalize = TRUE)
embeddings2 <- as.matrix(model_w2v)

Word2vec examples: Nearest words

Our model

predict(w2v, newdata = c("WSJ", "King"),
        type = "nearest", top_n = 5)
$WSJ
  term1   term2 similarity rank
1   WSJ  Editor  0.9637138    1
2   WSJ     See  0.9637008    2
3   WSJ Letters  0.9555249    3
4   WSJ      Is  0.9202298    4
5   WSJ    2014  0.9158489    5

$King
  term1    term2 similarity rank
1  King   Luther  0.9325169    1
2  King Heritage  0.9146810    2
3  King       Jr  0.9072557    3
4  King     Fred  0.9005683    4
5  King  Riordan  0.8990958    5

Off the shelf model

predict(model_w2v, newdata = c("WSJ", "King"),
        type = "nearest", top_n = 5)
$WSJ
  term1               term2 similarity rank
1   WSJ Wall_Street_Journal  0.8937571    1
2   WSJ                 NYT  0.8836945    2
3   WSJ        Businessweek  0.8325805    3
4   WSJ       Bloomberg.com  0.8274593    4
5   WSJ            Dealbook  0.8149146    5

$King
  term1                term2 similarity rank
1  King              Jackson  0.7298183    1
2  King               Prince  0.7284457    2
3  King             Tupou_V.  0.7275184    3
4  King                 KIng  0.7230150    4
5  King e_mail_robert.king_@  0.7192793    5

Word2vec examples: computations

Our model

vector  <- (embeddings["Xbox", ] - 
  embeddings["Microsoft", ] + 
  embeddings["Sony", ])
predict(w2v, vector,
        type = "nearest", top_n = 10)
          term similarity rank
1         Sony  0.9856522    1
2         Xbox  0.9824312    2
3         Live  0.9694299    3
4          Wii  0.9692171    4
5  PlayStation  0.9550805    5
6  installment  0.9504678    6
7        brisk  0.9422508    7
8      bottler  0.9422448    8
9     consoles  0.9387657    9
10 interactive  0.9379721   10

Off the shelf model

vector  <- (embeddings2["Xbox", ] -
            embeddings2["Microsoft", ] +
            embeddings2["Sony", ])
predict(model_w2v, vector,
        type = "nearest", top_n = 10)
                   term similarity rank
1                   PS2  0.9753843    1
2           Playstation  0.9708967    2
3                   PSP  0.9639490    3
4                   Wii  0.9580098    4
5  PlayStation_Portable  0.9542965    5
6           Nintendo_DS  0.9442139    6
7                  XBox  0.9394975    7
8           Wii_console  0.9381446    8
9                 PSPgo  0.9374237    9
10             Sony_PSP  0.9338397   10

Step 2: Preparing to run the ETM

  • Before we can run the model, we need to preprocess our documents into document term matrices like for LDA
# split apart the documents
dtm   <- strsplit.data.frame(df, group = "doc_id", term = "text", split = " ")
# calculate bag-of-words word frequencies
dtm   <- document_term_frequencies(dtm)
  # convert to a document term matrix
dtm   <- document_term_matrix(dtm)
# remove terms with relatively minimal TF-IDF weights
dtm   <- dtm_remove_tfidf(dtm, prob = 0.50)
  • Then, we need to ensure the words in our embedding space and our documents are identical
    • Important for matrix multiplication between the topic vectors and word2vec embeddings
# Find the shared vocabulary across the models.
vocab <- intersect(rownames(embeddings), colnames(dtm))
# Subset the embedding matrix to our shared vocabulary.
emb <- dtm_conform(embeddings, rows = vocab)
# Subset the DTM to our shared vocabulary.
dtm <- dtm_conform(dtm, columns = vocab)
  • Vector dimensions: dtm of (1195, 5394) and embedding space of (5394, 25)
    • 1,195 documents, 5,394 words, 25 dimensional word2vec embedding

Running the ETM

set.seed(1234)
torch_manual_seed(4321)
model     <- ETM(k = 5, dim = 100, embeddings = emb)
optimizer <- optim_adam(params = model$parameters,
                        lr = 0.005,
                        weight_decay = 0.0000012)
loss      <- model$fit(data = dtm, optimizer = optimizer,
                       epoch = 40, batch_size = 1000)

plot(loss)

Examining the topics

terminology  <- predict(model, type = "terms", top_n = 10)
terminology
[[1]]
         term        beta rank
2148    banks 0.001592649    1
2134      GBP 0.001560317    2
3027      PLC 0.001510595    3
2398      oil 0.001493404    4
2419 reserves 0.001380166    5
2944    Korea 0.001339821    6
2424  exports 0.001333484    7
2350     auto 0.001318377    8
1229 producer 0.001309911    9
4290  credits 0.001305023   10

[[2]]
           term        beta rank
2617     Google 0.002803810    1
2148      banks 0.002437610    2
1923    players 0.002412275    3
1814      Yahoo 0.002407836    4
4615    revenue 0.002303097    5
2505 regulators 0.002173478    6
1232     profit 0.002033646    7
2961      cards 0.001918313    8
1347   carriers 0.001894356    9
2564     Senate 0.001884263   10

[[3]]
          term        beta rank
3205      army 0.002488749    1
2138        GM 0.002014414    2
3689 coalition 0.002004599    3
2939      drug 0.001941700    4
2366    troops 0.001833640    5
1911  military 0.001746292    6
1255     truck 0.001731683    7
2921     labor 0.001713173    8
1861     civil 0.001656562    9
4954   Iranian 0.001630007   10

[[4]]
           term        beta rank
3098 securities 0.006724702    1
1886      loans 0.006241893    2
3037   Treasury 0.005707681    3
1942        Fed 0.005643071    4
2219    Reserve 0.005500115    5
1983      funds 0.005183956    6
2423  mortgages 0.004702089    7
2395      bonds 0.004577343    8
2148      banks 0.004513196    9
2951       bond 0.004287204   10

[[5]]
         term        beta rank
2114   income 0.013995818    1
2583      tax 0.011442363    2
2341    taxes 0.009401063    3
2885   Social 0.005688160    4
1029 patients 0.004278495    5
2092 expenses 0.004115332    6
4290  credits 0.003999760    7
3115  genetic 0.003701571    8
4919 coverage 0.003448279    9
2178  disease 0.003358758   10

This is pretty clean! Especially for such a small corpus

Applying topics to documents

newdata <- head(dtm, n = 5)
scores  <- predict(model, newdata, type = "topics")
scores
         [,1]       [,2]       [,3]      [,4]       [,5]
6  0.11162491 0.05244913 0.06140492 0.6977346 0.07678639
8  0.08471944 0.05569922 0.03959323 0.7797222 0.04026586
10 0.07642368 0.56824833 0.07757105 0.1039151 0.17384183
11 0.06825889 0.03992265 0.02839619 0.8342549 0.02916736
12 0.12825677 0.08921261 0.07952920 0.6440677 0.05893374

Visualizing the embedding space

  • We can use UMAP to visualize the 25-dim embedding space
  • Since ETM embeds the topics and words in the same space, we can see their relationship
manifolded <- summary(model, type = "umap",
                      n_components = 2,
                      metric = "cosine",
                      n_neighbors = 15, 
                      fast_sgd = FALSE,
                      n_threads = 2,
                      verbose = TRUE)
space      <- subset(manifolded$embed_2d,
  cluster %in% c(1, 2, 3, 4, 5) & rank <= 5)
textplot_embedding_2d(space,
  title = "ETM topics",
  subtitle = "embedded in 2D using UMAP",
  encircle = FALSE, points = TRUE)

What if we use off the shelf embeddings?

set.seed(1234)
torch_manual_seed(4321)
model     <- ETM(k = 5, dim = 100, embeddings = emb)
optimizer <- optim_adam(params = model$parameters,
                        lr = 0.005,
                        weight_decay = 0.0000012)
loss      <- model$fit(data = dtm, optimizer = optimizer,
                       epoch = 40, batch_size = 1000)

terminology  <- predict(model, type = "terms", top_n = 10)
terminology
[[1]]
        term        beta rank
9925    Sen. 0.023856310    1
5036    Gov. 0.006392638    2
3186    fish 0.005434623    3
8419   music 0.004921591    4
5413   court 0.004710632    5
14119    Dr. 0.004453861    6
764    Apple 0.004127380    7
1262  Google 0.004002580    8
1730    city 0.003943428    9
9916    drug 0.003422471   10

[[2]]
          term        beta rank
2011     banks 0.014750409    1
4647     Obama 0.012925205    2
4149    McCain 0.007163467    3
13807      oil 0.005454842    4
5878       gas 0.004044053    5
5400  Treasury 0.003656057    6
7049    dollar 0.003606206    7
959      Texas 0.003507924    8
13493    loans 0.003141974    9
7336     Palin 0.002974555   10

[[3]]
            term        beta rank
9925        Sen. 0.017854473    1
4647       Obama 0.014861688    2
4149      McCain 0.014316648    3
7698       China 0.011084058    4
3189      energy 0.010519576    5
6473     Chinese 0.007447055    6
13876 production 0.006961744    7
12556       gold 0.006211172    8
3119      Russia 0.004538346    9
3590     Russian 0.004521761   10

[[4]]
           term        beta rank
2011      banks 0.018051842    1
5876      funds 0.017585717    2
10927       tax 0.016355807    3
8115     shares 0.015699711    4
5629  employees 0.009731499    5
11241    stocks 0.008651924    6
1308   earnings 0.007395790    7
13851       net 0.006663597    8
7132     profit 0.006529912    9
13493     loans 0.006187856   10

[[5]]
          term        beta rank
10927      tax 0.004361523    1
7049    dollar 0.004206724    2
2814      fuel 0.004036348    3
2375  November 0.003627720    4
5936     Wayne 0.003560776    5
12567     loan 0.003521387    6
9347     board 0.003484925    7
1730      city 0.003087238    8
2898      card 0.003062016    9
6178  division 0.002969853   10

Conclusion

Wrap-up

LDA models work well as measures and can capture meaningful variation in text

  • Provides document-level insight into content distribution

STM provides more power for analyses interested in if textual content differs across groups or treatments

  • Useful for social science research

ETM provides several advantages over LDA

  • Coherent topic models over smaller data sets
  • Allows us to compare words and topics within an embedding space

Packages used for these slides

Python

  • gensim
  • matplotlib
  • numpy
  • pandas
  • pyLDAvis
  • seaborn
  • spacy

R

References

  • Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” the Journal of machine Learning research 3 (2003): 993-1022.
  • Brown, Nerissa C., Richard M. Crowley, and W. Brooke Elliott. “What are you saying? Using topic to detect financial misreporting.” Journal of Accounting Research 58, no. 1 (2020): 237-291.
  • Cer, Daniel, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018).
  • Dieng, Adji B., Francisco JR Ruiz, and David M. Blei. “Topic modeling in embedding spaces.” Transactions of the Association for Computational Linguistics 8 (2020): 439-453.
  • Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S.K., Albertson, B. and Rand, D.G., 2014. Structural topic models for open-ended survey responses. American Journal of Political Science, 58(4), pp.1064-1082.