Session 6: Embeddings and Topic Modeling

Getting started

First, we will load the packages we need for these exercises.

To start, we need to load in the text data from last session and process it again.

A lot of work with text analytics typically goes into cleaning up the document. In the case of the above, we probably want 119 articles separated out such that each article is an element of a list. To do this, we can use some basic looping and conditional statements. Note 2 key insights for this:

  1. Each article starts with the phrase Full text:. This is unlikely to be used in the article text itself, so it can serve as a delimiter for the start of an article.
  2. Articles end with one of the following tags: Company / organization:, Credit:, or Copyright:
  3. One article is missing, just saying Full text: Not available

Given the above list, we are now in good shape to do whichever analysis it is that we need.

Additional functions for USE analysis

Word2vec

Analogies in Word2Vec

A common way of demonstrating the power of word vectors is to demonstrate their usage in analogies. Consider the below:

Man is to king, as woman is to ___

Algorithmically, we can represent this as King - man + woman = ?.

As we can see above, the algorithm correctly guesses Queen.

The sleight of hand of Word2Vec analogies

Generally, word2vec implementations don't allow it to return one of the input words in the output. As such, if we compare raw distances between King and the analogy versus Queen and the analogy, we will surprisingly find that word2vec really thinks the following:

Man is to King, as woman is to King.

That being said, it isn't all sleight of hand. Queen is not the closest word to King in the word2vec structure, yet it is the second guess for the analogy. As such, we can see that the algorithm does encapsulate some meaning.

Exercise -- Try it out!

If you have Tensorflow installed, you can run this locally. If not, you can run it at https://rmc.link/colab_w2v

Exercise 1: Importing word2vec

As there is only 1 English word2vec model available, word2vec-google-news-300, we will use that model.

Note: Running the cell below will download ~1.7GB of data if you haven't already downloaded it previously. The data will be stored in ~/gensim_data/ (where ~ represents your Users folder). You can safely delete the data when you are done experimenting.

Exercise 2: Exploring word similarity in word2vec

To gain some familiarity with how word2vec function, pick a few words and run the base_w2v.most_similar() function with those words (once per word). This will list out the 10 closest words to the word you chose, based on how those words were used in Google News. Pay attention to the different linkages it made (some you may have expected, some you may not have!). If you have no ideas of where to start, 'Enron' is an interesting one.

If you get an error, it just means your word wasn't in the data -- pick a new word.

Note: Some phrases are also in the data. You can try a phrase by replacing any spaces with underscores (_).

Next, take a look at how word2vec captures antonyms. Here you will most likely find that word2vec is a bit shaky, offering up words that are either rather odd or nonsensical. An instereting one to try here is Accounting (note: the capitalization matters!).

Exercise 3: Analogies with word2vec

To input an analogy, we make use of the positive= and negative= parameters.

For an analogy of the form A : B :: C : ?, we would specify positive=['B, C'] and negative=['A'].

Try to come up with an analogy, and see whether your predicted word ends up in word2vec's top 10! An example is given below.

LDA

Preparations

Setup for longer documents -- not needed for our data, but useful to have

Reading a corpus efficiently is tricky. A good practice is to construct an iterator. What this means is that it will load your data on the fly as it is needed for processing, rather than loading it up front. You can consequently save a lot of RAM usage by doing this. sorted() is used to wrap glob.glob() to ensure consistent behavior on multiple calls.

Cleaning documents using spaCy

Cleaning up our articles for gensim

Needed for STM later

The below cell converts the data to tidytext format

Building the gensim dictionary and corpus

Examining the LDA model

For the sake of exposition, I will label the above:

  1. Politics
  2. Government services
  3. Politics
  4. Sports
  5. Working life
  6. Economics
  7. Foreign investment
  8. NYC
  9. Local govenment
  10. Foreign policy

Applying the topics to our documents

Normalization

Using a Universal Sentence Encoder (USE) model

We can grab the model from Tensorflow Hub, which makes this quite easy to use.

Note how these embeddings are all the same shape. Regardless fo the length of text, USE will compress them into a 512-dimensional space.

Exercise -- Try it out yourself!

If you have Tensorflow installed, you can run this locally. If not, you can run it at https://rmc.link/colab_use