Session 5: Linguistics

Getting started

First, we will load the packages we need for these exercises.

To start, we need to load in the text data from last session and process it again.

A lot of work with text analytics typically goes into cleaning up the document. In the case of the above, we probably want 119 articles separated out such that each article is an element of a list. To do this, we can use some basic looping and conditional statements. Note 2 key insights for this:

  1. Each article starts with the phrase Full text:. This is unlikely to be used in the article text itself, so it can serve as a delimiter for the start of an article.
  2. Articles end with one of the following tags: Company / organization:, Credit:, or Copyright:
  3. One article is missing, just saying Full text: Not available

Given the above list, we are now in good shape to do whichever analysis it is that we need.

Parsing with NLTK

For NLTK, we will take a look at simple part of speech tagging. To do this, there are two steps:

  1. Tokenize the documents
  2. Apply PoS tags

Next, we need to import in a Part of Speech (PoS) tagger. One such tagger is based on the Brown corpus, which is based on news articles that were manually labeled with parts of speech. NLTK makes this data available for us for free, and we can import it from nltk.corpus once we have separately downloaded it.

Then, constructing a tagger is as simple as importing a tagger and passing the tagged corpus to it.

Next, we can apply the built tagger using the .tag() method. As we can see, the below does a reasonable job, but misses certain words that were not in the training data, such as "Hedge".

Another possible tagger is the perceptron tagger from nltk.pos_tag(). This is a simple neural network approach to PoS tagging. Note that it is able to give a label to the word "Hedge".

We can also work directly with bigrams in NLTK. The nltk.bigrams function is helpful for this, as it will automatically chain everything together even after PoS tagging. We can then filter on any conditions we may be interested in.

The below example uses bigrams to show what the most common words are that preceed a noun of any sort.

We can also examine the ways in which words are used in these articles. Below we will use a hand-coded part of speech file based on WSJ articles themselves, the Penn Treebank. We will then train a simple tagger using this data, and apply it across all documents.

We can also explore the frequency of usage of different words within parts of speech. The cfd_p2w object we created earlier will facilitate this.

The one issue with a simple unigram tagger is that it ignores context -- it simply applies the most common outcome for a word given the training data. A bigram tagger can help improve this, but will only match exact bigrams, and thus will not match most of our text. As such, we can chain these two taggers into 1 tagger, allowing us to match on bigrams first and then fill in the gaps with unigrams.

A neat feature of this is that a word can take on multiple parts of speech, depending on context, as demonstrated below.

Parsing documents using SpaCy

Spacy is a bit more flexible than NLTK, as it provides rather large fully-pretrained models as opposed to providing training corpuses. The models themselves usually work quite well, but it is important to maintain a set version of the models for a project, as SpaCy does update them somewhat frequently, sometimes introducing breaking changes.

The below code will load in the default English model and process all of our articles. On a fairly fast computer, this takes around 35 seconds, which is much slower than NLTK.

At first blush, it doesn't look like SpaCy has done anything with our document! Outputting the first document reads the same as before processing.

However, SpaCy actually did quite a bit under the hood. For instance, it has completed a sentence segmentation task. We can extract sentences using the generator provided by .sents. Below extracts one potentially interesting sentence.

Spacy has also assigned parts of speech to every word, computed a dependency tree (which allows you to see which words are related to which other words), and it has computed all entities in the sentence for us.

We can extract entities using the .ents property.`

As usual with SpaCy, the above isn't actually text -- those elements are all neatly tagged!

As a couple bonuses, it also calculates out noun chunks and lemmatization.

Scaling up entity counting

Finding entities in just 1 sentence is a neat trick, but not so useful. However, SpaCy calculated entities for all sentences back in the nlp.pipe() command, so we can scale this up using minimal computing power.

Parsing HTML

The above is pretty good, and will parse most files correctly already. However, there is a second format that FASB likes to use, which is exhibited by Update No. 2014-09.

The below code picks out this entry.

As we can see, this update contains a bunch of <li> tags each with a URL of their own. In this case, we would want to parse them all, but tie them back to the same standard.

Also, note that since this format is interspersed with normally formatted entries, we cannot reliably use the approach used for 2021 in other years. Instead, we should find all list items (<li> tags), and check the format before extracting the URL and name.

Final parser

At this point, you would expect to be able to toss the list of URLs to your favorite downloader (such as wget) and grab them all. Unfortunately, that is not the case, as each URL above goes to a disclaimer page asking you to accept the TOS.

This is easy to get around: after you accept, FASB's webpage just adds &acceptedDisclaimer=true to the end of the URL. We can add this ourselves with a list comprehension.

We also need to swap out some of the URL. While our URLs have page/document?pdf= in them, the actual PDF URLs have Page/ShowPdf?path= instead. This can also be swapped out in the list comprehension.

Grabbing the files

There is a second issue with downloading the FASB files which we will now have to address. When you go to the links we found in your browser, you will find yourself on a PDF document as expected. However, this isn't actually a PDF document! it is a PDF document embedded in an iframe on a webpage. As such, when downloading we need to scrape the actual URL to the PDF first!

The below code solves this issue by using two calls to requests. The first call gets the page with the iframe and determines

What if you want to parse out the PDFs now?

There are a lot of libraries to do this in python, such as pdfminer, PyPDF2, tika, pdfplumber, etc.

My preferred tools are pdfminer and textract. The pdfminer package is quite powerful for working with PDF files, and has fairly good text extractions. The textract package is also good at this, but is more broad-based, supporting Microsoft Word and PowerPoint files as well, along with being able to OCR images and do speech-to-text for audio. You can actually plug pdfminer.six into textract to make use of both as well.