Importing the data

The data was originally encoded as a zip of folders (users) of folders (outlook folders) of numbered text files (emails). A python script was used to extract all metadata from emails, clean the To and From fields, strip newlines out of the text, and add everything to a csv.

library(tidyverse)
library(quanteda)
library(DT)
library(stringr)

Loading the corpus

Here I will use a version of the Enron data that is straight from the US government’s dump of Enron’s outlook files. There is also a dump available on Kaggle at this link, but it is not quite as clean to work with.

# Read in data
df <- read_csv('/media/Data/Data/Enron_Emails/enron_emails.csv')

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Mime-Version` = col_double()
## )

## See spec(...) for full column specifications.

# Convert date from default Outlook format to a proper date
df$Date <- as.Date(substr(df$Date,1, str_length(df$Date)-3), format="%a, %d %b %Y")
# Trim dates -- some dates are malformed (leading to a missing year, and some
# are after the investigation started on October 19, 2001)
df <- df %>% filter(Date > as.Date("1990-01-01"), Date < as.Date("2001-10-19"))
nrow(df)

## [1] 398836

head(as.data.frame(df %>% select(Date, From, To)))

##         Date                                     From
## 1 2001-10-18         exchange.administrator@enron.com
## 2 2001-10-16 continental_airlines_inc@coair.rsc01.com
## 3 2001-05-31                    eddie.zhang@enron.com
## 4 2001-05-31                        40enron@enron.com
## 5 2001-06-08                    jason.jones@enron.com
## 6 2001-06-08                        40enron@enron.com
##                                                                     To
## 1                                                 matt.smith@enron.com
## 2                                                 matt.smith@enron.com
## 3                          tara.piazze@enron.com, matt.smith@enron.com
## 4                                                                 <NA>
## 5 colin.tonks@enron.com, d.alexander@enron.com, lindon.chiu@enron.com,
## 6                                                                 <NA>

Checking for Enron’s fraudulent SPVs

Next, let’s examine some of the more unique terms used in Enron’s fraud, namely, the names of some of the biggest Special Purpose Vehicles (SPVs). See this Guardian article. A list of terms to check:

The use of SPVs
An SPV named Chewco Investments
An SPV named JEDI LP
An SPV names Obi-1 Holdings LLC

df_spv <- df %>% filter(str_detect(str_to_lower(text), "spv"))
df_chewco <- df %>% filter(str_detect(str_to_lower(text), "chewco"))
df_jedi <- df %>% filter(str_detect(str_to_lower(text), "jedi"))
df_obi <- df %>% filter(str_detect(str_to_lower(text), "obi-1"))

Next, let’s look excerpts from some of these emails.

For SPVs:

df_spv$text %>% 
  str_extract_all(regex("([^\\s]+\\s+){0,10}spv.*?(\\s+[^\\s]+){10}", multiline=TRUE, ignore_case=TRUE)) %>%
  unlist() %>%
  unique() %>%
  as.tibble() %>%
  DT::datatable()

For Chewco:

df_chewco$text %>% 
  str_extract_all(regex("([^\\s]+\\s+){0,10}chewco.*?(\\s+[^\\s]+){10}", multiline=TRUE, ignore_case=TRUE)) %>%
  unlist() %>%
  unique() %>%
  as.tibble() %>%
  DT::datatable()

For Jedi:

df_jedi$text %>% 
  str_extract_all(regex("([^\\s]+\\s+){0,10}jedi.*?(\\s+[^\\s]+){10}", multiline=TRUE, ignore_case=TRUE)) %>%
  unlist() %>%
  unique() %>%
  as.tibble() %>%
  DT::datatable()

There are no hits for Obi-1, however, so we will skip that SPV.

Enron email network

Setup

Next, we will take a look at who talked with who at Enron, so we can understand a bit more about the nature of these emails. Before we do that, though, we need to understand the data structure. In particular, emails can be sent to multiple individuals at the same time, so we will need to examine this aspect of our data before we can examine who talked with who.

# We need to understand how the network is structured first
maxFrom <- max(str_count(df$From, ","), na.rm=T)+1
maxTo <- max(str_count(df$To, ","), na.rm=T)+1
maxCc <- max(str_count(df$Cc, ","), na.rm=T)+1
maxBcc <- max(str_count(df$Bcc, ","), na.rm=T)+1
maxToAll <- max(str_count(df$Bcc, ",") + str_count(df$Cc, ",") + str_count(df$To, ","), na.rm=T)+3
print(paste0("From:", maxFrom, "; To:", maxTo, "; Cc:", maxCc, "; Bcc:", maxBcc, "; ToAll:", maxToAll))

## [1] "From:1; To:6; Cc:6; Bcc:5; ToAll:11"

For any given email, there is only 1 sender (as it should be), but a maximum of 6 recipients in the To field, 6 recipients in the Cc field, 5 recipients in the Bcc field, or 11 across any fields. As such, we need to do a significant amount of preprocessing to get a usable structure out of this data.

library(tidyr)  # for tidying the data

# Use separate() to build additional columns for those with multiple emails
# We will due this on a separate copy of df so as to have a clean copy of df 
# for later analysis
df2 <- df
df2 <- separate(df2, To, paste0("To.", 1:maxTo), sep=", ", fill="right")
df2 <- separate(df2, Cc, paste0("Cc.", 1:maxCc), sep=", ", fill="right")
df2 <- separate(df2, Bcc, paste0("Bcc.", 1:maxBcc), sep=", ", fill="right")

To visualize the network of emails, we will use the visNetwork R package, which creates interactive network visualizations.

library(visNetwork)
nodes <- df2 %>%
  select(starts_with("Bcc."), starts_with("CC."), starts_with("To."), From) %>%
  sapply(unique) %>%
  unlist() %>%
  unique()
enron_nodes <- nodes[str_detect(nodes, "enron\\.com")]
print(paste0("There are ", length(nodes), " emails in the data, including ", length(enron_nodes), " from within Enron."))

## [1] "There are 31710 emails in the data, including 14564 from within Enron."

That’s a lot of emails. We’ll have to trim this later. For now though, let’s build a full network graph. We have the edges already, i.e., the emails involved. Now we need to see who talked to who, and how many times. For a visNetwork, that means we need to code up a data frame with two columns: from and to (where to is going to be a To, Cc or Bcc column). We will also later count how many times each pair occurs. Note: !! is a shorthand trick to get dplyr to use dynamically created variable names.

# Build a full edge network
edges <- df2 %>%
  select(From, To.1) %>%
  rename(from=From, to=To.1)
for(var in c(paste0("To.", 2:maxTo), paste0("Cc.", 1:maxCc), paste0("Bcc.", 1:maxBcc))) {
  temp <- df2 %>%
    select(From, !!var) %>%
    rename(from=From, to=!!var)
  edges <- rbind(edges, temp)
}
# Drop any rows with NA -- ~85% of the data
edges <- edges[!is.na(edges$to) & !is.na(edges$from),]
print(paste0("There are ", nrow(edges), " in the data, of which ", nrow(unique(edges)), " are unique."))

## [1] "There are 792284 in the data, of which 95517 are unique."

Shrinking it down

Let’s start by combining identical (to, from) pairs. We’ll create a new variable, count, to keep track of the number of these.

edges <- edges %>% 
  group_by(to, from) %>%
  mutate(count=n()) %>%
  slice(1) %>%
  ungroup()

summary(edges$count)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    2.000    3.000    8.295    6.000 8358.000

The most common is… Vince Kaminski emailing himself, 8358 times. Let’s remove duplicate sender and receivers as well, since that isn’t communication between people. This will remove ~1000 edges.

edges <- edges[edges$to != edges$from,]
summary(edges$count)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    2.000    3.000    8.139    6.000 5704.000

To visualize this network, we’ll need to significantly shrink this down. We’ll start by determining the most prolific emailers in the network.

nodes = data.frame(id=nodes, title=nodes, stringsAsFactors = F)
nodes <- nodes %>%
  left_join(edges[, c("to", "count")], by = c("id" = "to")) %>%
  rename(to_count=count) %>%
  group_by(id) %>%
  mutate(to_count = sum(to_count, na.rm=T)) %>%
  slice(1) %>%
  ungroup() %>%
  left_join(edges[, c("from", "count")], by = c("id" = "from")) %>%
  rename(from_count=count) %>%
  group_by(id) %>%
  mutate(from_count = sum(from_count, na.rm=T)) %>%
  slice(1) %>%
  ungroup() %>%
  mutate(count = to_count + from_count, na.rm=T)
nodes$key_person <- ifelse(nodes$id %in% c('kenneth.lay@enron.com', 'jeff.skilling@enron.com', 'andrew.fastow@enron.com'), 1, 0)

We’ll take the 3 executives in the data, along with the other top emails up to 100 total accounts. We will also scale the edge width by the number of emails between individuals, and we will group individuals by their roles within Enron. To limit the number of edges, we will require more than 10 emails for the edge to be plotted. I have also included 2 ex post groupings: 1 other key player from the fraud (Greg Whalley), and 1 internal whistle blower, who is none other than Vince Kaminski (the guy who was emailing himself frequently).

nodes_f <- nodes %>%
  arrange(-key_person, -count) %>%
  head(n=100) %>%
  mutate(group = ifelse(key_person == 1, "Executives", 
                   ifelse(id %in% c('greg.whalley@enron.com'), "Other key player",
                   ifelse(id %in% c('vkaminski@aol.com','vince.kaminski@enron.com'), 'Internal whistleblower',
                   ifelse(str_detect(id,'enron'), 'Others in Enron', 'Not in Enron')))))
  

nodes_i <- nodes_f[,c("id")]
nodes_i$keep <- 1

edges_f <- edges %>%
  left_join(nodes_i, by=c("to" = "id")) %>%
  rename(keep_to=keep) %>%
  left_join(nodes_i, by=c("from" = "id")) %>%
  mutate(keep = keep * keep_to) %>%
  filter(keep == 1) %>%
  select(-keep, -keep_to) %>%
  filter(count > 10) %>%
  mutate(value = log(count) + 1) %>%
  mutate(title = paste(count, "emails")) %>%
  select(-count)

Now, we can plot out the network. Take a look at the structure of the network, and how the relevant players’ networks overlap significantly. Likewise, notice how Vince Kaminski was also frequently in contact with those individuals, as well as with his own personal account (5,400 times).

visNetwork(nodes_f, edges_f) %>%
  visEdges(arrows = "to") %>%
  visGroups(groupname = "Executives", color = list(background = "#75A0DA", border = "#5590BA")) %>%
  visGroups(groupname = "Other key player", color = list(background = "#97C2FC", border = "#77A2DC")) %>%
  visGroups(groupname = "Internal whistleblower", color = list(background = "#FFFF7F", border = "#DDDD5D")) %>%
  visGroups(groupname = "Others in Enron", color = list(background = "#FCC297", border = "#DCA277")) %>% 
  visGroups(groupname = "Not in Enron", color = list(background = "#C2FC97", border = "#A2DC77")) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = FALSE, selectedBy ="group") %>%
  visLegend()

Using Quanteda

Term document matrix

Here, we’ll take a look at word usage. The first step is to create a matrix of terms.

corp <- corpus(df, text_field="text")

tdm <- dfm(corp,
           stem=FALSE,  # Not stemming, as specific word variants may be helpful here
           remove=stopwords(source='smart'),
           remove_punct=TRUE,
           remove_numbers=TRUE) %>%
  dfm_trim(min_termfreq=2, termfreq_type = "count") # 2 so as to keep rare words

Add in an indicator for CEO, CFO, and COO

It is natural to suspect that someone at the top was in charge of the fraud. Let’s make indicators for the key people at Enron.

CEO: Kenneth Lay
- user: lay-k
- email: kenneth.lay@enron.com
CFO: Jeffrey Skilling
- user: skilling-j
- email: jeff.skilling@enron.com
COO: Andrew Fastow
- user: Not in the system at the time of the data dump, sadly
- email: andrew.fastow@enron.com – his footprint lives on in others' inboxes

docvars(tdm, field="CEO_flag") <- ifelse(
  tdm@docvars$From =='kenneth.lay@enron.com' | tdm@docvars$To =='kenneth.lay@enron.com' | tdm@docvars$user == 'lay-k',1,0)
docvars(tdm, field="CFO_flag") <- ifelse(
  tdm@docvars$From =='jeff.skilling@enron.com' | tdm@docvars$To =='jeff.skilling@enron.com' | tdm@docvars$user == 'skilling-j',1,0)
docvars(tdm, field="COO_flag") <- ifelse(
  tdm@docvars$From =='andrew.fastow@enron.com' | tdm@docvars$To =='andrew.fastow@enron.com',1,0)
docvars(tdm, field="Person_wrote") <- ifelse(
  tdm@docvars$From =='kenneth.lay@enron.com','CEO',
  ifelse(tdm@docvars$From =='jeff.skilling@enron.com','CFO',
  ifelse(tdm@docvars$From =='andrew.fastow@enron.com','COO',"Other")))

topfeatures(tdm, n=20, groups = "Person_wrote")

## $Other
##          cn       enron         hou     ect@ect          na     subject 
##      675114      657724      552031      483243      419856      325014 
##          ou  recipients enron@enron          pm          cc       power 
##      321935      315444      288245      257599      254956      223492 
##        corp      energy      e-mail          3d        http        time 
##      206838      190957      172216      142064      140057      135393 
##         gas   forwarded 
##      133902      130422 
## 
## $CEO
##         cn      enron recipients         ou         na  notesaddr 
##      12068       5489       5484       5419       5367       1083 
##       john         eu      david    michael      james       mark 
##        138        119        117        100         94         82 
##       jeff         jr    program      scott      chris      kevin 
##         78         73         70         60         56         53 
##   jennifer     martin 
##         51         50 
## 
## $CFO
##               cn            enron               na               ou 
##              542              385              279              239 
##       recipients             jeff      enron@enron         skilling 
##              239              211              154              143 
##             corp          subject               cc          ect@ect 
##              126              110               93               88 
##              hou               pm        notesaddr           survey 
##               81               72               64               62 
##          company         business enron@enronxgate              lon 
##               56               56               52               50 
## 
## $COO
##          ljm     standard         mark         andy         jeff 
##            3            2            2            2            2 
##       bidder      deleted        enron         list organization 
##            2            1            1            1            1 
##      company   management         back    employees       things 
##            1            1            1            1            1 
## contribution     business        hours          put     critical 
##            1            1            1            1            1

TF-IDF matrix

tfidf_mat <- dfm_tfidf(tdm, base=2, scheme_tf="prop")
topfeatures(tfidf_mat, n=20, groups = "Person_wrote")

## $Other
##         hou          cn     ect@ect       enron          pm    attached 
##    7957.879    7706.878    7382.475    5659.278    4785.167    4676.662 
##        call enron@enron          na       vince        deal         fyi 
##    4637.330    4554.799    4343.520    4270.360    4238.473    4152.848 
##         kay        time          cc  recipients        corp          ou 
##    3943.646    3689.574    3670.498    3654.376    3626.297    3613.147 
##        good        http 
##    3610.848    3513.862 
## 
## $CEO
##      rosie         cn      linda recipients      misha         ou 
## 11.2095242  9.4462798  5.1205588  4.1603302  3.9637302  3.9546082 
##         na        ken     siegel      enron   likewise   christie 
##  2.5338161  2.3608597  1.7359353  1.7135776  1.6042161  1.3730036 
##      hasan  notesaddr       glad        lay       hear        lot 
##  1.2924286  1.2866592  1.0919749  1.0369263  0.9526306  0.8934625 
##     chance     jamail 
##  0.8875167  0.8762796 
## 
## $CFO
##    skilling        jeff         xms        test         srs     expense 
##   15.300488   12.201614    7.498691    6.187320    5.403390    5.187414 
##      unable        corp     forward        asap      sherri      number 
##    5.009980    4.893210    4.795625    4.695302    4.677416    4.426123 
##        e.on enron@enron       beard       tully    gold@ect       enron 
##    4.421360    4.127320    4.060765    3.908165    3.800270    3.487779 
##     stadium     joannie 
##    3.378286    3.316170 
## 
## $COO
##          trapped            flood           inside           things 
##        2.8393771        2.5314140        1.7756511        1.1857481 
##              ljm           bidder          discern      innumerable 
##        1.0162918        0.5655244        0.4173776        0.4160590 
##         standard             andy             tier          watched 
##        0.3849606        0.3653586        0.3414658        0.3313931 
## misunderstanding    appropriately         disagree        inclusion 
##        0.3215591        0.3198636        0.3191625        0.3083020 
##     disadvantage           glisan              ubs             arms 
##        0.3024561        0.2905278        0.2806483        0.2755009

Nothing too interesting yet, except perhaps the LJM mentions by the COO is interesting in light of this and this.

Applying LDA

Note: Due to the size of the text corpus, we will not examine much on implementing a properly specified LDA algorithm, which would likely need to find over 50 topics and would take well over a day to run. We will, however, take a quick look at a smaller 10 topics model, which still takes around 3 hours to run on a fast computer.

Next, we can output the measures for the STM algorithm.

# A bunch of emails had no content, like forwards and such.  These get dropped here
out <- convert(tdm, to = 'stm')
# Lost 10140 emails

# Since the above took a while to run, we'll save a point here for replicating in the future
saveRDS(out, '/media/Data/Data/Enron_Emails/pre-LDA.rds')  # 144.3MB

Then we run the STM algorithm.

library(stm)
topics <- stm(out$documents, out$vocab, K=10)
saveRDS(topics, '/media/Data/Data/Enron_Emails/LDA_topics.rds')

Next, we’ll take a look at the topics.

library(stm)

## stm v1.3.3 (2018-1-26) successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

topics <- readRDS('/media/Data/Data/Enron_Emails/LDA_topics.rds')
out <- readRDS('/media/Data/Data/Enron_Emails/pre-LDA.rds')
labelTopics(topics)

## Topic 1 Top Words:
##       Highest Prob: e-mail, pete.davis@enron.com, final, error, schedule, mark.guzman@enron.com, ryan.slinger@enron.com 
##       FREX: pete.davis@enron.com, jbryson@enron.com, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, bill.williams.iii@enron.com, dporter3@enron.com, craig.dean@enron.com, eric.linder@enron.com 
##       Lift: _316b0b4b, _5b01220d, _a5feb6fd.f190b7d2, _b7ec8e15.52335ec2, _d288d5b4, _nextpart_000_0021_01c0d23e, _nextpart_000_003d_01c04af0 
##       Score: pete.davis@enron.com, jbryson@enron.com, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, dbcaps97data, leaf.harasin@enron.com, mkt_type, trans_date 
## Topic 2 Top Words:
##       Highest Prob: 3d, http, image, company, million, online, free 
##       FREX: m.asp, font, fares, www.fool.com, specials, venturewire, href 
##       Lift: _________________________________________t, _______________________________________a, _______________________________________c, _______________________________________d, ______________________________________sponsored, _____________________________________m, _____________________________________sponsored 
##       Score: 3d, image, m.asp, www.fool.com, venturewire, http, imds 
## Topic 3 Top Words:
##       Highest Prob: cn, enron, na, ou, recipients, notesaddr, john 
##       FREX: cn, recipients, notesaddr, recipients@enron.com, liz.taylor@enron.com, kinser, errigo 
##       Lift: @school, 0.25p, 030s, 0711rs@classic.msn.com, 07average, 07daily, 07enrononline 
##       Score: cn, recipients, ou, notesaddr, na, enron, non-privileged 
## Topic 4 Top Words:
##       Highest Prob: hou, ect@ect, enron@enron, corp, cc, subject, pm 
##       FREX: ect@ect, enron@enronxgate, communications@enron, taffy, gpgfin, 3801a, viverito 
##       Lift: _0c5551f0, _154d7df3, _global, _nextpart_000_0082_01c044e2, _scully@tdbank-us.tdsecurities.com, @bear.com, @marathon-com.com 
##       Score: ect@ect, hou, enron@enron, enron@enronxgate, ees@ees, enron_development@enron_development, corp 
## Topic 5 Top Words:
##       Highest Prob: power, energy, california, state, electricity, market, gas 
##       FREX: conservation, mseb, maharashtra, democrats, newswires, californians, ratepayers 
##       Lift: _a, _bill_hits_california, _both, _calpine, _critics, _et_al, _his 
##       Score: electricity, state's, california, generators, california's, utilities, blackouts 
## Topic 6 Top Words:
##       Highest Prob: enron, business, group, meeting, trading, risk, management 
##       FREX: quicklinks, mid-market, ehs, apachi, ooc, egep, metgas 
##       Lift: _95cf86a0, _really_, _very_, @media4, #96156-v1-r0010002_wood_ruling_, #sbm, 0.4mm 
##       Score: enron, ena, trading, global, risk, ebs, ews 
## Topic 7 Top Words:
##       Highest Prob: notes, text, content-type, mime-version, plain, charset, content-transfer-encoding 
##       FREX: mann-k, kmann.nsf, ccampbell@kslaw.com, mjones7@txu.com, ggreen2@txu.com, daren.j.farmer@enron.com, jkeffer@kslaw.com 
##       Lift: _______yes, ______no, _1d46d028.58393ad0, _1d46d028.59383bd1, _2d75a66a, _5f071567.5e3f112d, _633898b2 
##       Score: us-ascii, 7bit, x-bcc, x-cc, x-filename, x-folder, x-from 
## Topic 8 Top Words:
##       Highest Prob: pm, information, time, contact, email, scheduled, access 
##       FREX: backout, pep, isc, on-call, clickathome, espeak, xms 
##       Lift: _____yes, ____no, _nextpart_000_0011_01c0beb4, _nextpart_002_01bf8927.768c4fa0, _plans, _strogen@dom.com, _v1.0_1qxblr_2i.fv0 
##       Score: outages, backout, sat, pt, environments, password, ct 
## Topic 9 Top Words:
##       Highest Prob: vince, pm, subject, time, kaminski, good, omni 
##       FREX: omni, 3dhou, 3dect, vince.j.kaminski@enron.com, 3djohn, qb, updatedby 
##       Lift: ___kai___, _326474e3, _3e66d8ff, _nextpart_000_0005_01c09a9a.b862d5e0, _nextpart_000_0015_01c0bd4a, _nextpart_000_01bf94d7, _nextpart_000_0454_01bf7fc7 
##       Score: omni, 3dect, 3djohn, 3dhou, vince, omni_viewicon, omniappointmenttype 
## Topic 10 Top Words:
##       Highest Prob: gas, deal, subject, agreement, contract, cc, price 
##       FREX: dth, prebon, shipper, tco, amerex, cgas, mlokay.nsf 
##       Lift: ______________________________to___________________________________, _______and, __5__, _ash@dom.com, _oliverio@dom.com, _section, _units 
##       Score: deal, mmbtu, dth, deals, gas, ces, ena

These are rather noisy due to running such a small LDA model. Instead of doing a deep dive into the topics, we will just stick to a high level analysis.

We’ll combine the topic data with our metadata on each email. This will allow us to examine various aspects of the emails, such as what each executive at Enron most frequently emailed about.

out$meta$Person_wrote <- factor(out$meta$Person_wrote)
out$meta$Person_wrote <- addNA(out$meta$Person_wrote)

doc_index = data.frame(document=names(out$documents))
doc_index = cbind(doc_index, out$meta)

doc_topics = data.frame(document=names(out$documents),
                        topic=1,
                        weight=topics$theta[,1])
for (i in 2:10) {
  temp = data.frame(document=names(out$documents),
                    topic=i,
                    weight=topics$theta[,i])
  doc_topics = rbind(doc_topics, temp)
}

doc_topics <- doc_topics %>% left_join(doc_index)

## Joining, by = "document"

# Proporitional topics (%)
doc_topics <- doc_topics %>%
  group_by(document) %>%
  mutate(topic_prop = weight / sum(weight)) %>%
  ungroup()

Our first analysis is topics by executive:

doc_topics %>%
  group_by(Person_wrote, topic) %>%
  mutate(topic_prop = mean(topic_prop)) %>%
  slice(1) %>%
  ungroup() %>%
  ggplot(aes(x=factor(topic), y=topic_prop, fill=factor(topic))) + 
  geom_col() + facet_wrap(~Person_wrote) +
  theme(axis.text.x=element_blank(),axis.ticks.x = element_blank())

Above, we see that the three executives talk about rather different topics from the rest of the employees at Enron. In particular, we see a lot more mentions of topic 6 across all three of them, and also a bit more on topic 9 for the CFO and COO. For the CEO, we see a high concentration of topic 3 as well.

Next, let’s see which topics contain our words of interest:

# Tidytext quickly pulls out word's weights
library(tidytext)
topics %>%
  tidy() %>%
  mutate(keyword = ifelse(term == 'jedi', 'jedi', ifelse(term == 'spv', 'spv', ifelse(term == 'chewco', 'chewco', 'other')))) %>%
  group_by(keyword, topic) %>%
  mutate(beta_mean = mean(beta)) %>%
  slice(1) %>%
  ungroup() %>%
  ggplot(aes(x=factor(topic), y=beta_mean, fill=factor(topic))) + 
  geom_col() + facet_wrap(~keyword) +
  theme(axis.ticks.x = element_blank())

Here, we see that two of the words we are most interested in are concentrated on topic 6, much like the discussion by the executives. This indicates that further investigation should dig deeper into the discussions by the executives, as they seem to discuss matters revolving around SPVs and jedi much more than everyone else.

Analyzing Enron’s emails

Dr. Richard M. Crowley

ACCT 420, Fall 2020