The data was originally encoded as a zip of folders (users) of folders (outlook folders) of numbered text files (emails). A python script was used to extract all metadata from emails, clean the To and From fields, strip newlines out of the text, and add everything to a csv.
library(tidyverse)
library(quanteda)
library(DT)
library(stringr)
Here I will use a version of the Enron data that is straight from the US government’s dump of Enron’s outlook files. There is also a dump available on Kaggle at this link, but it is not quite as clean to work with.
# Read in data
df <- read_csv('/media/Data/Data/Enron_Emails/enron_emails.csv')
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Mime-Version` = col_double()
## )
## See spec(...) for full column specifications.
# Convert date from default Outlook format to a proper date
df$Date <- as.Date(substr(df$Date,1, str_length(df$Date)-3), format="%a, %d %b %Y")
# Trim dates -- some dates are malformed (leading to a missing year, and some
# are after the investigation started on October 19, 2001)
df <- df %>% filter(Date > as.Date("1990-01-01"), Date < as.Date("2001-10-19"))
nrow(df)
## [1] 398836
head(as.data.frame(df %>% select(Date, From, To)))
## Date From
## 1 2001-10-18 exchange.administrator@enron.com
## 2 2001-10-16 continental_airlines_inc@coair.rsc01.com
## 3 2001-05-31 eddie.zhang@enron.com
## 4 2001-05-31 40enron@enron.com
## 5 2001-06-08 jason.jones@enron.com
## 6 2001-06-08 40enron@enron.com
## To
## 1 matt.smith@enron.com
## 2 matt.smith@enron.com
## 3 tara.piazze@enron.com, matt.smith@enron.com
## 4 <NA>
## 5 colin.tonks@enron.com, d.alexander@enron.com, lindon.chiu@enron.com,
## 6 <NA>
Next, let’s examine some of the more unique terms used in Enron’s fraud, namely, the names of some of the biggest Special Purpose Vehicles (SPVs). See this Guardian article. A list of terms to check:
df_spv <- df %>% filter(str_detect(str_to_lower(text), "spv"))
df_chewco <- df %>% filter(str_detect(str_to_lower(text), "chewco"))
df_jedi <- df %>% filter(str_detect(str_to_lower(text), "jedi"))
df_obi <- df %>% filter(str_detect(str_to_lower(text), "obi-1"))
Next, let’s look excerpts from some of these emails.
For SPVs:
df_spv$text %>%
str_extract_all(regex("([^\\s]+\\s+){0,10}spv.*?(\\s+[^\\s]+){10}", multiline=TRUE, ignore_case=TRUE)) %>%
unlist() %>%
unique() %>%
as.tibble() %>%
DT::datatable()
For Chewco:
df_chewco$text %>%
str_extract_all(regex("([^\\s]+\\s+){0,10}chewco.*?(\\s+[^\\s]+){10}", multiline=TRUE, ignore_case=TRUE)) %>%
unlist() %>%
unique() %>%
as.tibble() %>%
DT::datatable()
For Jedi:
df_jedi$text %>%
str_extract_all(regex("([^\\s]+\\s+){0,10}jedi.*?(\\s+[^\\s]+){10}", multiline=TRUE, ignore_case=TRUE)) %>%
unlist() %>%
unique() %>%
as.tibble() %>%
DT::datatable()
There are no hits for Obi-1, however, so we will skip that SPV.
Next, we will take a look at who talked with who at Enron, so we can understand a bit more about the nature of these emails. Before we do that, though, we need to understand the data structure. In particular, emails can be sent to multiple individuals at the same time, so we will need to examine this aspect of our data before we can examine who talked with who.
# We need to understand how the network is structured first
maxFrom <- max(str_count(df$From, ","), na.rm=T)+1
maxTo <- max(str_count(df$To, ","), na.rm=T)+1
maxCc <- max(str_count(df$Cc, ","), na.rm=T)+1
maxBcc <- max(str_count(df$Bcc, ","), na.rm=T)+1
maxToAll <- max(str_count(df$Bcc, ",") + str_count(df$Cc, ",") + str_count(df$To, ","), na.rm=T)+3
print(paste0("From:", maxFrom, "; To:", maxTo, "; Cc:", maxCc, "; Bcc:", maxBcc, "; ToAll:", maxToAll))
## [1] "From:1; To:6; Cc:6; Bcc:5; ToAll:11"
For any given email, there is only 1 sender (as it should be), but a maximum of 6 recipients in the To field, 6 recipients in the Cc field, 5 recipients in the Bcc field, or 11 across any fields. As such, we need to do a significant amount of preprocessing to get a usable structure out of this data.
library(tidyr) # for tidying the data
# Use separate() to build additional columns for those with multiple emails
# We will due this on a separate copy of df so as to have a clean copy of df
# for later analysis
df2 <- df
df2 <- separate(df2, To, paste0("To.", 1:maxTo), sep=", ", fill="right")
df2 <- separate(df2, Cc, paste0("Cc.", 1:maxCc), sep=", ", fill="right")
df2 <- separate(df2, Bcc, paste0("Bcc.", 1:maxBcc), sep=", ", fill="right")
To visualize the network of emails, we will use the visNetwork
R package, which creates interactive network visualizations.
library(visNetwork)
nodes <- df2 %>%
select(starts_with("Bcc."), starts_with("CC."), starts_with("To."), From) %>%
sapply(unique) %>%
unlist() %>%
unique()
enron_nodes <- nodes[str_detect(nodes, "enron\\.com")]
print(paste0("There are ", length(nodes), " emails in the data, including ", length(enron_nodes), " from within Enron."))
## [1] "There are 31710 emails in the data, including 14564 from within Enron."
That’s a lot of emails. We’ll have to trim this later. For now though, let’s build a full network graph. We have the edges already, i.e., the emails involved. Now we need to see who talked to who, and how many times. For a visNetwork
, that means we need to code up a data frame with two columns: from
and to
(where to
is going to be a To
, Cc
or Bcc
column). We will also later count how many times each pair occurs. Note: !!
is a shorthand trick to get dplyr
to use dynamically created variable names.
# Build a full edge network
edges <- df2 %>%
select(From, To.1) %>%
rename(from=From, to=To.1)
for(var in c(paste0("To.", 2:maxTo), paste0("Cc.", 1:maxCc), paste0("Bcc.", 1:maxBcc))) {
temp <- df2 %>%
select(From, !!var) %>%
rename(from=From, to=!!var)
edges <- rbind(edges, temp)
}
# Drop any rows with NA -- ~85% of the data
edges <- edges[!is.na(edges$to) & !is.na(edges$from),]
print(paste0("There are ", nrow(edges), " in the data, of which ", nrow(unique(edges)), " are unique."))
## [1] "There are 792284 in the data, of which 95517 are unique."
Let’s start by combining identical (to, from)
pairs. We’ll create a new variable, count
, to keep track of the number of these.
edges <- edges %>%
group_by(to, from) %>%
mutate(count=n()) %>%
slice(1) %>%
ungroup()
summary(edges$count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 8.295 6.000 8358.000
The most common is… Vince Kaminski emailing himself, 8358 times. Let’s remove duplicate sender and receivers as well, since that isn’t communication between people. This will remove ~1000 edges.
edges <- edges[edges$to != edges$from,]
summary(edges$count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 8.139 6.000 5704.000
To visualize this network, we’ll need to significantly shrink this down. We’ll start by determining the most prolific emailers in the network.
nodes = data.frame(id=nodes, title=nodes, stringsAsFactors = F)
nodes <- nodes %>%
left_join(edges[, c("to", "count")], by = c("id" = "to")) %>%
rename(to_count=count) %>%
group_by(id) %>%
mutate(to_count = sum(to_count, na.rm=T)) %>%
slice(1) %>%
ungroup() %>%
left_join(edges[, c("from", "count")], by = c("id" = "from")) %>%
rename(from_count=count) %>%
group_by(id) %>%
mutate(from_count = sum(from_count, na.rm=T)) %>%
slice(1) %>%
ungroup() %>%
mutate(count = to_count + from_count, na.rm=T)
nodes$key_person <- ifelse(nodes$id %in% c('kenneth.lay@enron.com', 'jeff.skilling@enron.com', 'andrew.fastow@enron.com'), 1, 0)
We’ll take the 3 executives in the data, along with the other top emails up to 100 total accounts. We will also scale the edge width by the number of emails between individuals, and we will group individuals by their roles within Enron. To limit the number of edges, we will require more than 10 emails for the edge to be plotted. I have also included 2 ex post groupings: 1 other key player from the fraud (Greg Whalley), and 1 internal whistle blower, who is none other than Vince Kaminski (the guy who was emailing himself frequently).
nodes_f <- nodes %>%
arrange(-key_person, -count) %>%
head(n=100) %>%
mutate(group = ifelse(key_person == 1, "Executives",
ifelse(id %in% c('greg.whalley@enron.com'), "Other key player",
ifelse(id %in% c('vkaminski@aol.com','vince.kaminski@enron.com'), 'Internal whistleblower',
ifelse(str_detect(id,'enron'), 'Others in Enron', 'Not in Enron')))))
nodes_i <- nodes_f[,c("id")]
nodes_i$keep <- 1
edges_f <- edges %>%
left_join(nodes_i, by=c("to" = "id")) %>%
rename(keep_to=keep) %>%
left_join(nodes_i, by=c("from" = "id")) %>%
mutate(keep = keep * keep_to) %>%
filter(keep == 1) %>%
select(-keep, -keep_to) %>%
filter(count > 10) %>%
mutate(value = log(count) + 1) %>%
mutate(title = paste(count, "emails")) %>%
select(-count)
Now, we can plot out the network. Take a look at the structure of the network, and how the relevant players’ networks overlap significantly. Likewise, notice how Vince Kaminski was also frequently in contact with those individuals, as well as with his own personal account (5,400 times).
visNetwork(nodes_f, edges_f) %>%
visEdges(arrows = "to") %>%
visGroups(groupname = "Executives", color = list(background = "#75A0DA", border = "#5590BA")) %>%
visGroups(groupname = "Other key player", color = list(background = "#97C2FC", border = "#77A2DC")) %>%
visGroups(groupname = "Internal whistleblower", color = list(background = "#FFFF7F", border = "#DDDD5D")) %>%
visGroups(groupname = "Others in Enron", color = list(background = "#FCC297", border = "#DCA277")) %>%
visGroups(groupname = "Not in Enron", color = list(background = "#C2FC97", border = "#A2DC77")) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = FALSE, selectedBy ="group") %>%
visLegend()
Here, we’ll take a look at word usage. The first step is to create a matrix of terms.
corp <- corpus(df, text_field="text")
tdm <- dfm(corp,
stem=FALSE, # Not stemming, as specific word variants may be helpful here
remove=stopwords(source='smart'),
remove_punct=TRUE,
remove_numbers=TRUE) %>%
dfm_trim(min_termfreq=2, termfreq_type = "count") # 2 so as to keep rare words
It is natural to suspect that someone at the top was in charge of the fraud. Let’s make indicators for the key people at Enron.
user
: lay-k
email
: kenneth.lay@enron.com
user
: skilling-j
email
: jeff.skilling@enron.com
user
: Not in the system at the time of the data dump, sadlyemail
: andrew.fastow@enron.com
– his footprint lives on in others' inboxesdocvars(tdm, field="CEO_flag") <- ifelse(
tdm@docvars$From =='kenneth.lay@enron.com' | tdm@docvars$To =='kenneth.lay@enron.com' | tdm@docvars$user == 'lay-k',1,0)
docvars(tdm, field="CFO_flag") <- ifelse(
tdm@docvars$From =='jeff.skilling@enron.com' | tdm@docvars$To =='jeff.skilling@enron.com' | tdm@docvars$user == 'skilling-j',1,0)
docvars(tdm, field="COO_flag") <- ifelse(
tdm@docvars$From =='andrew.fastow@enron.com' | tdm@docvars$To =='andrew.fastow@enron.com',1,0)
docvars(tdm, field="Person_wrote") <- ifelse(
tdm@docvars$From =='kenneth.lay@enron.com','CEO',
ifelse(tdm@docvars$From =='jeff.skilling@enron.com','CFO',
ifelse(tdm@docvars$From =='andrew.fastow@enron.com','COO',"Other")))
topfeatures(tdm, n=20, groups = "Person_wrote")
## $Other
## cn enron hou ect@ect na subject
## 675114 657724 552031 483243 419856 325014
## ou recipients enron@enron pm cc power
## 321935 315444 288245 257599 254956 223492
## corp energy e-mail 3d http time
## 206838 190957 172216 142064 140057 135393
## gas forwarded
## 133902 130422
##
## $CEO
## cn enron recipients ou na notesaddr
## 12068 5489 5484 5419 5367 1083
## john eu david michael james mark
## 138 119 117 100 94 82
## jeff jr program scott chris kevin
## 78 73 70 60 56 53
## jennifer martin
## 51 50
##
## $CFO
## cn enron na ou
## 542 385 279 239
## recipients jeff enron@enron skilling
## 239 211 154 143
## corp subject cc ect@ect
## 126 110 93 88
## hou pm notesaddr survey
## 81 72 64 62
## company business enron@enronxgate lon
## 56 56 52 50
##
## $COO
## ljm standard mark andy jeff
## 3 2 2 2 2
## bidder deleted enron list organization
## 2 1 1 1 1
## company management back employees things
## 1 1 1 1 1
## contribution business hours put critical
## 1 1 1 1 1
tfidf_mat <- dfm_tfidf(tdm, base=2, scheme_tf="prop")
topfeatures(tfidf_mat, n=20, groups = "Person_wrote")
## $Other
## hou cn ect@ect enron pm attached
## 7957.879 7706.878 7382.475 5659.278 4785.167 4676.662
## call enron@enron na vince deal fyi
## 4637.330 4554.799 4343.520 4270.360 4238.473 4152.848
## kay time cc recipients corp ou
## 3943.646 3689.574 3670.498 3654.376 3626.297 3613.147
## good http
## 3610.848 3513.862
##
## $CEO
## rosie cn linda recipients misha ou
## 11.2095242 9.4462798 5.1205588 4.1603302 3.9637302 3.9546082
## na ken siegel enron likewise christie
## 2.5338161 2.3608597 1.7359353 1.7135776 1.6042161 1.3730036
## hasan notesaddr glad lay hear lot
## 1.2924286 1.2866592 1.0919749 1.0369263 0.9526306 0.8934625
## chance jamail
## 0.8875167 0.8762796
##
## $CFO
## skilling jeff xms test srs expense
## 15.300488 12.201614 7.498691 6.187320 5.403390 5.187414
## unable corp forward asap sherri number
## 5.009980 4.893210 4.795625 4.695302 4.677416 4.426123
## e.on enron@enron beard tully gold@ect enron
## 4.421360 4.127320 4.060765 3.908165 3.800270 3.487779
## stadium joannie
## 3.378286 3.316170
##
## $COO
## trapped flood inside things
## 2.8393771 2.5314140 1.7756511 1.1857481
## ljm bidder discern innumerable
## 1.0162918 0.5655244 0.4173776 0.4160590
## standard andy tier watched
## 0.3849606 0.3653586 0.3414658 0.3313931
## misunderstanding appropriately disagree inclusion
## 0.3215591 0.3198636 0.3191625 0.3083020
## disadvantage glisan ubs arms
## 0.3024561 0.2905278 0.2806483 0.2755009
Nothing too interesting yet, except perhaps the LJM mentions by the COO is interesting in light of this and this.
Note: Due to the size of the text corpus, we will not examine much on implementing a properly specified LDA algorithm, which would likely need to find over 50 topics and would take well over a day to run. We will, however, take a quick look at a smaller 10 topics model, which still takes around 3 hours to run on a fast computer.
Next, we can output the measures for the STM algorithm.
# A bunch of emails had no content, like forwards and such. These get dropped here
out <- convert(tdm, to = 'stm')
# Lost 10140 emails
# Since the above took a while to run, we'll save a point here for replicating in the future
saveRDS(out, '/media/Data/Data/Enron_Emails/pre-LDA.rds') # 144.3MB
Then we run the STM algorithm.
library(stm)
topics <- stm(out$documents, out$vocab, K=10)
saveRDS(topics, '/media/Data/Data/Enron_Emails/LDA_topics.rds')
Next, we’ll take a look at the topics.
library(stm)
## stm v1.3.3 (2018-1-26) successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
topics <- readRDS('/media/Data/Data/Enron_Emails/LDA_topics.rds')
out <- readRDS('/media/Data/Data/Enron_Emails/pre-LDA.rds')
labelTopics(topics)
## Topic 1 Top Words:
## Highest Prob: e-mail, pete.davis@enron.com, final, error, schedule, mark.guzman@enron.com, ryan.slinger@enron.com
## FREX: pete.davis@enron.com, jbryson@enron.com, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, bill.williams.iii@enron.com, dporter3@enron.com, craig.dean@enron.com, eric.linder@enron.com
## Lift: _316b0b4b, _5b01220d, _a5feb6fd.f190b7d2, _b7ec8e15.52335ec2, _d288d5b4, _nextpart_000_0021_01c0d23e, _nextpart_000_003d_01c04af0
## Score: pete.davis@enron.com, jbryson@enron.com, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, dbcaps97data, leaf.harasin@enron.com, mkt_type, trans_date
## Topic 2 Top Words:
## Highest Prob: 3d, http, image, company, million, online, free
## FREX: m.asp, font, fares, www.fool.com, specials, venturewire, href
## Lift: _________________________________________t, _______________________________________a, _______________________________________c, _______________________________________d, ______________________________________sponsored, _____________________________________m, _____________________________________sponsored
## Score: 3d, image, m.asp, www.fool.com, venturewire, http, imds
## Topic 3 Top Words:
## Highest Prob: cn, enron, na, ou, recipients, notesaddr, john
## FREX: cn, recipients, notesaddr, recipients@enron.com, liz.taylor@enron.com, kinser, errigo
## Lift: @school, 0.25p, 030s, 0711rs@classic.msn.com, 07average, 07daily, 07enrononline
## Score: cn, recipients, ou, notesaddr, na, enron, non-privileged
## Topic 4 Top Words:
## Highest Prob: hou, ect@ect, enron@enron, corp, cc, subject, pm
## FREX: ect@ect, enron@enronxgate, communications@enron, taffy, gpgfin, 3801a, viverito
## Lift: _0c5551f0, _154d7df3, _global, _nextpart_000_0082_01c044e2, _scully@tdbank-us.tdsecurities.com, @bear.com, @marathon-com.com
## Score: ect@ect, hou, enron@enron, enron@enronxgate, ees@ees, enron_development@enron_development, corp
## Topic 5 Top Words:
## Highest Prob: power, energy, california, state, electricity, market, gas
## FREX: conservation, mseb, maharashtra, democrats, newswires, californians, ratepayers
## Lift: _a, _bill_hits_california, _both, _calpine, _critics, _et_al, _his
## Score: electricity, state's, california, generators, california's, utilities, blackouts
## Topic 6 Top Words:
## Highest Prob: enron, business, group, meeting, trading, risk, management
## FREX: quicklinks, mid-market, ehs, apachi, ooc, egep, metgas
## Lift: _95cf86a0, _really_, _very_, @media4, #96156-v1-r0010002_wood_ruling_, #sbm, 0.4mm
## Score: enron, ena, trading, global, risk, ebs, ews
## Topic 7 Top Words:
## Highest Prob: notes, text, content-type, mime-version, plain, charset, content-transfer-encoding
## FREX: mann-k, kmann.nsf, ccampbell@kslaw.com, mjones7@txu.com, ggreen2@txu.com, daren.j.farmer@enron.com, jkeffer@kslaw.com
## Lift: _______yes, ______no, _1d46d028.58393ad0, _1d46d028.59383bd1, _2d75a66a, _5f071567.5e3f112d, _633898b2
## Score: us-ascii, 7bit, x-bcc, x-cc, x-filename, x-folder, x-from
## Topic 8 Top Words:
## Highest Prob: pm, information, time, contact, email, scheduled, access
## FREX: backout, pep, isc, on-call, clickathome, espeak, xms
## Lift: _____yes, ____no, _nextpart_000_0011_01c0beb4, _nextpart_002_01bf8927.768c4fa0, _plans, _strogen@dom.com, _v1.0_1qxblr_2i.fv0
## Score: outages, backout, sat, pt, environments, password, ct
## Topic 9 Top Words:
## Highest Prob: vince, pm, subject, time, kaminski, good, omni
## FREX: omni, 3dhou, 3dect, vince.j.kaminski@enron.com, 3djohn, qb, updatedby
## Lift: ___kai___, _326474e3, _3e66d8ff, _nextpart_000_0005_01c09a9a.b862d5e0, _nextpart_000_0015_01c0bd4a, _nextpart_000_01bf94d7, _nextpart_000_0454_01bf7fc7
## Score: omni, 3dect, 3djohn, 3dhou, vince, omni_viewicon, omniappointmenttype
## Topic 10 Top Words:
## Highest Prob: gas, deal, subject, agreement, contract, cc, price
## FREX: dth, prebon, shipper, tco, amerex, cgas, mlokay.nsf
## Lift: ______________________________to___________________________________, _______and, __5__, _ash@dom.com, _oliverio@dom.com, _section, _units
## Score: deal, mmbtu, dth, deals, gas, ces, ena
These are rather noisy due to running such a small LDA model. Instead of doing a deep dive into the topics, we will just stick to a high level analysis.
We’ll combine the topic data with our metadata on each email. This will allow us to examine various aspects of the emails, such as what each executive at Enron most frequently emailed about.
out$meta$Person_wrote <- factor(out$meta$Person_wrote)
out$meta$Person_wrote <- addNA(out$meta$Person_wrote)
doc_index = data.frame(document=names(out$documents))
doc_index = cbind(doc_index, out$meta)
doc_topics = data.frame(document=names(out$documents),
topic=1,
weight=topics$theta[,1])
for (i in 2:10) {
temp = data.frame(document=names(out$documents),
topic=i,
weight=topics$theta[,i])
doc_topics = rbind(doc_topics, temp)
}
doc_topics <- doc_topics %>% left_join(doc_index)
## Joining, by = "document"
# Proporitional topics (%)
doc_topics <- doc_topics %>%
group_by(document) %>%
mutate(topic_prop = weight / sum(weight)) %>%
ungroup()
Our first analysis is topics by executive:
doc_topics %>%
group_by(Person_wrote, topic) %>%
mutate(topic_prop = mean(topic_prop)) %>%
slice(1) %>%
ungroup() %>%
ggplot(aes(x=factor(topic), y=topic_prop, fill=factor(topic))) +
geom_col() + facet_wrap(~Person_wrote) +
theme(axis.text.x=element_blank(),axis.ticks.x = element_blank())
Above, we see that the three executives talk about rather different topics from the rest of the employees at Enron. In particular, we see a lot more mentions of topic 6 across all three of them, and also a bit more on topic 9 for the CFO and COO. For the CEO, we see a high concentration of topic 3 as well.
Next, let’s see which topics contain our words of interest:
# Tidytext quickly pulls out word's weights
library(tidytext)
topics %>%
tidy() %>%
mutate(keyword = ifelse(term == 'jedi', 'jedi', ifelse(term == 'spv', 'spv', ifelse(term == 'chewco', 'chewco', 'other')))) %>%
group_by(keyword, topic) %>%
mutate(beta_mean = mean(beta)) %>%
slice(1) %>%
ungroup() %>%
ggplot(aes(x=factor(topic), y=beta_mean, fill=factor(topic))) +
geom_col() + facet_wrap(~keyword) +
theme(axis.ticks.x = element_blank())
Here, we see that two of the words we are most interested in are concentrated on topic 6, much like the discussion by the executives. This indicates that further investigation should dig deeper into the discussions by the executives, as they seem to discuss matters revolving around SPVs and jedi much more than everyone else.