A Delicious Analysis! (aka topic modelling using recipes)

A few months ago, I saw a link on twitter to an awesome graph charting the similarities of different foods based on their flavour compounds, in addition to their prevalence in recipes (see the whole study, The Flavor Network and the Principles of Food Pairing).  I thought this was really neat and became interested in potentially using the data for something slightly different; to figure out which ingredients tended to correlate across recipes.  I emailed one of the authors, Yong-Yeol Ahn, who is a real mensch by the way, and he let me know that the raw recipe data is readily available on his website!

Given my goal of looking for which ingredients correlate across recipes, I figured this would be the perfect opportunity to use topic modelling (here I use Latent Dirichlet Allocation or LDA).  Usually in topic modelling you have a lot of filtering to do.  Not so with these recipe data, where all the words (ingredients) involved in the corpus are of potential interest, and there aren’t even any punctuation marks!  The topics coming out of the analysis would represent clusters of ingredients that co-occur with one another across recipes, and would possibly teach me something about cooking (of which I know precious little!).

All my code is at the bottom, so all you’ll find up here are graphs and my textual summary.  The first thing I did was to put the 3 raw recipe files together using python.  Each file consisted of one recipe per line, with the cuisine of the recipe as the first entry on the line, and all other entries (the ingredients) separated by tab characters.  In my python script, I separated out the cuisines from the ingredients, and created two files, one for the recipes, and one for the cuisines of the recipes.

Then I loaded up the recipes into R and got word/ingredient counts.  As you can see below, the 3 most popular ingredients were egg, wheat, and butter.  It makes sense, considering the fact that roughly 70% of all the recipes fall under the “American” cuisine.  I did this analysis for novelty’s sake, and so I figured I would take those ingredients out of the running before I continued on.  Egg makes me fart, wheat is not something I have at home in its raw form, and butter isn’t important to me for the purpose of this analysis!

Recipe Popularity of Top 30 Ingredients

Here are the top ingredients without the three filtered out ones:

Recipe Popularity of Top 30 Ingredients - No Egg Wheat or Butter

Finally, I ran the LDA, extracting 50 topics, and the top 5 most characteristic ingredients of each topic.  You can see the full complement of topics at the bottom of my post, but I thought I’d review some that I find intriguing.  You will, of course, find other topics intriguing, or some to be bizarre and inappropriate (feel free to tell me in the comment section).  First, topic 4:

[1] "tomato"  "garlic"  "oregano" "onion"   "basil"

Here’s a cluster of ingredients that seems decidedly Italian.  The ingredients seem to make perfect sense together, and so I think I’ll try them together next time I’m making pasta (although I don’t like tomatoes in their original form, just tomato sauce).

Next, topic 19:

[1] "vanilla" "cream"   "almond"  "coconut" "oat"

This one caught my attention, and I’m curious if the ingredients even make sense together.  Vanilla and cream makes sense… Adding coconut would seem to make sense as well.  Almond would give it that extra crunch (unless it’s almond milk!).  I don’t know whether it would be tasty however, so I’ll probably pass this one by.

Next, topic 20:

[1] "onion"         "black_pepper"  "vegetable_oil" "bell_pepper"   "garlic"

This one looks tasty!  I like spicy foods and so putting black pepper in with onion, garlic and bell pepper sounds fun to me!

Next, topic 23:

[1] "vegetable_oil" "soy_sauce"     "sesame_oil"    "fish"          "chicken"

Now we’re into the meaty zone!  I’m all for putting sauces/oils onto meats, but putting vegetable oil, soy sauce, and sesame oil together does seem like overkill.  I wonder whether soy sauce shows up with vegetable oil or sesame oil separately in recipes, rather than linking them all together in the same recipes.  I’ve always liked the extra salty flavour of soy sauce, even though I know it’s horrible for you as it has MSG in it.  I wonder what vegetable oil, soy sauce, and chicken would taste like.  Something to try, for sure!

Now, topic 26:

[1] "cumin"      "coriander"  "turmeric"   "fenugreek"  "lemongrass"

These are a whole lot of spices that I never use on my food.  Not for lack of wanting, but rather out of ignorance and laziness.  One of my co-workers recently commented that cumin adds a really nice flavour to food (I think she called it “middle eastern”).  I’ve never heard a thing about the other spices here, but why not try them out!

Next, topic 28:

[1] "onion"       "vinegar"     "garlic"      "lemon_juice" "ginger"

I tend to find that anything with an intense flavour can be very appetizing for me.  Spices, vinegar, and anything citric are what really register on my tongue.  So, this topic does look very interesting to me, probably as a topping or a sauce.  It’s interesting that ginger shows up here, as that neutralizes other flavours, so I wonder whether I’d include it in any sauce that I make?

Last one!  Topic 41:

[1] "vanilla"  "cocoa"    "milk"     "cinnamon" "walnut"

These look like the kinds of ingredients for a nice drink of some sort (would you crush the walnuts?  I’m not sure!)

Well, I hope you enjoyed this as much as I did!  It’s not a perfect analysis, but it definitely is a delicious one 🙂  Again, feel free to leave any comments about any of the ingredient combinations, or questions that you think could be answered with a different analysis!


import os
rfiles = os.listdir('.')
rc = []
for f in rfiles:
if '.txt' in f:
# The recipes come in 3 txt files consisting of 1 recipe per line, the
# cuisine of the recipe as the first entry in the line, and all subsequent ingredient
# entries separated by a tab
infile = open(f, 'r')
rc.append(infile.read())
infile.close()
all_rs = '\n'.join(rc)
import re
line_pat = re.compile('[A-Za-z]+\t.+\n')
recipe_lines = line_pat.findall(all_rs)
new_recipe_lines = []
cuisine_lines = []
for n,r in enumerate(recipe_lines):
# First we find the cuisine of the recipe
cuisine = r[:r.find('\t')]
# Then we append the ingredients withou the cuisine
new_recipe_lines.append(recipe_lines[n].replace(cuisine, ''))
# I saved the cuisines to a different list in case I want to do some
# cuisine analysis later
cuisine_lines.append(cuisine + '\n')
outfile1 = open('recipes combined.tsv', 'wb')
outfile1.write(''.join(new_recipe_lines))
outfile1.close()
outfile2 = open('cuisines.csv', 'wb')
outfile2.write(''.join(cuisine_lines))
outfile2.close()


recipes = readLines('recipes combined.tsv')
# Once I read it into R, I have to get rid of the /t
# characters so that it's more acceptable to the tm package
recipes.new = apply(as.matrix(recipes), 1, function (x) gsub('\t',' ', x))
recipes.corpus = Corpus(VectorSource(recipes.new))
recipes.dtm = DocumentTermMatrix(recipes.corpus)
# Now I filter out any terms that have shown up in less than 10 documents
recipes.dict = Dictionary(findFreqTerms(recipes.dtm,10))
recipes.dtm.filtered = DocumentTermMatrix(recipes.corpus, list(dictionary = recipes.dict))
# Here I get a count of number of ingredients in each document
# with the intent of deleting any documents with 0 ingredients
ingredient.counts = apply(recipes.dtm.filtered, 1, function (x) sum(x))
recipes.dtm.filtered = recipes.dtm.filtered[ingredient.counts > 0]
# Here i get some simple ingredient frequencies so that I can plot them and decide
# which I'd like to filter out
recipes.m = as.matrix(recipes.dtm.filtered)
popularity.of.ingredients = sort(colSums(recipes.m), decreasing=TRUE)
popularity.of.ingredients = data.frame(ingredients = names(popularity.of.ingredients), num_recipes=popularity.of.ingredients)
popularity.of.ingredients$ingredients = reorder(popularity.of.ingredients$ingredients, popularity.of.ingredients$num_recipes)
library(ggplot2)
ggplot(popularity.of.ingredients[1:30,], aes(x=ingredients, y=num_recipes)) + geom_point(size=5, colour="red") + coord_flip() +
ggtitle("Recipe Popularity of Top 30 Ingredients") +
theme(axis.text.x=element_text(size=13,face="bold", colour="black"), axis.text.y=element_text(size=13,colour="black",
face="bold"), axis.title.x=element_text(size=14, face="bold"), axis.title.y=element_text(size=14,face="bold"),
plot.title=element_text(size=24,face="bold"))
# Having found wheat, egg, and butter to be the three most frequent ingredients
# (and not caring too much about them as ingredients in general) I remove them
# from the corpus and redo the document term matrix
recipes.corpus = tm_map(recipes.corpus, removeWords, c("wheat","egg","butter")) # Go back to line 6
recipes.dtm.final = DocumentTermMatrix(recipes.corpus, list(dictionary = recipes.dict))
# Finally, I run the LDA and extract the 5 most
# characteristic ingredients in each topic… yummy!
recipes.lda = LDA(recipes.dtm.filtered, 50)
t = terms(recipes.lda,5)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9
[1,] "onion" "pepper" "milk" "tomato" "olive_oil" "milk" "milk" "tomato" "garlic"
[2,] "rice" "vinegar" "vanilla" "garlic" "garlic" "nutmeg" "pepper" "cayenne" "cream"
[3,] "cayenne" "onion" "cocoa" "oregano" "onion" "vanilla" "yeast" "olive_oil" "vegetable_oil"
[4,] "chicken_broth" "tomato" "onion" "onion" "black_pepper" "cinnamon" "potato" "garlic" "pepper"
[5,] "olive_oil" "milk" "cane_molasses" "basil" "vinegar" "cream" "lemon_juice" "pepper" "milk"
Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17
[1,] "milk" "soy_sauce" "vegetable_oil" "onion" "milk" "tamarind" "milk" "vegetable_oil"
[2,] "cream" "scallion" "milk" "black_pepper" "cinnamon" "onion" "vanilla" "pepper"
[3,] "vanilla" "sesame_oil" "pepper" "vinegar" "onion" "garlic" "cream" "cream"
[4,] "cane_molasses" "cane_molasses" "cane_molasses" "bell_pepper" "cayenne" "corn" "vegetable_oil" "black_pepper"
[5,] "cinnamon" "roasted_sesame_seed" "cinnamon" "bacon" "olive_oil" "vinegar" "garlic" "mustard"
Topic 18 Topic 19 Topic 20 Topic 21 Topic 22 Topic 23 Topic 24 Topic 25 Topic 26
[1,] "cane_molasses" "vanilla" "onion" "garlic" "onion" "vegetable_oil" "onion" "cream" "cumin"
[2,] "onion" "cream" "black_pepper" "cane_molasses" "garlic" "soy_sauce" "garlic" "tomato" "coriander"
[3,] "vinegar" "almond" "vegetable_oil" "vinegar" "tomato" "sesame_oil" "cane_molasses" "chicken" "turmeric"
[4,] "olive_oil" "coconut" "bell_pepper" "black_pepper" "olive_oil" "fish" "tomato" "lemon_juice" "fenugreek"
[5,] "pepper" "oat" "garlic" "soy_sauce" "basil" "chicken" "vegetable_oil" "black_pepper" "lemongrass"
Topic 27 Topic 28 Topic 29 Topic 30 Topic 31 Topic 32 Topic 33 Topic 34 Topic 35
[1,] "onion" "onion" "onion" "onion" "vanilla" "garlic" "onion" "onion" "garlic"
[2,] "garlic" "vinegar" "celery" "pepper" "milk" "onion" "pepper" "garlic" "basil"
[3,] "black_pepper" "garlic" "chicken" "garlic" "garlic" "vegetable_oil" "garlic" "vegetable_oil" "pepper"
[4,] "tomato" "lemon_juice" "vegetable_oil" "parsley" "cinnamon" "cayenne" "black_pepper" "black_pepper" "tomato"
[5,] "olive_oil" "ginger" "carrot" "olive_oil" "cream" "beef" "beef" "chicken" "olive_oil"
Topic 36 Topic 37 Topic 38 Topic 39 Topic 40 Topic 41 Topic 42 Topic 43 Topic 44
[1,] "onion" "onion" "onion" "cayenne" "garlic" "vanilla" "vanilla" "scallion" "milk"
[2,] "garlic" "garlic" "cream" "garlic" "onion" "cocoa" "cane_molasses" "garlic" "tomato"
[3,] "cayenne" "black_pepper" "tomato" "ginger" "bell_pepper" "milk" "cocoa" "ginger" "garlic"
[4,] "vegetable_oil" "lemon_juice" "cane_molasses" "rice" "olive_oil" "cinnamon" "oat" "soybean" "vegetable_oil"
[5,] "oregano" "scallion" "milk" "onion" "milk" "walnut" "milk" "pepper" "cream"
Topic 45 Topic 46 Topic 47 Topic 48 Topic 49 Topic 50
[1,] "onion" "cream" "pepper" "cream" "milk" "olive_oil"
[2,] "cream" "black_pepper" "vegetable_oil" "tomato" "vanilla" "tomato"
[3,] "black_pepper" "chicken_broth" "garlic" "beef" "lard" "parmesan_cheese"
[4,] "milk" "vegetable_oil" "onion" "garlic" "cocoa" "lemon_juice"
[5,] "cinnamon" "garlic" "olive_oil" "carrot" "cane_molasses" "garlic"

Enron Email Corpus Topic Model Analysis Part 2 – This Time with Better regex

After posting my analysis of the Enron email corpus, I realized that the regex patterns I set up to capture and filter out the cautionary/privacy messages at the bottoms of peoples emails were not working.  Let’s have a look at my revised python code for processing the corpus:


docs = []
from os import listdir, chdir
import re
# Here's the section where I try to filter useless stuff out.
# Notice near the end all of the regex patterns where I've called
# "re.DOTALL". This is pretty key here. What it means is that the
# .+ I have referenced within the regex pattern should be able to
# pick up alphanumeric characters, in addition to newline characters
# (\n). Since I did not have this in the first version, the cautionary/
# privacy messages people were pasting at the ends of their emails
# were not getting filtered out and were being entered into the
# LDA analysis, putting noise in the topics that were modelled.
email_pat = re.compile(".+@.+")
to_pat = re.compile("To:.+\n")
cc_pat = re.compile("cc:.+\n")
subject_pat = re.compile("Subject:.+\n")
from_pat = re.compile("From:.+\n")
sent_pat = re.compile("Sent:.+\n")
received_pat = re.compile("Received:.+\n")
ctype_pat = re.compile("Content-Type:.+\n")
reply_pat = re.compile("Reply- Organization:.+\n")
date_pat = re.compile("Date:.+\n")
xmail_pat = re.compile("X-Mailer:.+\n")
mimver_pat = re.compile("MIME-Version:.+\n")
dash_pat = re.compile("–+.+–+", re.DOTALL)
star_pat = re.compile('\*\*+.+\*\*+', re.DOTALL)
uscore_pat = re.compile(" __+.+__+", re.DOTALL)
equals_pat = re.compile("==+.+==+", re.DOTALL)
# (the below is the same note as before)
# The enron emails are in 151 directories representing each each senior management
# employee whose email account was entered into the dataset.
# The task here is to go into each folder, and enter each
# email text file into one long nested list.
# I've used readlines() to read in the emails because read()
# didn't seem to work with these email files.
chdir("/home/inkhorn/enron")
names = [d for d in listdir(".") if "." not in d]
for name in names:
chdir("/home/inkhorn/enron/%s" % name)
subfolders = listdir('.')
sent_dirs = [n for n, sf in enumerate(subfolders) if "sent" in sf]
sent_dirs_words = [subfolders[i] for i in sent_dirs]
for d in sent_dirs_words:
chdir('/home/inkhorn/enron/%s/%s' % (name,d))
file_list = listdir('.')
docs.append([" ".join(open(f, 'r').readlines()) for f in file_list if "." in f])
# (the below is the same note as before)
# Here i go into each email from each employee, try to filter out all the useless stuff,
# then paste the email into one long flat list. This is probably inefficient, but oh well – python
# is pretty fast anyway!
docs_final = []
for subfolder in docs:
for email in subfolder:
if ".nsf" in email:
etype = ".nsf"
elif ".pst" in email:
etype = ".pst"
email_new = email[email.find(etype)+4:]
email_new = to_pat.sub('', email_new)
email_new = cc_pat.sub('', email_new)
email_new = subject_pat.sub('', email_new)
email_new = from_pat.sub('', email_new)
email_new = sent_pat.sub('', email_new)
email_new = received_pat.sub('', email_new)
email_new = email_pat.sub('', email_new)
email_new = ctype_pat.sub('', email_new)
email_new = reply_pat.sub('', email_new)
email_new = date_pat.sub('', email_new)
email_new = xmail_pat.sub('', email_new)
email_new = mimver_pat.sub('', email_new)
email_new = dash_pat.sub('', email_new)
email_new = star_pat.sub('', email_new)
email_new = uscore_pat.sub('', email_new)
email_new = equals_pat.sub('', email_new)
docs_final.append(email_new)
# (the below is the same note as before)
# Here I proceed to dump each and every email into about 126 thousand separate
# txt files in a newly created 'data' directory. This gets it ready for entry into a Corpus using the tm (textmining)
# package from R.
for n, doc in enumerate(docs_final):
outfile = open("/home/inkhorn/enron/data/%s.txt" % n,'w')
outfile.write(doc)
outfile.close()

As I did not change the R code since the last post, let’s have a look at the results:

terms(lda.model,20)
      Topic 1   Topic 2   Topic 3     Topic 4   
 [1,] "enron"   "time"    "pleas"     "deal"    
 [2,] "busi"    "thank"   "thank"     "gas"     
 [3,] "manag"   "day"     "attach"    "price"   
 [4,] "meet"    "dont"    "email"     "contract"
 [5,] "market"  "call"    "enron"     "power"   
 [6,] "compani" "week"    "agreement" "market"  
 [7,] "vinc"    "look"    "fax"       "chang"   
 [8,] "report"  "talk"    "call"      "rate"    
 [9,] "time"    "hope"    "copi"      "trade"   
[10,] "energi"  "ill"     "file"      "day"     
[11,] "inform"  "tri"     "messag"    "month"   
[12,] "pleas"   "bit"     "inform"    "compani" 
[13,] "trade"   "guy"     "phone"     "energi"  
[14,] "risk"    "night"   "send"      "transact"
[15,] "discuss" "friday"  "corp"      "product" 
[16,] "regard"  "weekend" "kay"       "term"    
[17,] "team"    "love"    "review"    "custom"  
[18,] "plan"    "item"    "receiv"    "cost"    
[19,] "servic"  "email"   "question"  "thank"   
[20,] "offic"   "peopl"   "draft"     "purchas"

One at a time, I will try to interpret what each topic is trying to describe:

  1. This one appears to be a business process topic, containing a lot of general business terms, with a few even relating to meetings.
  2. Similar to the last model that I derived, this topic has a lot of time related words in it such as: time, day, week, night, friday, weekend.  I’ll be interested to see if this is another business meeting/interview/social meeting topic, or whether it describes something more social.
  3. Hrm, this topic seems to contain a lot of general terms used when we talk about communication: email, agreement, fax, call, message, inform, phone, send, review, question.  It even has please and thank you!  I suppose it’s very formal and you could perhaps interpret this as professional sounding administrative emails.
  4. This topic seems to be another case of emails containing a lot of ‘shop talk’

Okay, let’s see if we can find some examples for each topic:

sample(which(df.emails.topics$"1" > .95),3)
[1] 27771 45197 27597

enron[[27771]]

 Christi's call.
 
  
     
 
 	Christi has asked me to schedule the above meeting/conference call.  September 11th (p.m.) seems to be the best date.  Question:  Does this meeting need to be a 1/2 day meeting?  Christi and I were wondering.
 
 	Give us your thoughts.

Yup, business process, meeting. This email fits the bill! Next!

enron[[45197]]

 
 Bob, 
 
 I didn't check voice mail until this morning (I don't have a blinking light.  
 The assistants pick up our lines and amtel us when voice mails have been 
 left.)  Anyway, with the uncertainty of the future business under the Texas 
 Desk, the following are my goals for the next six months:
 
 1)  Ensure a smooth transition of HPL to AEP, with minimal upsets to Texas 
 business.
 2)  Develop operations processes and controls for the new Texas Desk.   
 3)  Develop a replacement
  a.  Strong push to improve Liz (if she remains with Enron and )
  b.  Hire new person, internally or externally
 4)  Assist in develop a strong logisitcs team.  With the new business, we 
 will need strong performers who know and accept their responsibilites.
 
 1 and 2 are open-ended.  How I accomplish these goals and what they entail 
 will depend how the Texas Desk (if we have one) is set up and what type of 
 activity the desk will be invovled in, which is unknown to me at this time.  
 I'm sure as we get further into the finalization of the sale, additional and 
 possibly more urgent goals will develop.  So, in short, who knows what I need 
 to do.
 
 D

This one also seems to fit the bill. “D” here is writing about his/her goals for the next six months and considers briefly how to accomplish them. Not heavy into the content of the business, so I’m happy here. On to topic 2:

sample(which(df.emails.topics$"2" > .95),3)
[1] 50356 22651 19259

enron[[50356]]

I agree it is Matt, and  I believe he has reviewed this tax stuff (or at 
 least other turbine K's) before.  His concern will be us getting some amount 
 of advance notice before title transfer (ie, delivery).  Obviously, he might 
 have some other comments as well.  I'm happy to send him the latest, or maybe 
 he can access the site?
 
 Kay
 
 
    
  
 Given that the present form of GE world hunger seems to be more domestic than 
 international it would appear that Matt Gockerman would be a good choice for 
 the Enron- GE tax discussion.  Do you want to contact him or do you want me 
 to.   I would be interested in listening in on the conversation for 
 continuity. 

Here, the conversants seem to be talking about having a phone conversation with “Matt” to get his ideas on a tax discussion. This fits in with the meeting theme. Next!

enron[[22651]]

 LOVE
 HONEY PIE

Well, that was pretty social, wasn’t it? 🙂 Okay one more from the same topic:

enron[[19259]]

  Mime-Version: 1.0
  Content-Transfer-Encoding: 7bit
 X- X- X- X-b X-Folder: \ExMerge - Giron, Darron C.\Sent Items
 X-Origin: GIRON-D
 X-FileName: darron giron 6-26-02.PST
 
 Sorry.  I've got a UBS meeting all day.  Catch you later.  I was looking forward to the conversation.
 
 DG
 
  
     
 It seems everyone agreed to Ninfa's.  Let's meet at 11:45; let me know if a
 different time is better.  Ninfa's is located in the tunnel under the JP
 Morgan Chase Tower at 600 Travis.  See you there.
 
 Schroeder

Woops, header info that I didn’t manage to filter out :(. Anyway, DG writes about an impending conversation, and Schroeder writes about a specific time for their meeting. This fits! Next topic!

sample(which(df.emails.topics$"3" > .95),3)
[1] 24147 51673 29717

enron[[24147]]

Kaye:  Can you please email the prior report to me?  Thanks.
 
 Sara Shackleton
 Enron North America Corp.
 1400 Smith Street, EB 3801a
 Houston, Texas  77002
 713-853-5620 (phone)
 713-646-3490 (fax)


 	04/10/2001 05:56 PM
 			  		  
 
 At Alan's request, please provide to me by e-mail (with a  Thursday of this week your suggested changes to the March 2001 Monthly 
 Report, so that we can issue the April 2001 Monthly Report by the end of this 
 week.  Thanks for your attention to this matter.
 
 Nita

This one definitely fits in with the professional sounding administrative emails interpretation. Emailing reports and such. Next!

 I believe this was intended for Susan Scott with ETS...I'm with Nat Gas trading.
 
 Thanks
 
 
 
     
 FYI...another executed capacity transaction on EOL for Transwestern.
 
  
     
 This message is to confirm your EOL transaction with Transwestern Pipeline Company.
 You have successfully acquired the package(s) listed below.  If you have questions or
 concerns regarding the transaction(s), please call Michelle Lokay at (713) 345-7932
 prior to placing your nominations for these volumes.
 
 Product No.:	39096
 Time Stamp:	3/27/01	09:03:47 am
 Product Name:	US PLCapTW Frm CenPool-OasisBlock16
  
 Shipper Name:  E Prime, Inc.
 
 Volume:	10,000 Dth/d  
 					
 Rate:	$0.0500 /dth 1-part rate (combined  Res + Com) 100% Load Factor
 		+ applicable fuel and unaccounted for
 	
 TW K#: 27548		
 
 Effective  
 Points:	RP- (POI# 58649)  Central Pool      10,000 Dth/d
 		DP- (POI# 8516)   Oasis Block 16  10,000 Dth/d
 
 Alternate Point(s):  NONE
 
 
 Note:     	In order to place a nomination with this agreement, you must log 
 	            	off the TW system and then log back on.  This action will update
 	            	the agreement's information on your PC and allow you to place
 		nominations under the agreement number shown above.
 
 Contact Info:		Michelle Lokay
 	 			Phone (713) 345-7932
               			Fax       (713) 646-8000

Rather long, but even the short part at the beginning falls under the right category for this topic! Okay, let’s look at the final topic:

sample(which(df.emails.topics$"4" > .95),3)
[1] 39100  31681  6427

enron[[39100]]

 Randy, your proposal is fine by me.  Jim

Hrm, this is supposed to be a ‘business content’ topic, so I suppose I can see why this email was classified as such. It doesn’t take long to go from ‘proposal’ to ‘contract’ if you free associate, right? Next!

enron[[31681]]

 Attached is the latest version of the Wildhorse Entrada Letter.  Please 
 review.  I reviewed the letter with Jim Osborne and Ken Krisa yesterday and 
 should get their comments today.  My plan is to Fedex to Midland for Ken's 
 signature tomorrow morning and from there it will got to Wildhorse.  

This one makes me feel a little better, referencing a specific business letter that the emailer probably wants the emailed person to see. Let’s find one more for good luck:

enron[[6427]]

 At a ratio of 10:1, you should have your 4th one signed and have the fifth 
 one on the way...
 
  	09/19/2000 05:40 PM
   		  		  
 ONLY 450!  Why, I thought you guys hit 450 a long time ago.
 
 Marie Heard
 Senior Legal Specialist
 Enron Broadband Services
 Phone:  (713) 853-3907
 Fax:  (713) 646-8537

 	09/19/00 05:34 PM
		  		  		  
 Well, I do believe this makes 450!  A nice round number if I do say so myself!

 	Susan Bailey
 	09/19/2000 05:30 PM

 We have received an executed Master Agreement:
 
 
 Type of Contract:  ISDA Master Agreement (Multicurrency-Cross Border)
 
 Effective  
 Enron Entity:   Enron North America Corp.
 
 Counterparty:   Arizona Public Service Company
 
 Transactions Covered:  Approved for all products with the exception of: 
 Weather
           Foreign Exchange
           Pulp & Paper
 
 Special Note:  The Counterparty has three (3) Local Business Days after the 
 receipt of a Confirmation from ENA to accept or dispute the Confirmation.  
 Also, ENA is the Calculation Agent unless it should become a Defaulting 
 Party, in which case the Counterparty shall be the Calculation Agent.
 
 Susan S. Bailey
 Enron North America Corp.
 1400 Smith Street, Suite 3806A
 Houston, Texas 77002
 Phone: (713) 853-4737
 Fax: (713) 646-3490

That one was very long, but there’s definitely some good business content in it (along with some happy banter about the contract that I guess was acquired).

All in all, I’d say that fixing those regex patterns that were supposed to filter out the caution/privacy messages at the ends of peoples’ emails was a big boon to the LDA analysis here.

Let that be a lesson: half the battle in LDA is in filtering out the noise!

A Rather Nosy Topic Model Analysis of the Enron Email Corpus

Having only ever played with Latent Dirichlet Allocation using gensim in python, I was very interested to see a nice example of this kind of topic modelling in R.  Whenever I see a really cool analysis done, I get the urge to do it myself.  What better corpus to do topic modelling on than the Enron email dataset?!?!?  Let me tell you, this thing is a monster!  According to the website I got it from, it contains about 500k messages, coming from 151 mostly senior management users and is organized into user folders.  I didn’t want to accept everything into my analysis, so I made the decision that I would only look into messages contained within the “sent” or “sent items” folders.

Being a large advocate of R, I really really tried to do all of the processing and analysis in R, but it was just too difficult and was taking up more time than I wanted.  So I dusted off my python skills (thank you grad school!) and did the bulk of the data processing/preparation in python, and the text mining in R.  Following is the code (hopefully well enough commented) that I used to process the corpus in python:


docs = []
from os import listdir, chdir
import re
# Here's my attempt at coming up with regular expressions to filter out
# parts of the enron emails that I deem as useless.
email_pat = re.compile(".+@.+")
to_pat = re.compile("To:.+\n")
cc_pat = re.compile("cc:.+\n")
subject_pat = re.compile("Subject:.+\n")
from_pat = re.compile("From:.+\n")
sent_pat = re.compile("Sent:.+\n")
received_pat = re.compile("Received:.+\n")
ctype_pat = re.compile("Content-Type:.+\n")
reply_pat = re.compile("Reply- Organization:.+\n")
date_pat = re.compile("Date:.+\n")
xmail_pat = re.compile("X-Mailer:.+\n")
mimver_pat = re.compile("MIME-Version:.+\n")
contentinfo_pat = re.compile("—————————————-.+—————————————-")
forwardedby_pat = re.compile("———————-.+———————-")
caution_pat = re.compile('''\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*.+\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*''')
privacy_pat = re.compile(" _______________________________________________________________.+ _______________________________________________________________")
# The enron emails are in 151 directories representing each each senior management
# employee whose email account was entered into the dataset.
# The task here is to go into each folder, and enter each
# email text file into one long nested list.
# I've used readlines() to read in the emails because read()
# didn't seem to work with these email files.
chdir("/home/inkhorn/enron")
names = [d for d in listdir(".") if "." not in d]
for name in names:
chdir("/home/inkhorn/enron/%s" % name)
subfolders = listdir('.')
sent_dirs = [n for n, sf in enumerate(subfolders) if "sent" in sf]
sent_dirs_words = [subfolders[i] for i in sent_dirs]
for d in sent_dirs_words:
chdir('/home/inkhorn/enron/%s/%s' % (name,d))
file_list = listdir('.')
docs.append([" ".join(open(f, 'r').readlines()) for f in file_list if "." in f])
# Here i go into each email from each employee, try to filter out all the useless stuff,
# then paste the email into one long flat list. This is probably inefficient, but oh well – python
# is pretty fast anyway!
docs_final = []
for subfolder in docs:
for email in subfolder:
if ".nsf" in email:
etype = ".nsf"
elif ".pst" in email:
etype = ".pst"
email_new = email[email.find(etype)+4:]
email_new = to_pat.sub('', email_new)
email_new = cc_pat.sub('', email_new)
email_new = subject_pat.sub('', email_new)
email_new = from_pat.sub('', email_new)
email_new = sent_pat.sub('', email_new)
email_new = email_pat.sub('', email_new)
if "—–Original Message—–" in email_new:
email_new = email_new.replace("—–Original Message—–","")
email_new = ctype_pat.sub('', email_new)
email_new = reply_pat.sub('', email_new)
email_new = date_pat.sub('', email_new)
email_new = xmail_pat.sub('', email_new)
email_new = mimver_pat.sub('', email_new)
email_new = contentinfo_pat.sub('', email_new)
email_new = forwardedby_pat.sub('', email_new)
email_new = caution_pat.sub('', email_new)
email_new = privacy_pat.sub('', email_new)
docs_final.append(email_new)
# Here I proceed to dump each and every email into about 126 thousand separate
# txt files in a newly created 'data' directory. This gets it ready for entry into a Corpus using the tm (textmining)
# package from R.
for n, doc in enumerate(docs_final):
outfile = open("/home/inkhorn/enron/data/%s.txt" % n,'w')
outfile.write(doc)
outfile.close()

After having seen python’s performance in rifling through these enron emails, I was very impressed!  It was very agile in creating a directory with the largest number of files I’d ever seen on my computer!

Okay, so now I had a directory filled with a whole lot of text files.  The next step was to bring them into R so that I could submit them to the LDA.  Following is the R code that I used:


library(stringr)
library(plyr)
library(tm)
library(tm.plugin.mail)
library(SnowballC)
library(topicmodels)
# At this point, the python script should have been run,
# creating about 126 thousand txt files. I was very much afraid
# to import that many txt files into the tm package in R (my computer only
# runs on 8GB of RAM), so I decided to mark 60k of them for a sample, and move the
# rest of them into a separate directory
email_txts = list.files('data/')
email_txts_sample = sample(email_txts, 60000)
email_rename = data.frame(orig=email_txts_sample, new=sub(".txt",".rxr", email_txts_sample))
file.rename(str_c('data/',email_rename$orig), str_c('data/',email_rename$new))
# At this point, all of the non-sampled emails (labelled .txt, not .rxr)
# need to go into a different directory. I created a directory that I called
# nonsampled/ and moved the files there via the terminal command "mv *.txt nonsampled/".
# It's very important that you don't try to do this via a file explorer, windows or linux,
# as the act of trying to display that many file icons is apparently very difficult for a regular machine :$
enron = Corpus(DirSource("/home/inkhorn/enron/data"))
extendedstopwords=c("a","about","above","across","after","MIME Version","forwarded","again","against","all","almost","alone","along","already","also","although","always","am","among","an","and","another","any","anybody","anyone","anything","anywhere","are","area","areas","aren't","around","as","ask","asked","asking","asks","at","away","b","back","backed","backing","backs","be","became","because","become","becomes","been","before","began","behind","being","beings","below","best","better","between","big","both","but","by","c","came","can","cannot","can't","case","cases","certain","certainly","clear","clearly","come","could","couldn't","d","did","didn't","differ","different","differently","do","does","doesn't","doing","done","don't","down","downed","downing","downs","during","e","each","early","either","end","ended","ending","ends","enough","even","evenly","ever","every","everybody","everyone","everything","everywhere","f","face","faces","fact","facts","far","felt","few","find","finds","first","for","four","from","full","fully","further","furthered","furthering","furthers","g","gave","general","generally","get","gets","give","given","gives","go","going","good","goods","got","great","greater","greatest","group","grouped","grouping","groups","h","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","her","here","here's","hers","herself","he's","high","higher","highest","him","himself","his","how","however","how's","i","i'd","if","i'll","i'm","important","in","interest","interested","interesting","interests","into","is","isn't","it","its","it's","itself","i've","j","just","k","keep","keeps","kind","knew","know","known","knows","l","large","largely","last","later","latest","least","less","let","lets","let's","like","likely","long","longer","longest","m","made","make","making","man","many","may","me","member","members","men","might","more","most","mostly","mr","mrs","much","must","mustn't","my","myself","n","necessary","need","needed","needing","needs","never","new","newer","newest","next","no","nobody","non","noone","nor","not","nothing","now","nowhere","number","numbers","o","of","off","often","old","older","oldest","on","once","one","only","open","opened","opening","opens","or","order","ordered","ordering","orders","other","others","ought","our","ours","ourselves","out","over","own","p","part","parted","parting","parts","per","perhaps","place","places","point","pointed","pointing","points","possible","present","presented","presenting","presents","problem","problems","put","puts","q","quite","r","rather","really","right","room","rooms","s","said","same","saw","say","says","second","seconds","see","seem","seemed","seeming","seems","sees","several","shall","shan't","she","she'd","she'll","she's","should","shouldn't","show","showed","showing","shows","side","sides","since","small","smaller","smallest","so","some","somebody","someone","something","somewhere","state","states","still","such","sure","t","take","taken","than","that","that's","the","their","theirs","them","themselves","then","there","therefore","there's","these","they","they'd","they'll","they're","they've","thing","things","think","thinks","this","those","though","thought","thoughts","three","through","thus","to","today","together","too","took","toward","turn","turned","turning","turns","two","u","under","until","up","upon","us","use","used","uses","v","very","w","want","wanted","wanting","wants","was","wasn't","way","ways","we","we'd","well","we'll","wells","went","were","we're","weren't","we've","what","what's","when","when's","where","where's","whether","which","while","who","whole","whom","who's","whose","why","why's","will","with","within","without","won't","work","worked","working","works","would","wouldn't","x","y","year","years","yes","yet","you","you'd","you'll","young","younger","youngest","your","you're","yours","yourself","yourselves","you've","z")
dtm.control = list(
tolower = T,
removePunctuation = T,
removeNumbers = T,
stopwords = c(stopwords("english"),extendedstopwords),
stemming = T,
wordLengths = c(3,Inf),
weighting = weightTf)
dtm = DocumentTermMatrix(enron, control=dtm.control)
dtm = removeSparseTerms(dtm,0.999)
dtm = dtm[rowSums(as.matrix(dtm))>0,]
k = 4
# Beware: this step takes a lot of patience! My computer was chugging along for probably 10 or so minutes before it completed the LDA here.
lda.model = LDA(dtm, k)
# This enables you to examine the words that make up each topic that was calculated. Bear in mind that I've chosen to stem all words possible in this corpus, so some of the words output will look a little weird.
terms(lda.model,20)
# Here I construct a dataframe that scores each document according to how closely its content
# matches up with each topic. The closer the score is to 0, the more likely its content matches
# up with a particular topic.
emails.topics = posterior(lda.model, dtm)$topics
df.emails.topics = as.data.frame(emails.topics)
df.emails.topics = cbind(email=as.character(rownames(df.emails.topics)),
df.emails.topics, stringsAsFactors=F)

Phew, that took a lot of computing power! Now that it’s done, let’s look at the results of the command on line 48 from the above gist:

      Topic 1   Topic 2     Topic 3      Topic 4     
 [1,] "time"    "thank"     "market"     "email"     
 [2,] "vinc"    "pleas"     "enron"      "pleas"     
 [3,] "week"    "deal"      "power"      "messag"    
 [4,] "thank"   "enron"     "compani"    "inform"    
 [5,] "look"    "attach"    "energi"     "receiv"    
 [6,] "day"     "chang"     "price"      "intend"    
 [7,] "dont"    "call"      "gas"        "copi"      
 [8,] "call"    "agreement" "busi"       "attach"    
 [9,] "meet"    "question"  "manag"      "recipi"    
[10,] "hope"    "fax"       "servic"     "enron"     
[11,] "talk"    "america"   "rate"       "confidenti"
[12,] "ill"     "meet"      "trade"      "file"      
[13,] "tri"     "mark"      "provid"     "agreement" 
[14,] "night"   "kay"       "issu"       "thank"     
[15,] "friday"  "corp"      "custom"     "contain"   
[16,] "peopl"   "trade"     "california" "address"   
[17,] "bit"     "ena"       "oper"       "contact"   
[18,] "guy"     "north"     "cost"       "review"    
[19,] "love"    "discuss"   "electr"     "parti"     
[20,] "houston" "regard"    "report"     "contract"

Here’s where some really subjective interpretation is required, just like in PCA analysis.  Let’s try to interpret the topics, one at a time:

  1. I see a lot of words related to time in this topic, and then I see the word ‘meet’.  I’ll call this the meeting (business or otherwise) topic!
  2. I’m not sure how to interpret this second topic, so perhaps I’ll chalk it up to noise in my analysis!
  3. This topic contains a lot of ‘business content’ words, so it appears to be a kind of ‘talking shop’ topic.
  4. This topic, while still pretty ‘businessy’, appears to be less about the content of the business and more about the processes, or perhaps legalities of the business.

For each of the sensible topics (1,3,4), let’s bring up some emails that scored highly on these topics to see if the analysis makes sense:

sample(which(df.emails.topics$"1" > .95), 10)
 [1] 53749 32102 16478 36204 29296 29243 47654 38733 28515 53254
enron[[32102]]

 I will be out of the office next week on Spring Break. Can you participate on 
 this call? Please report what is said to Christian Yoder 503-464-7845 or 
 Steve Hall 503-4647795

 	03/09/2001 05:48 PM

 I don't know, but I will check with our client.

 Our client Avista Energy has received the communication, below, from the ISO
 regarding withholding of payments to creditors of monies the ISO has
 received from PG&E.  We are interested in whether any of your clients have
 received this communication, are interested in this issue and, if so,
 whether you have any thoughts about how to proceed.

 You are invited to participate in a conference call to discuss this issue on
 Monday, March 12, at 10:00 a.m.

 Call-in number: (888) 320-6636
 Host: Pritchard
 Confirmation number: 1827-1922

 Diane Pritchard
 Morrison & Foerster LLP
 425 Market Street
 San Francisco, California 94105
 (415) 268-7188

So this one isn’t a business meeting in the physical sense, but is a conference call, which still falls under the general category of meetings.

enron[[29243]]
 Hey Fritz.  I am going to send you an email that attaches a referral form to your job postings.  In addition, I will also personally tell the hiring manager that I have done this and I can also give him an extra copy of youe resume.  Hopefully we can get something going here....

 Tori,

 I received your name from Diane Hoyuela. You and I spoke
 back in 1999 about the gas industry. I tried briefly back
 in 1999 and found few opportunities during de-regulations
 first few steps. Well,...I'm trying again. I've been
 applying for a few job openings at Enron and was wondering
 if you could give me an internal referral. Also, any advice
 on landing a position at Enron or in general as a scheduler
 or analyst.
 Last week I applied for these positions at Enron; gas
 scheduler 110360, gas analyst 110247, and book admin.
 110129. I have a pretty good understanding of the gas
 market.

 I've attached my resume for you. Congrats. on the baby!
 I'll give you a call this afternoon to follow-up, I know
 mornings are your time.
 Regards,

 Fritz Hiser

 __________________________________________________
 Do You Yahoo!?
 Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger. http://im.yahoo.com

That one obviously shows someone who was trying to get a job at Enron and wanted to call “this afternoon to follow-up”. Again, a ‘call’ rather than a physical meeting.

Finally,

enron[[29296]]

 Susan,

 Well you have either had a week from hell so far or its just taking you time
 to come up with some good bs.  Without being too forward I will be in town
 next Friday and wanted to know if you would like to go to dinner or
 something.  At least that will give us a chance to talk face to face.  If
 your busy don't worry about it I thought I would just throw it out there.

 I'll keep this one short and sweet since the last one was rather lengthy.
 Hope this Thursday is a little better then last week.

 Kyle

 _________________________________________________________________________
 Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

 Share information about yourself, create your own public profile at
 http://profiles.msn.com.

Ahh, here’s a particularly juicy one. Kyle here wants to go to dinner, “or something” (heh heh heh) with Susan to get a chance to talk face to face with her. Finally, a physical meeting (maybe very physical…) lumped into a category with other business meetings in person or on the phone.

Okay, now let’s switch to topic 3, the “business content” topic.

sample(which(df.emails.topics$"3" > .95), 10)
 [1] 40671 26644  5398 52918 37708  5548 15167 56149 47215 26683

enron[[40671]]

 Please change the counterparty on deal 806589 from TP2 to TP3 (sorry about that).

Okay, that seems fairly in the realm of business content, but I don’t know what the heck it means. Let’s try another one:

enron[[5548]]

Phillip, Scott, Hunter, Tom and John -

 Just to reiterate the new trading guidelines on PG&E Energy Trading:

 1.  Both financial and physical trading are approved, with a maximum tenor of 18 months

 2.  Approved entities are:	PG&E Energy Trading - Gas Corporation
 				PG&E Energy Trading - Canada Corporation

 				NO OTHER PG&E ENTITIES ARE APPROVED FOR TRADING

 3.  Both EOL and OTC transactions are OK

 4.  Please call Credit (ext. 31803) with details on every OTC transaction.  We need to track all new positions with PG&E Energy Trading on an ongoing basis.  Please ask the traders and originators on your desks to notify us with the details on any new transactions immediately upon execution.  For large transactions (greater than 2 contracts/day or 5 BCF total), please call for approval before transacting.

 Thanks for your assistance; please call me (ext. 53923) or Russell Diamond (ext. 57095) if you have any questions.

 Jay

That one is definitely oozing with business content. Note the terms such as “Energy Trading”, and “Gas Corporation”, etc. Finally, one more:

enron[[26683]]

Hi Kathleen, Randy, Chris, and Trish,

 Attached is the text of the August issue of The Islander.  The headings will
 be lined up when Trish adds the art and ads.  A calendar, also, which is in
 the next e-mail.

 I'll appreciate your comments by the end of tomorrow, Monday.

 There are open issues which I sure hope get resolved before printing:

 1.  I'm waiting for a reply from Mike Bass regarding tenses on the Home Depot
 article.  Don't know if there's one developer or more and what the name(s)
 is/are.

 2.  Didn't hear back from Ted Weir regarding minutes for July's water board
 meeting.  I think there are 2 meetings minutes missed, 6/22 and July.

 3.  Waiting to hear back from Cheryl Hanks about the 7/6 City Council and 6/7
 BOA meetings minutes.

 4.  Don't know the name of the folks who were honored with Yard of the Month.
  They're at 509 Narcissus.

 I'm not feeling very good about the missing parts but need to move on
 schedule!  I'm also looking for a good dictionary to check the spellings of
 ettouffe, tree-house and orneryness.  (Makes me feel kind of ornery, come to
 think about it!)

 Please let me know if you have revisions.  Hope your week is starting out
 well.

 'Nita

Alright, this one seems to be a mix between business content and process. So I can see how it was lumped into this topic, but it doesn’t quite have the perfection that I would like.

Finally, let’s move on to topic 4, which appeared to be a ‘business process’ topic to me. I’m suspicious of this topic, as I don’t think I successfully filtered out everything that I wanted to:

sample(which(df.emails.topics$"4" > .95), 10)
 [1] 51205  5129 48826 51214 55337 15843 52543 11978 48337  2609

enron[[5129]]

very funny today...during the free fall, couldn't price jv and xh low enough 
 on eol, just kept getting cracked.  when we stabilized, customers came in to 
 buy and couldnt price it high enough.  winter versus apr went from +23 cents 
 when we were at the bottom to +27 when april rallied at the end even though 
 it should have tightened theoretically.  however, april is being supported 
 just off the strip.  getting word a lot of utilities are going in front of 
 the puc trying to get approval for hedging programs this year.  

 hey johnny. hope all is well. what u think hrere? utuilites buying this break
 down? charts look awful but 4.86 ish is next big level.
 jut back from skiing in co, fun but took 17 hrs to get home and a 1.5 days to
 get there cuz of twa and weather.

Hrm, this one appears to be some ‘shop talk’, and isn’t too general. I’m not sure how this applies to the topic 4 words. Let’s try another one:

enron[[55337]]

Fran, do you have an updated org chart that I could send to the Measurement group?
 	Thanks. Lynn

    Cc:	Estalee Russi

 Lynn,

 Attached are the org charts for ETS Gas Logistics:

 Have a great weekend.  Thanks!

 Miranda

Here we go. This one seems to fall much more into the ‘business process’ realm. Let’s see if I can find another good example:

enron[[11978]]

 Bill,

 As per our conversation today, I am sending you an outline of what we intend to be doing in Ercot and in particular on the real-time desk. For 2002 Ercot is split into 4 zones with TCRs between 3 of the zones. The zones are fairly diverse from a supply/demand perspective. Ercot has an average load of 38,000 MW, a peak of 57,000 MW with a breakdown of 30% industrial, 30% commercial and 40% residential. There are already several successful aggregators that are looking to pass on their wholesale risk to a credit-worthy QSE (Qualified Scheduling Entity). 

 Our expectation is that we will be a fully qualified QSE by mid-March with the APX covering us up to that point. Our initial on-line products will include a bal day and next day financial product. (There is no day ahead settlement in this market). There are more than 10 industrial loads with greater than 150 MW concentrated at single meters offering good opportunities for real-time optimization. Our intent is to secure one of these within the next 2 months.

 I have included some price history to show the hourly volatility and a business plan to show the scope of the opportunity. In addition, we have very solid analytics that use power flow simulations to map out expected outcomes in the real-time market.

 The initial job opportunity will involve an analysis of the real-time market as it stands today with a view to trading around our information. This will also drive which specific assets we approach to manage. As we are loosely combining our Texas gas and Ercot power desks our information flow will be superior and I believe we will have all the tools needed for a successful real-time operation.

 Let me know if you have any further questions.

 Thanks,

 Doug

Again, I seem to have found an email that straddles the boundary between business process and business content. Okay, I guess this topic isn’t the clearest in describing each of the examples that I found!

Overall, I probably could have done a bit more to filter out the useless stuff to construct topics that were better in describing the examples that they represent. Also, I’m not sure if I should be surprised or not that I didn’t pick up some sort of ‘social banter’ topic, where people were emailing about non-business topics. I suppose that social banter emails might be less predictable in their content, but maybe somebody much smarter than I am can tell me the answer 🙂

If you know how I can significantly ramp up the quality of this analysis, feel free to contribute your comments!