Teaching a Class of Undergrads, RStudio Server, and My Ubuntu Machine

I was chatting about public speaking with my brother, who is a Lecturer in the Faculty of Pharmacy at UofT, when he offered me the opportunity to come to his class and teach about R.  Always eager to spread the analytical goodness, I said yes!  The class is this Friday, and I am excited.

For this class I’ll be making use of RStudio Server, rather than having to get R onto some 30 individual machines.  Furthermore, I’ll be using an installation of RStudio Server on my own home machine.  It gives me more control and the convenience of configuring things late at night when I have the time to.

While playing around with the server on my computer (connecting via my own browser) I noticed that for each user you create, a new package library gets built.  That’s too bad as it relates to this class, because it would be neat for everyone to be able to make use of additional packages like ggplot2, dplyr and such, but this is an extremely beginner class anyway.

I’ve signed up for a dynamic dns host name from no-ip.com, and have set the port forwarding on my router accordingly, so that seems to be working just fine.  I just hope that nothing goes wrong.  I need to remember to create enough accounts on my ubuntu machine to accommodate all the students, which will be a small pain in the you-know-what, but oh well.

As for the data side of things, I’ve compiled some mildly interesting data on drug-related deaths by council area in scotland, geographical coordinates, and levels of crime, employment, education, income and health.  I only have an hour, so we’ll see how much I can cover!  Wish me luck.  If you have any advice, I’d be happy to hear it.  I’ve already been told to start with graphics 🙂

Nuclear vs Green Energy: Share the Wealth or Get Your Own?

Thanks to Ontario Open Data, a survey dataset was recently made public containing peoples’ responses to questions about Ontario’s Long Term Energy Plan (LTEP).  The survey did fairly well in terms of raw response numbers, with 7,889 responses in total (although who knows how many people it was sent to!).  As you’ll see in later images in this post, two major goals of Ontario’s LTEP is to eliminate power generation from coal, and to maximize savings by encouraging lots of conservation.

For now though, I’ll introduce my focus of interest: Which energy sources did survey respondents think should be relied on in the future, and how does that correlate with their views on energy management/sharing?

As you can see in the graph below, survey respondents were given a 7 point scale and asked to use it to rate the importance of different energy source options (scale has been flipped so that 7 is the most important and 1 is the least).  Perhaps it’s my ignorance of this whole discussion, but it surprised me that 76% of respondents rated Nuclear power as at least a 5/7 on a scale of importance!  Nuclear power?  But what about Chernobyl and Fukushima?  To be fair, although terribly dramatic and devastating, those were isolated incidents.  Also, measures have been taken to ensure our current nuclear reactors are and will be disaster safe.  Realistically, I think most people don’t think about those things!  A few other things to notice here: conservation does have its adherents, with 37% giving a positive response.  Also, I think it was surprising (and perhaps saddening) to see that green energy has so few adherents, proportionately speaking.

Survey: Importance of Energy Sources

After staring at this graph for a while, I had the idea to see what interesting differences I could find between people who supported Nuclear energy versus those who support Green energy.  What I found is quite striking in its consistency:

  1. Those who believe Nuclear energy is important for Ontario’s future mix of energy sources seem to be more confident that there’s enough energy to share between regions and that independence in power generation is not entirely necessary.
  2. On the flip side, those who believe Green energy is important for Ontario’s future mix of energy sources seem to be more confident that there isn’t enough energy to share between regions and that independence in power generation should be nurtured.

See for yourself in the following graphs:

Survey: Regions Should Make Conservation their First Priority

Survey: Self Sustaining Regions

Survey: Region Responsible for Growing Demand

Survey: Regions buy Power

Does this make sense in light of actual facts?  The graph below comes from a very digestible page set up by the Ontario Ministry of Energy to communicate its Long Term Energy Plan.  As they make pretty obvious, Nuclear energy accounts for over half of energy production in Ontario in 2013, whereas the newer green energy sources (Solar, Bioenergy, Wind vs. Hydro ) amount to about 5%.  In their forecast for 2032, they are hopeful that they will account for 13% of energy production in Ontario.  Still not the lion’s share of energy, but if you add that to the 22% accounted for by Hydro, then you get 35% of all energy production, which admittedly isn’t bad!  Still, I wonder what people were thinking of when they saw “Green energy” on the survey.  If the new sources, then I think what is going on here is that perhaps people who advocate for Green energy sources such as wind and solar have an idea how difficult it is to power a land mass such as Ontario with these kinds of power stations.  People advocating for Nuclear, on the other hand, are either blissfully ignorant, or simply understand that Nuclear power plants are able to serve a wider area.
MOE: Screenshot from 2013-12-08 13:28:04

MOE: Screenshot from 2013-12-08 13:41:06

All of this being said, as you can see in the image above, the Ontario Provincial Government actually wants to *reduce* our province’s reliance on Nuclear energy in the next 5 years, and in fact they will not be building new reactors.  I contacted Mark Smith, Senior Media Relations Coordinator of the Ontario Ministry of Energy to ask him to comment about the role of Nuclear energy in the long run.  Following are some tidbits that he shared with me over email:

Over the past few months, we have had extensive consultations as part of our review of Ontario’s Long Term Energy Plan (LTEP). There is a strong consensus that now is not the right time to build new nuclear.

Ontario is currently in a comfortable supply situation and it does not require the additional power.

We will continue to monitor the demand and supply situation and look at building new nuclear in the future, if the need arises.

Nuclear power has been operating safely in our province for over 40 years, and is held to the strictest regulations and safety requirements to ensure that the continued operation of existing facilities, and any potential new build are held to the highest standards.

We will continue with our nuclear refurbishment plans for which there was strong province-wide support during the LTEP consultations.

During refurbishment, both OPG and Bruce Power will be subject to the strictest possible oversight to ensure safety, reliable supply and value for ratepayers.

Nuclear refurbishments will create thousands of jobs and extend the lives of our existing fleet for another 25-30 years, sustaining thousands of highly-skilled and high-paying jobs.

The nuclear sector will continue be a vital, innovative part of Ontario, creating new technology which is exported around the world.

Well, even Mr. Mark Smith seems confident about Nuclear energy!  I tried to contact the David Suzuki Foundation to see if they’d have anything to say on the role of Green Energy in Ontario’s future, but they were unavailable for comment.

Well, there you have it!  Despite confidence in Nuclear energy as a viable source for the future, the province will be increasing its investments in both Green energy and conservation!  Here’s hoping for an electric following decade 🙂

(P.S. As usual, the R code follows)


ltep = read.csv("ltep-survey-results-all.csv")
library(likert)
library(ggthemes)
# Here I flip the scoring
ltep[,13:19] = sapply(ltep[,13:19], function (x) 8 x)
deal.w.esources = likert(ltep[,13:19])
summary(deal.w.esources)
plot(deal.w.esources, text.size=6, text.color="black") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold")) + ggtitle("What guidelines should Ontario use\n for its future mix of energy sources?")
library(plyr)
# following is a lot of boring code where I
# generate numerous data frames that contain percent of
# respondents who agree with the statement by their rated
# importance of Nuclear vs. Green Energy
# Here's the nuclear section
self.sustaining.region.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$A.region.should.be.responsible.for.generating.at.least.some.of.its.own.power. == "Agree", na.rm=TRUE))
self.sustaining.region.by.nuke = self.sustaining.region.by.nuke[1:7,]
region.buy.power.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$Regions.should.have.the.option.to.buy.all.of.their.power.from.sources.in.another.region..if.the.power.is.available. == "Agree", na.rm=TRUE))
region.buy.power.by.nuke = region.buy.power.by.nuke[1:7,]
region.resp.for.growing.demand.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$If.a.region.s.power.demand.is.growing..it.should.be.responsible.for.building.the.generation.or.transmission.to.supply.that.it.needs. == "Agree", na.rm=TRUE))
region.resp.for.growing.demand.by.nuke = region.resp.for.growing.demand.by.nuke[1:7,]
regions.make.cons.first.priority.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$Regions.should.make.conservation.their.first.priority.to.reduce.the.need.for.new.supply. == "Agree", na.rm=TRUE))
regions.make.cons.first.priority.by.nuke = regions.make.cons.first.priority.by.nuke[1:7,]
# Here's the green energy section
self.sustaining.region.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$A.region.should.be.responsible.for.generating.at.least.some.of.its.own.power. == "Agree", na.rm=TRUE))
self.sustaining.region.by.green = self.sustaining.region.by.green[1:7,]
region.buy.power.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$Regions.should.have.the.option.to.buy.all.of.their.power.from.sources.in.another.region..if.the.power.is.available. == "Agree", na.rm=TRUE))
region.buy.power.by.green= region.buy.power.by.green[1:7,]
region.resp.for.growing.demand.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$If.a.region.s.power.demand.is.growing..it.should.be.responsible.for.building.the.generation.or.transmission.to.supply.that.it.needs. == "Agree", na.rm=TRUE))
region.resp.for.growing.demand.by.green = region.resp.for.growing.demand.by.green[1:7,]
regions.make.cons.first.priority.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$Regions.should.make.conservation.their.first.priority.to.reduce.the.need.for.new.supply. == "Agree", na.rm=TRUE))
regions.make.cons.first.priority.by.green = regions.make.cons.first.priority.by.green[1:7,]
# in this section I alternate between getting each data frame ready for plotting
# and obviously plotting the data frame. I liked the yellow and green that I used in the plot
# to represent Nuclear vs. Green energy 🙂
library(ggplot2)
library(scales)
self.sustaining.region = rbind(data.frame(Importance=seq(1,7),Pct.Agree=self.sustaining.region.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=self.sustaining.region.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(self.sustaining.region, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("A Region Should be Responsible for Generating at Least Some of Its Own Power\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))
region.buy.power = rbind(data.frame(Importance=seq(1,7),Pct.Agree=region.buy.power.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=region.buy.power.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(region.buy.power, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("Regions Should Have the Option to Buy All of their Power\nfrom Sources in Another Region if Available\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))
region.resp.for.growing.demand = rbind(data.frame(Importance=seq(1,7),Pct.Agree=region.resp.for.growing.demand.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=region.resp.for.growing.demand.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(region.resp.for.growing.demand, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("If a Region's Power Demand is Growing\nIt Should be Responsible for Building the Transmission/Supply it Needs\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))
regions.make.cons.first.priority = rbind(data.frame(Importance=seq(1,7),Pct.Agree=regions.make.cons.first.priority.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=regions.make.cons.first.priority.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(regions.make.cons.first.priority, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("Regions Should Make Conservation Their First Priority \nto Reduce the Need for New Supply\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))

view raw

ltep.r

hosted with ❤ by GitHub

Enron Email Corpus Topic Model Analysis Part 2 – This Time with Better regex

After posting my analysis of the Enron email corpus, I realized that the regex patterns I set up to capture and filter out the cautionary/privacy messages at the bottoms of peoples emails were not working.  Let’s have a look at my revised python code for processing the corpus:


docs = []
from os import listdir, chdir
import re
# Here's the section where I try to filter useless stuff out.
# Notice near the end all of the regex patterns where I've called
# "re.DOTALL". This is pretty key here. What it means is that the
# .+ I have referenced within the regex pattern should be able to
# pick up alphanumeric characters, in addition to newline characters
# (\n). Since I did not have this in the first version, the cautionary/
# privacy messages people were pasting at the ends of their emails
# were not getting filtered out and were being entered into the
# LDA analysis, putting noise in the topics that were modelled.
email_pat = re.compile(".+@.+")
to_pat = re.compile("To:.+\n")
cc_pat = re.compile("cc:.+\n")
subject_pat = re.compile("Subject:.+\n")
from_pat = re.compile("From:.+\n")
sent_pat = re.compile("Sent:.+\n")
received_pat = re.compile("Received:.+\n")
ctype_pat = re.compile("Content-Type:.+\n")
reply_pat = re.compile("Reply- Organization:.+\n")
date_pat = re.compile("Date:.+\n")
xmail_pat = re.compile("X-Mailer:.+\n")
mimver_pat = re.compile("MIME-Version:.+\n")
dash_pat = re.compile("–+.+–+", re.DOTALL)
star_pat = re.compile('\*\*+.+\*\*+', re.DOTALL)
uscore_pat = re.compile(" __+.+__+", re.DOTALL)
equals_pat = re.compile("==+.+==+", re.DOTALL)
# (the below is the same note as before)
# The enron emails are in 151 directories representing each each senior management
# employee whose email account was entered into the dataset.
# The task here is to go into each folder, and enter each
# email text file into one long nested list.
# I've used readlines() to read in the emails because read()
# didn't seem to work with these email files.
chdir("/home/inkhorn/enron")
names = [d for d in listdir(".") if "." not in d]
for name in names:
chdir("/home/inkhorn/enron/%s" % name)
subfolders = listdir('.')
sent_dirs = [n for n, sf in enumerate(subfolders) if "sent" in sf]
sent_dirs_words = [subfolders[i] for i in sent_dirs]
for d in sent_dirs_words:
chdir('/home/inkhorn/enron/%s/%s' % (name,d))
file_list = listdir('.')
docs.append([" ".join(open(f, 'r').readlines()) for f in file_list if "." in f])
# (the below is the same note as before)
# Here i go into each email from each employee, try to filter out all the useless stuff,
# then paste the email into one long flat list. This is probably inefficient, but oh well – python
# is pretty fast anyway!
docs_final = []
for subfolder in docs:
for email in subfolder:
if ".nsf" in email:
etype = ".nsf"
elif ".pst" in email:
etype = ".pst"
email_new = email[email.find(etype)+4:]
email_new = to_pat.sub('', email_new)
email_new = cc_pat.sub('', email_new)
email_new = subject_pat.sub('', email_new)
email_new = from_pat.sub('', email_new)
email_new = sent_pat.sub('', email_new)
email_new = received_pat.sub('', email_new)
email_new = email_pat.sub('', email_new)
email_new = ctype_pat.sub('', email_new)
email_new = reply_pat.sub('', email_new)
email_new = date_pat.sub('', email_new)
email_new = xmail_pat.sub('', email_new)
email_new = mimver_pat.sub('', email_new)
email_new = dash_pat.sub('', email_new)
email_new = star_pat.sub('', email_new)
email_new = uscore_pat.sub('', email_new)
email_new = equals_pat.sub('', email_new)
docs_final.append(email_new)
# (the below is the same note as before)
# Here I proceed to dump each and every email into about 126 thousand separate
# txt files in a newly created 'data' directory. This gets it ready for entry into a Corpus using the tm (textmining)
# package from R.
for n, doc in enumerate(docs_final):
outfile = open("/home/inkhorn/enron/data/%s.txt" % n,'w')
outfile.write(doc)
outfile.close()

As I did not change the R code since the last post, let’s have a look at the results:

terms(lda.model,20)
      Topic 1   Topic 2   Topic 3     Topic 4   
 [1,] "enron"   "time"    "pleas"     "deal"    
 [2,] "busi"    "thank"   "thank"     "gas"     
 [3,] "manag"   "day"     "attach"    "price"   
 [4,] "meet"    "dont"    "email"     "contract"
 [5,] "market"  "call"    "enron"     "power"   
 [6,] "compani" "week"    "agreement" "market"  
 [7,] "vinc"    "look"    "fax"       "chang"   
 [8,] "report"  "talk"    "call"      "rate"    
 [9,] "time"    "hope"    "copi"      "trade"   
[10,] "energi"  "ill"     "file"      "day"     
[11,] "inform"  "tri"     "messag"    "month"   
[12,] "pleas"   "bit"     "inform"    "compani" 
[13,] "trade"   "guy"     "phone"     "energi"  
[14,] "risk"    "night"   "send"      "transact"
[15,] "discuss" "friday"  "corp"      "product" 
[16,] "regard"  "weekend" "kay"       "term"    
[17,] "team"    "love"    "review"    "custom"  
[18,] "plan"    "item"    "receiv"    "cost"    
[19,] "servic"  "email"   "question"  "thank"   
[20,] "offic"   "peopl"   "draft"     "purchas"

One at a time, I will try to interpret what each topic is trying to describe:

  1. This one appears to be a business process topic, containing a lot of general business terms, with a few even relating to meetings.
  2. Similar to the last model that I derived, this topic has a lot of time related words in it such as: time, day, week, night, friday, weekend.  I’ll be interested to see if this is another business meeting/interview/social meeting topic, or whether it describes something more social.
  3. Hrm, this topic seems to contain a lot of general terms used when we talk about communication: email, agreement, fax, call, message, inform, phone, send, review, question.  It even has please and thank you!  I suppose it’s very formal and you could perhaps interpret this as professional sounding administrative emails.
  4. This topic seems to be another case of emails containing a lot of ‘shop talk’

Okay, let’s see if we can find some examples for each topic:

sample(which(df.emails.topics$"1" > .95),3)
[1] 27771 45197 27597

enron[[27771]]

 Christi's call.
 
  
     
 
 	Christi has asked me to schedule the above meeting/conference call.  September 11th (p.m.) seems to be the best date.  Question:  Does this meeting need to be a 1/2 day meeting?  Christi and I were wondering.
 
 	Give us your thoughts.

Yup, business process, meeting. This email fits the bill! Next!

enron[[45197]]

 
 Bob, 
 
 I didn't check voice mail until this morning (I don't have a blinking light.  
 The assistants pick up our lines and amtel us when voice mails have been 
 left.)  Anyway, with the uncertainty of the future business under the Texas 
 Desk, the following are my goals for the next six months:
 
 1)  Ensure a smooth transition of HPL to AEP, with minimal upsets to Texas 
 business.
 2)  Develop operations processes and controls for the new Texas Desk.   
 3)  Develop a replacement
  a.  Strong push to improve Liz (if she remains with Enron and )
  b.  Hire new person, internally or externally
 4)  Assist in develop a strong logisitcs team.  With the new business, we 
 will need strong performers who know and accept their responsibilites.
 
 1 and 2 are open-ended.  How I accomplish these goals and what they entail 
 will depend how the Texas Desk (if we have one) is set up and what type of 
 activity the desk will be invovled in, which is unknown to me at this time.  
 I'm sure as we get further into the finalization of the sale, additional and 
 possibly more urgent goals will develop.  So, in short, who knows what I need 
 to do.
 
 D

This one also seems to fit the bill. “D” here is writing about his/her goals for the next six months and considers briefly how to accomplish them. Not heavy into the content of the business, so I’m happy here. On to topic 2:

sample(which(df.emails.topics$"2" > .95),3)
[1] 50356 22651 19259

enron[[50356]]

I agree it is Matt, and  I believe he has reviewed this tax stuff (or at 
 least other turbine K's) before.  His concern will be us getting some amount 
 of advance notice before title transfer (ie, delivery).  Obviously, he might 
 have some other comments as well.  I'm happy to send him the latest, or maybe 
 he can access the site?
 
 Kay
 
 
    
  
 Given that the present form of GE world hunger seems to be more domestic than 
 international it would appear that Matt Gockerman would be a good choice for 
 the Enron- GE tax discussion.  Do you want to contact him or do you want me 
 to.   I would be interested in listening in on the conversation for 
 continuity. 

Here, the conversants seem to be talking about having a phone conversation with “Matt” to get his ideas on a tax discussion. This fits in with the meeting theme. Next!

enron[[22651]]

 LOVE
 HONEY PIE

Well, that was pretty social, wasn’t it? 🙂 Okay one more from the same topic:

enron[[19259]]

  Mime-Version: 1.0
  Content-Transfer-Encoding: 7bit
 X- X- X- X-b X-Folder: \ExMerge - Giron, Darron C.\Sent Items
 X-Origin: GIRON-D
 X-FileName: darron giron 6-26-02.PST
 
 Sorry.  I've got a UBS meeting all day.  Catch you later.  I was looking forward to the conversation.
 
 DG
 
  
     
 It seems everyone agreed to Ninfa's.  Let's meet at 11:45; let me know if a
 different time is better.  Ninfa's is located in the tunnel under the JP
 Morgan Chase Tower at 600 Travis.  See you there.
 
 Schroeder

Woops, header info that I didn’t manage to filter out :(. Anyway, DG writes about an impending conversation, and Schroeder writes about a specific time for their meeting. This fits! Next topic!

sample(which(df.emails.topics$"3" > .95),3)
[1] 24147 51673 29717

enron[[24147]]

Kaye:  Can you please email the prior report to me?  Thanks.
 
 Sara Shackleton
 Enron North America Corp.
 1400 Smith Street, EB 3801a
 Houston, Texas  77002
 713-853-5620 (phone)
 713-646-3490 (fax)


 	04/10/2001 05:56 PM
 			  		  
 
 At Alan's request, please provide to me by e-mail (with a  Thursday of this week your suggested changes to the March 2001 Monthly 
 Report, so that we can issue the April 2001 Monthly Report by the end of this 
 week.  Thanks for your attention to this matter.
 
 Nita

This one definitely fits in with the professional sounding administrative emails interpretation. Emailing reports and such. Next!

 I believe this was intended for Susan Scott with ETS...I'm with Nat Gas trading.
 
 Thanks
 
 
 
     
 FYI...another executed capacity transaction on EOL for Transwestern.
 
  
     
 This message is to confirm your EOL transaction with Transwestern Pipeline Company.
 You have successfully acquired the package(s) listed below.  If you have questions or
 concerns regarding the transaction(s), please call Michelle Lokay at (713) 345-7932
 prior to placing your nominations for these volumes.
 
 Product No.:	39096
 Time Stamp:	3/27/01	09:03:47 am
 Product Name:	US PLCapTW Frm CenPool-OasisBlock16
  
 Shipper Name:  E Prime, Inc.
 
 Volume:	10,000 Dth/d  
 					
 Rate:	$0.0500 /dth 1-part rate (combined  Res + Com) 100% Load Factor
 		+ applicable fuel and unaccounted for
 	
 TW K#: 27548		
 
 Effective  
 Points:	RP- (POI# 58649)  Central Pool      10,000 Dth/d
 		DP- (POI# 8516)   Oasis Block 16  10,000 Dth/d
 
 Alternate Point(s):  NONE
 
 
 Note:     	In order to place a nomination with this agreement, you must log 
 	            	off the TW system and then log back on.  This action will update
 	            	the agreement's information on your PC and allow you to place
 		nominations under the agreement number shown above.
 
 Contact Info:		Michelle Lokay
 	 			Phone (713) 345-7932
               			Fax       (713) 646-8000

Rather long, but even the short part at the beginning falls under the right category for this topic! Okay, let’s look at the final topic:

sample(which(df.emails.topics$"4" > .95),3)
[1] 39100  31681  6427

enron[[39100]]

 Randy, your proposal is fine by me.  Jim

Hrm, this is supposed to be a ‘business content’ topic, so I suppose I can see why this email was classified as such. It doesn’t take long to go from ‘proposal’ to ‘contract’ if you free associate, right? Next!

enron[[31681]]

 Attached is the latest version of the Wildhorse Entrada Letter.  Please 
 review.  I reviewed the letter with Jim Osborne and Ken Krisa yesterday and 
 should get their comments today.  My plan is to Fedex to Midland for Ken's 
 signature tomorrow morning and from there it will got to Wildhorse.  

This one makes me feel a little better, referencing a specific business letter that the emailer probably wants the emailed person to see. Let’s find one more for good luck:

enron[[6427]]

 At a ratio of 10:1, you should have your 4th one signed and have the fifth 
 one on the way...
 
  	09/19/2000 05:40 PM
   		  		  
 ONLY 450!  Why, I thought you guys hit 450 a long time ago.
 
 Marie Heard
 Senior Legal Specialist
 Enron Broadband Services
 Phone:  (713) 853-3907
 Fax:  (713) 646-8537

 	09/19/00 05:34 PM
		  		  		  
 Well, I do believe this makes 450!  A nice round number if I do say so myself!

 	Susan Bailey
 	09/19/2000 05:30 PM

 We have received an executed Master Agreement:
 
 
 Type of Contract:  ISDA Master Agreement (Multicurrency-Cross Border)
 
 Effective  
 Enron Entity:   Enron North America Corp.
 
 Counterparty:   Arizona Public Service Company
 
 Transactions Covered:  Approved for all products with the exception of: 
 Weather
           Foreign Exchange
           Pulp & Paper
 
 Special Note:  The Counterparty has three (3) Local Business Days after the 
 receipt of a Confirmation from ENA to accept or dispute the Confirmation.  
 Also, ENA is the Calculation Agent unless it should become a Defaulting 
 Party, in which case the Counterparty shall be the Calculation Agent.
 
 Susan S. Bailey
 Enron North America Corp.
 1400 Smith Street, Suite 3806A
 Houston, Texas 77002
 Phone: (713) 853-4737
 Fax: (713) 646-3490

That one was very long, but there’s definitely some good business content in it (along with some happy banter about the contract that I guess was acquired).

All in all, I’d say that fixing those regex patterns that were supposed to filter out the caution/privacy messages at the ends of peoples’ emails was a big boon to the LDA analysis here.

Let that be a lesson: half the battle in LDA is in filtering out the noise!

A Rather Nosy Topic Model Analysis of the Enron Email Corpus

Having only ever played with Latent Dirichlet Allocation using gensim in python, I was very interested to see a nice example of this kind of topic modelling in R.  Whenever I see a really cool analysis done, I get the urge to do it myself.  What better corpus to do topic modelling on than the Enron email dataset?!?!?  Let me tell you, this thing is a monster!  According to the website I got it from, it contains about 500k messages, coming from 151 mostly senior management users and is organized into user folders.  I didn’t want to accept everything into my analysis, so I made the decision that I would only look into messages contained within the “sent” or “sent items” folders.

Being a large advocate of R, I really really tried to do all of the processing and analysis in R, but it was just too difficult and was taking up more time than I wanted.  So I dusted off my python skills (thank you grad school!) and did the bulk of the data processing/preparation in python, and the text mining in R.  Following is the code (hopefully well enough commented) that I used to process the corpus in python:


docs = []
from os import listdir, chdir
import re
# Here's my attempt at coming up with regular expressions to filter out
# parts of the enron emails that I deem as useless.
email_pat = re.compile(".+@.+")
to_pat = re.compile("To:.+\n")
cc_pat = re.compile("cc:.+\n")
subject_pat = re.compile("Subject:.+\n")
from_pat = re.compile("From:.+\n")
sent_pat = re.compile("Sent:.+\n")
received_pat = re.compile("Received:.+\n")
ctype_pat = re.compile("Content-Type:.+\n")
reply_pat = re.compile("Reply- Organization:.+\n")
date_pat = re.compile("Date:.+\n")
xmail_pat = re.compile("X-Mailer:.+\n")
mimver_pat = re.compile("MIME-Version:.+\n")
contentinfo_pat = re.compile("—————————————-.+—————————————-")
forwardedby_pat = re.compile("———————-.+———————-")
caution_pat = re.compile('''\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*.+\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*''')
privacy_pat = re.compile(" _______________________________________________________________.+ _______________________________________________________________")
# The enron emails are in 151 directories representing each each senior management
# employee whose email account was entered into the dataset.
# The task here is to go into each folder, and enter each
# email text file into one long nested list.
# I've used readlines() to read in the emails because read()
# didn't seem to work with these email files.
chdir("/home/inkhorn/enron")
names = [d for d in listdir(".") if "." not in d]
for name in names:
chdir("/home/inkhorn/enron/%s" % name)
subfolders = listdir('.')
sent_dirs = [n for n, sf in enumerate(subfolders) if "sent" in sf]
sent_dirs_words = [subfolders[i] for i in sent_dirs]
for d in sent_dirs_words:
chdir('/home/inkhorn/enron/%s/%s' % (name,d))
file_list = listdir('.')
docs.append([" ".join(open(f, 'r').readlines()) for f in file_list if "." in f])
# Here i go into each email from each employee, try to filter out all the useless stuff,
# then paste the email into one long flat list. This is probably inefficient, but oh well – python
# is pretty fast anyway!
docs_final = []
for subfolder in docs:
for email in subfolder:
if ".nsf" in email:
etype = ".nsf"
elif ".pst" in email:
etype = ".pst"
email_new = email[email.find(etype)+4:]
email_new = to_pat.sub('', email_new)
email_new = cc_pat.sub('', email_new)
email_new = subject_pat.sub('', email_new)
email_new = from_pat.sub('', email_new)
email_new = sent_pat.sub('', email_new)
email_new = email_pat.sub('', email_new)
if "—–Original Message—–" in email_new:
email_new = email_new.replace("—–Original Message—–","")
email_new = ctype_pat.sub('', email_new)
email_new = reply_pat.sub('', email_new)
email_new = date_pat.sub('', email_new)
email_new = xmail_pat.sub('', email_new)
email_new = mimver_pat.sub('', email_new)
email_new = contentinfo_pat.sub('', email_new)
email_new = forwardedby_pat.sub('', email_new)
email_new = caution_pat.sub('', email_new)
email_new = privacy_pat.sub('', email_new)
docs_final.append(email_new)
# Here I proceed to dump each and every email into about 126 thousand separate
# txt files in a newly created 'data' directory. This gets it ready for entry into a Corpus using the tm (textmining)
# package from R.
for n, doc in enumerate(docs_final):
outfile = open("/home/inkhorn/enron/data/%s.txt" % n,'w')
outfile.write(doc)
outfile.close()

After having seen python’s performance in rifling through these enron emails, I was very impressed!  It was very agile in creating a directory with the largest number of files I’d ever seen on my computer!

Okay, so now I had a directory filled with a whole lot of text files.  The next step was to bring them into R so that I could submit them to the LDA.  Following is the R code that I used:


library(stringr)
library(plyr)
library(tm)
library(tm.plugin.mail)
library(SnowballC)
library(topicmodels)
# At this point, the python script should have been run,
# creating about 126 thousand txt files. I was very much afraid
# to import that many txt files into the tm package in R (my computer only
# runs on 8GB of RAM), so I decided to mark 60k of them for a sample, and move the
# rest of them into a separate directory
email_txts = list.files('data/')
email_txts_sample = sample(email_txts, 60000)
email_rename = data.frame(orig=email_txts_sample, new=sub(".txt",".rxr", email_txts_sample))
file.rename(str_c('data/',email_rename$orig), str_c('data/',email_rename$new))
# At this point, all of the non-sampled emails (labelled .txt, not .rxr)
# need to go into a different directory. I created a directory that I called
# nonsampled/ and moved the files there via the terminal command "mv *.txt nonsampled/".
# It's very important that you don't try to do this via a file explorer, windows or linux,
# as the act of trying to display that many file icons is apparently very difficult for a regular machine :$
enron = Corpus(DirSource("/home/inkhorn/enron/data"))
extendedstopwords=c("a","about","above","across","after","MIME Version","forwarded","again","against","all","almost","alone","along","already","also","although","always","am","among","an","and","another","any","anybody","anyone","anything","anywhere","are","area","areas","aren't","around","as","ask","asked","asking","asks","at","away","b","back","backed","backing","backs","be","became","because","become","becomes","been","before","began","behind","being","beings","below","best","better","between","big","both","but","by","c","came","can","cannot","can't","case","cases","certain","certainly","clear","clearly","come","could","couldn't","d","did","didn't","differ","different","differently","do","does","doesn't","doing","done","don't","down","downed","downing","downs","during","e","each","early","either","end","ended","ending","ends","enough","even","evenly","ever","every","everybody","everyone","everything","everywhere","f","face","faces","fact","facts","far","felt","few","find","finds","first","for","four","from","full","fully","further","furthered","furthering","furthers","g","gave","general","generally","get","gets","give","given","gives","go","going","good","goods","got","great","greater","greatest","group","grouped","grouping","groups","h","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","her","here","here's","hers","herself","he's","high","higher","highest","him","himself","his","how","however","how's","i","i'd","if","i'll","i'm","important","in","interest","interested","interesting","interests","into","is","isn't","it","its","it's","itself","i've","j","just","k","keep","keeps","kind","knew","know","known","knows","l","large","largely","last","later","latest","least","less","let","lets","let's","like","likely","long","longer","longest","m","made","make","making","man","many","may","me","member","members","men","might","more","most","mostly","mr","mrs","much","must","mustn't","my","myself","n","necessary","need","needed","needing","needs","never","new","newer","newest","next","no","nobody","non","noone","nor","not","nothing","now","nowhere","number","numbers","o","of","off","often","old","older","oldest","on","once","one","only","open","opened","opening","opens","or","order","ordered","ordering","orders","other","others","ought","our","ours","ourselves","out","over","own","p","part","parted","parting","parts","per","perhaps","place","places","point","pointed","pointing","points","possible","present","presented","presenting","presents","problem","problems","put","puts","q","quite","r","rather","really","right","room","rooms","s","said","same","saw","say","says","second","seconds","see","seem","seemed","seeming","seems","sees","several","shall","shan't","she","she'd","she'll","she's","should","shouldn't","show","showed","showing","shows","side","sides","since","small","smaller","smallest","so","some","somebody","someone","something","somewhere","state","states","still","such","sure","t","take","taken","than","that","that's","the","their","theirs","them","themselves","then","there","therefore","there's","these","they","they'd","they'll","they're","they've","thing","things","think","thinks","this","those","though","thought","thoughts","three","through","thus","to","today","together","too","took","toward","turn","turned","turning","turns","two","u","under","until","up","upon","us","use","used","uses","v","very","w","want","wanted","wanting","wants","was","wasn't","way","ways","we","we'd","well","we'll","wells","went","were","we're","weren't","we've","what","what's","when","when's","where","where's","whether","which","while","who","whole","whom","who's","whose","why","why's","will","with","within","without","won't","work","worked","working","works","would","wouldn't","x","y","year","years","yes","yet","you","you'd","you'll","young","younger","youngest","your","you're","yours","yourself","yourselves","you've","z")
dtm.control = list(
tolower = T,
removePunctuation = T,
removeNumbers = T,
stopwords = c(stopwords("english"),extendedstopwords),
stemming = T,
wordLengths = c(3,Inf),
weighting = weightTf)
dtm = DocumentTermMatrix(enron, control=dtm.control)
dtm = removeSparseTerms(dtm,0.999)
dtm = dtm[rowSums(as.matrix(dtm))>0,]
k = 4
# Beware: this step takes a lot of patience! My computer was chugging along for probably 10 or so minutes before it completed the LDA here.
lda.model = LDA(dtm, k)
# This enables you to examine the words that make up each topic that was calculated. Bear in mind that I've chosen to stem all words possible in this corpus, so some of the words output will look a little weird.
terms(lda.model,20)
# Here I construct a dataframe that scores each document according to how closely its content
# matches up with each topic. The closer the score is to 0, the more likely its content matches
# up with a particular topic.
emails.topics = posterior(lda.model, dtm)$topics
df.emails.topics = as.data.frame(emails.topics)
df.emails.topics = cbind(email=as.character(rownames(df.emails.topics)),
df.emails.topics, stringsAsFactors=F)

Phew, that took a lot of computing power! Now that it’s done, let’s look at the results of the command on line 48 from the above gist:

      Topic 1   Topic 2     Topic 3      Topic 4     
 [1,] "time"    "thank"     "market"     "email"     
 [2,] "vinc"    "pleas"     "enron"      "pleas"     
 [3,] "week"    "deal"      "power"      "messag"    
 [4,] "thank"   "enron"     "compani"    "inform"    
 [5,] "look"    "attach"    "energi"     "receiv"    
 [6,] "day"     "chang"     "price"      "intend"    
 [7,] "dont"    "call"      "gas"        "copi"      
 [8,] "call"    "agreement" "busi"       "attach"    
 [9,] "meet"    "question"  "manag"      "recipi"    
[10,] "hope"    "fax"       "servic"     "enron"     
[11,] "talk"    "america"   "rate"       "confidenti"
[12,] "ill"     "meet"      "trade"      "file"      
[13,] "tri"     "mark"      "provid"     "agreement" 
[14,] "night"   "kay"       "issu"       "thank"     
[15,] "friday"  "corp"      "custom"     "contain"   
[16,] "peopl"   "trade"     "california" "address"   
[17,] "bit"     "ena"       "oper"       "contact"   
[18,] "guy"     "north"     "cost"       "review"    
[19,] "love"    "discuss"   "electr"     "parti"     
[20,] "houston" "regard"    "report"     "contract"

Here’s where some really subjective interpretation is required, just like in PCA analysis.  Let’s try to interpret the topics, one at a time:

  1. I see a lot of words related to time in this topic, and then I see the word ‘meet’.  I’ll call this the meeting (business or otherwise) topic!
  2. I’m not sure how to interpret this second topic, so perhaps I’ll chalk it up to noise in my analysis!
  3. This topic contains a lot of ‘business content’ words, so it appears to be a kind of ‘talking shop’ topic.
  4. This topic, while still pretty ‘businessy’, appears to be less about the content of the business and more about the processes, or perhaps legalities of the business.

For each of the sensible topics (1,3,4), let’s bring up some emails that scored highly on these topics to see if the analysis makes sense:

sample(which(df.emails.topics$"1" > .95), 10)
 [1] 53749 32102 16478 36204 29296 29243 47654 38733 28515 53254
enron[[32102]]

 I will be out of the office next week on Spring Break. Can you participate on 
 this call? Please report what is said to Christian Yoder 503-464-7845 or 
 Steve Hall 503-4647795

 	03/09/2001 05:48 PM

 I don't know, but I will check with our client.

 Our client Avista Energy has received the communication, below, from the ISO
 regarding withholding of payments to creditors of monies the ISO has
 received from PG&E.  We are interested in whether any of your clients have
 received this communication, are interested in this issue and, if so,
 whether you have any thoughts about how to proceed.

 You are invited to participate in a conference call to discuss this issue on
 Monday, March 12, at 10:00 a.m.

 Call-in number: (888) 320-6636
 Host: Pritchard
 Confirmation number: 1827-1922

 Diane Pritchard
 Morrison & Foerster LLP
 425 Market Street
 San Francisco, California 94105
 (415) 268-7188

So this one isn’t a business meeting in the physical sense, but is a conference call, which still falls under the general category of meetings.

enron[[29243]]
 Hey Fritz.  I am going to send you an email that attaches a referral form to your job postings.  In addition, I will also personally tell the hiring manager that I have done this and I can also give him an extra copy of youe resume.  Hopefully we can get something going here....

 Tori,

 I received your name from Diane Hoyuela. You and I spoke
 back in 1999 about the gas industry. I tried briefly back
 in 1999 and found few opportunities during de-regulations
 first few steps. Well,...I'm trying again. I've been
 applying for a few job openings at Enron and was wondering
 if you could give me an internal referral. Also, any advice
 on landing a position at Enron or in general as a scheduler
 or analyst.
 Last week I applied for these positions at Enron; gas
 scheduler 110360, gas analyst 110247, and book admin.
 110129. I have a pretty good understanding of the gas
 market.

 I've attached my resume for you. Congrats. on the baby!
 I'll give you a call this afternoon to follow-up, I know
 mornings are your time.
 Regards,

 Fritz Hiser

 __________________________________________________
 Do You Yahoo!?
 Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger. http://im.yahoo.com

That one obviously shows someone who was trying to get a job at Enron and wanted to call “this afternoon to follow-up”. Again, a ‘call’ rather than a physical meeting.

Finally,

enron[[29296]]

 Susan,

 Well you have either had a week from hell so far or its just taking you time
 to come up with some good bs.  Without being too forward I will be in town
 next Friday and wanted to know if you would like to go to dinner or
 something.  At least that will give us a chance to talk face to face.  If
 your busy don't worry about it I thought I would just throw it out there.

 I'll keep this one short and sweet since the last one was rather lengthy.
 Hope this Thursday is a little better then last week.

 Kyle

 _________________________________________________________________________
 Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

 Share information about yourself, create your own public profile at
 http://profiles.msn.com.

Ahh, here’s a particularly juicy one. Kyle here wants to go to dinner, “or something” (heh heh heh) with Susan to get a chance to talk face to face with her. Finally, a physical meeting (maybe very physical…) lumped into a category with other business meetings in person or on the phone.

Okay, now let’s switch to topic 3, the “business content” topic.

sample(which(df.emails.topics$"3" > .95), 10)
 [1] 40671 26644  5398 52918 37708  5548 15167 56149 47215 26683

enron[[40671]]

 Please change the counterparty on deal 806589 from TP2 to TP3 (sorry about that).

Okay, that seems fairly in the realm of business content, but I don’t know what the heck it means. Let’s try another one:

enron[[5548]]

Phillip, Scott, Hunter, Tom and John -

 Just to reiterate the new trading guidelines on PG&E Energy Trading:

 1.  Both financial and physical trading are approved, with a maximum tenor of 18 months

 2.  Approved entities are:	PG&E Energy Trading - Gas Corporation
 				PG&E Energy Trading - Canada Corporation

 				NO OTHER PG&E ENTITIES ARE APPROVED FOR TRADING

 3.  Both EOL and OTC transactions are OK

 4.  Please call Credit (ext. 31803) with details on every OTC transaction.  We need to track all new positions with PG&E Energy Trading on an ongoing basis.  Please ask the traders and originators on your desks to notify us with the details on any new transactions immediately upon execution.  For large transactions (greater than 2 contracts/day or 5 BCF total), please call for approval before transacting.

 Thanks for your assistance; please call me (ext. 53923) or Russell Diamond (ext. 57095) if you have any questions.

 Jay

That one is definitely oozing with business content. Note the terms such as “Energy Trading”, and “Gas Corporation”, etc. Finally, one more:

enron[[26683]]

Hi Kathleen, Randy, Chris, and Trish,

 Attached is the text of the August issue of The Islander.  The headings will
 be lined up when Trish adds the art and ads.  A calendar, also, which is in
 the next e-mail.

 I'll appreciate your comments by the end of tomorrow, Monday.

 There are open issues which I sure hope get resolved before printing:

 1.  I'm waiting for a reply from Mike Bass regarding tenses on the Home Depot
 article.  Don't know if there's one developer or more and what the name(s)
 is/are.

 2.  Didn't hear back from Ted Weir regarding minutes for July's water board
 meeting.  I think there are 2 meetings minutes missed, 6/22 and July.

 3.  Waiting to hear back from Cheryl Hanks about the 7/6 City Council and 6/7
 BOA meetings minutes.

 4.  Don't know the name of the folks who were honored with Yard of the Month.
  They're at 509 Narcissus.

 I'm not feeling very good about the missing parts but need to move on
 schedule!  I'm also looking for a good dictionary to check the spellings of
 ettouffe, tree-house and orneryness.  (Makes me feel kind of ornery, come to
 think about it!)

 Please let me know if you have revisions.  Hope your week is starting out
 well.

 'Nita

Alright, this one seems to be a mix between business content and process. So I can see how it was lumped into this topic, but it doesn’t quite have the perfection that I would like.

Finally, let’s move on to topic 4, which appeared to be a ‘business process’ topic to me. I’m suspicious of this topic, as I don’t think I successfully filtered out everything that I wanted to:

sample(which(df.emails.topics$"4" > .95), 10)
 [1] 51205  5129 48826 51214 55337 15843 52543 11978 48337  2609

enron[[5129]]

very funny today...during the free fall, couldn't price jv and xh low enough 
 on eol, just kept getting cracked.  when we stabilized, customers came in to 
 buy and couldnt price it high enough.  winter versus apr went from +23 cents 
 when we were at the bottom to +27 when april rallied at the end even though 
 it should have tightened theoretically.  however, april is being supported 
 just off the strip.  getting word a lot of utilities are going in front of 
 the puc trying to get approval for hedging programs this year.  

 hey johnny. hope all is well. what u think hrere? utuilites buying this break
 down? charts look awful but 4.86 ish is next big level.
 jut back from skiing in co, fun but took 17 hrs to get home and a 1.5 days to
 get there cuz of twa and weather.

Hrm, this one appears to be some ‘shop talk’, and isn’t too general. I’m not sure how this applies to the topic 4 words. Let’s try another one:

enron[[55337]]

Fran, do you have an updated org chart that I could send to the Measurement group?
 	Thanks. Lynn

    Cc:	Estalee Russi

 Lynn,

 Attached are the org charts for ETS Gas Logistics:

 Have a great weekend.  Thanks!

 Miranda

Here we go. This one seems to fall much more into the ‘business process’ realm. Let’s see if I can find another good example:

enron[[11978]]

 Bill,

 As per our conversation today, I am sending you an outline of what we intend to be doing in Ercot and in particular on the real-time desk. For 2002 Ercot is split into 4 zones with TCRs between 3 of the zones. The zones are fairly diverse from a supply/demand perspective. Ercot has an average load of 38,000 MW, a peak of 57,000 MW with a breakdown of 30% industrial, 30% commercial and 40% residential. There are already several successful aggregators that are looking to pass on their wholesale risk to a credit-worthy QSE (Qualified Scheduling Entity). 

 Our expectation is that we will be a fully qualified QSE by mid-March with the APX covering us up to that point. Our initial on-line products will include a bal day and next day financial product. (There is no day ahead settlement in this market). There are more than 10 industrial loads with greater than 150 MW concentrated at single meters offering good opportunities for real-time optimization. Our intent is to secure one of these within the next 2 months.

 I have included some price history to show the hourly volatility and a business plan to show the scope of the opportunity. In addition, we have very solid analytics that use power flow simulations to map out expected outcomes in the real-time market.

 The initial job opportunity will involve an analysis of the real-time market as it stands today with a view to trading around our information. This will also drive which specific assets we approach to manage. As we are loosely combining our Texas gas and Ercot power desks our information flow will be superior and I believe we will have all the tools needed for a successful real-time operation.

 Let me know if you have any further questions.

 Thanks,

 Doug

Again, I seem to have found an email that straddles the boundary between business process and business content. Okay, I guess this topic isn’t the clearest in describing each of the examples that I found!

Overall, I probably could have done a bit more to filter out the useless stuff to construct topics that were better in describing the examples that they represent. Also, I’m not sure if I should be surprised or not that I didn’t pick up some sort of ‘social banter’ topic, where people were emailing about non-business topics. I suppose that social banter emails might be less predictable in their content, but maybe somebody much smarter than I am can tell me the answer 🙂

If you know how I can significantly ramp up the quality of this analysis, feel free to contribute your comments!

When did “How I Met Your Mother” become less legen.. wait for it…

…dary!  Or, as you’ll see below, when did it become slightly less legendary?  The analysis in this post was inspired by DiffusePrioR’s analysis of when The Simpsons became less Cromulent. When I read his post a while back, I thought it was pretty cool and told myself that I would use his method on another relevant dataset in the future.

Enter IMDB average user ratings of every episode of How I Met Your Mother (until season 9, episode 5).  Once I brought up the page showing rated episodes by date, I simply copy and pasted the table into LibreOffice Calc, saved it as a csv, and then loaded it into R.  As you can see in DiffusePrioR’s post (and also in mine), the purpose of the changepoint package is to find changepoints, or abrupt baseline shifts, in your data.

Now, some R code:

library(changepoint)
library(ggplot2)
library(zoo)

himym = read.csv("himym.csv")
mean2.pelt = cpt.mean(himym$Rating, method="PELT")
plot(mean2.pelt, type='l', cpt.col='red',xlab="Episode",
     ylab='IMDB Avg Rating', cpt.width=2, main="Changepoints in IMDB Ratings of 'How I Met Your Mother'",ylim=c(0,10), xlim=c(0,200))

Changepoints in IMDB Ratings of HIMYM
Above we see that the changepoint method I used has detected two baselines for the rating data.  One baseline has its value at 8.1, and the other has its value at 7.6.  The first thing you have to admit here is that despite a downwards shift in the average rating, How I Met Your Mother still seems to be pretty popular and held in high regard.  That being said, I do feel as though the earlier episodes were better, and apparently the masses of IMDB users rating each episode generally agree!  So, what’s the first episode that marked the abrupt downwards shift in average rating for the show?

himym[161,1:2]
    Ep.Num   Ep.Title
161    8.1 Farhampton

Ah, Farhampton, Season 8, Episode 1.  That seems like a natural breakpoint!  But the low score doesn’t even come until Season 8 Episode 2.  Let’s see what people have to say about episodes 1 and 2 on IMDB (forgive their punctuation, spelling, and grammar):

“Boring return I thought as once again the writing was horrible, the show seems to want to go on forever and also some of the actors in the show deserve better then what they are getting week to week. Even at the end it seems we finally see the mother of the show but we do not see her face.”

“The first episode however.was not funny at all, the thrill of maybe finding who is the mother made me think that it was not too late, but then again the joke is on me. This second episode (The Pre-Nup) only made me think that this show will not find a good finale, using all the clichés on the book of the sitcom. I just don’t care about who is the mother anymore, I just want a honesty last season, showing the characters as human beings that learn with their mistakes and move on with their lifes.”

So it appears that some people were expecting the pace to quicken in Season 8, but were disappointed.  That being said, if you look back at the graph at the beginning of my post, you’ll see that although Season 8 was the start of abruptly lower ratings, the average ratings seemed to be dropping very slowly from well before that point.  Given that the rate of change is so slow, perhaps detecting the “beginning of the end” using the changepoints package is not sensitive enough for our purpose.

That’s why I decided to bin all How I Met Your Mother episodes into groups of 10 successive episodes each, and then calculate the proportion of episodes in each bin which have an average rating under 8.0.  You can see the graph below:

eps.under.8 = rollapply(himym$Rating, width=10, function (x) mean(x < 8), by=10, partial=TRUE)
ep.nums = paste(((seq(1,19)-1)*10)+1, ((seq(1,19)-1)*10)+10, sep="-")
ep.nums[19] = "181-189"
ep.grouped.ratings = data.frame(ep.nums, eps.under.8)

ggplot(ep.grouped.ratings, aes(x=reorder(ep.nums, seq(1,19)), y=eps.under.8, group=1)) + geom_line(size=3, colour="dark red") + scale_y_continuous(name="Proportion of Epsidoes/10 With a Below 8.0 Avg Rating") + scale_x_discrete(name="Episode Number Grouping") + theme(axis.text.x=element_text(angle=-90, vjust=.5, colour="black", size=14), axis.text.y=element_text(colour="black", size=14), axis.title.x=element_text(size=16, face='bold'), axis.title.y=element_text(size=16, face="bold"), plot.title=element_text(size=20,face="bold")) + geom_vline(xintercept = 10,linetype='longdash') + ggtitle("How I Met Your Mother Ratings, Every 10 Episodes")

Grouped HIMYM Ratings

In the above graph, you can see that because I’ve set such a high threshold (8.0), the results become much more dramatic looking.  Until episodes 71-80, the show enjoyed a very high proportion of ratings of 8.0 or over (80% for 6 episode groups, 90% in one, and 100% in the first!).  It looks as though it wasn’t until episodes 91-100 that you can really notice that ratings were not staying so well over 8.0.  Which season and episodes were those?

himym[91:100,1:2]
    Ep.Num                           Ep.Title
91    5.30                          Robin 101
92    5.40              The Sexless Innkeeper
93    5.50                   Duel Citizenship
94    5.60                           Bagpipes
95    5.70                    The Rough Patch
96    5.80                       The Playbook
97    5.90 Slapsgiving 2: Revenge of the Slap
98    5.10                         The Window
99    5.11                Last Cigarette Ever
100   5.12                    Girls Vs. Suits

Well, now we have some evidence that quality started to go down earlier on in the show (although bear in mind we are still talking about a highly rated show here!).  Again, let’s go to IMDB comments to see what we can glean about these episodes:

(in reference to “The Sexless Inkeeper”) “So after waiting 9/10 months for Robin and Barney to finally stop looking for reasons not to admit their feelings to each other, Lily is euphoric when the two finally do concede that they love each other. And how does she celebrate this?, she invites them to an incredibly awkward, incredibly forward night that displays EVERYTHING that Barney and Robin hate about relationships and gives the couple a massive reason NOT to be together. Lily and Marshall then react, whence realising Robin and The Barnicle didn’t enjoy the evening, as if they are the couple that has been mistreated. Lily then dumps Robin and Barney without so much as a phone call and plays right into the pair’s issues of abandonment.”

(in reference to “Last Cigarette Ever”) “… “Last Cigarette Ever” is the worst one. There are many reasons why I feel this way, but the two big reasons are: 1. It made no sense. Up to this point, I believe only Robin had ever been shown smoking a cigarette. Now all of a sudden, they all are smokers. I’m not going to pretend that the writers haven’t made mistakes before, but this was a little ridiculous. 2. It felt like a half-hour Marlboro ad. I think the intent was to discourage smoking, but I personally felt like it did the opposite. Awful episode in an otherwise very good series.”

I’m finding there don’t tend to be many user reviews of episodes on IMDB, so this next one is from metacritic.com:

“How i met your mother is starting to annoy everyone as the jokes are repetitive, Ted, Marshall & Lilly have started to get boring. There were only 5-6 good episodes in the season which includes the PLAYBOOK. The story line and the characters seem to be awfully close to NBC’s ”F.R.I.E.N.D.S”, but friends steals the show with better writing and acting which HIMYM seems to lack. Only Neil’s role as Barney has helped CBS to keep this show on air for 5 seasons.”

In the end, it’s hard to attribute shifts in average rating to these few comments, but they at least they do highlight some of the same complaints I’ve had or heard about the show (although I didn’t care so much about the cigarette episode!!).  All we can say here is that the ratings started to decline around the start of season 5, and suffered a sharper (although still not grand scale) decline at the beginning of season 8.

Big and small daycares in Toronto by building type, mapped using RGoogleMaps and Toronto Open Data

Before my daughter was born, I thought that my wife and I would have to send her to a licensed child care centre somewhere in Toronto.  I had heard over and over how long of a waiting list I should expect the centre to have, and so we’d better get her registered nice and early!  Well, it turns out that we found an excellent unlicensed home day care which she’s been at for two years now.  So when I recently went on Toronto Open Data’s website and found a dataset of licensed child care centres throughout Toronto, I thought I might have a fun time analyzing a topic that I thankfully have not had to deal with thus far!

If you look in the dataset (or in the documentation for the dataset) you’ll see that it contains names, addresses, phone info, building type, number of spaces in the daycare (broken down into age categories, and then totalled up) and unprojected latitude/longitude coordinates.  This dataset literally begged me to map it, but it also begged me to use one of the number of spaces variables in a map as well!

The process which I used to create the maps is very similar to the maps I made when I was analyzing the Toronto Casino Feedback Form (Survey), except in the maps I’ve put in this post, the dots are bigger or smaller depending on the percentiles of a quantitative variable (in this case the total number of spaces in the child care centres of a particular building type).  You can find the R code I used to generate these maps and stats at the bottom of this post.

This is meant strictly as an exploratory exercise.  To provide further informative clarity for this exercise, I’ve created multiple maps where each map shows locations of child care centres from one building type (e.g. Places of Worship, Public Elementary Schools, High Rise Buildings, etc.).  As I describe each map, I’ll also refer to descriptive stats that I’ve calculated and displayed at the bottom of this post (after all of the R code).  If you look at the table at the bottom of this post, you’ll notice there were more building types than I’ve mapped here.  That’s because I didn’t feel like mapping everything, only some of the most popular ones 🙂

Public Elementary Schools are by far the most popular type of building in which the licensed child care centres in this dataset are found (279 centres, according to the data).  Looking at the map below, you can see that there is a very dense cluster of public elementary school child care centres in the core of the GTA (down-town Toronto, and North York).  As you go west towards Etobicoke and Rexdale, you definitely see fewer centres, and then in Scarborough you also see few centres, but they appear to be more dispersed and less clustered than in the other areas.  There’s a lot of variability in the number of spaces in these child care centres, ranging from a minimum of 15 to a maximum of 217, with the average number of spaces per public elementary school child care centres being around 74 spaces per centre.

Public Elementary Schools

Places of Worship, as you can see below, are far less numerous than their public elementary school counterparts, with only 116 registered in this dataset.  The first thing that I noticed with this map is that most of the small dots (thus the smaller child care centres in places of worship) seem to fall in the south of the GTA, rather than the north.  I suppose that makes sense to me in an analogical kind of a way.  In the north of the GTA (where I live) a lot of the businesses are big chains that seek to serve as many people as possible, whereas downtown there are a lot of smaller businesses that serve a niche market.  Perhaps it’s a similar story with child care centres in places of worship as well.

On a side note, the vast majority of the places of worship mentioned in this dataset were of Christian or Catholic denominations.  I was perhaps surprised not to find too many synagogues in there, but that might just be my bias speaking!

Child care centres in places of worship ranged from having a minimum of 8 spaces to a maximum of 167 with an average of about 48 spaces per centre.
places of worship

High Rise child care centres seem to show a pretty distinctive geographical pattern, as you can see below!  They seem to either be in the east of Toronto, near or beyond highway 404, or the west of Toronto, most of them beyond Allen Road/Dufferin Street.  I wonder what accounts for what looks like this hole in the middle!?  Also, you’ll notice that many of the smaller high rise child care centres are in the east of Toronto, rather than the West.  High rise child care centres range from having a minimum of 20 spaces to a maximum of 145, with an average of about 69 spaces per centre.  You’ll notice that the minimum number of spaces is higher than other categories, probably accounted for by the fact that they are likely serving many residents in their own high rise building!
High Rises

Purpose Buildings, or buildings that were created with the child care centre in mind, are fairly sparse throughout Toronto, with only 58 in the dataset.  In terms of clusters here, it almost looks like you could delineate 4 clusters of buildings: North, South, East, and West.  Purpose buildings range from having a minimum of 20 spaces to a maximum of 165 with an average of about 72 spaces per centre (however median is 60, suggesting that there are a few really big ones in there, relative to all the rest).  Outliers aside, it seems that purpose buildings are like high rise daycares, in that they are meant for higher capacity than other centres.
Purpose Buildings

Community and Recreation Centres with child care seem to almost show a circle pattern in how they are laid out around the GTA.  The obvious exceptions are in Scarborough, which seems to have very few community and recreation centres with child care compared to what’s going on in the west.  Perhaps mirroring the phenomenon we saw with child care centres in places of worship, a lot of the smaller community and recreation centres with child care are in the south, whereas the north is the domain of bigger centres.  These centres range from having a minimum of 13 spaces to a maximum of 146, with an average of about 64 spaces per centre.
Community and Recreation Centres

Although child care centres in Houses seem to show a very random looking pattern, I can’t help but notice that there are more than a few centres within a close proximity to the Go Train tracks emanating from Union Station.  Perhaps there’s an interesting story there, or maybe I’m just seeing patterns than don’t exactly mean anything (after all, there are just 38 of these places registered in the dataset!).  Child care centres in houses range from having a minimum of 10 spaces to a maximum of 116, with an average of about 50 spaces per centre.  I do have to wonder how exactly these houses fit so many kids.  Seeing as how we are living in a post google street view world, you can just look at whatever house you want in living colour by typing in the address.  You can see how big on of these houses is (it has 87 spaces!) below the following map.
Houses

Selection_030
Wow!  It’s not a full picture, but you get the sense that the house really is quite big!

Well, so concludes my foray into the world of licensed child care centres.  If you have any commentary to add regarding these results, or can show me a better way of mapping them (although I do like RGoogleMaps), then by all means leave me a comment!  R code is shared below.


library(ff)
library(ffbase)
library(RgoogleMaps)
library(plyr)
addTrans <- function(color,trans)
{
# This function adds transparancy to a color.
# Define transparancy with an integer between 0 and 255
# 0 being fully transparant and 255 being fully visable
# Works with either color and trans a vector of equal length,
# or one of the two of length 1.
if (length(color)!=length(trans)&!any(c(length(color),length(trans))==1)) stop("Vector lengths not correct")
if (length(color)==1 & length(trans)>1) color <- rep(color,length(trans))
if (length(trans)==1 & length(color)>1) trans <- rep(trans,length(color))
num2hex <- function(x)
{
hex <- unlist(strsplit("0123456789ABCDEF",split=""))
return(paste(hex[(xx%%16)/16+1],hex[x%%16+1],sep=""))
}
rgb <- rbind(col2rgb(color),trans)
res <- paste("#",apply(apply(rgb,2,num2hex),2,paste,collapse=""),sep="")
return(res)
}
childcare = read.csv.ffdf(file="child-care.csv", first.rows=500,next.rows=500,colClasses=NA,header=TRUE)
pcodes = read.csv.ffdf(file="zipcodeset.txt", first.rows=50000, next.rows=50000, colClasses=NA, header=FALSE)
childcare$PCODE_R = as.ff(as.factor(sub(" ","", childcare[,"PCODE"])))
names(pcodes) = c("PCODE","Lat","Long","City","Prov")
childcare = merge(childcare, as.ffdf(pcodes[,1:3]), by.x="PCODE_R", by.y="PCODE", all.x=TRUE)
childcare.gc = subset(childcare, !is.na(Lat))
childcare.worship = subset(childcare.gc, bldg_type == "Place of Worship")
childcare.house = subset(childcare.gc, bldg_type == "House")
childcare.community = subset(childcare.gc, bldg_type == "Community/Recreation Centre")
childcare.pschool = subset(childcare.gc, bldg_type == "Public Elementary School")
childcare.highrise = subset(childcare.gc, bldg_type == "High Rise Apartment")
childcare.purpose = subset(childcare.gc, bldg_type == "Purpose Built")
Fn = ecdf(childcare.worship[,"TOTSPACE"])
childcare.worship$TOTSPACE.pct = as.ff(Fn(childcare.worship[,"TOTSPACE"]))
mymap = MapBackground(lat=childcare.worship[,"Lat"], lon=childcare.worship[,"Long"])
PlotOnStaticMap(mymap, childcare.worship[,"Lat"], childcare.worship[,"Long"], cex=childcare.worship[,"TOTSPACE.pct"]*4, pch=21, bg=addTrans("purple",100))
Fn = ecdf(childcare.house[,"TOTSPACE"])
childcare.house$TOTSPACE.pct = as.ff(Fn(childcare.house[,"TOTSPACE"]))
mymap = MapBackground(lat=childcare.house[,"Lat"], lon=childcare.house[,"Long"])
PlotOnStaticMap(mymap, childcare.house[,"Lat"], childcare.house[,"Long"], cex=childcare.house[,"TOTSPACE.pct"]*4, pch=21, bg=addTrans("purple",100))
Fn = ecdf(childcare.community[,"TOTSPACE"])
childcare.community$TOTSPACE.pct = as.ff(Fn(childcare.community[,"TOTSPACE"]))
mymap = MapBackground(lat=childcare.community[,"Lat"], lon=childcare.community[,"Long"])
PlotOnStaticMap(mymap, childcare.community[,"Lat"], childcare.community[,"Long"], cex=childcare.community[,"TOTSPACE.pct"]*4, pch=21, bg=addTrans("purple",100))
Fn = ecdf(childcare.pschool[,"TOTSPACE"])
childcare.pschool$TOTSPACE.pct = as.ff(Fn(childcare.pschool[,"TOTSPACE"]))
mymap = MapBackground(lat=childcare.pschool[,"Lat"], lon=childcare.pschool[,"Long"])
PlotOnStaticMap(mymap, childcare.pschool[,"Lat"], childcare.pschool[,"Long"], cex=childcare.pschool[,"TOTSPACE.pct"]*4, pch=21, bg=addTrans("purple",100))
Fn = ecdf(childcare.highrise[,"TOTSPACE"])
childcare.highrise$TOTSPACE.pct = as.ff(Fn(childcare.highrise[,"TOTSPACE"]))
mymap = MapBackground(lat=childcare.highrise[,"Lat"], lon=childcare.highrise[,"Long"])
PlotOnStaticMap(mymap, childcare.highrise[,"Lat"], childcare.highrise[,"Long"], cex=childcare.highrise[,"TOTSPACE.pct"]*4, pch=21, bg=addTrans("purple",100))
Fn = ecdf(childcare.purpose[,"TOTSPACE"])
childcare.purpose$TOTSPACE.pct = as.ff(Fn(childcare.purpose[,"TOTSPACE"]))
mymap = MapBackground(lat=childcare.purpose[,"Lat"], lon=childcare.purpose[,"Long"])
PlotOnStaticMap(mymap, childcare.purpose[,"Lat"], childcare.purpose[,"Long"], cex=childcare.purpose[,"TOTSPACE.pct"]*4, pch=21, bg=addTrans("purple",100))
space.by.bldg_type = ddply(as.data.frame(childcare.gc), .(bldg_type), function (x) c(min.space = min(x[,"TOTSPACE"], na.rm=TRUE), average.space = mean(x[,"TOTSPACE"], na.rm=TRUE), median.space = median(x[,"TOTSPACE"], na.rm=TRUE), max.space = max(x[,"TOTSPACE"], na.rm=TRUE), tot_daycares = sum(!is.na(x[,"TOTSPACE"]))))
space.by.bldg_type = space.by.bldg_type[order(space.by.bldg_type$tot_daycares),]
bldg_type min.space average.space median.space max.space tot_daycares
18 Public Elementary School 15 74.19355 69.0 217 279
17 Place of Worship 8 48.46552 44.0 167 116
16 Other 14 51.17647 48.5 160 102
1 Catholic Elementary School 16 51.50000 49.5 112 76
9 High Rise Apartment 20 68.56522 62.0 145 69
22 Purpose Built 20 72.48276 59.5 165 58
8 Community/Recreation Centre 13 63.73333 60.0 146 45
11 House 10 49.84211 44.5 116 38
6 Commercial Building 16 55.95833 51.5 129 24
15 Office Building 20 69.69565 64.0 162 23
20 Public High School 16 42.36842 41.0 60 19
21 Public School (Closed) 22 70.26667 56.0 180 15
4 Church 13 51.90909 46.0 148 11
19 Public Elementary School (French) 36 84.71429 70.0 167 7
23 Synagogue 24 64.00000 61.0 108 7
7 Community College/University 15 55.16667 59.5 78 6
14 Low Rise Apartment 15 56.00000 62.0 92 6
2 Catholic Elementary School(French) 39 81.20000 76.0 130 5
5 City owned Community/Recreation Centre 28 65.80000 62.0 103 5
3 Catholic High School 36 51.50000 54.0 62 4
12 HUMSRV 45 52.00000 52.0 59 2
13 Industrial Building 45 109.00000 109.0 173 2
26 Private Elementary School 20 154.50000 154.5 289 2
10 Hospital/Health Centre 25 25.00000 25.0 25 1
24 109 109.00000 109.0 109 1
25 Coomunity/Recreation Centre 156 156.00000 156.0 156 1
27 Public Middle School 10 10.00000 10.0 10 1

view raw

daycares.R

hosted with ❤ by GitHub

Who uses E-Bikes in Toronto? Fun with Recursive Partitioning Trees and Toronto Open Data

I found a fun survey released to the Toronto Open Data website that investigates the travel/commuting behaviour of Torontonians, but with a special focus on E-bikes.  When I opened up the file, I found various demographic information, in addition to a question asking people their most frequently used mode of transportation.  Exactly 2,238 people responded to this survey, of which 194 were frequent E-bike users.  I figured that’s enough to do some data mining and an especially fun opportunity to use recursive partitioning trees to do that data mining!

Following is the code I used (notice in the model statements that I focus specifically on E-bike users versus everyone else):


library(rpart)
library(plyr)
library(rpart.plot)
ebike = read.csv("E-Bike_Survey_Responses.csv")
# This next part is strictly to change any blank responses into NAs
ebike[,2:10][ebike[,2:10] == ''] = NA
# In this section we use mapvalues from the plyr package to get rid of blanks, but also
# to reduce the number of values in each factor that we use.
ebike$Sex = mapvalues(ebike$Sex, c('',levels(ebike$Sex)[c(1,2,6)]), c('Other', rep("Other",10)))
ebike$Health = mapvalues(ebike$How.would.you.describe.your.level.of.physical.health., c('', levels(ebike$How.would.you.describe.your.level.of.physical.health.)[c(1,4,5,6,12,13)]), c(NA, rep("Other",7)))
ebike$Edu = mapvalues(ebike[,5], c('', levels(ebike[,5])[c(1,4,8,14,23)]), c(NA, rep('Other',20)))
ebike$Income = mapvalues(ebike[,6], '', NA)
ebike$Age = mapvalues(ebike[,2], '', NA)
# People put a lot of varying answers in here, but the categories I've chosen here can be found in most of them.
ebike$transport = factor(ifelse(grepl("bicycle",ebike[,11]),"Bicycle",
ifelse(grepl("e-bike", ebike[,11]), "E-bike",
ifelse(grepl("car", ebike[,11]), "Car",
ifelse(grepl("transit", ebike[,11]), "Transit","Other")))))
# Here we ask R to make two trees based first on Sex Health and Age (they seem like they go together)
# then based on education and income. You can try to put them together, but you will find that only some are
# chosen as the most significant for the classification. Therefore, keeping them apart describes for us
# E-bike users on separate dimensions.
b = rpart(transport == "E-bike"~ Sex + Health + Age, data=ebike)
c = rpart(transport == "E-bike" ~ Edu + Income, data=ebike)
# And here we plot the two Partition Tree models. I like seeing the factor label
# values in their entirety, so I've chosen a large enough number for the 'faclen' argument
# in each call to rpart.plot
rpart.plot(b, type=1,extra=1, varlen=0, faclen=10)
rpart.plot(c, type=1,extra=1, varlen=0, faclen=20)

view raw

ebike.r

hosted with ❤ by GitHub

Here is the first tree based on Sex, Health, and Age (Remember that the factor levels shown are not the only ones.  When you look on the “no” side of the tree, it means that you are examining the proportion of ebike users who are described by factor levels not shown):

Health and Age Tree
As you can see, only Health and Age came out as significantly discriminating between E-bike users and everyone else.  What this tree is telling us is that it’s in fact those people who are not in “Excellent, Good, Very good” health who are likely to use E-bikes, but rather an un-shown part of the Health spectrum: “Other,Fairly good,Poor”.  That’s interesting in and of itself.  It seems that of those people who are in Excellent or Very Good health, they are more likely (44%) to be riding Bicycles than people in other levels of health (23%).  That makes sense!  You’re not going to choose something effortful if your health isn’t great.

We also see a very interesting finding that it is in fact the 50 – 64 year olds (whose health isn’t great) who are more likely to be riding an E-bike compared to people of all other age groups!

Here’s the second tree based on Education and Income:

Education and Income Tree
Here we see that it is in fact not the university educated ones more likely to ride E-bikes, but in fact people with “College or trade school diploma,High school diploma”.  Interesting!!  Further, we see that amongst those who aren’t university educated, it’s those who say they make lower than $80,000 in income who are more likely to ride E-bikes.

So now we have an interesting picture emerging, with two parallel descriptions of who is most likely to ride E-bikes:

1) 50 – 64 year olds in not the greatest of health
and
2) Non University educated folks with lower than $80,000 income.

Toronto, these are your E-bike users!

sapply is my new friend!

I’ve written previously about how the apply function is a major workhorse in many of my work projects. What I didn’t know is how handy the sapply function can be!

There are a couple of cases so far where I’ve found that sapply really comes in handy for me:

1) If I want to quickly see some descriptive stats for multiple columns in my dataframe. For example,

sapply(mydf[,10:20], median, na.rm=true)

would show me the medians of columns 10 through 20, displaying the column names above each median value.

2) If I want to apply the same function to multiple vectors in my dataframe, modifying them in place. I oftentimes have count variables that have NA values in place of zeros. I made a “zerofy” function to add zeros into a vector that lacks them. So, if I want to use my function to modify these count columns, I can do the following:

mydf[,30:40] = sapply(mydf[,30:40], zerofy)

Which then replaces the original data in columns 30 through 40 with the modified data! Handy!

Package sqldf eases the multivariable sorting pain

This will be a quick one.  I was trying to sort my dataframe so that it went in ascending order on one variable and descending order on another variable.  This was really REALLY bothersome to try to figure out with base R functions.  Then I remembered sqldf!

# Assuming dataframe named 'mydf' and 'V1' and 'V2' are your variables you want to sort on in opposite directions
library(sqldf)
mydf.sorted = sqldf("SELECT * FROM mydf ORDER BY V1 ASC, V2 DESC")

That’s all! No pain, and lots of gain. Show me how to do this so simply with base R and I’ll be impressed.

EDIT:

After many comments on this post, it’s become painfully obvious to me that I wasn’t using the right search terms (or looking at that negative sign on the second argument to the order function) when searching for a base R solution to this problem. Well, we all make mistakes, right? I hope this post has still been helpful to others, and it isn’t a completely redundant internet resource on this topic!

Estimating Ages from First Names Part 2 – Using Some Morbid Test Data

In my last post, I wrote about how I compiled a US Social Security Agency data set into something usable in R, and mentioned some issues scaling it up to be usable for bigger datasets.  I also mentioned the need for data to test out the accuracy of my estimates.  First, I’ll show you how I prepped the dataset that it became more scalable (for the code that got us here, see my last post):

name_data_wavgpop_unisex = ddply(name_data, .(Name), function (x) sum(x$Rel_Pop*as.numeric(x$Year))/sum(x$Rel_Pop))
name_data_wavgpop_unisex$V1 = round(name_data_wavgpop_unisex$V1,0)

Above I’ve taken a different tactic to predicting expected year of birth according to name than I started out with in my last post. Here I’m using the relative popularity of the names in each year as weights for each year value. Multiplying them by the years, I get a weighted average of Year that gives me predicted year of birth. Then I round off the predictions to the nearest integer and continue on my way. Also, because test data doesn’t seem to come packaged with gender info, I’ve constructed the weighted averages using all relative popularity values for each name, regardless of whether or not that name has been used for both sexes (a.k.a. “Jordan”).

Now enter the test data. I’ve discovered that the easiest way of getting real names and ages off the internet is by looking for lists of victims of some horrible tragedy. The biggest such list of victims I could find was a list of 9/11 victims.  It’s not exactly formatted for easy analysis, and I was too lazy to get the data programatically, so I just copy-pasted into LibreOffice Calc the names and ages from the first 4 lists on the page (all from either American Airlines, or United Airlines) for a total of 285 observations.  I then extracted the first names, and then imported the first names and ages into R.

worldtrade = read.csv("world trade.csv")
worldtrade.ages = sqldf("SELECT a.*, b.V1 as Year FROM [worldtrade] AS a LEFT JOIN name_data_wavgpop_unisex AS b on a.Name == b.Name")
worldtrade.ages$Pred.Age = 2001 - as.numeric(worldtrade.ages$Year)

As you can see, I opted to use sqldf to append the appropriate predicted birth years for each name on the list I imported. I then got the predicted ages by subtracting each predicted birth year from 2001. Finally, let’s have a look at the resulting model fit (showing how close each predicted age was to the real age of the victim at the time of death):

Predicted vs Real Ages for 911 Victims

 

 

 

As you can see, it’s not a tight prediction in the least bit.  According to model fit statistics, there’s an adjusted r-squared of 14.6% and a residual standard error of 15.58 years.  You can also see from the scatter plot that the prediction doesn’t become reasonably linear until about age 30 and onwards.  Overall, I’d say it’s not too impressive, and I’d imagine it’s even worse for predicting who’s under 10 years old!

Well, this was fun (if only a little disappointing).  That’s statistics for you – sometimes it confirms, and sometimes it humbles you.  If you think you can show me a better way of using this name trending data to better predict ages than what I’ve done here, feel free to show me!