KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!

The challenge from the KDD Cup this year was to use their data relating to student enrollment in online MOOCs to predict who would drop out vs who would stay.

The short story is that using H2O and a lot of my free time, I trained several hundred GBM models looking for the final one which eventually got me an AUC score of 0.88127 on the KDD Cup leaderboard and at the time of this writing landed me in 120th place. My score is 2.6% away from 1st place, but there are 119 people above me!

Here are the main characters of this story:

MySQL Workbench

It started with my obsessive drive to find an analytics project to work on. I happened upon the KDD Cup 2015 competition and decided to give it a go. It had the characteristics of a project that I wanted to get into:

1) I could use it to practice my SQL skills
2) The data set was of a moderate size (training table was 120,542 records, log info table was 8,151,053 records!)
3) It looked like it would require some feature engineering
4) I like predictive modeling competitions 🙂

Once I had loaded up the data into a mariadb database, I had to come to decisions about how I would use the info in each table. Following were my thought processes for each table:

enrollment_train / enrollment_test
Columns: enrollment_id, username, course_id

Simply put, from this table I extracted the number of courses each student (username) was enrolled in, and also the number of students enrolled in each course (course_id).

log_train / log_test
Columns: enrollment_id, tstamp, source, logged_event, object

There were a few items of information that I decided to extract from this table:

1) Number of times each particular event was logged for every enrollment_id
2) Average timestamp for each event for each enrollment_id
3) Min and Max timestamp for each event for each enrollment_id
4) Total time elapsed from the first to the last instance of each event for each enrollment_id
5) Overall average timestamp for each enrollment_id

Contrary to what you might think, the object field does not seem to link up with the object table.

Columns: course_id, module_id, category, children, tstart

From this table I extracted a count of course components by course_id and also the number of ‘children’ per course_id. I assume these are relational references but am not sure what in the data set these child IDs refer to.

Columns: enrollment_id, dropped_out

I didn’t extract anything special out of this table, but used it as the table to which all other SQL views that I had created were linked.

If you’d like to see the SQL code I used to prepare the tables, views, and the final output table I used to train the model, see my github repo for this project.

Import into R and Feature Engineering

Once I imported the data into R through RODBC, you’ll see in the code that my feature engineering was essentially a desperate fishing expedition where I tried a whole lot of stuff.  I didn’t even end up using everything that I had engineered through my R code, but as my final model included 35 variables, I wasn’t suffering any severe lack! If you download the KDD Cup 2015 data and are having a look around, feel free to let me know if I’ve missed any important variables!

H2O, Model Tuning, and Training of The Final Model

This is the part where I managed to train hundreds of models! I don’t think this would have been feasible just using plain R on my computer alone (I have 8GB of RAM and an 8 core AMD processor). For these tasks I turned to H2O. For those who don’t know, H2O is a java based analytical interface for cloud computing that is frankly very easy and beneficial to set up when all you have at your disposal is one computer. I say beneficial for one reason: my computer chokes when trying to train ensemble models on even moderate sized data sets. Through H2O, I’m able to get it done without watching the RAM meter on my system monitor shoot all the way up to full capacity!! What you’ll notice in my R code is that R is able to interface with H2O in such a way that once I passed the dataframe with the training data to H2O, it was H2O that handled the modeling from there, and sends info back to R when available or requested (e.g. while you’re training a model, it gives you a cute text-based progress bar automatically!). More on this soon.

Before I show some results, I want to talk about my model tuning algorithm. Let’s look at the relevant code, then I’ll break it down verbally.

ntree = seq(100,500,100)
balance_class = c(TRUE,FALSE)
learn_rate = seq(.05,.4,.05)

parameters = list(ntree = c(), balance_class = c(), learn_rate = c(), r2 = c(), min.r2 = c(), max.r2 = c(), acc = c(), min.acc = c(), max.acc = c(), AUC = c(), min.AUC = c(), max.AUC = c())
n = 1

mooc.hex = as.h2o(localH2O, mooc[,c("enrollment_id","dropped_out_factor",x.names)])
for (trees in ntree) {
  for (c in balance_class) {
    for (rate in learn_rate) {
      r2.temp = c(NA,NA,NA)
      acc.temp = c(NA,NA,NA)
      auc.temp = c(NA,NA,NA)
      for (i in 1:3) {
        mooc.hex.split = h2o.splitFrame(mooc.hex, ratios=.8)   
        train.gbm = h2o.gbm(x = x.names, y = "dropped_out_factor",  training_frame = mooc.hex.split[[1]],
                            validation_frame = mooc.hex.split[[2]], ntrees = trees, balance_classes = c, learn_rate = rate)
        r2.temp[i] = train.gbm@model$validation_metrics@metrics$r2
        acc.temp[i] = train.gbm@model$validation_metrics@metrics$max_criteria_and_metric_scores[4,3]
        auc.temp[i] = train.gbm@model$validation_metrics@metrics$AUC
    parameters$ntree[n] = trees
    parameters$balance_class[n] = c
    parameters$learn_rate[n] = rate
    parameters$r2[n] = mean(r2.temp)
    parameters$min.r2[n] = min(r2.temp)
    parameters$max.r2[n] = max(r2.temp)
    parameters$acc[n] = mean(acc.temp)
    parameters$min.acc[n] = min(acc.temp)
    parameters$max.acc[n] = max(acc.temp)
    parameters$AUC[n] = mean(auc.temp)
    parameters$min.AUC[n] = min(auc.temp)
    parameters$max.AUC[n] = max(auc.temp)
    n = n+1

parameters.df = data.frame(parameters)

The model that I decided to use is my usual favourite, gradient boosting machines (h2o.gbm is the function you use to train a gbm model through H2O). As such, the 3 hyperparameters which I chose to vary and evaluate in the model tuning process were number of trees, whether or not to balance the outcome classes through over/undersampling, and the learning rate. As you can see above, I wanted to try out numerous values for each hyperparameter, making 5 values for number of trees, 2 values for balance classes, and 8 values for learning rate, totalling 80 possible combinations of all 3 hyperparameter values together. Furthermore, I wanted to try out each combination of hyperparemeter values on 3 random samples of the training data. So, 3 samples of each one of 80 combinations is equal to 240 models trained and validated with the aim of selecting the one with the best area under the curve (AUC). As you can see, each time I trained a model, I saved and summarised the validation stats in a growing list which I ultimately converted to a data.frame and called called parameters.df

The best hyperparameters, according to these validation stats which I collected, are:

– ntree = 500
– balance_class = FALSE
– learn_rate = .05

You can see a very nice summary of how validation set performance changed depending on the values of all of these parameters in the image below (the FALSE and TRUE over the two facets refer to the balance_class values.

AUC by Tuning Parameters

Have a look at my validation data model summary output from the H2O package below:

H2OBinomialMetrics: gbm
** Reported on validation data. **

MSE:  0.06046745
R^2:  0.102748
LogLoss:  0.2263847
AUC:  0.7542866
Gini:  0.5085732

Confusion Matrix for F1-optimal threshold:
            dropped out stayed    Error         Rate
dropped out       21051   1306 0.058416  =1306/22357
stayed             1176    576 0.671233   =1176/1752
Totals            22227   1882 0.102949  =2482/24109

Maximum Metrics:
                      metric threshold    value        idx
1                     max f1  0.170555 0.317006 198.000000
2                     max f2  0.079938 0.399238 282.000000
3               max f0point5  0.302693 0.343008 134.000000
4               max accuracy  0.612984 0.929321  48.000000
5              max precision  0.982246 1.000000   0.000000
6           max absolute_MCC  0.170555 0.261609 198.000000
7 max min_per_class_accuracy  0.061056 0.683410 308.000000

The first statistic that my eyes were drawn to when I saw this output was the R^2 statistic. It looks quite low and I’m not even sure why. That being said, status in the KDD Cup 2015 competition is measured in AUC, and here you can see that it is .75 on my validation data. Next, have a look at the confusion matrix. You can see in the Error column that the model did quite well predicting who would drop out (naturally, in my opinion), but did not do so well figuring out who would stay. The overall error rate on the validation data is 10%, but I’m still not so happy about the high error rate as it pertains to those who stayed in the MOOC.

So this was all well and good (and was what got me my highest score yet according to the KDD Cup leaderboard) but what if I could get better performance with fewer variables? I took a look at my variable importances and decided to see what would happen if I eliminate the variables with the lowest importance scores one by one until I reach the variable with the 16th lowest importance score. Here’s the code I used:

varimps = data.frame(h2o.varimp(train.gbm))
variable.set = list(nvars = c(), AUC = c(), min.AUC = c(), max.AUC = c())

mooc.hex = as.h2o(localH2O, mooc[,c("enrollment_id","dropped_out_factor",x.names)])
n = 1
for (i in seq(35,20)) {
  auc.temp = c(NA,NA,NA)
  x.names.new = setdiff(x.names, varimps$variable[i:dim(varimps)[1]])
  for (j in 1:3) {
        mooc.hex.split = h2o.splitFrame(mooc.hex, ratios=.8)  
        train.gbm.smaller = h2o.gbm(x = x.names.new, y = "dropped_out_factor",  training_frame = mooc.hex.split[[1]],
                            validation_frame = mooc.hex.split[[2]], ntrees = 500, balance_classes = FALSE, learn_rate = .05)
        auc.temp[j] = train.gbm.smaller@model$validation_metrics@metrics$AUC
    variable.set$AUC[n] = mean(auc.temp)
    variable.set$min.AUC[n] = min(auc.temp)
    variable.set$max.AUC[n] = max(auc.temp)
    variable.set$nvars[n] = i-1
    n = n + 1

variable.set.df = data.frame(variable.set)

You can see that it’s a similar algorithm as what I used to do the model tuning. I moved up the variable importance list from the bottom, one variable at a time, and progressively eliminated more variables. I trained 3 models for each new number of variables, each on a random sample of the data, and averaged the AUCs from those models (totalling 48 models). See the following graph for the result:

AUC by num vars

As you can see, even though the variables I eliminated were of the lowest importance, they were still contributing something positive to the model. This goes to show how well GBM performs with variables that could be noisy.

Now let’s look at the more important variables according to H2O:

                           variable relative_importance scaled_importance   percentage
1                 num_logged_events        48481.160156      1.000000e+00 5.552562e-01
2     DAYS_problem_total_etime_unix        11651.416992      2.403288e-01 1.334440e-01
3                      days.in.mooc         6495.756348      1.339852e-01 7.439610e-02
4      DAYS_access_total_etime_unix         3499.054443      7.217349e-02 4.007478e-02
5                         avg_month         3019.399414      6.227985e-02 3.458127e-02
6                           avg_day         1862.299316      3.841285e-02 2.132897e-02
7                    Pct_sequential         1441.578247      2.973481e-02 1.651044e-02
8    DAYS_navigate_total_etime_unix          969.427734      1.999597e-02 1.110289e-02
9                       num_courses          906.499451      1.869797e-02 1.038217e-02
10                      Pct_problem          858.774353      1.771357e-02 9.835569e-03
11                     num_students          615.350403      1.269257e-02 7.047627e-03

Firstly, we see that the number of logged events was the most important variable for predicting drop-out. I guess the more active they are, the less likely they are to drop out. Let’s see a graph:

MOOC dropout by num logged events

Although a little bit messy because I did not bin the num_logged_events variable, we see that this is exactly the case that those students who were more active online were less likely to drop out.

Next, we see a few variables regarding the days spent doing something. They seem to follow similar patterns, so the image I’ll show you below involves the days.in.mooc variable. This is simply how many days passed from the logging of the first event to the last.

MOOC dropouts by days in mooc

Here we see a very steady decrease in probability of dropping out where those who spent very little time from their first to their last interaction with the MOOC are the most likely to drop out, whereas those who spend more time with it are obviously less likely.

Next, let’s look at the avg_month and avg_day variables. These were calculated by taking the average timestamp of all events for each person enrolled in each course and then extracting the month and then the day from that timestamp. Essentially, when, on average, did they tend to do that course.

MOOC dropout by avg month and day

Interestingly, most months seem to exhibit a downward pattern, whereby if the person tended to have their interactions with the MOOC near the end of the month, then they were less likely to drop out, but if they had their interactions near the beginning they were more likely to drop out. This applied to February, May, June, November, and December. The reverse seems to be true for July and maybe October. January maybe applies to the second list.

The last two plots I’ll show you relate to num_courses and num_students, in other words, how many courses each student is taking and how many students are in each course.

MOOC dropouts by # courses per student

MOOC dropout by course popularity

The interesting result here is that it’s only those students who were super committed (taking more than 20 courses in the period captured by the data) who appeared significantly less likely to drop out than those who were taking fewer courses.

Finally, you can see that as the number of students enrolled in a course went up, the overall drop-out rate decreased. Popular courses retain students!


This was fun! I was amazed by how obsessed I became on account of this competition. I’m disappointed that I couldn’t think of something to bridge the 2.6% gap between me and first place, but the point of this was to practice, to learn something new, and have fun. I hope you enjoyed it too!

Contraceptive Choice in Indonesia

I wanted yet another opportunity to get to use the fabulous caret package, but also to finally give plot.ly a try.  To scratch both itches, I dipped into the UCI machine learning library yet again and came up with a survey data set on the topic of contraceptive choice in Indonesia.  This was an interesting opportunity for me to learn about a far-off place while practicing some fun data skills.

According to recent estimates, Indonesia is home to some 250 million individuals and over the years, thanks to government intervention, has had its fertility rate slowed down from well over 5 births per woman, to a current value of under 2.4 births per woman.  Despite this slow down, Indonesia is not generating enough jobs to satisfy the population.  Almost a fifth of their youth labour force (aged 15-24) are unemployed (19.6%), a whole 6.3% more than a recent estimate of the youth unemployment rate in Canada (13.3%).  When you’re talking about a country with 250 million individuals (with approximately 43.4 million 15-24 year olds), that’s the difference between about 5.8 million unemployed and 8.5 million unemployed teenagers/young adults.  The very idea is frightening!  Hence the government’s focus (decades ago and now) on promoting contraceptive method use to its population.

That is the spirit behind the survey data we’ll look at today!  First, download the data from my plot.ly account and load it into R.


cmc = read.csv("cmc.csv", colClasses = c(Wife.Edu = "factor", Husband.Edu = "factor", Wife.Religion = "factor", 
                                         Wife.Working = "factor", Husband.Occu = "factor", Std.of.Living = "factor",
                                         Media.Exposure = "factor", Contraceptive.Method.Used = "factor"))

levels(cmc$Contraceptive.Method.Used) = c("None","Short-Term","Long-Term")

      None Short-Term  Long-Term 
       629        333        511 


      None Short-Term  Long-Term 
 0.4270197  0.2260692  0.3469111 

Everything in this data set is stored as a number, so the first thing I do is to define as factors what the documentation suggests are factors.  Then, we see the numeric breakdown of how many women fell into each contraceptive method category.  It’s 1473 responses overall, and ‘None’ is the largest category, although it is not hugely different from the others (although a chi squared test does tell me that the proportions are significantly different from one another).

Next up, let’s set up the cross validation and training vs testing indices, then train a bunch of models, use a testing sample to compare performance, and then we’ll see some graphs

# It's training time!
control = trainControl(method = "cv")
in_train = createDataPartition(cmc$Contraceptive.Method.Used, p=.75, list=FALSE)

contraceptive.model = train(Contraceptive.Method.Used ~ ., data=cmc, method="rf", metric="Kappa",
                            trControl=control, subset=in_train)

contraceptive.model.gbm = train(Contraceptive.Method.Used ~ ., data=cmc, method="gbm", metric="Kappa",
                            trControl=control, subset=in_train, verbose=FALSE)

contraceptive.model.svm = train(Contraceptive.Method.Used ~ ., data=cmc, method="svmRadial", metric="Kappa",
                                preProc = c('center','scale'), trControl=control, subset=in_train)

contraceptive.model.c50 = train(Contraceptive.Method.Used ~ ., data=cmc, method="C5.0", metric="Kappa",
                                trControl=control, subset=in_train, verbose=FALSE)

dec = expand.grid(.decay=c(0,.0001,.01,.05,.10))
control.mreg = trainControl(method = "cv")
contraceptive.model.mreg = train(Contraceptive.Method.Used ~ ., data=cmc, method="multinom", metric="Kappa",
                                  trControl=control.mreg, tuneGrid = dec, subset=in_train, verbose=FALSE)

# And now it's testing time...
cmc.test = cmc[-in_train,]

cmc.test = cmc.test %>% mutate(
  rf.class = predict(contraceptive.model, cmc.test, type="raw"),
  gbm.class = predict(contraceptive.model.gbm, cmc.test, type="raw"),
  svm.class = predict(contraceptive.model.svm, cmc.test, type="raw"),
  mreg.class = predict(contraceptive.model.mreg, cmc.test, type="raw"),
  c50.class = predict(contraceptive.model.c50, cmc.test, type="raw"))

# Here I'm setting up a matrix to host some performance statistics from each of the models
cmatrix.metrics = data.frame(model = c("rf", "gbm", "svm", "mreg","c50"), kappa = rep(NA,5), sensitivity1 = rep(NA,5), sensitivity2 = rep(NA,5), sensitivity3 = rep(NA,5), specificity1 = rep(NA,5), specificity2 = rep(NA,5), specificity3 = rep(NA,5))

# For each of the models, I use confusionMatrix to give me the performance statistics that I want
for (i in 11:15) {
  cmatrix.metrics[i-10,"kappa"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$overall[2][[1]]
  cmatrix.metrics[i-10, "sensitivity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,1]
  cmatrix.metrics[i-10, "sensitivity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,1]
  cmatrix.metrics[i-10, "sensitivity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,1]
  cmatrix.metrics[i-10, "specificity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,2]
  cmatrix.metrics[i-10, "specificity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,2]
  cmatrix.metrics[i-10, "specificity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,2]

# Now I transform my cmatrix.metrics matrix into a long format suitable for ggplot, graph it, and then post it to plot.ly

cmatrix.long = melt(cmatrix.metrics, id.vars=1, measure.vars=c(2:8))

ggplot(cmatrix.long, aes(x=model, y=value)) + geom_point(stat="identity", colour="blue", size=3) + facet_grid(~variable) + theme(axis.text.x=element_text(angle=90,vjust=0.5, colour="black",size=12), axis.text.y=element_text(colour="black", size=12), strip.text=element_text(size=12), axis.title.x=element_text(size=14,face="bold"), axis.title.y=element_text(size=14, face="bold")) + ggtitle("Performance Stats for Contraceptive Choice Models") + scale_x_discrete("Model") + scale_y_continuous("Value")

py = plotly()

And behold the beautiful graph (after some tinkering in the plot.ly interface, of course)

performance_stats_for_contraceptive_choice_modelsPlease note, the 1/2/3 next to sensitivity and specificity refer to the contraceptive use classes, ‘None’, ‘Short-term’, ‘Long-term’.

Firstly, the kappa stats are telling us that after you factor into accuracy the results you would expect by chance, the models don’t add a humongous level of predictive power.  So, we won’t collect our nobel prize on this one!  The classes are evidently more difficult to positively identify than they are to negatively identify, as evidenced by the lower sensitivity scores than specificity scores.  Interestingly, class 1 (users of NO contraceptive methods) was the most positively identifiable.

Secondly, it looks like we have a list of the top 3 performers in terms of kappa, sensitivity and specificity (in no particular order because I they don’t appear that different):

  • gbm (gradient boosting machine)
  • svm (support vector machine)
  • c50

Now that we have a subset of models, let’s see what they have to say about variable importance.  Once we see what they have to say, we’ll do some more fun graphing to see the variables in action.

gbm.imp = varImp(contraceptive.model.gbm)
svm.imp = varImp(contraceptive.model.svm)
c50.imp = varImp(contraceptive.model.c50)

I haven’t used ggplot here because I decided to try copy pasting the tables directly into plot.ly for the heck of it. Let’s have a look at variable importance according to the three models now:



contraceptive_method_model_-_svm_variable_importanceOne commonality that you see throughout all of these graphs is that the wife’s age, her education, and the number of children born to her are among the most predictive in terms of classifying where she will fall on the contraceptive choice spectrum provided in the survey.  Given that result, let’s see some graphs that plot contraceptive choice against the aforementioned 3 predictive variables:

plot(Contraceptive.Method.Used ~ Wife.Age, data=cmc, main="Contraceptive Method Used According to Age of Wife")

cmethod.by.kids = melt(dcast(cmc, Num.Kids ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4), variable.name="Contraceptive.Method.Used", value.name="Num.Records")

ggplot(cmethod.by.kids, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Number of Kids")
ggplot(cmethod.by.kids, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Number of Kids")

cmethod.by.wife.edu = melt(dcast(cmc, Wife.Edu ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4), variable.name="Contraceptive.Method.Used", value.name="Num.Records")

ggplot(cmethod.by.wife.edu, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Education Level of Wife")
ggplot(cmethod.by.wife.edu, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Education Level of Wife")

And now the graphs:

Contraceptive Choice by Age of Wife

Okay so this isn’t a plotly graph, but I do rather like the way this mosaic plot conveys the information.  The insight here is after a certain age (around age 36) the women seem to be less inclined to report the use of long-term contraceptive methods, and more inclined to report no contraceptive use at all.  Could it be that the older women feel that they are done with child bearing, and so do not feel the need for contraception anymore?  Perhaps someone more knowledgeable on the matter could enlighten me!


contraceptive_choice_by_education_level_of_wife prop

Here’s a neat one, showing us the perhaps not-so-revelatory result that as the education level of the wife increases, the likelihood that she will report no contraception diminishes.  Here it is in fact the short term contraceptive methods where we see the biggest increase in likelihood as the education level increases.  I don’t think you can cite education in and of itself that causes women to choose contraception, because perhaps it’s higher socio-economic status which leads them to pursue higher education which leads them to choose contraception.  I’m not sure this survey will answer that question, however!


contraceptive_choice_by_number_of_kids prop

Finally, we have number of kids, which exhibits an odd relationship with contraceptive choice. It appears as though at least half of the women with 1 child reported no contraception, but that proportion goes down when you look at women with 3 children.  After that, women are more and more likely to cite no contraception, likely reflecting their age.

Conclusion and observations:

As I mentioned earlier, I don’t expect to pick up any nobel prizes from this analysis.  Substantively, the most interesting thing that came out of this analysis for me is that it was stage of life factors (# kids and age) in addition to the wife’s education (and possibly income, but that wasn’t measured) which formed the most predictive variables in the classification of who uses which contraceptive methods.  Naively, I expected wife’s religion to be amongst the most predictive.

According to UNICEF, 6/10 drop-outs in primary school are girls.  That increases to 7/10 in secondary school.  Stop that trend from happening, and then who knows what improvements might result?  Perhaps continuing to invest in girls’ education will help lay the foundation for the later pursuit of higher education and the slowing down of their population expansion.

Lastly, I have some comments about plotly:

  1. It was a big buzz kill to discover that I couldn’t embed my plotly plots into my wordpress blog.  Bummer.
  2. The plots that did result were very nice looking!
  3. I found myself getting impatient figuring out where to click to customize my plots to my liking.  There’s something about going from a command line to a gui which is odd for me!
  4. I initially thought it was cool that I could save a theme for my plot once I customized it to my liking.  I was disappointed to learn that the theme was not saved as an absolute set of visual characteristics.  For some reason those characteristics changed depending on the initial graph I imported from R and I could not just apply it and be done with it.
  5. I found myself wondering if I there was a better way of getting my R graphs into plotly than the proscribed py$ggplotly() method.  It’s not my biggest complaint, but somehow I’d rather just have one command to batch upload all of my graphs.

I’ll be watching plotly as it evolves and hope to see it improve even more than it has already!  Good luck!

Ontario First Nations Libraries Compared Using Ontario Open Data

I recently downloaded a very cool dataset on Ontario libraries from the Ontario Open Data Catalogue.  The dataset contains 142 columns of information describing 386 libraries in Ontario, representing a fantastically massive data collection effort for such important cultural institutions (although the most recent information available is as of 2010).  One column which particularly caught my interest was “Library Service Type”, which breaks the libraries down into:

  • Public or Union Library (247)
  • LSB Library (4)
  • First Nations Library (43)
  • County, County co-operative or Regional Municipality Library (13)
  • Contracting Municipality (49)
  • Contracting LSB (14)

I saw the First Nations Library type and thought it would be really educational for me to compare First Nations libraries against all the other library types combined and see how they compare based on some interesting indicators.  To make these comparisons in this post, I use a few violin plots; where you see more bulkiness in the plot, it tells you that the value on the y axis is more likely for a library compared to the thinner parts.

Our first comparison, shown below, reveals that local population sizes are a LOT more variable amongst the “Other” library types compared to First Nations libraries.  From first to third quartile, First Nations libraries tend to have around 250 to 850 local residents, whereas Other libraries tend to have around 1,110 to 18,530 local residents!

Local Population Sizes by Library Type

             isFN.Library 0%    25%  50%   75%    100%
1         Other Libraries 28 1113.5 5079 18529 2773000
2 First Nations Libraries 55  254.5  421   857   11297

Considering the huge difference in the population sizes that these libraries were made to serve, comparisons between library types need to be weighted according to those sizes, so that the comparisons are made proportionate.  In that spirit, the next plot compares the distribution of the number of cardholders per resident by library type.  Thinking about this metric for a moment, it’s possible that a person not living in the neighbourhood of the library can get a card there.  If all the residents of the library’s neighbourhood have a card, and there are people outside of that neighbourhood with cards, then a library could have over 1 cardholder per resident.

Looking at the plot, a couple of things become apparent: Firstly, First Nations libraries appear more likely to to be overloaded with cardholders (more cardholders than there are local residents, 14% of First Nations libraries, vs. 4% of Other libraries).  On the lower end of the spectrum, First Nations libraries show a slight (non-significant) tendency of having fewer cardholders per resident than Other libraries.

Cardholders per Resident by Library Type_

             isFN.Library 0%  25%  50%  75% 100%
1         Other Libraries  0 0.20 0.37 0.55  2.1
2 First Nations Libraries  0 0.19 0.32 0.77  2.8

Next we’ll look at a very interesting metric, because it looks so different when you compare it in its raw form to when you compare it in proportion to population size.  The plot below shows the distribution of English titles in circulation by library type.  It shouldn’t be too surprising that Other libraries, serving population sizes ranging from small to VERY large, also vary quite widely in the number of English titles in circulation (ranging from around 5,600 to 55,000, from first to third quartile).  On the other hand we have First Nations libraries, serving smaller population sizes, varying a lot less in this regard (from around 1,500 to 5,600 from first to third quartile).
Num English Titles in Circulation by Library Type

             isFN.Library 0%    25%   50%   75%   100%
1         Other Libraries  0 5637.5 21054 54879 924635
2 First Nations Libraries  0 1500.0  3800  5650  25180

Although the above perspective reveals that First Nations libraries tend to have considerably fewer English titles in circulation, things look pretty different when you weight this metric by the local population size.  Here, the plot for First Nations libraries looks very much like a Hershey’s Kiss, whereas the Other libraries plot looks a bit like a toilet plunger.  In other words, First Nations libraries tend to have more English titles in circulation per resident than Other libraries.  This doesn’t say anything about the quality of those books available in First Nations libraries.  For that reason, it would be nice to have a measure even as simple as median/average age/copyright date of the books in the libraries to serve as a rough proxy for the quality of the books sitting in each library.  That way, we’d know whether the books in these libraries are up to date, or antiquated.
English Titles in Circulation per Resident by Library Type

             isFN.Library 0%       25%      50%       75%      100%
1         Other Libraries  0 0.9245169 2.698802  5.179767 119.61462
2 First Nations Libraries  0 2.0614922 7.436399 13.387416  51.14423

For the next plot, I took all of the “per-person” values, and normed them.  That is to say, for any given value on the variables represented below, I subtracted from that value the minimum possible value, and then divided the result by the range of values on that measure.  Thus, any and all values close to 1 are the higher values, and those closer to 0 are the lower values.  I then took the median value (by library type) for each measure, and plotted below.  Expressed this way, flawed though it may be, we see that First Nations Libraries tend to spend more money per local resident, across areas, than Other libraries.  The revenue side looks a bit different.  While they tend to get more revenue per local resident, they appear to generate less self-generated revenue, get fewer donations, and get less money in local operating grants, all in proportion to the number of local residents.  The three areas where they are excelling (again, this is a median measure) are total operating revenue, provincial operating funding, and especially project grants.
Normed Costs and Revenues by Library Type
Here I decided to zero in on the distributional differences in net profit per resident by library type.  Considering that libraries are non-profit institutions, you would expect to see something similar to the plot shown for “Other” libraries, where the overwhelming majority are at or around the zero line.  It’s interesting to me then, especially since I work with non-profit institutions, to see the crazy variability in the First Nations libraries plot.  The upper end of this appears to be from some outrageously high outliers, so I decided to take them out and replot.
Net Profit per Resident Population
In the plot below, I’ve effectively zoomed in, and can see that there do seem to be more libraries showing a net loss, per person, than those in the net gain status.
Normed Costs and Revenues by Library Type - Zoomed In

             isFN.Library      0%    25%   50%  75%   100%
1         Other Libraries -149.87  -0.49  0.00 1.16  34.35
2 First Nations Libraries  -76.55 -17.09 -0.88 0.40 250.54

I wanted to see this net profit per person measure mapped out across Ontario, so I used the wonderful ggmap package, which to my delight is Canadian friendly!  Go Canada!  In this first map, we see that First Nations libraries in Southern Ontario (the part of Ontario that looks like the head of a dragon) seem to be “okay” on this measure, with one library at the “neck” of the dragon seeming to take on a little more red of a shade, one further west taking on a very bright green, and a few closer to Manitoba appearing to be the worst.

Net Profit per Local Resident Amongst First Nations LibrariesTo provide more visual clarity on these poorly performing libraries, I took away all libraries at or above zero on this measure.  Now there are fewer distractions, and it’s easier to see the worst performers.

Net Profit per Local Resident Amongst First Nations Libraries - in the red

Finally, as a sanity check, I re-expressed the above measure into a ratio of total operating revenue to total operating expenditure to see if the resulting geographical pattern was similar enough.  Anything taking on a value of less than 1 is spending more than they are making in revenue, and are thus “in the red”.  While there are some differences in how the colours are arrayed across Ontario, the result is largely the same.Operating Revenue to Cost Ratio Amongst First Nations Libraries

Finally, I have one last graph that does seem to show a good-news story.  When I looked at the ratio of annual program attendance to local population size, I found that First Nations libraries seem to attract more people every year, proportionate to population size, compared to Other libraries!  This might have something to do with the draw of a cultural institution in a small community, but feel free to tell me some first hand stories either running against this result, or confirming it if you will:

Annual Program Attendance by Library Type


          isFN.Library    0%   25%   50%   75%   100%
1         Other Libraries  0 0.018 0.155 0.307  8.8017
2 First Nations Libraries  0 0.113 0.357 2.686  21.361

That’s it for now! If you have any questions, or ideas for further analysis, don’t hesitate to drop me a line 🙂

As a final note, I think that it’s fantastic that this data collection was done, but the fact that the most recent data available is as of 2010 is very tardy.  What happened here?  Libraries are so important across the board, so please, Ontario provincial government, keep up the data collection efforts!

libraries = read.csv("ontario_library_stats_2010.csv")
libraries$isFN = ifelse(libraries$Library.Service.Type == "First Nations Library",1,0)
# Here we create the 'proportionate' versions of all the variables
libraries[,143:265] = sapply(libraries[,20:142], function (x) x/libraries[,13])
names(libraries)[143:265] = paste(names(libraries)[20:142], "P",sep=".")
libraries$cardholders.per.resident = libraries$X..of.Active.Library.Cardholders / libraries$Population..Resident.
libraries[,269:391] = sapply(libraries[,143:265], function (x) (xmin(x, na.rm=TRUE))/(max(x,na.rm=TRUE) min(x,na.rm=TRUE)))
# here we gather a list of proportionate revenue vs expenditure data,
# figure out which columns show a significant group difference between
# Other libraries and First Nations Libraries, and then plot the medians
# from each group!
results.rev = c()
for (i in 23:50 + 123) {
results.rev = rbind(results.rev, data.frame(variable = names(libraries)[i],
position = i,
pval = kruskal.test(libraries[,i] ~ libraries$isFN)$p.value,
med.FN = median(libraries[which(libraries$isFN == 1),i+126], na.rm=TRUE),
med.OTHER = median(libraries[which(libraries$isFN == 0),i+126], na.rm=TRUE)))
results.rev = subset(results.rev, !(med.FN == 0 & med.OTHER == 0) & pval < .05)
ggplot(results.rev) + geom_point(aes(x=variable, y=med.FN), colour="red",size=6, alpha=.5) + geom_point(aes(x=variable, y=med.OTHER), colour="black",size=6, alpha=.5) + coord_flip() + scale_y_continuous(name="Normed Per-Resident Value \n(x-min(x)/max(x)-min(x))", limits=c(0,.25)) + scale_x_discrete(name="") + theme(axis.text.x=element_text(size=14, colour="black"), axis.text.y=element_text(size=14, colour="black"), axis.title.x=element_text(size=17), plot.title=element_text(size=17,face="bold")) + ggtitle("Costs and Revenues by Library Type\n(Red = First Nations, Black = Other)")
# Now for a whole whack of graphing!
ggplot(libraries, aes(factor(isFN, labels=c("Other Libraries", "First Nations Libraries")), cardholders.per.resident)) + geom_violin(fill="darkgreen") + scale_x_discrete(name="Library Type") + ggtitle("Cardholders per Resident Population by Library Type")
cardholders.by.FN = ddply(libraries, .(isFN.Library = factor(isFN, labels=c("Other Libraries", "First Nations Libraries"))), function (x) round(quantile(x$cardholders.per.resident, na.rm=TRUE),2))
libraries$rev.minus.cost = libraries$Total.Operating.Revenues.P libraries$Total.Operating.Expenditures.P
ggplot(libraries, aes(factor(isFN, labels=c("Other Libraries", "First Nations Libraries")), rev.minus.cost)) + geom_violin(fill="darkgreen") + scale_x_discrete(name="Library Type") + scale_y_continuous(name="Net Profit per Resident Population", limits=c(50,50) ) + ggtitle("Net Profit per Resident Population by Library Type\n(Zoomed in)")
netprofit.by.FN = ddply(libraries, .(isFN.Library = factor(isFN, labels=c("Other Libraries", "First Nations Libraries"))), function (x) round(quantile(x$rev.minus.cost, na.rm=TRUE),2))
ggplot(libraries, aes(x=factor(isFN, labels=c("Other Libraries", "First Nations Libraries")), y=libraries[,13])) + geom_violin(fill="darkgreen") + scale_x_discrete(name="Library Type") + scale_y_log10(name="Resident Population Size\n(log scaled)", breaks=c(25,50,100,250,500,1000,2500,5000,10000,25000,50000,100000,250000,500000, 1000000,2500000), labels=comma) + ggtitle("Distribution of Local Population Sizes by Library Type")
pop.by.FN = ddply(libraries, .(isFN.Library = factor(isFN, labels=c("Other Libraries", "First Nations Libraries"))), function (x) quantile(x$Population..Resident., na.rm=TRUE))
ggplot(libraries, aes(x=factor(isFN, labels=c("Other Libraries", "First Nations Libraries")), y=libraries[,66])) + geom_violin(fill="darkgreen") + scale_x_discrete(name="Library Type") + scale_y_log10(name="# English Titles in Circulation\n(log scaled)", breaks=c(25,50,100,250,500,1000,2500,5000,10000,25000,50000,100000,165000,250000,500000), labels=comma) + ggtitle("Distribution of English Titles in Circulation by Library Type")
raw.eng.titles.by.FN = ddply(libraries, .(isFN.Library = factor(isFN, labels=c("Other Libraries", "First Nations Libraries"))), function (x) quantile(x$Titles.Held..Circulating…English.Language., na.rm=TRUE))
ggplot(libraries, aes(x=factor(isFN, labels=c("Other Libraries", "First Nations Libraries")), y=libraries[,189])) + geom_violin(fill="darkgreen") + scale_x_discrete(name="Library Type") + scale_y_continuous(name="# English Titles in Circulation / Local Pop. Size") + ggtitle("Distribution of English Titles in Circulation per Resident by Library Type")
eng.titles.by.FN = ddply(libraries, .(isFN.Library = factor(isFN, labels=c("Other Libraries", "First Nations Libraries"))), function (x) quantile(x$Titles.Held..Circulating…English.Language..P, na.rm=TRUE))
ggplot(libraries, aes(x=factor(isFN, labels=c("Other Libraries", "First Nations Libraries")), y=libraries[,252])) + geom_violin(fill="darkgreen") + scale_x_discrete(name="Library Type") + scale_y_continuous(name="Annual Program Attendance / Local Pop. Size") + ggtitle("Distribution of Annual Program Attendance \nper Resident by Library Type")
prog.attendance.by.FN = ddply(libraries, .(isFN.Library = factor(isFN, labels=c("Other Libraries", "First Nations Libraries"))), function (x) quantile(x$Annual.program.attendance.P, na.rm=TRUE))
# Now for the mapping!
# zipcodeset taken from geocoder.ca
postals = read.csv("zipcodeset.txt", header=FALSE)
names(postals) = c("Postal", "Lat","Lon","City", "Prov")
libraries = merge(libraries, postals[,1:3], by.x="Postal.Code", by.y="Postal", all.x=TRUE)
libraries.fn = subset(libraries, isFN == 1)
ontario = qmap("ontario", zoom=5)
ontario + geom_point(aes(x=Lon, y=Lat, colour=rev.minus.cost), data=libraries.fn, alpha=.7, size=6) + scale_colour_continuous(low="red", high="green", name="Net Profit per \nResident Population") + ggtitle("Net Profit per Local Resident\nAmongst First Nations Libraries") + theme(plot.title=element_text(size=20,face="bold"))
ontario + geom_point(aes(x=Lon, y=Lat, colour=rev.minus.cost), data=subset(libraries.fn, rev.minus.cost < 0), alpha=.7, size=6) + scale_colour_continuous(low="darkred", high="pink", name="Net Profit per \nResident Population") + ggtitle("Net Profit per Local Resident\nAmongst First Nations Libraries\n(Only those with a net loss)") + theme(plot.title=element_text(size=20,face="bold"))
ontario + geom_point(aes(x=Lon, y=Lat, colour=Total.Operating.Revenues/Total.Operating.Expenditures), data=libraries.fn, alpha=.7, size=6) + scale_colour_continuous(low="red", high="green", name="Operating Revenue \nto Cost Ratio") + ggtitle("Operating Revenue to Cost Ratio\nAmongst First Nations Libraries") + theme(plot.title=element_text(size=20,face="bold"))

Nuclear vs Green Energy: Share the Wealth or Get Your Own?

Thanks to Ontario Open Data, a survey dataset was recently made public containing peoples’ responses to questions about Ontario’s Long Term Energy Plan (LTEP).  The survey did fairly well in terms of raw response numbers, with 7,889 responses in total (although who knows how many people it was sent to!).  As you’ll see in later images in this post, two major goals of Ontario’s LTEP is to eliminate power generation from coal, and to maximize savings by encouraging lots of conservation.

For now though, I’ll introduce my focus of interest: Which energy sources did survey respondents think should be relied on in the future, and how does that correlate with their views on energy management/sharing?

As you can see in the graph below, survey respondents were given a 7 point scale and asked to use it to rate the importance of different energy source options (scale has been flipped so that 7 is the most important and 1 is the least).  Perhaps it’s my ignorance of this whole discussion, but it surprised me that 76% of respondents rated Nuclear power as at least a 5/7 on a scale of importance!  Nuclear power?  But what about Chernobyl and Fukushima?  To be fair, although terribly dramatic and devastating, those were isolated incidents.  Also, measures have been taken to ensure our current nuclear reactors are and will be disaster safe.  Realistically, I think most people don’t think about those things!  A few other things to notice here: conservation does have its adherents, with 37% giving a positive response.  Also, I think it was surprising (and perhaps saddening) to see that green energy has so few adherents, proportionately speaking.

Survey: Importance of Energy Sources

After staring at this graph for a while, I had the idea to see what interesting differences I could find between people who supported Nuclear energy versus those who support Green energy.  What I found is quite striking in its consistency:

  1. Those who believe Nuclear energy is important for Ontario’s future mix of energy sources seem to be more confident that there’s enough energy to share between regions and that independence in power generation is not entirely necessary.
  2. On the flip side, those who believe Green energy is important for Ontario’s future mix of energy sources seem to be more confident that there isn’t enough energy to share between regions and that independence in power generation should be nurtured.

See for yourself in the following graphs:

Survey: Regions Should Make Conservation their First Priority

Survey: Self Sustaining Regions

Survey: Region Responsible for Growing Demand

Survey: Regions buy Power

Does this make sense in light of actual facts?  The graph below comes from a very digestible page set up by the Ontario Ministry of Energy to communicate its Long Term Energy Plan.  As they make pretty obvious, Nuclear energy accounts for over half of energy production in Ontario in 2013, whereas the newer green energy sources (Solar, Bioenergy, Wind vs. Hydro ) amount to about 5%.  In their forecast for 2032, they are hopeful that they will account for 13% of energy production in Ontario.  Still not the lion’s share of energy, but if you add that to the 22% accounted for by Hydro, then you get 35% of all energy production, which admittedly isn’t bad!  Still, I wonder what people were thinking of when they saw “Green energy” on the survey.  If the new sources, then I think what is going on here is that perhaps people who advocate for Green energy sources such as wind and solar have an idea how difficult it is to power a land mass such as Ontario with these kinds of power stations.  People advocating for Nuclear, on the other hand, are either blissfully ignorant, or simply understand that Nuclear power plants are able to serve a wider area.
MOE: Screenshot from 2013-12-08 13:28:04

MOE: Screenshot from 2013-12-08 13:41:06

All of this being said, as you can see in the image above, the Ontario Provincial Government actually wants to *reduce* our province’s reliance on Nuclear energy in the next 5 years, and in fact they will not be building new reactors.  I contacted Mark Smith, Senior Media Relations Coordinator of the Ontario Ministry of Energy to ask him to comment about the role of Nuclear energy in the long run.  Following are some tidbits that he shared with me over email:

Over the past few months, we have had extensive consultations as part of our review of Ontario’s Long Term Energy Plan (LTEP). There is a strong consensus that now is not the right time to build new nuclear.

Ontario is currently in a comfortable supply situation and it does not require the additional power.

We will continue to monitor the demand and supply situation and look at building new nuclear in the future, if the need arises.

Nuclear power has been operating safely in our province for over 40 years, and is held to the strictest regulations and safety requirements to ensure that the continued operation of existing facilities, and any potential new build are held to the highest standards.

We will continue with our nuclear refurbishment plans for which there was strong province-wide support during the LTEP consultations.

During refurbishment, both OPG and Bruce Power will be subject to the strictest possible oversight to ensure safety, reliable supply and value for ratepayers.

Nuclear refurbishments will create thousands of jobs and extend the lives of our existing fleet for another 25-30 years, sustaining thousands of highly-skilled and high-paying jobs.

The nuclear sector will continue be a vital, innovative part of Ontario, creating new technology which is exported around the world.

Well, even Mr. Mark Smith seems confident about Nuclear energy!  I tried to contact the David Suzuki Foundation to see if they’d have anything to say on the role of Green Energy in Ontario’s future, but they were unavailable for comment.

Well, there you have it!  Despite confidence in Nuclear energy as a viable source for the future, the province will be increasing its investments in both Green energy and conservation!  Here’s hoping for an electric following decade 🙂

(P.S. As usual, the R code follows)

ltep = read.csv("ltep-survey-results-all.csv")
# Here I flip the scoring
ltep[,13:19] = sapply(ltep[,13:19], function (x) 8 x)
deal.w.esources = likert(ltep[,13:19])
plot(deal.w.esources, text.size=6, text.color="black") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold")) + ggtitle("What guidelines should Ontario use\n for its future mix of energy sources?")
# following is a lot of boring code where I
# generate numerous data frames that contain percent of
# respondents who agree with the statement by their rated
# importance of Nuclear vs. Green Energy
# Here's the nuclear section
self.sustaining.region.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$A.region.should.be.responsible.for.generating.at.least.some.of.its.own.power. == "Agree", na.rm=TRUE))
self.sustaining.region.by.nuke = self.sustaining.region.by.nuke[1:7,]
region.buy.power.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$Regions.should.have.the.option.to.buy.all.of.their.power.from.sources.in.another.region..if.the.power.is.available. == "Agree", na.rm=TRUE))
region.buy.power.by.nuke = region.buy.power.by.nuke[1:7,]
region.resp.for.growing.demand.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$If.a.region.s.power.demand.is.growing..it.should.be.responsible.for.building.the.generation.or.transmission.to.supply.that.it.needs. == "Agree", na.rm=TRUE))
region.resp.for.growing.demand.by.nuke = region.resp.for.growing.demand.by.nuke[1:7,]
regions.make.cons.first.priority.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$Regions.should.make.conservation.their.first.priority.to.reduce.the.need.for.new.supply. == "Agree", na.rm=TRUE))
regions.make.cons.first.priority.by.nuke = regions.make.cons.first.priority.by.nuke[1:7,]
# Here's the green energy section
self.sustaining.region.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$A.region.should.be.responsible.for.generating.at.least.some.of.its.own.power. == "Agree", na.rm=TRUE))
self.sustaining.region.by.green = self.sustaining.region.by.green[1:7,]
region.buy.power.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$Regions.should.have.the.option.to.buy.all.of.their.power.from.sources.in.another.region..if.the.power.is.available. == "Agree", na.rm=TRUE))
region.buy.power.by.green= region.buy.power.by.green[1:7,]
region.resp.for.growing.demand.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$If.a.region.s.power.demand.is.growing..it.should.be.responsible.for.building.the.generation.or.transmission.to.supply.that.it.needs. == "Agree", na.rm=TRUE))
region.resp.for.growing.demand.by.green = region.resp.for.growing.demand.by.green[1:7,]
regions.make.cons.first.priority.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$Regions.should.make.conservation.their.first.priority.to.reduce.the.need.for.new.supply. == "Agree", na.rm=TRUE))
regions.make.cons.first.priority.by.green = regions.make.cons.first.priority.by.green[1:7,]
# in this section I alternate between getting each data frame ready for plotting
# and obviously plotting the data frame. I liked the yellow and green that I used in the plot
# to represent Nuclear vs. Green energy 🙂
self.sustaining.region = rbind(data.frame(Importance=seq(1,7),Pct.Agree=self.sustaining.region.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=self.sustaining.region.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(self.sustaining.region, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("A Region Should be Responsible for Generating at Least Some of Its Own Power\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))
region.buy.power = rbind(data.frame(Importance=seq(1,7),Pct.Agree=region.buy.power.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=region.buy.power.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(region.buy.power, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("Regions Should Have the Option to Buy All of their Power\nfrom Sources in Another Region if Available\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))
region.resp.for.growing.demand = rbind(data.frame(Importance=seq(1,7),Pct.Agree=region.resp.for.growing.demand.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=region.resp.for.growing.demand.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(region.resp.for.growing.demand, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("If a Region's Power Demand is Growing\nIt Should be Responsible for Building the Transmission/Supply it Needs\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))
regions.make.cons.first.priority = rbind(data.frame(Importance=seq(1,7),Pct.Agree=regions.make.cons.first.priority.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=regions.make.cons.first.priority.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(regions.make.cons.first.priority, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("Regions Should Make Conservation Their First Priority \nto Reduce the Need for New Supply\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))

view raw


hosted with ❤ by GitHub

When did “How I Met Your Mother” become less legen.. wait for it…

…dary!  Or, as you’ll see below, when did it become slightly less legendary?  The analysis in this post was inspired by DiffusePrioR’s analysis of when The Simpsons became less Cromulent. When I read his post a while back, I thought it was pretty cool and told myself that I would use his method on another relevant dataset in the future.

Enter IMDB average user ratings of every episode of How I Met Your Mother (until season 9, episode 5).  Once I brought up the page showing rated episodes by date, I simply copy and pasted the table into LibreOffice Calc, saved it as a csv, and then loaded it into R.  As you can see in DiffusePrioR’s post (and also in mine), the purpose of the changepoint package is to find changepoints, or abrupt baseline shifts, in your data.

Now, some R code:


himym = read.csv("himym.csv")
mean2.pelt = cpt.mean(himym$Rating, method="PELT")
plot(mean2.pelt, type='l', cpt.col='red',xlab="Episode",
     ylab='IMDB Avg Rating', cpt.width=2, main="Changepoints in IMDB Ratings of 'How I Met Your Mother'",ylim=c(0,10), xlim=c(0,200))

Changepoints in IMDB Ratings of HIMYM
Above we see that the changepoint method I used has detected two baselines for the rating data.  One baseline has its value at 8.1, and the other has its value at 7.6.  The first thing you have to admit here is that despite a downwards shift in the average rating, How I Met Your Mother still seems to be pretty popular and held in high regard.  That being said, I do feel as though the earlier episodes were better, and apparently the masses of IMDB users rating each episode generally agree!  So, what’s the first episode that marked the abrupt downwards shift in average rating for the show?

    Ep.Num   Ep.Title
161    8.1 Farhampton

Ah, Farhampton, Season 8, Episode 1.  That seems like a natural breakpoint!  But the low score doesn’t even come until Season 8 Episode 2.  Let’s see what people have to say about episodes 1 and 2 on IMDB (forgive their punctuation, spelling, and grammar):

“Boring return I thought as once again the writing was horrible, the show seems to want to go on forever and also some of the actors in the show deserve better then what they are getting week to week. Even at the end it seems we finally see the mother of the show but we do not see her face.”

“The first episode however.was not funny at all, the thrill of maybe finding who is the mother made me think that it was not too late, but then again the joke is on me. This second episode (The Pre-Nup) only made me think that this show will not find a good finale, using all the clichés on the book of the sitcom. I just don’t care about who is the mother anymore, I just want a honesty last season, showing the characters as human beings that learn with their mistakes and move on with their lifes.”

So it appears that some people were expecting the pace to quicken in Season 8, but were disappointed.  That being said, if you look back at the graph at the beginning of my post, you’ll see that although Season 8 was the start of abruptly lower ratings, the average ratings seemed to be dropping very slowly from well before that point.  Given that the rate of change is so slow, perhaps detecting the “beginning of the end” using the changepoints package is not sensitive enough for our purpose.

That’s why I decided to bin all How I Met Your Mother episodes into groups of 10 successive episodes each, and then calculate the proportion of episodes in each bin which have an average rating under 8.0.  You can see the graph below:

eps.under.8 = rollapply(himym$Rating, width=10, function (x) mean(x < 8), by=10, partial=TRUE)
ep.nums = paste(((seq(1,19)-1)*10)+1, ((seq(1,19)-1)*10)+10, sep="-")
ep.nums[19] = "181-189"
ep.grouped.ratings = data.frame(ep.nums, eps.under.8)

ggplot(ep.grouped.ratings, aes(x=reorder(ep.nums, seq(1,19)), y=eps.under.8, group=1)) + geom_line(size=3, colour="dark red") + scale_y_continuous(name="Proportion of Epsidoes/10 With a Below 8.0 Avg Rating") + scale_x_discrete(name="Episode Number Grouping") + theme(axis.text.x=element_text(angle=-90, vjust=.5, colour="black", size=14), axis.text.y=element_text(colour="black", size=14), axis.title.x=element_text(size=16, face='bold'), axis.title.y=element_text(size=16, face="bold"), plot.title=element_text(size=20,face="bold")) + geom_vline(xintercept = 10,linetype='longdash') + ggtitle("How I Met Your Mother Ratings, Every 10 Episodes")

Grouped HIMYM Ratings

In the above graph, you can see that because I’ve set such a high threshold (8.0), the results become much more dramatic looking.  Until episodes 71-80, the show enjoyed a very high proportion of ratings of 8.0 or over (80% for 6 episode groups, 90% in one, and 100% in the first!).  It looks as though it wasn’t until episodes 91-100 that you can really notice that ratings were not staying so well over 8.0.  Which season and episodes were those?

    Ep.Num                           Ep.Title
91    5.30                          Robin 101
92    5.40              The Sexless Innkeeper
93    5.50                   Duel Citizenship
94    5.60                           Bagpipes
95    5.70                    The Rough Patch
96    5.80                       The Playbook
97    5.90 Slapsgiving 2: Revenge of the Slap
98    5.10                         The Window
99    5.11                Last Cigarette Ever
100   5.12                    Girls Vs. Suits

Well, now we have some evidence that quality started to go down earlier on in the show (although bear in mind we are still talking about a highly rated show here!).  Again, let’s go to IMDB comments to see what we can glean about these episodes:

(in reference to “The Sexless Inkeeper”) “So after waiting 9/10 months for Robin and Barney to finally stop looking for reasons not to admit their feelings to each other, Lily is euphoric when the two finally do concede that they love each other. And how does she celebrate this?, she invites them to an incredibly awkward, incredibly forward night that displays EVERYTHING that Barney and Robin hate about relationships and gives the couple a massive reason NOT to be together. Lily and Marshall then react, whence realising Robin and The Barnicle didn’t enjoy the evening, as if they are the couple that has been mistreated. Lily then dumps Robin and Barney without so much as a phone call and plays right into the pair’s issues of abandonment.”

(in reference to “Last Cigarette Ever”) “… “Last Cigarette Ever” is the worst one. There are many reasons why I feel this way, but the two big reasons are: 1. It made no sense. Up to this point, I believe only Robin had ever been shown smoking a cigarette. Now all of a sudden, they all are smokers. I’m not going to pretend that the writers haven’t made mistakes before, but this was a little ridiculous. 2. It felt like a half-hour Marlboro ad. I think the intent was to discourage smoking, but I personally felt like it did the opposite. Awful episode in an otherwise very good series.”

I’m finding there don’t tend to be many user reviews of episodes on IMDB, so this next one is from metacritic.com:

“How i met your mother is starting to annoy everyone as the jokes are repetitive, Ted, Marshall & Lilly have started to get boring. There were only 5-6 good episodes in the season which includes the PLAYBOOK. The story line and the characters seem to be awfully close to NBC’s ”F.R.I.E.N.D.S”, but friends steals the show with better writing and acting which HIMYM seems to lack. Only Neil’s role as Barney has helped CBS to keep this show on air for 5 seasons.”

In the end, it’s hard to attribute shifts in average rating to these few comments, but they at least they do highlight some of the same complaints I’ve had or heard about the show (although I didn’t care so much about the cigarette episode!!).  All we can say here is that the ratings started to decline around the start of season 5, and suffered a sharper (although still not grand scale) decline at the beginning of season 8.

Estimating Ages from First Names Part 2 – Using Some Morbid Test Data

In my last post, I wrote about how I compiled a US Social Security Agency data set into something usable in R, and mentioned some issues scaling it up to be usable for bigger datasets.  I also mentioned the need for data to test out the accuracy of my estimates.  First, I’ll show you how I prepped the dataset that it became more scalable (for the code that got us here, see my last post):

name_data_wavgpop_unisex = ddply(name_data, .(Name), function (x) sum(x$Rel_Pop*as.numeric(x$Year))/sum(x$Rel_Pop))
name_data_wavgpop_unisex$V1 = round(name_data_wavgpop_unisex$V1,0)

Above I’ve taken a different tactic to predicting expected year of birth according to name than I started out with in my last post. Here I’m using the relative popularity of the names in each year as weights for each year value. Multiplying them by the years, I get a weighted average of Year that gives me predicted year of birth. Then I round off the predictions to the nearest integer and continue on my way. Also, because test data doesn’t seem to come packaged with gender info, I’ve constructed the weighted averages using all relative popularity values for each name, regardless of whether or not that name has been used for both sexes (a.k.a. “Jordan”).

Now enter the test data. I’ve discovered that the easiest way of getting real names and ages off the internet is by looking for lists of victims of some horrible tragedy. The biggest such list of victims I could find was a list of 9/11 victims.  It’s not exactly formatted for easy analysis, and I was too lazy to get the data programatically, so I just copy-pasted into LibreOffice Calc the names and ages from the first 4 lists on the page (all from either American Airlines, or United Airlines) for a total of 285 observations.  I then extracted the first names, and then imported the first names and ages into R.

worldtrade = read.csv("world trade.csv")
worldtrade.ages = sqldf("SELECT a.*, b.V1 as Year FROM [worldtrade] AS a LEFT JOIN name_data_wavgpop_unisex AS b on a.Name == b.Name")
worldtrade.ages$Pred.Age = 2001 - as.numeric(worldtrade.ages$Year)

As you can see, I opted to use sqldf to append the appropriate predicted birth years for each name on the list I imported. I then got the predicted ages by subtracting each predicted birth year from 2001. Finally, let’s have a look at the resulting model fit (showing how close each predicted age was to the real age of the victim at the time of death):

Predicted vs Real Ages for 911 Victims




As you can see, it’s not a tight prediction in the least bit.  According to model fit statistics, there’s an adjusted r-squared of 14.6% and a residual standard error of 15.58 years.  You can also see from the scatter plot that the prediction doesn’t become reasonably linear until about age 30 and onwards.  Overall, I’d say it’s not too impressive, and I’d imagine it’s even worse for predicting who’s under 10 years old!

Well, this was fun (if only a little disappointing).  That’s statistics for you – sometimes it confirms, and sometimes it humbles you.  If you think you can show me a better way of using this name trending data to better predict ages than what I’ve done here, feel free to show me!

Which Torontonians Want a Casino? Survey Analysis Part 2

In my last post I said that I would try to investigate the question of who actually does want a casino, and whether place of residence is a factor in where they want the casino to be built.  So, here goes something:

The first line of attack in this blog post is to distinguish between people based on their responses to the third question on the survey, the one asking people to rate the importance of a long list of issues.  When I looked at this list originally, I knew that I would want to reduce the dimensionality using PCA.

issues.pca = principal(casino[,8:23], 3, rotate="varimax",scores=TRUE)

The PCA resulted in the 3 components listed in the table below.  The first component had variables loading on to it that seemed to relate to the casino being a big attraction with lots of features, so I named it “Go big or Go Home”.  On the second component there seemed to be variables loading on to it that related to technical details, while the third component seemed to have variables loading on to it that dealt with social or environmental issues.

Go Big or Go Home Concerned with Technical Details Concerned with Social/Environmental Issues or not Issue/Concern
Q3_A 0.181 0.751 Design of the facility
Q3_B 0.366 0.738 Employment Opportunities
Q3_C 0.44 0.659 Entertainment and cultural activities
Q3_D 0.695 0.361 Expanded convention facilities
Q3_E 0.701 0.346 Integration with surrounding areas
Q3_F 0.808 0.266 New hotel accommodations
Q3_G -0.117 0.885 Problem gambling & health concerns
Q3_H 0.904 Public safety and social concerns
Q3_I 0.254 0.716 Public space
Q3_J 0.864 0.218 Restaurants
Q3_K 0.877 0.157 Retail
Q3_L 0.423 0.676 -0.1 Revenue for the city
Q3_M 0.218 0.703 0.227 Support for local businesses
Q3_N 0.647 0.487 -0.221 Tourist attraction
Q3_O 0.118 0.731 Traffic concerns
Q3_P 0.497 0.536 0.124 Training and career development

Once I was satisfied that I had a decent understanding of what the PCA was telling me, I loaded the component scores into the original dataframe.

casino[,110:112] = issues.pca$scores
names(casino)[110:112] = c("GoBigorGoHome","TechnicalDetails","Soc.Env.Issues")

In order to investigate the question of who wants a casino and where, I decided to use question 6 as a dependent variable (the one asking where they would want it built, if one were to be built) and the PCA components as independent variables.  This is a good question to use, because the answer options, if you remember, are “Toronto”, “Adjacent Municipality” and “Neither”.  My approach was to model each response individually using logistic regression.

casino$Q6[casino$Q6 == ""] = NA
casino$Q6 = factor(casino$Q6, levels=c("Adjacent Municipality","City of Toronto","Neither"))

adj.mun = glm(casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
toronto = glm(casino$Q6 == "City of Toronto" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
neither = glm(casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))

Following are the summaries of each GLM:

glm(formula = casino$Q6 == "City of Toronto" ~ GoBigorGoHome +
TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
3.6426 0.4745 0.1156 0.4236 3.4835
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.58707 0.04234 37.48 <2e-16 ***
GoBigorGoHome 1.76021 0.03765 46.75 <2e-16 ***
TechnicalDetails 1.77155 0.05173 34.24 <2e-16 ***
Soc.Env.Issues 1.63057 0.04262 38.26 <2e-16 ***
Signif. codes: 0***0.001**0.01* ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13537 on 10365 degrees of freedom
Residual deviance: 6818 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 6826
Number of Fisher Scoring iterations: 6

Adjacent municipality:

glm(formula = casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome +
TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
1.0633 0.7248 0.5722 0.3264 2.7136
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.45398 0.02673 54.394 < 2e-16 ***
GoBigorGoHome 0.41989 0.02586 16.239 < 2e-16 ***
TechnicalDetails 0.18764 0.02612 7.183 6.82e-13 ***
Soc.Env.Issues 0.52325 0.03221 16.243 < 2e-16 ***
Signif. codes: 0***0.001**0.01* ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10431.8 on 10365 degrees of freedom
Residual deviance: 9756.4 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 9764.4
Number of Fisher Scoring iterations: 5

Neither location:

glm(formula = casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails +
Soc.Env.Issues, family = binomial(logit), data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
2.4090 0.7344 0.3934 0.8966 2.7194
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.22987 0.02415 9.517 <2e-16 ***
GoBigorGoHome 0.85050 0.02462 34.549 <2e-16 ***
TechnicalDetails 1.00182 0.02737 36.597 <2e-16 ***
Soc.Env.Issues 0.69707 0.02584 26.972 <2e-16 ***
Signif. codes: 0***0.001**0.01* ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14215 on 10365 degrees of freedom
Residual deviance: 10557 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 10565
Number of Fisher Scoring iterations: 4

And here is a quick summary of the above GLM information:
Summary of Casino GLMs

Judging from these results, it looks like those who want a casino in Toronto don’t focus on the big social/environmental issues surrounding the casino, but do focus on the flashy and non-flashy details and benefits alike.  Those who want a casino outside of Toronto do care about the social/environmental issues, don’t care as much about the flashy details, but do have a focus on some of the non-flashy details.  Finally, those not wanting a casino in either location care about the social/environmental issues, but don’t care about any of the details.

Here’s where the issue of location comes into play.  When I look at the summary for the GLM that predicts who wants a casino in an adjacent municipality, I get the feeling that it’s picking up people living in the down-town core who just don’t think the area can handle a casino.  In other words, I think there might be a “not in my backyard!” effect.

The first inkling that this might be the case comes from an article from the Martin Prosperity Institute (MPI), who analyzed the same data set, and managed to get a very nice looking heat map-map of the responses to the first question on the survey, asking people how they feel about having a new casino in Toronto.  From this map, it does look like people in Downtown Toronto are feeling pretty negative about a new casino, whereas those in the far east and west of Toronto are feeling better about it.

My next evidence comes from the cities uncovered by geocoding the responses in the data set.  I decided to create a very simple indicator variable, distinguishing those for whom the “City” is Toronto, and those for whom the city is anything else.  I like this better than the MPI analysis, because it looks at peoples’ attitudes towards a casino both inside and outside of Toronto (rather than towards the concept of a new Casino in Toronto).  If there really is a “not in my backyard!” effect, I would expect to see evidence that those in Toronto are more disposed towards a casino in an adjacent municipality, and that those from outside of Toronto are more disposed towards a casino inside Toronto!  Here we go:

ffload(file="casino", overwrite=TRUE)
casino.orig$Outside.of.Toronto = as.ff(ifelse(casino.orig[,"City"] == "Toronto",0,1))
casino.in.toronto = glm(casino.orig[,"Q6"] == "City of Toronto" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
casino.outside.toronto = glm(casino.orig[,"Q6"] == "Adjacent Municipality" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
glm(formula = casino.orig[, "Q6"] == "City of Toronto" ~ Outside.of.Toronto,
family = binomial(logit), data = casino.orig)
Deviance Residuals:
Min 1Q Median 3Q Max
0.9132 0.9132 0.7205 1.4669 1.7179
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.21605 0.02600 46.77 <2e-16 ***
Outside.of.Toronto 0.55712 0.03855 14.45 <2e-16 ***
Signif. codes: 0***0.001**0.01* ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16278 on 13881 degrees of freedom
Residual deviance: 16070 on 13880 degrees of freedom
(3884 observations deleted due to missingness)
AIC: 16074
Number of Fisher Scoring iterations: 4
glm(formula = casino.orig[, "Q6"] == "Adjacent Municipality" ~
Outside.of.Toronto, family = binomial(logit), data = casino.orig)
Deviance Residuals:
Min 1Q Median 3Q Max
0.7280 0.7280 0.5554 0.5554 1.9726
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19254 0.02583 46.16 <2e-16 ***
Outside.of.Toronto 0.59879 0.04641 12.90 <2e-16 ***
Signif. codes: 0***0.001**0.01* ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13786 on 13881 degrees of freedom
Residual deviance: 13611 on 13880 degrees of freedom
(3884 observations deleted due to missingness)
AIC: 13615
Number of Fisher Scoring iterations: 4
casino.loc.by.city.res = as.matrix(table(casino.orig[,"Outside.of.Toronto"], casino.orig[,"Q6"])[,2:4], dimnames=list(c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto"),colnames(casino.loc.by.city.res)))
rownames(casino.loc.by.city.res) = c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto")
casino.loc.by.city.res = melt(prop.table(casino.loc.by.city.res,1))
names(casino.loc.by.city.res) = c("City of Residence","Where to Build the Casino","value")
ggplot(casino.loc.by.city.res, aes(x=casino.loc.by.city.res$"Where to Build the Casino", y=value, fill=casino.loc.by.city.res$"City of Residence")) + geom_bar(position="dodge") + scale_y_continuous(labels=percent) + theme_wsj() + scale_fill_discrete(name="City of Residence") + ggtitle("If a casino is built, where would you prefer to see it located?")

view raw


hosted with ❤ by GitHub

Where located by city of residenceAs you can see here, those from the outside of Toronto are more likely to suggest building a casino in Toronto compared with those from the inside, and less likely to suggest building a casino in an adjacent municipality (with the reverse being true about those from the inside of Toronto).

That being said, when you do the comparison within city of residence (instead of across it like I just did), those from the inside of Toronto seem equally likely to suggest that the casino be built in our outside of the city, whereas those outside are much more likely to suggest building the casino inside Toronto than outside.  So, depending on how you view this graph, you might only say there’s evidence for a “not in my backyard!” effect for those living outside of Toronto.

As a final note, I’ll remind you that although these analyses point to which Torontonians do want a new casino, the fact from this survey remains that about 71% of respondents are unsupportive of a casino in Toronto, and 53% don’t want a casino built in either Toronto or and adjacent municipality.  I really have to wonder if they’re still going to go ahead with it!

Multiple Classification and Authorship of the Hebrew Bible

Sitting in my synagogue this past Saturday, I started thinking about the authorship analysis that I did using function word counts from texts authored by Shakespeare, Austen, etc.  I started to wonder if I could do something similar with the component books of the Torah (Hebrew bible).

A very cursory reading of the Documentary Hypothesis indicates that the only books of the Torah supposed to be authored by one person each were Vayikra (Leviticus) and Devarim (Deuteronomy).  The remaining three appear to be a confusing hodgepodge compilation from multiple authors.  I figured that if I submitted the books of the Torah to a similar analysis, and if the Documentary Hypothesis is spot-on, then the analysis should be able to accurately classify only Vayikra and Devarim.

The theory with the following analysis (taken from the English textual world, of course) seems to be this: When someone writes a book, they write with a very particular style.  If you are going to be able to detect that style, statistically, it is convenient to detect it using function words.  Function words (“and”, “also”, “if”, “with”, etc) need to be used regardless of content, and therefore should show up throughout the text being analyzed.  Each author uses a distinct number/proportion of each of these function words, and therefore are distinguishable based on their profile of usage.

With that in mind, I started my journey.  The first steps were to find an online source of Torah text that I could easily scrape for word counts, and then to figure out which hebrew function words to look for.  For the Torah text, I relied on the inimitable Chabad.org.  They hired good rational web developer(s) to make their website, and so looping through each perek (chapter) of the Torah was a matter of copying and pasting html page numbers from their source code.

Several people told me that I’d be wanting for function words in Hebrew, as there are not as many as in English.  However, I found a good 32 of them, as listed below:

Transliteration Hebrew Function Word Rough Translation Word Count
al עַל Upon 1262
el אֶל To 1380
asher אֲשֶׁר That 1908
ca_asher כַּאֲשֶׁר As 202
et אֶת (Direct object marker) 3214
ki כִּי For/Because 1030
col וְכָל + כָּל + לְכָל + בְּכָל + כֹּל All 1344
ken כֵּן Yes/So 124
lachen לָכֵן Therefore 6
hayah_and_variants תִּהְיֶינָה + תִּהְיֶה + וְהָיוּ + הָיוּ + יִהְיֶה + וַתְּהִי + יִּהְיוּ + וַיְהִי + הָיָה Be 819
ach אַךְ But 64
byad בְּיַד By 32
gam גַם Also/Too 72
mehmah מֶה + מָה What 458
haloh הֲלֹא Was not? 17
rak רַק Only 59
b_ad בְּעַד For the sake of 5
loh לֹא No/Not 1338
im אִם If 332
al2 אַל Do not 217
ele אֵלֶּה These 264
haheehoo הַהִוא + הַהוּא That 121
ad עַד Until 324
hazehzot הַזֶּה + הַזֹּאת + זֶה + זֹאת This 474
min מִן From 274
eem עִם With 80
mi מִי Who 703
oh אוֹ Or 231
maduah מַדּוּעַ Why 10
etzel אֵצֶל Beside 6
heehoo הִוא + הוּא + הִיא Him/Her/It 653
az אָז Thus 49

This list is not exhaustive, but definitely not small!  My one hesitation when coming up with this list surrounds the Hebrew word for “and”.  “And” takes the form of a single letter that attaches to the beginning of a word (a “vav” marked with a different vowel sound depending on its context), which I was afraid to try to extract because I worried that if I tried to count it, I would mistakenly count other vav’s that were a valid part of a word with a different meaning.  It’s a very frequent word, as you can imagine, and its absence might very well affect my subsequent analyses.

Anyhow, following is the structure of Torah:

‘Chumash’ / Book Number of Chapters
‘Bereishit’ / Genesis 50
‘Shemot’ / Exodus 40
‘Vayikra’ / Leviticus 27
‘Bamidbar’ / Numbers 36
‘Devarim’ / Deuteronomy 34

Additionally, I’ve included a faceted histogram below showing the distribution of word-counts per chapter by chumash/book of the Torah:

m = ggplot(torah, aes(x=wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

Word Count Dist by Chumash

You can see that the books are not entirely different in terms of word counts of the component chapters, except for the possibility of Vayikra, which seems to tend towards the shorter chapters.

After making a Python script to count the above words within each chapter of each book, I loaded it up into R and split it into a training and testing sample:

torah$randu = runif(187, 0,1)
torah.train = torah[torah$randu <= .4,] torah.test = torah[torah$randu > .4,]

For this analysis, it seemed that using Random Forests made the most sense.  However, I wasn’t quite sure if I should use the raw counts, or proportions, so I tried both. After whittling down the variables in both models, here are the final training model definitions:

torah.rf = randomForest(chumash ~ al + el + asher + caasher + et + ki + hayah + gam + mah + loh + haheehoo + oh + heehoo, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

torah.rf.props = randomForest(chumash ~ al_1 + el_1 + asher_1 + caasher_1 + col_1 + hayah_1 + gam_1 + mah_1 + loh_1 + im_1 + ele_1 + mi_1 + oh_1 + heehoo_1, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

As you can see, the final models were mostly the same, but with a few differences. Following are the variable importances from each Random Forests model:

> importance(torah.rf)

 Word MeanDecreaseAccuracy MeanDecreaseGini
hayah 31.05139 5.979567
heehoo 20.041149 4.805793
loh 18.861843 6.244709
mah 18.798385 4.316487
al 16.85064 5.038302
caasher 15.101464 3.256955
et 14.708421 6.30228
asher 14.554665 5.866929
oh 13.585169 2.38928
el 13.010169 5.605561
gam 5.770484 1.652031
ki 5.489 4.005724
haheehoo 2.330776 1.375457

> importance(torah.rf.props)

Word MeanDecreaseAccuracy MeanDecreaseGini
asher_1 37.074235 6.791851
heehoo_1 29.87541 5.544782
al_1 26.18609 5.365927
el_1 17.498034 5.003144
col_1 17.051121 4.530049
hayah_1 16.512206 5.220164
loh_1 15.761723 5.157562
ele_1 14.795885 3.492814
mi_1 12.391427 4.380047
gam_1 12.209273 1.671199
im_1 11.386682 2.651689
oh_1 11.336546 1.370932
mah_1 9.133418 3.58483
caasher_1 5.135583 2.059358

It’s funny that the results, from a raw numbers perspective, show that hayah, the hebrew verb for “to be”, shows at the top of the list.  That’s the same result as in the Shakespeare et al. analysis!  Having established that all variables in each model had some kind of an effect on the classification, the next task was to test each model on the testing sample, and see how well each chumash/book of the torah could be classified by that model:

> torah.test$pred.chumash = predict(torah.rf, torah.test, type="response")
> torah.test$pred.chumash.props = predict(torah.rf.props, torah.test, type="response")

> xtabs(~torah.test$chumash + torah.test$pred.chumash)
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            4            5          2         8          7
       'Bereishit'           1           14          1        14          2
       'Devarim'             1            2         17         0          1
       'Shemot'              2            4          4         9          2
       'Vayikra'             5            0          4         0          5

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash),1)
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.15384615   0.19230769 0.07692308 0.30769231 0.26923077
       'Bereishit'  0.03125000   0.43750000 0.03125000 0.43750000 0.06250000
       'Devarim'    0.04761905   0.09523810 0.80952381 0.00000000 0.04761905
       'Shemot'     0.09523810   0.19047619 0.19047619 0.42857143 0.09523810
       'Vayikra'    0.35714286   0.00000000 0.28571429 0.00000000 0.35714286

> xtabs(~torah.test$chumash + torah.test$pred.chumash.props)
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            0            5          0        13          8
       'Bereishit'           1           16          0        13          2
       'Devarim'             0            2         11         4          4
       'Shemot'              1            4          2        13          1
       'Vayikra'             3            3          0         0          8

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash.props),1)
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.00000000   0.19230769 0.00000000 0.50000000 0.30769231
       'Bereishit'  0.03125000   0.50000000 0.00000000 0.40625000 0.06250000
       'Devarim'    0.00000000   0.09523810 0.52380952 0.19047619 0.19047619
       'Shemot'     0.04761905   0.19047619 0.09523810 0.61904762 0.04761905
       'Vayikra'    0.21428571   0.21428571 0.00000000 0.00000000 0.57142857

So, from the perspective of the raw number of times each function word was used, Devarim, or Deuteronomy, seemed to be the most internally consistent, with 81% of the chapters in the testing sample correctly classified. Interestingly, from the perspective of the proportion of times each function word was used, we see that Devarim, Shemot, and Vayikra (Deuteronomy, Exodus, and Leviticus) had over 50% of their chapters correctly classified in the training sample.

I’m tempted to say here, at the least, that there is evidence that at least Devarim was written with one stylistic framework in mind, and potentially one distinct author. From a proportion point of view, it appears that Shemot and Vayikra also show an internal consistency suggestive of close to one stylistic framework, or possibly a distinct author for each book. I’m definitely skeptical of my own analysis, but what do you think?

The last part of this analysis comes from a suggestion given to me by a friend, which was that once I modelled the unique profiles of function words within each book of the Torah, I should use that model on some post-Biblical texts.  Apparently one idea is that the “Deuteronomist Source” was also behind the writing of Joshua, Judges, and Kings.  If the same author was behind all four books, then when I train my model on these books, they should tend to be classified by my model as Devarim/Deuteronomy, moreso than other books.

As above, below I show the distribution of word count by book, for comparison’s sake:

> m = ggplot(neviim, aes(wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

Word Count Dist by Prophets Book

Interestingly, it seems as though chapters in these particular post-Biblical texts seem to be a bit longer, on average, than those in the Torah.

Next, I gathered counts of the same function words in Joshua, Judges, and Kings as I had for the 5 books of the Torah, and tested my random forests Torah model on them.  As you can see below, the result was anything but clear on that matter:

> xtabs(~neviim$chumash + neviim$pred.chumash)
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           3            7          7         6          1
      'Judges'           2           11          2         6          0
      'Kings'            0            8          3        10          1

> xtabs(~neviim$chumash + neviim$pred.chumash.props)
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           2            8          6         7          1
      'Judges'           0            9          2         9          1
      'Kings'            0            7          6         7          2

I didn’t even bother to re-express this table into fractions, because it’s quite clear that each  book of the prophets that I analyzed didn’t seem to be clearly classified in any one category.  Looking at these tables, there doesn’t yet seem to me to be any evidence, from this analysis, that whoever authored Devarim/Deuteronomy also authored these post-biblical texts.


I don’t think that this has been a full enough analysis.  There are a few things in it that bother me, or make me wonder, that I’d love input on.  Let me list those things:

  1. As mentioned above, I’m missing the inclusion of the Hebrew “and” in this analysis.  I’d like to know how to extract counts of the Hebrew “and” without extracting counts of the Hebrew letter “vav” where it doesn’t signify “and”.
  2. Similar to my exclusion of “and”, there are a few one letter prepositions that I have not included as individual predictor variables.  Namely ל, ב, כ, מ, signifying “to”, “in”/”with”, “like”, and “from”.  How do I count these without counting the same letters that begin a different word and don’t mean the same thing?
  3. Is it more valid to consider the results of my analyses that were done on the word frequencies as proportions (individual word count divided by total number of words in the chapter), or are both valid?
  4. Does a list exist somewhere that details, chapter-by-chapter, which source is believed to have written the Torah text, according to the Documentary Hypothesis, or some more modern incarnation of the Hypothesis?  I feel that if I were able to categorize the chapters specifically, rather than just attributing them to the whole book (as a proxy of authorship), then the analysis might be a lot more interesting.

All that being said, I’m intrigued that when you look at the raw number of how often the function words were used, Devarim/Deuteronomy seems to be the most internally consistent.  If you’d like, you can look at the python code that I used to scrape the chabad.org website here: python code for scraping, although please forgive the rudimentary coding!  You can get the dataset that I collected for the Torah word counts here: Torah Data Set, and the data set that I collected for the Prophetic text word counts here: Neviim data set.  By all means, do the analysis yourself and tell me how to do it better 🙂

ggplot2: Creating a custom plot with two different geoms

This past week for work I had to create some plots to show the max, min, and median of a measure across the levels of a qualitative variable, and show the max and min of the same variable within a subgroup of the dataset.  To illustrate what I mean, I took a fun dataset from the data and story library and recreated the plot that I made at work.
The data are supposed to represent the results of a study done on how long it took subjects to complete a pen and paper maze while they were either smelling a floral scent or not.  Other than the time to completion (the main variable of interest), some other variables are recorded, like sex and whether or not the subject is a smoker.
 I represent the min, max, and median completion time that it took men and women to complete the maze with the “crossbar” geom, which show obvious overlap in performance across the genders (however it’s apparent that men were a bit more variable in their performance than women).  The seemingly interesting result is represented by the dots.  The dots show the min and max values for the smokers within each gender.  While smoking doesn’t appear to have an effect on performance amongst the women, it seems to be correlated with much slower performance amongst the men in the study.  Then again, I just re-checked the data, which show that there are only 2 male smokers and 6 female smokers, so the comparison seems pretty unreliable.
Stepping away from the study itself, what I like here is that you can call up several geoms in the same plot, passing different data subsets to each geom.  It allows for useful customization so that you can tackle problems that aren’t so cut and dried.
Finally, I realize that before putting this kind of graph into a report, I really should create a little legend that shows the reader that the ends of the boxes are max and mins, and the middle lines represent medians, and the dots represent max and mins of the subset.
Following is the code and the plot I created:

scents = read.table("clipboard",header=TRUE,sep="\t")
strial3.by.sex.wide = ddply(scents, 'Sex', function (x) quantile(x$S.Trial.3, c(0,.5,1), na.rm=TRUE))
strial3.by.sex.smokers = melt(ddply(subset(scents,Smoker == "Y") , 'Sex', function (x) quantile(x$S.Trial.3, c(0,1), na.rm=TRUE)),variable.name="Percentile",value.name="Time")
ggplot() + geom_crossbar(data=strial3.by.sex.wide, aes(x=Sex, y=strial3.by.sex.wide$"50%", ymin=strial3.by.sex.wide$"0%", ymax=strial3.by.sex.wide$"100%"),fill="#bcc927",width=.75) +
geom_point(data=strial3.by.sex.smokers, aes(x=Sex, y=Time, stat="identity"), size=3)
+ opts(legend.title = theme_text(size=10, face="bold"), legend.text = theme_text(size=10),
axis.text.x=theme_text(size=10), axis.text.y=theme_text(size=10,hjust=1), axis.title.x=theme_text(size=12,face="bold"), axis.title.y=theme_text(size=12, angle=90,
face="bold")) + scale_y_continuous(name="Time to Completion")

Bar Graph Colours That Work Well

Ever since I started using ggplot2 more often at work in order to do graphs, I’ve realized something about the use of colour in bar graphs vs. dot plots: When I’m looking at a graph displayed on the brilliant Viewsonic monitor I’m using at work, the same relatively intense colours that work well in a dot plot start to bother me in a bar graph.  Take the bar graph immediately below for example.  The colour choice is not a bad one, but there’s something about the intensity of the colours that makes me want to find a new set of colours somewhat more soothing to my eyes.

The first resource I found was a “Color Encyclopedia” website called Color Hex and started looking for colours that seemed more soothing and could be used to compare 3 bars against one another in a bar graph.  You can search for colour names, colours according to their hexadecimal values, or even browse their list of “web safe colors”.  I stumbled upon the particular purple displayed in the graph below, and it simply gave me the other colours in the triad as suggestions.

Looking at this triad of colours, I’m actually quite pleased, but I still didn’t really understand why these colours worked, and how to select a new triad that didn’t bother me.  I shuffled through many different colours on the Color Hex website, and nothing else seemed to work with me as I wasn’t selecting colours based on any theory.

Then I stumbled upon an article by the good people at Perceptual Edge.  They seemed to confirm my earlier statement about the same intense colours working well when used to colour dot plots not working so well in bar plots.   Their solution is a simple one: choose from a list of colours of medium intensity.  On page 6 of the document linked above, they offer 8 different hues that look nice in a bar plot.  All I had to do to use these colours in the below plots was take a screenshot of the document, bring it into Inkscape, and hover the eye-dropper tool over the colours to get the hexadecimal colour values.  If you’re interested in using the values, I typed them out at the bottom of my post.  Now take a look at the graphs below:

The two graphs above follow the same principle that I had unknowingly touched upon when I chose the colours from the Color Hex website: stick with medium intensity, and your eyes won’t be jarred by the colour contrast.  I like that!

Anyway, below I show you the code I used to manually input the hexadecimal colour values into my ggplot bar graphs, and the list of 8 hexadecimal colour values corresponding with the colour boxes on page 6 of the Perceptual Edge document.  The variables a, c, and b were just variable names from a mock data frame that I cooked up for the purpose of the plots.

> colours = c(“#599ad3”, “#f9a65a”, “#9e66ab”)
> ggplot(e, aes(x=a, y=c, fill=b, stat=”identity”)) + geom_bar(position=”dodge”) + coord_flip()+scale_fill_manual(values=colours)