KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!

The challenge from the KDD Cup this year was to use their data relating to student enrollment in online MOOCs to predict who would drop out vs who would stay.

The short story is that using H2O and a lot of my free time, I trained several hundred GBM models looking for the final one which eventually got me an AUC score of 0.88127 on the KDD Cup leaderboard and at the time of this writing landed me in 120th place. My score is 2.6% away from 1st place, but there are 119 people above me!

Here are the main characters of this story:

mariadb
MySQL Workbench
R
H2O

It started with my obsessive drive to find an analytics project to work on. I happened upon the KDD Cup 2015 competition and decided to give it a go. It had the characteristics of a project that I wanted to get into:

1) I could use it to practice my SQL skills
2) The data set was of a moderate size (training table was 120,542 records, log info table was 8,151,053 records!)
3) It looked like it would require some feature engineering
4) I like predictive modeling competitions ūüôā

Once I had loaded up the data into a mariadb database, I had to come to decisions about how I would use the info in each table. Following were my thought processes for each table:

enrollment_train / enrollment_test
Columns: enrollment_id, username, course_id

Simply put, from this table I extracted the number of courses each student (username) was enrolled in, and also the number of students enrolled in each course (course_id).

log_train / log_test
Columns: enrollment_id, tstamp, source, logged_event, object

There were a few items of information that I decided to extract from this table:

1) Number of times each particular event was logged for every enrollment_id
2) Average timestamp for each event for each enrollment_id
3) Min and Max timestamp for each event for each enrollment_id
4) Total time elapsed from the first to the last instance of each event for each enrollment_id
5) Overall average timestamp for each enrollment_id

Contrary to what you might think, the object field does not seem to link up with the object table.

object
Columns: course_id, module_id, category, children, tstart

From this table I extracted a count of course components by course_id and also the number of ‘children’ per course_id. I assume these are relational references but am not sure what in the data set these child IDs refer to.

truth_train
Columns: enrollment_id, dropped_out

I didn’t extract anything special out of this table, but used it as the table to which all other SQL views that I had created were linked.

If you’d like to see the SQL code I used to prepare the tables, views, and the final output table I used to train the model, see my github repo for this project.

Import into R and Feature Engineering

Once I imported the data into R through RODBC, you’ll see in the code that my feature engineering was essentially a desperate fishing expedition where I tried a whole lot of stuff. ¬†I didn’t even end up using everything that I had engineered through my R code, but as my final model included 35 variables, I wasn’t suffering any severe lack! If you download the KDD Cup 2015 data and are having a look around, feel free to let me know if I’ve missed any important variables!

H2O, Model Tuning, and Training of The Final Model

This is the part where I managed to train hundreds of models! I don’t think this would have been feasible just using plain R on my computer alone (I have 8GB of RAM and an 8 core AMD processor). For these tasks I turned to H2O. For those who don’t know, H2O is a java based analytical interface for cloud computing that is frankly very easy and beneficial to set up when all you have at your disposal is one computer. I say beneficial for one reason: my computer chokes when trying to train ensemble models on even moderate sized data sets. Through H2O, I’m able to get it done without watching the RAM meter on my system monitor shoot all the way up to full capacity!! What you’ll notice in my R code is that R is able to interface with H2O in such a way that once I passed the dataframe with the training data to H2O, it was H2O that handled the modeling from there, and sends info back to R when available or requested (e.g. while you’re training a model, it gives you a cute text-based progress bar automatically!). More on this soon.

Before I show some results, I want to talk about my model tuning algorithm. Let’s look at the relevant code, then I’ll break it down verbally.

ntree = seq(100,500,100)
balance_class = c(TRUE,FALSE)
learn_rate = seq(.05,.4,.05)

parameters = list(ntree = c(), balance_class = c(), learn_rate = c(), r2 = c(), min.r2 = c(), max.r2 = c(), acc = c(), min.acc = c(), max.acc = c(), AUC = c(), min.AUC = c(), max.AUC = c())
n = 1

mooc.hex = as.h2o(localH2O, mooc[,c("enrollment_id","dropped_out_factor",x.names)])
for (trees in ntree) {
  for (c in balance_class) {
    for (rate in learn_rate) {
      r2.temp = c(NA,NA,NA)
      acc.temp = c(NA,NA,NA)
      auc.temp = c(NA,NA,NA)
      for (i in 1:3) {
        
        mooc.hex.split = h2o.splitFrame(mooc.hex, ratios=.8)   
        train.gbm = h2o.gbm(x = x.names, y = "dropped_out_factor",  training_frame = mooc.hex.split[[1]],
                            validation_frame = mooc.hex.split[[2]], ntrees = trees, balance_classes = c, learn_rate = rate)
        r2.temp[i] = train.gbm@model$validation_metrics@metrics$r2
        acc.temp[i] = train.gbm@model$validation_metrics@metrics$max_criteria_and_metric_scores[4,3]
        auc.temp[i] = train.gbm@model$validation_metrics@metrics$AUC
      }
    parameters$ntree[n] = trees
    parameters$balance_class[n] = c
    parameters$learn_rate[n] = rate
    parameters$r2[n] = mean(r2.temp)
    parameters$min.r2[n] = min(r2.temp)
    parameters$max.r2[n] = max(r2.temp)
    parameters$acc[n] = mean(acc.temp)
    parameters$min.acc[n] = min(acc.temp)
    parameters$max.acc[n] = max(acc.temp)
    parameters$AUC[n] = mean(auc.temp)
    parameters$min.AUC[n] = min(auc.temp)
    parameters$max.AUC[n] = max(auc.temp)
    n = n+1
    }
  }
}


parameters.df = data.frame(parameters)
parameters.df[which.max(parameters.df$AUC),]

The model that I decided to use is my usual favourite, gradient boosting machines (h2o.gbm is the function you use to train a gbm model through H2O). As such, the 3 hyperparameters which I chose to vary and evaluate in the model tuning process were number of trees, whether or not to balance the outcome classes through over/undersampling, and the learning rate. As you can see above, I wanted to try out numerous values for each hyperparameter, making 5 values for number of trees, 2 values for balance classes, and 8 values for learning rate, totalling 80 possible combinations of all 3 hyperparameter values together. Furthermore, I wanted to try out each combination of hyperparemeter values on 3 random samples of the training data. So, 3 samples of each one of 80 combinations is equal to 240 models trained and validated with the aim of selecting the one with the best area under the curve (AUC). As you can see, each time I trained a model, I saved and summarised the validation stats in a growing list which I ultimately converted to a data.frame and called called parameters.df

The best hyperparameters, according to these validation stats which I collected, are:

– ntree = 500
– balance_class = FALSE
– learn_rate = .05

You can see a very nice summary of how validation set performance changed depending on the values of all of these parameters in the image below (the FALSE and TRUE over the two facets refer to the balance_class values.

AUC by Tuning Parameters

Have a look at my validation data model summary output from the H2O package below:

H2OBinomialMetrics: gbm
** Reported on validation data. **

MSE:  0.06046745
R^2:  0.102748
LogLoss:  0.2263847
AUC:  0.7542866
Gini:  0.5085732

Confusion Matrix for F1-optimal threshold:
            dropped out stayed    Error         Rate
dropped out       21051   1306 0.058416  =1306/22357
stayed             1176    576 0.671233   =1176/1752
Totals            22227   1882 0.102949  =2482/24109

Maximum Metrics:
                      metric threshold    value        idx
1                     max f1  0.170555 0.317006 198.000000
2                     max f2  0.079938 0.399238 282.000000
3               max f0point5  0.302693 0.343008 134.000000
4               max accuracy  0.612984 0.929321  48.000000
5              max precision  0.982246 1.000000   0.000000
6           max absolute_MCC  0.170555 0.261609 198.000000
7 max min_per_class_accuracy  0.061056 0.683410 308.000000

The first statistic that my eyes were drawn to when I saw this output was the R^2 statistic. It looks quite low and I’m not even sure why. That being said, status in the KDD Cup 2015 competition is measured in AUC, and here you can see that it is .75 on my validation data. Next, have a look at the confusion matrix. You can see in the Error column that the model did quite well predicting who would drop out (naturally, in my opinion), but did not do so well figuring out who would stay. The overall error rate on the validation data is 10%, but I’m still not so happy about the high error rate as it pertains to those who stayed in the MOOC.

So this was all well and good (and was what got me my highest score yet according to the KDD Cup leaderboard) but what if I could get better performance with fewer variables? I took a look at my variable importances and decided to see what would happen if I eliminate the variables with the lowest importance scores one by one until I reach the variable with the 16th lowest importance score. Here’s the code I used:

varimps = data.frame(h2o.varimp(train.gbm))
variable.set = list(nvars = c(), AUC = c(), min.AUC = c(), max.AUC = c())

mooc.hex = as.h2o(localH2O, mooc[,c("enrollment_id","dropped_out_factor",x.names)])
n = 1
for (i in seq(35,20)) {
  auc.temp = c(NA,NA,NA)
  x.names.new = setdiff(x.names, varimps$variable[i:dim(varimps)[1]])
  for (j in 1:3) {
        mooc.hex.split = h2o.splitFrame(mooc.hex, ratios=.8)  
        train.gbm.smaller = h2o.gbm(x = x.names.new, y = "dropped_out_factor",  training_frame = mooc.hex.split[[1]],
                            validation_frame = mooc.hex.split[[2]], ntrees = 500, balance_classes = FALSE, learn_rate = .05)
        auc.temp[j] = train.gbm.smaller@model$validation_metrics@metrics$AUC
        }
    variable.set$AUC[n] = mean(auc.temp)
    variable.set$min.AUC[n] = min(auc.temp)
    variable.set$max.AUC[n] = max(auc.temp)
    variable.set$nvars[n] = i-1
    n = n + 1
}

variable.set.df = data.frame(variable.set)

You can see that it’s a similar algorithm as what I used to do the model tuning. I moved up the variable importance list from the bottom, one variable at a time, and progressively eliminated more variables. I trained 3 models for each new number of variables, each on a random sample of the data, and averaged the AUCs from those models (totalling 48 models). See the following graph for the result:

AUC by num vars

As you can see, even though the variables I eliminated were of the lowest importance, they were still contributing something positive to the model. This goes to show how well GBM performs with variables that could be noisy.

Now let’s look at the more important variables according to H2O:

                           variable relative_importance scaled_importance   percentage
1                 num_logged_events        48481.160156      1.000000e+00 5.552562e-01
2     DAYS_problem_total_etime_unix        11651.416992      2.403288e-01 1.334440e-01
3                      days.in.mooc         6495.756348      1.339852e-01 7.439610e-02
4      DAYS_access_total_etime_unix         3499.054443      7.217349e-02 4.007478e-02
5                         avg_month         3019.399414      6.227985e-02 3.458127e-02
6                           avg_day         1862.299316      3.841285e-02 2.132897e-02
7                    Pct_sequential         1441.578247      2.973481e-02 1.651044e-02
8    DAYS_navigate_total_etime_unix          969.427734      1.999597e-02 1.110289e-02
9                       num_courses          906.499451      1.869797e-02 1.038217e-02
10                      Pct_problem          858.774353      1.771357e-02 9.835569e-03
11                     num_students          615.350403      1.269257e-02 7.047627e-03

Firstly, we see that the number of logged events was the most important variable for predicting drop-out. I guess the more active they are, the less likely they are to drop out. Let’s see a graph:

MOOC dropout by num logged events

Although a little bit messy because I did not bin the num_logged_events variable, we see that this is exactly the case that those students who were more active online were less likely to drop out.

Next, we see a few variables regarding the days spent doing something. They seem to follow similar patterns, so the image I’ll show you below involves the days.in.mooc variable. This is simply how many days passed from the logging of the first event to the last.

MOOC dropouts by days in mooc

Here we see a very steady decrease in probability of dropping out where those who spent very little time from their first to their last interaction with the MOOC are the most likely to drop out, whereas those who spend more time with it are obviously less likely.

Next, let’s look at the avg_month and avg_day variables. These were calculated by taking the average timestamp of all events for each person enrolled in each course and then extracting the month and then the day from that timestamp. Essentially, when, on average, did they tend to do that course.

MOOC dropout by avg month and day

Interestingly, most months seem to exhibit a downward pattern, whereby if the person tended to have their interactions with the MOOC near the end of the month, then they were less likely to drop out, but if they had their interactions near the beginning they were more likely to drop out. This applied to February, May, June, November, and December. The reverse seems to be true for July and maybe October. January maybe applies to the second list.

The last two plots I’ll show you relate to num_courses and num_students, in other words, how many courses each student is taking and how many students are in each course.

MOOC dropouts by # courses per student

MOOC dropout by course popularity

The interesting result here is that it’s only those students who were super committed (taking more than 20 courses in the period captured by the data) who appeared significantly less likely to drop out than those who were taking fewer courses.

Finally, you can see that as the number of students enrolled in a course went up, the overall drop-out rate decreased. Popular courses retain students!

Conclusion

This was fun! I was amazed by how obsessed I became on account of this competition. I’m disappointed that I couldn’t think of something to bridge the 2.6% gap between me and first place, but the point of this was to practice, to learn something new, and have fun. I hope you enjoyed it too!

An R Enthusiast Goes Pythonic!

I’ve spent so many years using and broadcasting my love for R and using Python quite minimally. Having read recently about machine learning in Python, I decided to take on a fun little ML project using Python from start to finish.

What follows below takes advantage of a neat dataset from the UCI Machine Learning Repository. ¬†The data contain Math test performance of 649 students in 2 Portuguese schools. ¬†What’s neat about this data set is that in addition to grades on the students’ 3 Math tests, they managed to collect a whole whack of demographic variables (and some behavioural) as well. ¬†That lead me to the question of how well can you predict final math test performance based on demographics and behaviour alone. ¬†In other words,¬†who is likely to do well, and who is likely to tank?

I have to admit before I continue, I initially intended on doing this analysis in Python alone, but I actually felt lost 3 quarters of the way through and just did the whole darned thing in R. ¬†Once I had completed the analysis in R to my liking, I then went back to my Python analysis and continued until I finished to my reasonable satisfaction. ¬†For that reason, for each step in the analysis, I will show you the code I used in Python, the results, and then the same thing in R. ¬†Do not treat this as a comparison of Python’s machine learning capabilities versus R per se. ¬†Please treat this as a comparison of my understanding of how to do machine learning in Python versus R!

Without further ado, let’s start with some import statements in Python and library statements in R:

#Python Code
from pandas import *
from matplotlib import *
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
%matplotlib inline # I did this in ipython notebook, this makes the graphs show up inline in the notebook.
import statsmodels.formula.api as smf
from scipy import stats
from numpy.random import uniform
from numpy import arange
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
mat_perf = read_csv('/home/inkhorn/Student Performance/student-mat.csv', delimiter=';')

I’d like to comment on the number of import statements I found myself writing in this python script. Eleven!! Is that even normal? Note the smaller number of library statements in my R code block below:

#R Code
library(ggplot2)
library(dplyr)
library(ggthemr)
library(caret)
ggthemr('flat') # I love ggthemr!
mat_perf = read.csv('student-mat.csv', sep = ';')

Now let’s do a quick plot of our target variable, scores on the students’ final math test, named ‘G3’.

#Python Code
sns.set_palette("deep", desat=.6)
sns.set_context(context='poster', font_scale=1)
sns.set_context(rc={"figure.figsize": (8, 4)})
plt.hist(mat_perf.G3)
plt.xticks(range(0,22,2))

Distribution of Final Math Test Scores (“G3”)
Python Hist - G3

That looks pretty pleasing to my eyes. Now let’s see the code for the same thing in R (I know, the visual theme is different. So sue me!)

#R Code
ggplot(mat_perf) + geom_histogram(aes(x=G3), binwidth=2)

Hist - G3

You’ll notice that I didn’t need to tweak any palette or font size parameters for the R plot, because I used the very fun ggthemr package. You choose the visual theme you want, declare it early on, and then all subsequent plots will share the same theme! There is a command I’ve hidden, however, modifying the figure height and width. I set the figure size using rmarkdown, otherwise I just would have sized it manually using the export menu in the plot frame in RStudio. ¬†I think both plots look pretty nice, although I’m very partial to working with ggthemr!

Univariate estimates of variable importance for feature selection

Below, what I’ve done in both languages is to cycle through each variable in the dataset (excepting prior test scores) insert the variable name in a dictionary/list, and get a measure of importance of how predictive that variable is, alone, of the final math test score (variable G3). Of course if the variable is qualitative then I get an F score from an ANOVA, and if it’s quantitative then I get a t score from the regression.

In the case of Python this is achieved in both cases using the ols function from scipy’s statsmodels package. In the case of R I’ve achieved this using the aov function for qualitative and the lm function for quantitative variables. The numerical outcome, as you’ll see from the graphs, is the same.

#Python Code
test_stats = {'variable': [], 'test_type' : [], 'test_value' : []}

for col in mat_perf.columns[:-3]:
    test_stats['variable'].append(col)
    if mat_perf[col].dtype == 'O':
        # Do ANOVA
        aov = smf.ols(formula='G3 ~ C(' + col + ')', data=mat_perf, missing='drop').fit()
        test_stats['test_type'].append('F Test')
        test_stats['test_value'].append(round(aov.fvalue,2))
    else:
        # Do correlation
        print col + '\n'
        model = smf.ols(formula='G3 ~ ' + col, data=mat_perf, missing='drop').fit()
        value = round(model.tvalues[1],2)
        test_stats['test_type'].append('t Test')
        test_stats['test_value'].append(value)

test_stats = DataFrame(test_stats)
test_stats.sort(columns='test_value', ascending=False, inplace=True)
#R Code
test.stats = list(test.type = c(), test.value = c(), variable = c())

for (i in 1:30) {
  test.stats$variable[i] = names(mat_perf)[i]
  if (is.factor(mat_perf[,i])) {
    anova = summary(aov(G3 ~ mat_perf[,i], data=mat_perf))
    test.stats$test.type[i] = "F test"
    test.stats$test.value[i] = unlist(anova)[7]
  }
  else {
    reg = summary(lm(G3 ~ mat_perf[,i], data=mat_perf))
    test.stats$test.type[i] = "t test"
    test.stats$test.value[i] = reg$coefficients[2,3]
  }

}

test.stats.df = arrange(data.frame(test.stats), desc(test.value))
test.stats.df$variable = reorder(test.stats.df$variable, -test.stats.df$test.value)

And now for the graphs. Again you’ll see a bit more code for the Python graph vs the R graph. Perhaps someone will be able to show me code that doesn’t involve as many lines, or maybe it’s just the way things go with graphing in Python. Feel free to educate me ūüôā

#Python Code
f, (ax1, ax2) = plt.subplots(2,1, figsize=(48,18), sharex=False)
sns.set_context(context='poster', font_scale=1)
sns.barplot(x='variable', y='test_value', data=test_stats.query("test_type == 'F Test'"), hline=.1, ax=ax1, x_order=[x for x in test_stats.query("test_type == 'F Test'")['variable']])
ax1.set_ylabel('F Values')
ax1.set_xlabel('')

sns.barplot(x='variable', y='test_value', data=test_stats.query("test_type == 't Test'"), hline=.1, ax=ax2, x_order=[x for x in test_stats.query("test_type == 't Test'")['variable']])
ax2.set_ylabel('t Values')
ax2.set_xlabel('')

sns.despine(bottom=True)
plt.tight_layout(h_pad=3)

Python Bar Plot - Univariate Estimates of Variable Importance

#R Code
ggplot(test.stats.df, aes(x=variable, y=test.value)) +
  geom_bar(stat="identity") +
  facet_grid(.~test.type ,  scales="free", space = "free") +
  theme(axis.text.x = element_text(angle = 45, vjust=.75, size=11))

Bar plot - Univariate Estimates of Variable Importance

As you can see, the estimates that I generated in both languages were thankfully the same. My next thought was to use only those variables with a test value (F or t) of 3.0 or higher. What you’ll see below is that this led to a pretty severe decrease in predictive power compared to being liberal with feature selection.

In reality, the feature selection I use below shouldn’t be necessary at all given the size of the data set vs the number of predictors, and the statistical method that I’m using to predict grades (random forest). What’s more is that my feature selection method in fact led me to reject certain variables which I later found to be important in my expanded models! For this reason it would be nice to investigate a scalable multivariate feature selection method (I’ve been reading a bit about boruta but am skeptical about how well it scales up) to have in my tool belt. Enough blathering, and on with the model training:

Training the First Random Forest Model

#Python code
usevars =  [x for x in test_stats.query("test_value >= 3.0 | test_value <= -3.0")['variable']]
mat_perf['randu'] = np.array([uniform(0,1) for x in range(0,mat_perf.shape[0])])

mp_X = mat_perf[usevars]
mp_X_train = mp_X[mat_perf['randu'] <= .67]
mp_X_test = mp_X[mat_perf['randu'] > .67]

mp_Y_train = mat_perf.G3[mat_perf['randu'] <= .67]
mp_Y_test = mat_perf.G3[mat_perf['randu'] > .67]

# for the training set
cat_cols = [x for x in mp_X_train.columns if mp_X_train[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_train[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_train = concat([mp_X_train, new_cols], axis=1)

# for the testing set
cat_cols = [x for x in mp_X_test.columns if mp_X_test[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_test[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_test = concat([mp_X_test, new_cols], axis=1)

mp_X_train.drop(cat_cols, inplace=True, axis=1)
mp_X_test.drop(cat_cols, inplace=True, axis=1)

rf = RandomForestRegressor(bootstrap=True,
           criterion='mse', max_depth=2, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0)
rf.fit(mp_X_train, mp_Y_train)

After I got past the part where I constructed the training and testing sets (with “unimportant” variables filtered out) I ran into a real annoyance. I learned that categorical variables need to be converted to dummy variables before you do the modeling (where each level of the categorical variable gets its own variable containing 1s and 0s. 1 means that the level was present in that row and 0 means that the level was not present in that row; so called “one-hot encoding”). I suppose you could argue that this puts less computational demand on the modeling procedures, but when you’re dealing with tree based ensembles I think this is a drawback. Let’s say you have a categorical variable with 5 features, “a” through “e”. It just so happens that when you compare a split on that categorical variable where “abc” is on one side and “de” is on the other side, there is a very significant difference in the dependent variable. How is one-hot encoding going to capture that? And then, your dataset which had a certain number of columns now has 5 additional columns due to the encoding. “Blah” I say!

Anyway, as you can see above, I used the get_dummies function in order to do the one-hot encoding. Also, you’ll see that I’ve assigned two thirds of the data to the training set and one third to the testing set. Now let’s see the same steps in R:

#R Code
keep.vars = match(filter(test.stats.df, abs(test.value) >= 3)$variable, names(mat_perf))
ctrl = trainControl(method="repeatedcv", number=10, selectionFunction = "oneSE")
mat_perf$randu = runif(395)
test = mat_perf[mat_perf$randu > .67,]
trf = train(mat_perf[mat_perf$randu <= .67,keep.vars], mat_perf$G3[mat_perf$randu <= .67],
            method="rf", metric="RMSE", data=mat_perf,
            trControl=ctrl, importance=TRUE)

Wait a minute. Did I really just train a Random Forest model in R, do cross validation, and prepare a testing data set with 5 commands!?!? That was a lot easier than doing these preparations and not doing cross validation in Python! I did in fact try to figure out cross validation in sklearn, but then I was having problems accessing variable importances after. I do like the caret package ūüôā Next, let’s see how each of the models did on their testing set:

Testing the First Random Forest Model

#Python Code
y_pred = rf.predict(mp_X_test)
sns.set_context(context='poster', font_scale=1)
first_test = DataFrame({"pred.G3.keepvars" : y_pred, "G3" : mp_Y_test})
sns.lmplot("G3", "pred.G3.keepvars", first_test, size=7, aspect=1.5)
print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred)[0]**2
print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred))

Python Scatter Plot - First Model Pred vs Actual

R^2 value of 0.104940038879
RMSE of 4.66552400292

Here, as in all cases when making a prediction using sklearn, I use the predict method to generate the predicted values from the model using the testing set and then plot the prediction (“pred.G3.keepvars”) vs the actual values (“G3”) using the lmplot function. I like the syntax that the lmplot function from the seaborn package uses as it is simple and familiar to me from the R world (where the arguments consist of “X variable, Y Variable, dataset name, other aesthetic arguments). As you can see from the graph above and from the R^2 value, this model kind of sucks. Another thing I like here is the quality of the graph that seaborn outputs. It’s nice! It looks pretty modern, the text is very readable, and nothing looks edgy or pixelated in the plot. Okay, now let’s look at the code and output in R, using the same predictors.

#R Code
test$pred.G3.keepvars = predict(trf, test, "raw")
cor.test(test$G3, test$pred.G3.keepvars)$estimate[[1]]^2
summary(lm(test$G3 ~ test$pred.G3.keepvars))$sigma
ggplot(test, aes(x=G3, y=pred.G3.keepvars)) + geom_point() + stat_smooth(method="lm") + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

Scatter Plot - First Model Pred vs Actual

R^2 value of 0.198648
RMSE of 4.148194

Well, it looks like this model sucks a bit less than the Python one. Quality-wise, the plot looks super nice (thanks again, ggplot2 and ggthemr!) although by default the alpha parameter is not set to account for overplotting. The docs page for ggplot2 suggests setting alpha=.05, but for this particular data set, setting it to .5 seems to be better.

Finally for this section, let’s look at the variable importances generated for each training model:

#Python Code
importances = DataFrame({'cols':mp_X_train.columns, 'imps':rf.feature_importances_})
print importances.sort(['imps'], ascending=False)

             cols      imps
3        failures  0.641898
0            Medu  0.064586
10          sex_F  0.043548
19  Mjob_services  0.038347
11          sex_M  0.036798
16   Mjob_at_home  0.036609
2             age  0.032722
1            Fedu  0.029266
15   internet_yes  0.016545
6     romantic_no  0.013024
7    romantic_yes  0.011134
5      higher_yes  0.010598
14    internet_no  0.007603
4       higher_no  0.007431
12        paid_no  0.002508
20   Mjob_teacher  0.002476
13       paid_yes  0.002006
18     Mjob_other  0.001654
17    Mjob_health  0.000515
8       address_R  0.000403
9       address_U  0.000330
#R Code
varImp(trf)

## rf variable importance
## 
##          Overall
## failures 100.000
## romantic  49.247
## higher    27.066
## age       17.799
## Medu      14.941
## internet  12.655
## sex        8.012
## Fedu       7.536
## Mjob       5.883
## paid       1.563
## address    0.000

My first observation is that it was obviously easier for me to get the variable importances in R than it was in Python. Next, you’ll certainly see the symptom of the dummy coding I had to do for the categorical variables. That’s no fun, but we’ll survive through this example analysis, right? Now let’s look which variables made it to the top:

Whereas failures, mother’s education level, sex and mother’s job made it to the top of the list for the Python model, the top 4 were different apart from failures in the R model.

With the understanding that the variable selection method that I used was inappropriate, let’s move on to training a Random Forest model using all predictors except the prior 2 test scores. Since I’ve already commented above on my thoughts about the various steps in the process, I’ll comment only on the differences in results in the remaining sections.

Training and Testing the Second Random Forest Model

#Python Code

#aav = almost all variables
mp_X_aav = mat_perf[mat_perf.columns[0:30]]
mp_X_train_aav = mp_X_aav[mat_perf['randu'] <= .67]
mp_X_test_aav = mp_X_aav[mat_perf['randu'] > .67]

# for the training set
cat_cols = [x for x in mp_X_train_aav.columns if mp_X_train_aav[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_train_aav[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_train_aav = concat([mp_X_train_aav, new_cols], axis=1)
    
# for the testing set
cat_cols = [x for x in mp_X_test_aav.columns if mp_X_test_aav[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_test_aav[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_test_aav = concat([mp_X_test_aav, new_cols], axis=1)

mp_X_train_aav.drop(cat_cols, inplace=True, axis=1)
mp_X_test_aav.drop(cat_cols, inplace=True, axis=1)

rf_aav = RandomForestRegressor(bootstrap=True, 
           criterion='mse', max_depth=2, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0)
rf_aav.fit(mp_X_train_aav, mp_Y_train)

y_pred_aav = rf_aav.predict(mp_X_test_aav)
second_test = DataFrame({"pred.G3.almostallvars" : y_pred_aav, "G3" : mp_Y_test})
sns.lmplot("G3", "pred.G3.almostallvars", second_test, size=7, aspect=1.5)
print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred_aav)[0]**2
print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred_aav))

Python Scatter Plot - Second Model Pred vs Actual

R^2 value of 0.226587731888
RMSE of 4.3338674965

Compared to the first Python model, the R^2 on this one is more than doubly higher (the first R^2 was .10494) and the RMSE is 7.1% lower (the first was 4.6666254). The predicted vs actual plot confirms that the predictions still don’t look fantastic compared to the actuals, which is probably the main reason why the RMSE hasn’t decreased by so much. Now to the R code using the same predictors:

#R code
trf2 = train(mat_perf[mat_perf$randu <= .67,1:30], mat_perf$G3[mat_perf$randu <= .67],
            method="rf", metric="RMSE", data=mat_perf,
            trControl=ctrl, importance=TRUE)
test$pred.g3.almostallvars = predict(trf2, test, "raw")
cor.test(test$G3, test$pred.g3.almostallvars)$estimate[[1]]^2
summary(lm(test$G3 ~ test$pred.g3.almostallvars))$sigma
ggplot(test, aes(x=G3, y=pred.g3.almostallvars)) + geom_point() + 
  stat_smooth() + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

Scatter Plot - Second Model Pred vs Actual

R^2 value of 0.3262093
RMSE of 3.8037318

Compared to the first R model, the R^2 on this one is approximately 1.64 times higher (the first R^2 was .19865) and the RMSE is 8.3% lower (the first was 4.148194). Although this particular model is indeed doing better at predicting values in the test set than the one built in Python using the same variables, I would still hesitate to assume that the process is inherently better for this data set. Due to the randomness inherent in Random Forests, one run of the training could be lucky enough to give results like the above, whereas other times the results might even be slightly worse than what I managed to get in Python. I confirmed this, and in fact most additional runs of this model in R seemed to result in an R^2 of around .20 and an RMSE of around 4.2.

Again, let’s look at the variable importances generated for each training model:

#Python Code
importances_aav = DataFrame({'cols':mp_X_train_aav.columns, 'imps':rf_aav.feature_importances_})
print importances_aav.sort(['imps'], ascending=False)

                 cols      imps
5            failures  0.629985
12           absences  0.057430
1                Medu  0.037081
41      schoolsup_yes  0.036830
0                 age  0.029672
23       Mjob_at_home  0.029642
16              sex_M  0.026949
15              sex_F  0.026052
40       schoolsup_no  0.019097
26      Mjob_services  0.016354
55       romantic_yes  0.014043
51         higher_yes  0.012367
2                Fedu  0.011016
39     guardian_other  0.010715
37    guardian_father  0.006785
8               goout  0.006040
11             health  0.005051
54        romantic_no  0.004113
7            freetime  0.003702
3          traveltime  0.003341
#R Code
varImp(trf2)

## rf variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##            Overall
## absences    100.00
## failures     70.49
## schoolsup    47.01
## romantic     32.20
## Pstatus      27.39
## goout        26.32
## higher       25.76
## reason       24.02
## guardian     22.32
## address      21.88
## Fedu         20.38
## school       20.07
## traveltime   20.02
## studytime    18.73
## health       18.21
## Mjob         17.29
## paid         15.67
## Dalc         14.93
## activities   13.67
## freetime     12.11

Now in both cases we’re seeing that absences and failures are considered as the top 2 most important variables for predicting final math exam grade. It makes sense to me, but frankly is a little sad that the two most important variables are so negative ūüė¶ On to to the third Random Forest model, containing everything from the second with the addition of the students’ marks on their second math exam!

Training and Testing the Third Random Forest Model

#Python Code

#allv = all variables (except G1)
allvars = range(0,30)
allvars.append(31)
mp_X_allv = mat_perf[mat_perf.columns[allvars]]
mp_X_train_allv = mp_X_allv[mat_perf['randu'] <= .67]
mp_X_test_allv = mp_X_allv[mat_perf['randu'] > .67]

# for the training set
cat_cols = [x for x in mp_X_train_allv.columns if mp_X_train_allv[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_train_allv[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_train_allv = concat([mp_X_train_allv, new_cols], axis=1)
    
# for the testing set
cat_cols = [x for x in mp_X_test_allv.columns if mp_X_test_allv[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_test_allv[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_test_allv = concat([mp_X_test_allv, new_cols], axis=1)

mp_X_train_allv.drop(cat_cols, inplace=True, axis=1)
mp_X_test_allv.drop(cat_cols, inplace=True, axis=1)

rf_allv = RandomForestRegressor(bootstrap=True, 
           criterion='mse', max_depth=2, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0)
rf_allv.fit(mp_X_train_allv, mp_Y_train)

y_pred_allv = rf_allv.predict(mp_X_test_allv)
third_test = DataFrame({"pred.G3.plusG2" : y_pred_allv, "G3" : mp_Y_test})
sns.lmplot("G3", "pred.G3.plusG2", third_test, size=7, aspect=1.5)
print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred_allv)[0]**2
print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred_allv))

Python Scatter Plot - Third Model Pred vs Actual

R^2 value of 0.836089929903
RMSE of 2.11895794845

Obviously we have added a highly predictive piece of information here by adding the grades from their second math exam (variable name was “G2”). I was reluctant to add this variable at first because when you predict test marks with previous test marks then it prevents the model from being useful much earlier on in the year when these tests have not been administered. However, I did want to see what the model would look like when I included it anyway! Now let’s see how predictive these variables were when I put them into a model in R:

#R Code
trf3 = train(mat_perf[mat_perf$randu <= .67,c(1:30,32)], mat_perf$G3[mat_perf$randu <= .67], 
             method="rf", metric="RMSE", data=mat_perf, 
             trControl=ctrl, importance=TRUE)
test$pred.g3.plusG2 = predict(trf3, test, "raw")
cor.test(test$G3, test$pred.g3.plusG2)$estimate[[1]]^2
summary(lm(test$G3 ~ test$pred.g3.plusG2))$sigma
ggplot(test, aes(x=G3, y=pred.g3.plusG2)) + geom_point() + 
  stat_smooth(method="lm") + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

Scatter Plot - Third Model Pred vs Actual

R^2 value of 0.9170506
RMSE of 1.3346087

Well, it appears that yet again we have a case where the R model has fared better than the Python model. I find it notable that when you look at the scatterplot for the Python model you can see what look like steps in the points as you scan your eyes from the bottom-left part of the trend line to the top-right part. It appears that the Random Forest model in R has benefitted from the tuning process and as a result the distribution of the residuals are more homoscedastic and also obviously closer to the regression line than the Python model. I still wonder how much more similar these results would be if I had carried out the Python analysis by tuning while cross validating like I did in R!

For the last time, let’s look at the variable importances generated for each training model:

#Python Code
importances_allv = DataFrame({'cols':mp_X_train_allv.columns, 'imps':rf_allv.feature_importances_})
print importances_allv.sort(['imps'], ascending=False)

                 cols      imps
13                 G2  0.924166
12           absences  0.075834
14          school_GP  0.000000
25        Mjob_health  0.000000
24       Mjob_at_home  0.000000
23          Pstatus_T  0.000000
22          Pstatus_A  0.000000
21        famsize_LE3  0.000000
20        famsize_GT3  0.000000
19          address_U  0.000000
18          address_R  0.000000
17              sex_M  0.000000
16              sex_F  0.000000
15          school_MS  0.000000
56       romantic_yes  0.000000
27      Mjob_services  0.000000
11             health  0.000000
10               Walc  0.000000
9                Dalc  0.000000
8               goout  0.000000
#R Code
varImp(trf3)

## rf variable importance
## 
##   only 20 most important variables shown (out of 31)
## 
##            Overall
## G2         100.000
## absences    33.092
## failures     9.702
## age          8.467
## paid         7.591
## schoolsup    7.385
## Pstatus      6.604
## studytime    5.963
## famrel       5.719
## reason       5.630
## guardian     5.278
## Mjob         5.163
## school       4.905
## activities   4.532
## romantic     4.336
## famsup       4.335
## traveltime   4.173
## Medu         3.540
## Walc         3.278
## higher       3.246

Now this is VERY telling, and gives me insight as to why the scatterplot from the Python model had that staircase quality to it. The R model is taking into account way more variables than the Python model. G2 obviously takes the cake in both models, but I suppose it overshadowed everything else by so much in the Python model, that for some reason it just didn’t find any use for any other variable than absences.

Conclusion

This was fun! For all the work I did in Python, I used IPython Notebook. Being an avid RStudio user, I’m not used to web-browser based interactive coding like what IPython Notebook provides. I discovered that I enjoy it and found it useful for laying out the information that I was using to write this blog post (I also laid out the R part of this analysis in RMarkdown for that same reason). What I did not like about IPython Notebook is that when you close it/shut it down/then later reinitialize it, all of the objects that form your data and analysis are gone and all you have left are the results. You must then re-run all of your code so that your objects are resident in memory again. It would be nice to have some kind of convenience function to save everything to disk so that you can reload at a later time.

I found myself stumbling a lot trying to figure out which Python packages to use for each particular purpose and I tended to get easily frustrated. I had to keep reminding myself that it’s a learning curve to a similar extent as it was for me while I was learning R. This frustration should not be a deterrent from picking it up and learning how to do machine learning in Python. Another part of my frustration was not being able to get variable importances from my Random Forest models in Python when I was building them using cross validation and grid searches. If you have a link to share with me that shows an example of this, I’d be happy to read it.

I liked seaborn and I think if I spend more time with it then perhaps it could serve as a decent alternative to graphing in ggplot2. That being said, I’ve spent so much time using ggplot2 that sometimes I wonder if there is anything out there that rivals its flexibility and elegance!

The issue I mentioned above with categorical variables is annoying and it really makes me wonder if using a Tree based R model would intrinsically be superior due to its automatic handling of categorical variables compared with Python, where you need to one-hot encode these variables.

All in all, I hope this was as useful and educational for you as it was for me. It’s important to step outside of your comfort zone every once in a while ūüôā

Predicting Mobile Phone Prices

Recently a colleague of mine showed me a nauseating interactive scatterplot that plots mobile phones according to two dimensions of the user’s choice from a list of possible dimensions. ¬†Although the interactive visualization was offensive to my tastes, the JSON data behind the visualization was intriguing. ¬†It was easy enough to get the data behind it (see this link if you want an up to date copy and be sure to take out the “data=” from the start of the file! I pulled this data around noon on March 23rd.) so that I could start asking a simple question: Which of the available factors provided in the dataset were the most predictive of full mobile phone price?

I’ll present the graphs and then the predictive model first and then the code later on:

Price by OS and Brand:

Often when investigating a topic using data, we confirm things that we already knew to be true.  This is certainly the case here with price by OS and brand.  From the below boxplots we see that the bulk of iOS devices tend to be the most expensive, and that brand-wise Apple, Google, and Samsung seem to stick out.

Mobile Phone Price by Operating System

Mobile Phone Prices by Brand

Price by Storage Capacity, RAM, and SD Card Capacity:

Storage capacity is perhaps the least surprising to find as having such a sharply positive correlation with price. I think what is more surprising to me is that there aren’t more gradations of storage capacity in the higher range past 50 gigabytes. ¬†I’m guessing this is because the bulk of these phones (bearing in mind roughly 90% of these phones are in fact smart phones) are catered towards lower income folks. ¬†Can you guess which phones occupy the top-right-most position on the first graph? ¬†If your answer involved the iPhone 6 then you’re right on two counts!

As you can see, the correlation between RAM and price is pretty linear (with phones costing $171.54 more for each additional gigabyte of RAM) and that between SD Card capacity and price is linear past the large group of phones with 0 SD Card capacity (with phones costing $3.64 more for each additional gigabyte of SD Card Capacity).

Price by Storage Capacity

Price by RAM

Price by SD Card

Price by Screen Size, Battery, and Weight:

The next factors that I think one would naturally think of when considering the price of a mobile phone are all related to how big the thing is. Smart phones these days have a lot of physical presence just by dint of their screen size alone. Add to the large screen size the batteries that are used to support such generous displays and you also get an impressive variety of weights to these phones.

In fact, for every additional inch of screen size to these phones, you can expect an additional .81504 ounces and 565.11 mAh of battery capacity. My own humble little smartphone (an HTC Desire 601) happens to be on the smaller and lighter side of the spectrum as far as screen size and weight goes (4.5 inches screen size, or 33rd percentile; 4.59 ounces or 26th percentile) but happens to have a pretty generous battery capacity all things considered (2100 mAh, or 56th percentile).

While positive correlations can be seen between Price and all these 3 factors, battery was the most correlated with Price, next to screen size and then weight. ¬†There’s obviously a lot of variability in price when you look at the phones with the bigger screen sizes, as they probably tend to come packed with a variety of premium extra features that can be used to jack up the price.

Price by Screen Size

Price by Battery

Price by Weight

Putting it all together in a model:
Finally, let’s lump all of the factors provided in the data set into a model, and see how well it performs on a testing sample. I decided on an 80/20 training/testing split, and am of course using Max Kuhn’s fabulous caret package to do the dirty work. I ran a gbm model, shown below, and managed to get an R squared of 60.4% in the training sample, so not bad.

Stochastic Gradient Boosting 

257 samples
 23 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 173, 173, 171, 171, 172, 171, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE      Rsquared   RMSE SD   Rsquared SD
  1                   50      150.1219  0.5441107  45.36781  0.1546993  
  1                  100      147.5400  0.5676971  46.03555  0.1528225  
  1                  150      146.3710  0.5803005  45.00296  0.1575795  
  2                   50      144.0657  0.5927624  45.46212  0.1736994  
  2                  100      143.7181  0.6036983  44.80662  0.1787351  
  2                  150      143.4850  0.6041207  45.57357  0.1760428  
  3                   50      148.4914  0.5729182  45.27579  0.1903465  
  3                  100      148.5363  0.5735842  43.41793  0.1746064  
  3                  150      148.8497  0.5785677  43.39338  0.1781990  

Tuning parameter 'shrinkage' was held constant at a value of 0.1
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were n.trees = 150, interaction.depth = 2 and shrinkage = 0.1.

Now let’s look at the terms that came out as the most significant in the chosen model. ¬†Below we see some unsurprising findings! Storage, battery, weight, RAM, and whether or not the phone uses iOS as the top 5. I guess I’m surprised that screen size was not higher up in the priority list, but at least it got in 6th place!

gbm variable importance

  only 20 most important variables shown (out of 41)

                  Overall
att_storage      100.0000
att_battery_mah   59.7597
att_weight        46.5410
att_ram           27.5871
att_osiOS         26.9977
att_screen_size   21.1106
att_sd_card       20.1130
att_brandSamsung   9.1220

Finally, let’s look at how our model did in the testing sample. Below I’ve shown you a plot of actual versus predicted price values. The straight line is what we would expect to see if there were a perfect correlation between the two (obviously not!!) while the smoothed line is the trend that we actually do see in the scatter plot. Considering the high R squared in the testing sample of 57% (not too far off from the training sample) it’s of course a nice confirmation of the utility of this model to see the smooth line following that perfect prediction line, but I won’t call be calling up Rogers Wireless with the magical model just yet!

Price by Predicted Price

In fact, before I close off this post, it would be remiss of me not to investigate a couple of cases in this final graph that look like outliers. The one on the bottom right, and the one on the top left.

The one on the bottom right happens to be a Sony Xperia Z3v Black with 32GB of storage space. What I learned from checking into this is that since the pricing data on the source website is pulled from amazon.com, sometimes instead of pulling the full regular price, it happens to pull the data on a day when a special sale or service agreement price is listed. When I pulled the data, the Xperia was listed at a price of $29.99. Today, on April 6th, the price that you would get if you looked it up through the source website is .99! Interestingly, my model had predicted a full price of $632.17, which was not very far off from the full price of $599.99 that you can see if you go on the listing on amazon.com. Not bad!

Now, how about the phone that cost so much but that the model said shouldn’t? This phone was none other than the Black LG 3960 Google Nexus 4 Unlocked GSM Phone with 16GB of Storage space. The price I pulled that day was a whopping $699.99 but the model only predicted a price of $241.86! Considering the specs on this phone, the only features that really seem to measure¬†up are the storage (16GB is roughly in the 85th percentile for smart phones) and the RAM (2 GB is roughly in the 93rd percentile for smart phones). Overall though, the model can’t account for any other qualities that Google might have imbued into this phone that were not measured by the source website. Hence, this is a formidable model outlier!

If you take out the Sony Xperia that I mentioned first, the Adjusted R squared value goes up from 57% to 74%, and the Residual Standard Error decreases from $156 to $121. That’s a lot of influence for just one outlier that we found to be based on data quality alone. Wow!

Reflecting on this exercise, the one factor that I wished were collected is processor speed. ¬†I’m curious how much that would factor into pricing decisions, but alas this information was unavailable.

Anyway, this was fun, and I hope not too boring for you, the readers. Thanks for reading!!


library(jsonlite)
cp = fromJSON(txt = "Cell Phone Data.txt", simplifyDataFrame = TRUE)
num.atts = c(4,9,11,12,13,14,15,16,18,22)
cp[,num.atts] = sapply(cp[,num.atts], function (x) as.numeric(x))
cp$aspect.ratio = cp$att_pixels_y / cp$att_pixels_x
cp$isSmartPhone = ifelse(grepl("smart|iphone|blackberry", cp$name, ignore.case=TRUE) == TRUE | cp$att_screen_size >= 4, "Yes", "No")
library(ggplot2)
library(ggthemr)
library(scales)
ggthemr("camoflauge")
ggplot(cp, aes(x=att_brand, y=price)) + geom_boxplot() + ggtitle("Mobile Phone Price by Brand") + theme(axis.text.x=element_text(angle=90, size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_discrete("Brand")
ggplot(cp, aes(x=att_weight, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Weight") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Weight (oz)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=att_screen_size, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Screen Size") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Screen Size (in)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=att_ram, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Amount of RAM") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("RAM (gb)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=att_sd_card, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by SD Card Capacity") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("SD Card Capacity (gb)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=ifelse(cp$att_dual_sim == 1, "Yes", "No"), y=price)) + geom_boxplot() + ggtitle("Mobile Phone Price by Dual Sim") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_discrete("Has Dual Sim Card?")
ggplot(cp, aes(x=att_storage, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Storage Capacity") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Storage Capacity (gb)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=att_battery_mah, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Battery Capacity") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Battery Capacity (mAh)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=aspect.ratio, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Aspect Ratio") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Aspect Ratio (Y Pixels / X Pixels)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=isSmartPhone, y=price)) + geom_boxplot() + ggtitle("Mobile Phone Price by Smart Phone Status") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_discrete("Is it a Smart Phone?")
ggplot(cp, aes(x=att_os, y=price)) + geom_boxplot() + ggtitle("Mobile Phone Price by Operating System") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_discrete("Operating System")
library(caret)
control = trainControl(method="cv")
in_train = createDataPartition(cp$price, p=.8, list=FALSE)
model.gbm = train(price ~ att_brand + att_weight + att_screen_size +
att_ram + att_sd_card + att_dual_sim +
att_storage + att_battery_mah + att_os, data=cp,
method="gbm", trControl=control, verbose=FALSE, subset=in_train)
cp$att_brand = factor(cp$)
cp.test = cp[in_train,]
cp.test = subset(cp.test, att_brand != 'TOTO')
cp.test = na.omit(cp.test)
cp.test$pred.price = predict(model.gbm, cp.test)
ggplot(cp.test, aes(x=pred.price, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Predicted Price") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Predicted Price", labels=dollar) + geom_abline(intercept=0, slope=1, colour="yellow") + stat_smooth(se=FALSE)

Contraceptive Choice in Indonesia

I wanted yet another opportunity to get to use the fabulous caret package, but also to finally give plot.ly a try.  To scratch both itches, I dipped into the UCI machine learning library yet again and came up with a survey data set on the topic of contraceptive choice in Indonesia.  This was an interesting opportunity for me to learn about a far-off place while practicing some fun data skills.

According to recent estimates, Indonesia is home to some 250 million individuals and over the years, thanks to government intervention, has had its fertility rate slowed down from well over 5 births per woman, to a current value of under 2.4 births per woman. ¬†Despite this slow down, Indonesia is not generating enough jobs to satisfy the population. ¬†Almost a fifth of their youth labour force (aged 15-24) are unemployed (19.6%), a whole 6.3% more than a recent estimate of the youth unemployment rate in Canada (13.3%). ¬†When you’re talking about a country with 250 million individuals (with approximately 43.4 million 15-24 year olds), that’s the difference between about 5.8 million unemployed and 8.5 million unemployed teenagers/young adults. ¬†The very idea is frightening! ¬†Hence the government’s focus (decades ago and now) on promoting contraceptive method use to its population.

That is the spirit behind the survey data we’ll look at today! ¬†First, download the data from my plot.ly account and load it into R.

library(plotly)
library(caret)
library(dplyr)
library(reshape2)
library(scales)

cmc = read.csv("cmc.csv", colClasses = c(Wife.Edu = "factor", Husband.Edu = "factor", Wife.Religion = "factor", 
                                         Wife.Working = "factor", Husband.Occu = "factor", Std.of.Living = "factor",
                                         Media.Exposure = "factor", Contraceptive.Method.Used = "factor"))

levels(cmc$Contraceptive.Method.Used) = c("None","Short-Term","Long-Term")

table(cmc$Contraceptive.Method.Used)
      None Short-Term  Long-Term 
       629        333        511 

prop.table(table(cmc$Contraceptive.Method.Used))

      None Short-Term  Long-Term 
 0.4270197  0.2260692  0.3469111 

Everything in this data set is stored as a number, so the first thing I do is to define as factors what the documentation suggests are factors. ¬†Then, we see the numeric breakdown of how many women fell into each contraceptive method category. ¬†It’s 1473 responses overall, and ‘None’ is the largest category, although it is not hugely different from the others (although a chi squared test does tell me that the proportions are significantly different from one another).

Next up, let’s set up the cross validation and training vs testing indices, then train a bunch of models, use a testing sample to compare performance, and then we’ll see some graphs

# It's training time!
control = trainControl(method = "cv")
in_train = createDataPartition(cmc$Contraceptive.Method.Used, p=.75, list=FALSE)

contraceptive.model = train(Contraceptive.Method.Used ~ ., data=cmc, method="rf", metric="Kappa",
                            trControl=control, subset=in_train)

contraceptive.model.gbm = train(Contraceptive.Method.Used ~ ., data=cmc, method="gbm", metric="Kappa",
                            trControl=control, subset=in_train, verbose=FALSE)

contraceptive.model.svm = train(Contraceptive.Method.Used ~ ., data=cmc, method="svmRadial", metric="Kappa",
                                preProc = c('center','scale'), trControl=control, subset=in_train)

contraceptive.model.c50 = train(Contraceptive.Method.Used ~ ., data=cmc, method="C5.0", metric="Kappa",
                                trControl=control, subset=in_train, verbose=FALSE)


dec = expand.grid(.decay=c(0,.0001,.01,.05,.10))
control.mreg = trainControl(method = "cv")
contraceptive.model.mreg = train(Contraceptive.Method.Used ~ ., data=cmc, method="multinom", metric="Kappa",
                                  trControl=control.mreg, tuneGrid = dec, subset=in_train, verbose=FALSE)

# And now it's testing time...
cmc.test = cmc[-in_train,]

cmc.test = cmc.test %>% mutate(
  rf.class = predict(contraceptive.model, cmc.test, type="raw"),
  gbm.class = predict(contraceptive.model.gbm, cmc.test, type="raw"),
  svm.class = predict(contraceptive.model.svm, cmc.test, type="raw"),
  mreg.class = predict(contraceptive.model.mreg, cmc.test, type="raw"),
  c50.class = predict(contraceptive.model.c50, cmc.test, type="raw"))

# Here I'm setting up a matrix to host some performance statistics from each of the models
cmatrix.metrics = data.frame(model = c("rf", "gbm", "svm", "mreg","c50"), kappa = rep(NA,5), sensitivity1 = rep(NA,5), sensitivity2 = rep(NA,5), sensitivity3 = rep(NA,5), specificity1 = rep(NA,5), specificity2 = rep(NA,5), specificity3 = rep(NA,5))

# For each of the models, I use confusionMatrix to give me the performance statistics that I want
for (i in 11:15) {
  cmatrix.metrics[i-10,"kappa"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$overall[2][[1]]
  cmatrix.metrics[i-10, "sensitivity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,1]
  cmatrix.metrics[i-10, "sensitivity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,1]
  cmatrix.metrics[i-10, "sensitivity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,1]
  cmatrix.metrics[i-10, "specificity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,2]
  cmatrix.metrics[i-10, "specificity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,2]
  cmatrix.metrics[i-10, "specificity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,2]
}

# Now I transform my cmatrix.metrics matrix into a long format suitable for ggplot, graph it, and then post it to plot.ly

cmatrix.long = melt(cmatrix.metrics, id.vars=1, measure.vars=c(2:8))

ggplot(cmatrix.long, aes(x=model, y=value)) + geom_point(stat="identity", colour="blue", size=3) + facet_grid(~variable) + theme(axis.text.x=element_text(angle=90,vjust=0.5, colour="black",size=12), axis.text.y=element_text(colour="black", size=12), strip.text=element_text(size=12), axis.title.x=element_text(size=14,face="bold"), axis.title.y=element_text(size=14, face="bold")) + ggtitle("Performance Stats for Contraceptive Choice Models") + scale_x_discrete("Model") + scale_y_continuous("Value")

set_credentials_file("my.username","my.password")
py = plotly()
py$ggplotly()

And behold the beautiful graph (after some tinkering in the plot.ly interface, of course)

performance_stats_for_contraceptive_choice_modelsPlease note, the 1/2/3 next to sensitivity and specificity refer to the contraceptive use classes, ‘None’, ‘Short-term’, ‘Long-term’.

Firstly, the kappa stats are telling us that after you factor into accuracy the results you would expect by chance, the models don’t add a humongous level of predictive power. ¬†So, we won’t collect our nobel prize on this one! ¬†The classes are evidently more difficult to positively identify than they are to negatively identify, as evidenced by the lower sensitivity scores than specificity scores. ¬†Interestingly, class 1 (users of NO contraceptive methods) was the most positively identifiable.

Secondly,¬†it looks like we have a list of the top 3 performers in terms of kappa, sensitivity and specificity (in no particular order because I they don’t appear that different):

  • gbm (gradient boosting machine)
  • svm (support vector machine)
  • c50

Now that we have a subset of models, let’s see what they have to say about variable importance. ¬†Once we see what they have to say, we’ll do some more fun graphing to see the variables in action.

gbm.imp = varImp(contraceptive.model.gbm)
svm.imp = varImp(contraceptive.model.svm)
c50.imp = varImp(contraceptive.model.c50)

I haven’t used ggplot here because I decided to try copy pasting the tables directly into plot.ly for the heck of it. Let’s have a look at variable importance according to the three models now:

contraceptive_choice_model_-_gbm_variable_importance

contraceptive_choice_model_-_c50_variable_importance

contraceptive_method_model_-_svm_variable_importanceOne commonality that you see throughout all of these graphs is that the wife’s age, her education, and the number of children born to her are among the most predictive in terms of classifying where she will fall on the contraceptive choice spectrum provided in the survey. ¬†Given that result, let’s see some graphs that plot contraceptive choice against the aforementioned 3 predictive variables:

plot(Contraceptive.Method.Used ~ Wife.Age, data=cmc, main="Contraceptive Method Used According to Age of Wife")

cmethod.by.kids = melt(dcast(cmc, Num.Kids ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4), variable.name="Contraceptive.Method.Used", value.name="Num.Records")

ggplot(cmethod.by.kids, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Number of Kids")
ggplot(cmethod.by.kids, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Number of Kids")

cmethod.by.wife.edu = melt(dcast(cmc, Wife.Edu ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4), variable.name="Contraceptive.Method.Used", value.name="Num.Records")

ggplot(cmethod.by.wife.edu, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Education Level of Wife")
ggplot(cmethod.by.wife.edu, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Education Level of Wife")

And now the graphs:

Contraceptive Choice by Age of Wife

Okay so this isn’t a plotly graph, but I do rather like the way this mosaic plot conveys the information. ¬†The insight here is after a certain age (around age 36) the women seem to be less inclined to report the use of long-term contraceptive methods, and more inclined to report no contraceptive use at all. ¬†Could it be that the older women feel that they are done with child bearing, and so do not feel the need for contraception anymore? ¬†Perhaps someone more knowledgeable on the matter could enlighten me!

contraceptive_choice_by_education_level_of_wife

contraceptive_choice_by_education_level_of_wife prop

Here’s a neat one, showing us the perhaps not-so-revelatory result that as the education level of the wife increases, the likelihood that she will report no contraception diminishes. ¬†Here it is in fact the short term contraceptive methods where we see the biggest increase in likelihood as the education level increases. ¬†I don’t think you can cite education in and of itself that causes women to choose contraception, because perhaps it’s higher socio-economic status which leads them to pursue higher education which leads them to choose contraception. ¬†I’m not sure this survey will answer that question, however!

contraceptive_choice_by_number_of_kids

contraceptive_choice_by_number_of_kids prop

Finally, we have number of kids, which exhibits an odd relationship with contraceptive choice. It appears as though at least half of the women with 1 child reported no contraception, but that proportion goes down when you look at women with 3 children.  After that, women are more and more likely to cite no contraception, likely reflecting their age.

Conclusion and observations:

As I mentioned earlier, I don’t expect to pick up any nobel prizes from this analysis. ¬†Substantively, the most interesting thing that came out of this analysis for me is that it was stage of life factors (# kids and age) in addition to the wife’s education (and possibly income, but that wasn’t measured) which formed the most predictive variables in the classification of who uses which contraceptive methods. ¬†Naively, I expected wife’s religion to be amongst the most predictive.

According to UNICEF, 6/10 drop-outs in primary school are girls. ¬†That increases to 7/10 in secondary school. ¬†Stop that trend from happening, and then who knows what improvements might result? ¬†Perhaps continuing to invest in girls’ education will help lay the foundation for the later pursuit of higher education and the slowing down of their population expansion.

Lastly, I have some comments about plotly:

  1. It was a big buzz kill to discover that I couldn’t embed my plotly plots into my wordpress blog. ¬†Bummer.
  2. The plots that did result were very nice looking!
  3. I found myself getting impatient figuring out where to click to customize my plots to my liking. ¬†There’s something about going from a command line to a gui which is odd for me!
  4. I initially thought it was cool that I could save a theme for my plot once I customized it to my liking.  I was disappointed to learn that the theme was not saved as an absolute set of visual characteristics.  For some reason those characteristics changed depending on the initial graph I imported from R and I could not just apply it and be done with it.
  5. I found myself wondering if I there was a better way of getting my R graphs into plotly than the proscribed¬†py$ggplotly() method. ¬†It’s not my biggest complaint, but somehow I’d rather just have one command to batch upload all of my graphs.

I’ll be watching plotly as it evolves and hope to see it improve even more than it has already! ¬†Good luck!

Nuclear vs Green Energy: Share the Wealth or Get Your Own?

Thanks to Ontario Open Data, a survey dataset was recently made public containing peoples’ responses to questions about Ontario’s Long Term Energy Plan (LTEP). ¬†The survey did fairly well in terms of raw response numbers, with 7,889 responses in total (although who knows how many people it was sent to!). ¬†As you’ll see in later images in this post, two major goals of Ontario’s LTEP is to eliminate power generation from coal, and to maximize savings by encouraging lots of conservation.

For now though, I’ll introduce my focus of interest: Which energy sources did survey respondents think should be relied on in the future, and how does that correlate with their views on energy management/sharing?

As you can see in the graph below, survey respondents were given a 7 point scale and asked to use it to rate the importance of different energy source options (scale has been flipped so that 7 is the most important and 1 is the least). ¬†Perhaps it’s my ignorance of this whole discussion, but it surprised me that 76% of respondents rated Nuclear power as at least a 5/7 on a scale of importance! ¬†Nuclear power? ¬†But what about Chernobyl and Fukushima? ¬†To be fair, although terribly dramatic and devastating, those were isolated incidents. ¬†Also, measures have been taken to ensure our current nuclear reactors are and will be disaster safe. ¬†Realistically, I think most people don’t think about those things! ¬†A few other things to notice here: conservation does have its adherents, with 37% giving a positive response. ¬†Also, I think it was surprising (and perhaps saddening) to see that green energy has so few adherents, proportionately speaking.

Survey: Importance of Energy Sources

After staring at this graph for a while, I had the idea to see what interesting differences I could find between people who supported Nuclear energy versus those who support Green energy.  What I found is quite striking in its consistency:

  1. Those who believe Nuclear energy is important for Ontario’s future mix of energy sources seem to be more confident that there’s enough energy to share between regions and that independence in power generation is not entirely necessary.
  2. On the flip side, those who believe Green energy is important for Ontario’s future mix of energy sources seem to be more confident that there isn’t enough energy to share between regions and that independence in power generation should be nurtured.

See for yourself in the following graphs:

Survey: Regions Should Make Conservation their First Priority

Survey: Self Sustaining Regions

Survey: Region Responsible for Growing Demand

Survey: Regions buy Power

Does this make sense in light of actual facts? ¬†The graph below comes from a very digestible page set up by the Ontario Ministry of Energy to communicate its Long Term Energy Plan. ¬†As they make pretty obvious, Nuclear energy accounts for over half of energy production in Ontario in 2013, whereas the newer green energy sources (Solar, Bioenergy, Wind vs. Hydro ) amount to about 5%. ¬†In their forecast for 2032, they are hopeful that they will account for 13% of energy production in Ontario. ¬†Still not the lion’s share of energy, but if you add that to the 22% accounted for by Hydro, then you get 35% of all energy production, which admittedly isn’t bad! ¬†Still, I wonder what people were thinking of when they saw “Green energy” on the survey. ¬†If the new sources, then I think what is going on here is that perhaps people who advocate for Green energy sources such as wind and solar have an idea how difficult it is to power a land mass such as Ontario with these kinds of power stations. ¬†People advocating for Nuclear, on the other hand, are either blissfully ignorant, or simply understand that Nuclear power plants are able to serve a wider area.
MOE: Screenshot from 2013-12-08 13:28:04

MOE: Screenshot from 2013-12-08 13:41:06

All of this being said, as you can see in the image above, the Ontario Provincial Government actually wants to *reduce* our province’s reliance on Nuclear energy in the next 5 years, and in fact they will not be building new reactors. ¬†I contacted Mark Smith, Senior Media Relations Coordinator¬†of the Ontario Ministry of Energy to ask him to comment about the role of Nuclear energy in the long run. ¬†Following are some tidbits that he shared with me over email:

Over the past few months, we have had extensive consultations as part of our review of Ontario’s Long Term Energy Plan (LTEP). There is a strong consensus that now is not the right time to build new nuclear.

Ontario is currently in a comfortable supply situation and it does not require the additional power.

We will continue to monitor the demand and supply situation and look at building new nuclear in the future, if the need arises.

Nuclear power has been operating safely in our province for over 40 years, and is held to the strictest regulations and safety requirements to ensure that the continued operation of existing facilities, and any potential new build are held to the highest standards.

We will continue with our nuclear refurbishment plans for which there was strong province-wide support during the LTEP consultations.

During refurbishment, both OPG and Bruce Power will be subject to the strictest possible oversight to ensure safety, reliable supply and value for ratepayers.

Nuclear refurbishments will create thousands of jobs and extend the lives of our existing fleet for another 25-30 years, sustaining thousands of highly-skilled and high-paying jobs.

The nuclear sector will continue be a vital, innovative part of Ontario, creating new technology which is exported around the world.

Well, even Mr. Mark Smith seems confident about Nuclear energy! ¬†I tried to contact the David Suzuki Foundation to see if they’d have anything to say on the role of Green Energy in Ontario’s future, but they were unavailable for comment.

Well, there you have it! ¬†Despite confidence in Nuclear energy as a viable source for the future, the province will be increasing its investments in both Green energy and conservation! ¬†Here’s hoping for an electric following decade ūüôā

(P.S. As usual, the R code follows)


ltep = read.csv("ltep-survey-results-all.csv")
library(likert)
library(ggthemes)
# Here I flip the scoring
ltep[,13:19] = sapply(ltep[,13:19], function (x) 8 x)
deal.w.esources = likert(ltep[,13:19])
summary(deal.w.esources)
plot(deal.w.esources, text.size=6, text.color="black") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold")) + ggtitle("What guidelines should Ontario use\n for its future mix of energy sources?")
library(plyr)
# following is a lot of boring code where I
# generate numerous data frames that contain percent of
# respondents who agree with the statement by their rated
# importance of Nuclear vs. Green Energy
# Here's the nuclear section
self.sustaining.region.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$A.region.should.be.responsible.for.generating.at.least.some.of.its.own.power. == "Agree", na.rm=TRUE))
self.sustaining.region.by.nuke = self.sustaining.region.by.nuke[1:7,]
region.buy.power.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$Regions.should.have.the.option.to.buy.all.of.their.power.from.sources.in.another.region..if.the.power.is.available. == "Agree", na.rm=TRUE))
region.buy.power.by.nuke = region.buy.power.by.nuke[1:7,]
region.resp.for.growing.demand.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$If.a.region.s.power.demand.is.growing..it.should.be.responsible.for.building.the.generation.or.transmission.to.supply.that.it.needs. == "Agree", na.rm=TRUE))
region.resp.for.growing.demand.by.nuke = region.resp.for.growing.demand.by.nuke[1:7,]
regions.make.cons.first.priority.by.nuke = ddply(ltep, .(Nuclear.power.is.our.best.option.), function (x) mean(x$Regions.should.make.conservation.their.first.priority.to.reduce.the.need.for.new.supply. == "Agree", na.rm=TRUE))
regions.make.cons.first.priority.by.nuke = regions.make.cons.first.priority.by.nuke[1:7,]
# Here's the green energy section
self.sustaining.region.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$A.region.should.be.responsible.for.generating.at.least.some.of.its.own.power. == "Agree", na.rm=TRUE))
self.sustaining.region.by.green = self.sustaining.region.by.green[1:7,]
region.buy.power.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$Regions.should.have.the.option.to.buy.all.of.their.power.from.sources.in.another.region..if.the.power.is.available. == "Agree", na.rm=TRUE))
region.buy.power.by.green= region.buy.power.by.green[1:7,]
region.resp.for.growing.demand.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$If.a.region.s.power.demand.is.growing..it.should.be.responsible.for.building.the.generation.or.transmission.to.supply.that.it.needs. == "Agree", na.rm=TRUE))
region.resp.for.growing.demand.by.green = region.resp.for.growing.demand.by.green[1:7,]
regions.make.cons.first.priority.by.green = ddply(ltep, .(Green.energy..e.g…wind..solar..is.our.best.option.), function (x) mean(x$Regions.should.make.conservation.their.first.priority.to.reduce.the.need.for.new.supply. == "Agree", na.rm=TRUE))
regions.make.cons.first.priority.by.green = regions.make.cons.first.priority.by.green[1:7,]
# in this section I alternate between getting each data frame ready for plotting
# and obviously plotting the data frame. I liked the yellow and green that I used in the plot
# to represent Nuclear vs. Green energy ūüôā
library(ggplot2)
library(scales)
self.sustaining.region = rbind(data.frame(Importance=seq(1,7),Pct.Agree=self.sustaining.region.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=self.sustaining.region.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(self.sustaining.region, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("A Region Should be Responsible for Generating at Least Some of Its Own Power\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))
region.buy.power = rbind(data.frame(Importance=seq(1,7),Pct.Agree=region.buy.power.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=region.buy.power.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(region.buy.power, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("Regions Should Have the Option to Buy All of their Power\nfrom Sources in Another Region if Available\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))
region.resp.for.growing.demand = rbind(data.frame(Importance=seq(1,7),Pct.Agree=region.resp.for.growing.demand.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=region.resp.for.growing.demand.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(region.resp.for.growing.demand, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("If a Region's Power Demand is Growing\nIt Should be Responsible for Building the Transmission/Supply it Needs\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))
regions.make.cons.first.priority = rbind(data.frame(Importance=seq(1,7),Pct.Agree=regions.make.cons.first.priority.by.nuke$V1, Energy=rep("Nuclear Energy",7)), data.frame(Importance=seq(1,7),Pct.Agree=regions.make.cons.first.priority.by.green$V1, Energy=rep("Green Energy", 7)))
ggplot(regions.make.cons.first.priority, aes(x=as.factor(Importance), y=Pct.Agree, fill=Energy)) + geom_bar(stat="identity",width=.5) + scale_y_continuous(labels=percent, name="Percent Who Agree") + scale_x_discrete(name="Rated Level of Importance (1-7 = Low to High)") + facet_grid(~Energy) + scale_fill_manual(values=c("khaki1","forest green")) + ggtitle("Regions Should Make Conservation Their First Priority \nto Reduce the Need for New Supply\n(Agreement by Rated Importance of Nuclear vs. Green Energy)") + theme(axis.text.x=element_text(colour="black", face="bold", size=14), axis.text.y=element_text(colour="black", face="bold", size=14), axis.title.x=element_text(colour="black", face="bold", size=14), axis.title.y=element_text(colour="black", face="bold", size=14), plot.title=element_text(size=18, face="bold"), strip.text=element_text(size=14), legend.title=element_text(size=14), legend.text=element_text(size=14))

view raw

ltep.r

hosted with ❤ by GitHub

When did “How I Met Your Mother” become less legen.. wait for it…

…dary! ¬†Or, as you’ll see below, when did it become slightly less legendary? ¬†The analysis in this post was inspired by DiffusePrioR’s analysis of when The Simpsons became less Cromulent. When I read his post a while back, I thought it was pretty cool and told myself that I would use his method on another relevant dataset in the future.

Enter IMDB average user ratings of every episode of How I Met Your Mother (until season 9, episode 5). ¬†Once I brought up the page showing rated episodes by date, I simply copy and pasted the table into LibreOffice Calc, saved it as a csv, and then loaded it into R. ¬†As you can see in DiffusePrioR’s post (and also in mine), the purpose of the changepoint package¬†is to find changepoints, or abrupt baseline shifts, in your data.

Now, some R code:

library(changepoint)
library(ggplot2)
library(zoo)

himym = read.csv("himym.csv")
mean2.pelt = cpt.mean(himym$Rating, method="PELT")
plot(mean2.pelt, type='l', cpt.col='red',xlab="Episode",
     ylab='IMDB Avg Rating', cpt.width=2, main="Changepoints in IMDB Ratings of 'How I Met Your Mother'",ylim=c(0,10), xlim=c(0,200))

Changepoints in IMDB Ratings of HIMYM
Above we see that the changepoint method I used has detected two baselines for the rating data. ¬†One baseline has its value at 8.1, and the other has its value at 7.6. ¬†The first thing you have to admit here is that despite a downwards shift in the average rating, How I Met Your Mother still seems to be pretty popular and held in high regard. ¬†That being said, I do feel as though the earlier episodes were better, and apparently the masses of IMDB users rating each episode generally agree! ¬†So, what’s the first episode that marked the abrupt downwards shift in average rating for the show?

himym[161,1:2]
    Ep.Num   Ep.Title
161    8.1 Farhampton

Ah, Farhampton, Season 8, Episode 1. ¬†That seems like a natural breakpoint! ¬†But the low score doesn’t even come until Season 8 Episode 2. ¬†Let’s see what people have to say about episodes 1 and 2 on IMDB (forgive their punctuation, spelling, and grammar):

“Boring return I thought as once again the writing was horrible, the show seems to want to go on forever and also some of the actors in the show deserve better then what they are getting week to week. Even at the end it seems we finally see the mother of the show but we do not see her face.”

“The first episode however.was not funny at all, the thrill of maybe finding who is the mother made me think that it was not too late, but then again the joke is on me. This second episode (The Pre-Nup) only made me think that this show will not find a good finale, using all the clich√©s on the book of the sitcom. I just don’t care about who is the mother anymore, I just want a honesty last season, showing the characters as human beings that learn with their mistakes and move on with their lifes.”

So it appears that some people were expecting the pace to quicken in Season 8, but were disappointed. ¬†That being said, if you look back at the graph at the beginning of my post, you’ll see that although Season 8 was the start of abruptly lower ratings, the average ratings seemed to be dropping very slowly from well before that point. ¬†Given that the rate of change is so slow, perhaps detecting the “beginning of the end” using the changepoints package is not sensitive enough for our purpose.

That’s why I decided to bin all How I Met Your Mother episodes into groups of 10 successive episodes each, and then calculate the proportion of episodes in each bin which have an average rating under 8.0. ¬†You can see the graph below:

eps.under.8 = rollapply(himym$Rating, width=10, function (x) mean(x < 8), by=10, partial=TRUE)
ep.nums = paste(((seq(1,19)-1)*10)+1, ((seq(1,19)-1)*10)+10, sep="-")
ep.nums[19] = "181-189"
ep.grouped.ratings = data.frame(ep.nums, eps.under.8)

ggplot(ep.grouped.ratings, aes(x=reorder(ep.nums, seq(1,19)), y=eps.under.8, group=1)) + geom_line(size=3, colour="dark red") + scale_y_continuous(name="Proportion of Epsidoes/10 With a Below 8.0 Avg Rating") + scale_x_discrete(name="Episode Number Grouping") + theme(axis.text.x=element_text(angle=-90, vjust=.5, colour="black", size=14), axis.text.y=element_text(colour="black", size=14), axis.title.x=element_text(size=16, face='bold'), axis.title.y=element_text(size=16, face="bold"), plot.title=element_text(size=20,face="bold")) + geom_vline(xintercept = 10,linetype='longdash') + ggtitle("How I Met Your Mother Ratings, Every 10 Episodes")

Grouped HIMYM Ratings

In the above graph, you can see that because I’ve set such a high threshold (8.0), the results become much more dramatic looking. ¬†Until episodes 71-80, the show enjoyed a very high proportion of ratings of 8.0 or over (80% for 6 episode groups, 90% in one, and 100% in the first!). ¬†It looks as though it wasn’t until episodes 91-100 that you can really notice that ratings were not staying so well over 8.0. ¬†Which season and episodes were those?

himym[91:100,1:2]
    Ep.Num                           Ep.Title
91    5.30                          Robin 101
92    5.40              The Sexless Innkeeper
93    5.50                   Duel Citizenship
94    5.60                           Bagpipes
95    5.70                    The Rough Patch
96    5.80                       The Playbook
97    5.90 Slapsgiving 2: Revenge of the Slap
98    5.10                         The Window
99    5.11                Last Cigarette Ever
100   5.12                    Girls Vs. Suits

Well, now we have some evidence that quality started to go down earlier on in the show (although bear in mind we are still talking about a highly rated show here!). ¬†Again, let’s go to IMDB comments to see what we can glean about these episodes:

(in reference to “The Sexless Inkeeper”) “So after waiting 9/10 months for Robin and Barney to finally stop looking for reasons not to admit their feelings to each other, Lily is euphoric when the two finally do concede that they love each other. And how does she celebrate this?, she invites them to an incredibly awkward, incredibly forward night that displays EVERYTHING that Barney and Robin hate about relationships and gives the couple a massive reason NOT to be together. Lily and Marshall then react, whence realising Robin and The Barnicle didn’t enjoy the evening, as if they are the couple that has been mistreated. Lily then dumps Robin and Barney without so much as a phone call and plays right into the pair’s issues of abandonment.”

(in reference to “Last Cigarette Ever”) “… “Last Cigarette Ever” is the worst one. There are many reasons why I feel this way, but the two big reasons are: 1. It made no sense. Up to this point, I believe only Robin had ever been shown smoking a cigarette. Now all of a sudden, they all are smokers. I’m not going to pretend that the writers haven’t made mistakes before, but this was a little ridiculous. 2. It felt like a half-hour Marlboro ad. I think the intent was to discourage smoking, but I personally felt like it did the opposite. Awful episode in an otherwise very good series.”

I’m finding there don’t tend to be many user reviews of episodes on IMDB, so this next one is from metacritic.com:

“How i met your mother is starting to annoy everyone as the jokes are repetitive, Ted, Marshall & Lilly have started to get boring. There were only 5-6 good episodes in the season which includes the PLAYBOOK. The story line and the characters seem to be awfully close to NBC’s ”F.R.I.E.N.D.S”, but friends steals the show with better writing and acting which HIMYM seems to lack. Only Neil’s role as Barney has helped CBS to keep this show on air for 5 seasons.”

In the end, it’s hard to attribute shifts in average rating to these few comments, but they at least they do highlight some of the same complaints I’ve had or heard about the show (although I didn’t care so much about the cigarette episode!!). ¬†All we can say here is that the ratings started to decline around the start of season 5, and suffered a sharper (although still not grand scale) decline at the beginning of season 8.

Who uses E-Bikes in Toronto? Fun with Recursive Partitioning Trees and Toronto Open Data

I found a fun survey released to the Toronto Open Data website¬†that investigates the travel/commuting behaviour of Torontonians, but with a special focus on E-bikes. ¬†When I opened up the file, I found various demographic information, in addition to a question asking people their most frequently used mode of transportation. ¬†Exactly 2,238 people responded to this survey, of which 194 were frequent E-bike users. ¬†I figured that’s enough to do some data mining and an especially fun opportunity to use recursive partitioning trees to do that data mining!

Following is the code I used (notice in the model statements that I focus specifically on E-bike users versus everyone else):


library(rpart)
library(plyr)
library(rpart.plot)
ebike = read.csv("E-Bike_Survey_Responses.csv")
# This next part is strictly to change any blank responses into NAs
ebike[,2:10][ebike[,2:10] == ''] = NA
# In this section we use mapvalues from the plyr package to get rid of blanks, but also
# to reduce the number of values in each factor that we use.
ebike$Sex = mapvalues(ebike$Sex, c('',levels(ebike$Sex)[c(1,2,6)]), c('Other', rep("Other",10)))
ebike$Health = mapvalues(ebike$How.would.you.describe.your.level.of.physical.health., c('', levels(ebike$How.would.you.describe.your.level.of.physical.health.)[c(1,4,5,6,12,13)]), c(NA, rep("Other",7)))
ebike$Edu = mapvalues(ebike[,5], c('', levels(ebike[,5])[c(1,4,8,14,23)]), c(NA, rep('Other',20)))
ebike$Income = mapvalues(ebike[,6], '', NA)
ebike$Age = mapvalues(ebike[,2], '', NA)
# People put a lot of varying answers in here, but the categories I've chosen here can be found in most of them.
ebike$transport = factor(ifelse(grepl("bicycle",ebike[,11]),"Bicycle",
ifelse(grepl("e-bike", ebike[,11]), "E-bike",
ifelse(grepl("car", ebike[,11]), "Car",
ifelse(grepl("transit", ebike[,11]), "Transit","Other")))))
# Here we ask R to make two trees based first on Sex Health and Age (they seem like they go together)
# then based on education and income. You can try to put them together, but you will find that only some are
# chosen as the most significant for the classification. Therefore, keeping them apart describes for us
# E-bike users on separate dimensions.
b = rpart(transport == "E-bike"~ Sex + Health + Age, data=ebike)
c = rpart(transport == "E-bike" ~ Edu + Income, data=ebike)
# And here we plot the two Partition Tree models. I like seeing the factor label
# values in their entirety, so I've chosen a large enough number for the 'faclen' argument
# in each call to rpart.plot
rpart.plot(b, type=1,extra=1, varlen=0, faclen=10)
rpart.plot(c, type=1,extra=1, varlen=0, faclen=20)

view raw

ebike.r

hosted with ❤ by GitHub

Here is the first tree based on Sex, Health, and Age (Remember that the factor levels shown are not the only ones. ¬†When you look on the “no” side of the tree, it means that you are examining the proportion of ebike users who are described by factor levels not shown):

Health and Age Tree
As you can see, only Health and Age came out as significantly discriminating between E-bike users and everyone else. ¬†What this tree is telling us is that it’s in fact those people who are not in “Excellent, Good, Very good” health who are likely to use E-bikes, but rather an un-shown part of the Health spectrum: “Other,Fairly good,Poor”. ¬†That’s interesting in and of itself. ¬†It seems that of those people who are in Excellent or Very Good health, they are more likely (44%) to be riding Bicycles than people in other levels of health (23%). ¬†That makes sense! ¬†You’re not going to choose something effortful if your health isn’t great.

We also see a very interesting finding that it is in fact the 50 – 64 year olds (whose health isn’t great) who are more likely to be riding an E-bike compared to people of all other age groups!

Here’s the second tree based on Education and Income:

Education and Income Tree
Here we see that it is in fact not the university educated ones more likely to ride E-bikes, but in fact people with “College or trade school diploma,High school diploma”. ¬†Interesting!! ¬†Further, we see that amongst those who aren’t university educated, it’s those who say they make lower than $80,000 in income who are more likely to ride E-bikes.

So now we have an interesting picture emerging, with two parallel descriptions of who is most likely to ride E-bikes:

1) 50 – 64 year olds in not the greatest of health
and
2) Non University educated folks with lower than $80,000 income.

Toronto, these are your E-bike users!

Do Torontonians Want a New Casino? Survey Analysis Part 1

Toronto City Council is in the midst of a very lengthy process of considering whether or not to allow the OLG to build of a new casino in Toronto, and where.  The process started in November of 2012, and set out to answer this question through many and varied consultations with the public, and key stakeholders in the city.

One of the methods of public consultation that they used was a “Casino Feedback Form“, or survey that was distributed online and in person. ¬†By the time the deadline had passed to collect responses on this survey (January 25, 11:59pm), they had collected a whopping 17,780 responses. ¬†The agency seemingly responsible for the survey is called DPRA, and from what I can tell they seemed to do a pretty decent job of creating and distributing the survey.

In a very surprisingly modern and democratic form, Toronto City Council made the response data for the survey available on the Toronto Open Data website, which I couldn’t help but download and analyze for myself (with R of course!).

For a relatively small survey, it’s very rich in information. ¬†I love having hobby data sets to work with from time to time, and so I’m going to dedicate a few posts to the analysis of this response data file. ¬†This post will not show too much that’s different from the report that DPRA has already released, as it contains largely univariate analyses. ¬†In later posts however, I will get around to asking and answering those questions that are of a more multivariate nature! ¬†To preserve flow of the post, I will post the R code at the end, instead of interspersing it throughout like I normally do. ¬†Unless otherwise specified, all numerical axes represent the % of people who selected a particular response on the survey.

Without further ado, I will start with some key findings:

Key Findings

  1. With 17,780 responses, Toronto City Council obtained for themselves a hefty data set with pretty decent geographical coverage of the core areas of Toronto (Toronto, North York, Scarborough, Etobicoke, East York, Mississauga). ¬†This is much better than Ipsos Reid’s Casino Survey response data set of 906 respondents.
  2. Only 25.7% of respondents were somewhat or strongly in favour of having a new casino in Toronto. ¬†I’d say that’s overwhelmingly negative!
  3. Ratings of the suitability of a casino in three different locations by type of casino indicate that people are more favourable towards an Integrated Entertainment Complex (basically a casino with extra amenities) vs. a standalone casino.
  4. Of the three different locations, people were most favourable towards an Integrated Entertainment Complex at the Exhibition Place. ¬†However, bear in mind that only 27.4% of respondents thought it was suitable at all. ¬†This is a ‘best of the worst’ result!
  5. When asked to rate the importance of a list of issues surrounding the building of a new casino in Toronto, respondents rated as most important the following issues: safety, health, addiction, public space, traffic, and integration with surrounding areas.

Geographic Distribution of Responses

In a relatively short time, City Council managed to collect many responses to their survey. ¬†I wanted to look at the geographic distribution of all of these responses. ¬†Luckily, the survey included a question that asked for the first 3 characters of the respondents’ postal code (or FSA). ¬†If you have a file containing geocoded postal codes, you then can plot the respondents on a map. ¬†I managed to find such a file on a website called geocoder.ca, with latitude and longitude coordinates for over 90,000 postal codes). ¬†Once I got the file into R, I made sure that all FSA codes in the survey data were capitalized, created an FSA column in the geocoded file, and then merged the geocoded dataset into the survey dataset. ¬†This isn’t a completely valid approach, but when looking at a broad area, I don’t think the errors in plotting points on a map aren’t going to look that serious.

For a survey about Toronto, the geographic distribution was actually pretty wide.  Have a look at the complete distribution:

Total Geo Distribution of Responses

Obviously there seem to be a whole lot of responses in Southern Ontario, but we even see a smattering of responses in neighbouring provinces as well. ¬†However, let’s look at a way of zooming in on the large cluster of Southern Ontario cities. ¬†From the postal codes, I was able to get the city in which each response was made. ¬†From that I pulled out what looked like a good cluster of top southern ontario cities:

          City	 # Responses
       Toronto	8389
    North York	1553
   Scarborough	1145
     Etobicoke	936
     East York	462
   Mississauga	201
       Markham	149
      Brampton	111
 Richmond Hill	79
     Thornhill	62
          York	59
         Maple	58
        Milton	30
      Oakville	30
    Woodbridge	30
    Burlington	28
        Oshawa	25
     Pickering	22
        Whitby	19
      Hamilton	17
        Bolton	14
        Guelph	13
      Nobleton	12
        Aurora	11
          Ajax	10
       Caledon	10
   Stouffville	10
        Barrie	9

Lots of people in Toronto, obviously, a fair amount in North York, Scarborough, and Etobicoke, and then it leaps downwards in frequency from there. However, these city labels are from the geocoding, and who knows if some people it places in Toronto are actually from North York (the tree, then one of its apples). So, I filtered the latitude and longitude coordinates based on this top list to get the following zoomed-in map:

Toronto Geo Distribution of ResponsesMuch better than a table!  I used transparency on the colours of the circles to help better distinguish dense clusters of responses from sparse ones.  Based on the map, I can see 3 patterns:

1) It looks like there is a huge cluster of responses came from an area of Toronto approximately bounded by Dufferin on the West, highway 404 on the east, the Gardiner on the south, and the 401 on the north.

2) There’s also an interesting vertical cluster that seems to go from well south of highway 400 and the 401, and travels north to the 407.

3) I’m not sure I would call this a cluster per se, but there definitely seems to be a pattern where you find responses all the way along the Gardiner Expressway/Queen Elizabeth Way/Kingston Road/401 from Burlington to Oshawa.

Now for the survey results!

Demographics of Respondents

This slideshow requires JavaScript.

As you can see, about 80% of the respondents disclosed their gender, with a noticeable bias towards men. ¬†Also, most of the respondents who disclosed their age were between 25 and 64 years of age. ¬†This might be a disadvantage, according to a recent report by Statistics Canada on gambling. ¬†If you look at page 6 on the report, you will see that of all age groups of female gamblers, those 65 and older are spending the most amount of money on Casinos, Slot Machines, and VLTs per 1 person spending household. ¬†However, I guess it’s better having some information than no information.

Feelings about the new casino

Feelings about the new casino

Well, isn’t that something? ¬†Only about a quarter of all people surveyed actually have positive feelings about a new casino! ¬†I have to say this is pretty telling. ¬†You would think this would be damning information, but here’s where we fall into the trap of whether or not to trust a survey result.

Here we have this telling response, but then again, Ipsos Reid conducted a poll that gathered 906 responses that concluded that 52% of Torontonians “either strongly or somewhat support a new gambling venue within its borders”. ¬†People were asked about their support of a new casino at the beginning and ending. ¬†At the end, after they supplied people with all the various arguments supplied by both sides of the debate, they asked the question again. ¬†Apparently the proportion supporting the casino was 54% when analyzing the second instance of the question. ¬†They don’t even link to the original question form, so I’m left to wonder exactly how it was phrased, and what preceded it. ¬†The only hint is in this phrase: ”¬†if a vote were held tomorrow on the idea of building a casino in the city of Toronto…”. ¬†Does that seem comparable to you?

A Casino for “Toronto The Good”?

Casino fit image of toronto

This question seems to be pretty similar to the first question. ¬†If a new casino fits your image of Toronto perfectly, then you’re probably going to be strongly in favour of one! ¬†Obviously, most people seem pretty sure that a new casino just isn’t the kind of thing that would fit in with their image of “Toronto the Good”.

Where to build a new casino

Where casino builtIn the response pattern here, we seem to see a kind of ‘not in/near my backyard’ mentality going on. ¬†A slight majority of respondents seem to be saying that if a new casino is to be built, it should be somewhere decently far away from Toronto, perhaps so that they don’t have to deal with the consequences. ¬†I’ll eat my sock if the “Neither” folks aren’t those who also were strongly opposed to the casino.

Casino Suitability in Downtown Area

Casino suitability at Exhibition Place

Casino Suitability at Port Lands

They also asked respondents to rate the suitability of a new casino in three different locations:

  1. A downtown area (bounded by Spadina Avenue, King Street, Jarvis Street and Queens Quay)
  2. Exhibition Place (bounded by Gardiner Expressway, Lake Shore Boulevard, Dufferin Street and Strachan Avenue)
  3. Port Lands (located south of the Don Valley and Gardiner/Lake Shore, east of the downtown core)

Looking at the above 3 graphs, you see right away that a kind of casino called an Integrated Entertainment Complex (kind of a smorgasboard of casino, restaurant, theatre, hotel, etc.) is more favourable than a standalone casino at any location.  That being said, the responses are still largely negative!  Out of the 3 options for location of an Integrated Entertainment Complex (IEC), it was Exhibition Place that rated the most positive by a small margin (18.1% said highly suitable, vs. 16.2% for downtown Toronto).  There are definitely those at the Exhibition Place who want the Toronto landmark to be chosen!

Desired Features of IEC by Location

This slideshow requires JavaScript.

These charts indicate that, for those who can imagine an Integrated Entertainment Complex in either of the 3 locations, they would like features at that locations that allow them to sit/stand and enjoy themselves. ¬†Restaurants, Cultural and Arts Facilities, and Theatre are tops in all 3 locations (but still bear in mind that less than half opted for those choices). ¬†A quick google search reveals that restaurants and theatres are mentioned in a large number of search results. ¬†An article in the Toronto Sun¬†boasts that an Integrated Entertainment Complex would catapult Toronto into the stars as even more of a tourist destination. ¬†Interestingly, the article also mentions the high monetary value of convention visitors and how much that would add to the revenues generated for the city. ¬†I find it funny that the popularity of having convention centre space in this survey is at its highest when people are asked about the Exhibition Place. ¬†The Exhibition Place already has convention centre space!! ¬†I don’t understand the logic, but maybe someone will explain it to me.

Issues of Importance Surrounding a New Casino

Issues of Importance re the New CasinoUnlike the previous graphs, this one charts the % who gave a particular response on each item. ¬†In this case, the graph shows the % of respondents who gave gave the answer “Very Important” when asked to rate each issue surrounding the new casino. ¬†Unlike some of the previous questions, this one did not include a “No Casino” option, so already more people can contribute positively to the response distribution. ¬†You can already see that people are pretty riled up about some serious social and environmental issues. ¬†They’re worried about safety, health, addiction, public space (sounds like a worry about clutter to me), traffic, and integration with surrounding areas. ¬†I’ll bet that the people worried about these top 5 issues are the people most likely to say that they don’t want a casino anywhere. ¬†It will be interesting to uncover some factor structure here and then find out what the pro and anti casino folks are concerned with.

For my next post, I have in mind to investigate a few simple questions so far:

  1. Who exactly wants or doesn’t want a new casino, and where? ¬†What are they most concerned with (those who do and don’t want a casino)
  2. Is there a “not in my backyard” effect going on, where those who are closest to the proposed casino spots are the least likely to want it there, but more likely to want a casino elsewhere? ¬†I have latitude/longitude coordinates, and can convert them into distances from the proposed casino spots. ¬†I think that will be interesting to look at!


library(ff)
library(ffbase)
library(stringr)
library(ggplot2)
library(ggthemes)
library(reshape2)
library(RgoogleMaps)
# Loading 2 copies of the same data set so that I can convert one and have the original for its text values
casino = read.csv("/home/inkhorn/Downloads/casino_survey_results20130325.csv")
casino.orig = read.csv("/home/inkhorn/Downloads/casino_survey_results20130325.csv")
# Here's the dataset of canadian postal codes and latitude/longitude coordinates
pcodes = read.csv.ffdf(file="/home/inkhorn/Downloads/zipcodeset.txt", first.rows=50000, next.rows=50000, colClasses=NA, header=FALSE)
# I'm doing some numerical recoding here. If you can tell me a cleaner way of doing this
# then by all means please do. I found this process really annoyingly tedious.
casino$Q1_A = ifelse(casino.orig$Q1_A == "Neutral or Mixed Feelings", 3,
ifelse(casino.orig$Q1_A == "Somewhat in Favour", 4,
ifelse(casino.orig$Q1_A == "Somewhat Opposed", 2,
ifelse(casino.orig$Q1_A == "Strongly in Favour", 5,
ifelse(casino.orig$Q1_A == "Strongly Opposed", 1,NA)))))
casino$Q2_A = ifelse(casino.orig$Q2_A == "Does Not Fit My Image At All", 1,
ifelse(casino.orig$Q2_A == "Neutral / I am Not Sure",2,
ifelse(casino.orig$Q2_A == "Fits Image Somewhat", 3,
ifelse(casino.orig$Q2_A == "Fits Image Perfectly", 4, NA))))
for (i in 8:24) {
casino[,i] = ifelse(casino.orig[,i] == "Not Important At All", 1,
ifelse(casino.orig[,i] == "Somewhat Important", 2,
ifelse(casino.orig[,i] == "Very Important", 3,NA)))}
for (i in c(31:32,47,48,63,64)) {
casino[,i] = ifelse(casino.orig[,i] == "Highly Suitable",5,
ifelse(casino.orig[,i] == "Neutral or Mixed Feelings",3,
ifelse(casino.orig[,i] == "Somewhat Suitable",4,
ifelse(casino.orig[,i] == "Somewhat Unsuitable",2,
ifelse(casino.orig[,i] == "Strongly Unsuitable",1,NA)))))}
# There tended to be blank responses in the original dataset. When seeking to
# plot the responses in their original text option format, I got rid of them in some cases,
# or coded them in "Did not disclose" in others.
casino.orig$Q1_A[casino.orig$Q1_A == ""] = NA
casino.orig$Q1_A = factor(casino.orig$Q1_A, levels=c("Strongly Opposed","Somewhat Opposed","Neutral or Mixed Feelings","Somewhat in Favour","Strongly in Favour"))
# Here's the graph showing how people feel about a new casino
ggplot(subset(casino.orig, !is.na(Q1_A)), aes(x=Q1_A,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("How do you feel about having a new casino in Toronto?") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)), geom="text") + scale_y_continuous(labels=percent)
# How does the casino fit into your image of toronto…
ggplot(subset(casino.orig, Q2_A!= ''), aes(x=Q2_A,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("How does a new casino in Toronto fit your image of the City of Toronto?") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)),geom="text") + scale_y_continuous(labels=percent)
# Where you'd prefer to see it located
ggplot(subset(casino.orig, Q6!= ''), aes(x=Q6,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("If a casino is built, where would you prefer to see it located?") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)), geom="text") + scale_y_continuous(labels=percent)
# Here I reorder the text labels from the questions asking about suitability of the downtown location
casino.orig$Q7_A_StandAlone = reorder(casino.orig$Q7_A_StandAlone, casino$Q7_A_StandAlone)
casino.orig$Q7_A_Integrated = reorder(casino.orig$Q7_A_Integrated, casino$Q7_A_Integrated)
# Reshaping the downtown ratings data for graphing..
stand.and.integrated.ratings.downtown = cbind(prop.table(as.matrix(table(casino.orig$Q7_A_StandAlone)[1:5])),
prop.table(as.matrix(table(casino.orig$Q7_A_Integrated)[1:5])))
colnames(stand.and.integrated.ratings.downtown) = c("Standalone Casino","Integrated Entertainment Complex")
stand.and.integrated.ratings.downtown.long = melt(stand.and.integrated.ratings.downtown, varnames=c("Rating","Casino Type"), value.name="Percentage")
# Graphing ratings of casino suitability for the downtown location
ggplot(stand.and.integrated.ratings.downtown.long, aes(x=stand.and.integrated.ratings.downtown.long$"Casino Type", fill=Rating, y=Percentage,label=sprintf("%.02f %%", Percentage*100))) + geom_bar(position="dodge") + coord_flip() + ggtitle("Ratings of Casino Suitability \nin Downtown Toronto by Casino Type") + scale_x_discrete(name="") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent) + geom_text(aes(x=stand.and.integrated.ratings.downtown.long$"Casino Type", y=Percentage, ymax=Percentage, label=sprintf("%.01f%%",Percentage*100), hjust=.75),position = position_dodge(width=1),size=4) + scale_fill_few(palette="light") + theme_wsj()
# Reshaping the exhibition place ratings for graphing
stand.and.integrated.ratings.exhibition = cbind(prop.table(as.matrix(table(casino.orig$Q7_B_StandAlone)[2:6])),
prop.table(as.matrix(table(casino.orig$Q7_B_Integrated)[2:6])))
colnames(stand.and.integrated.ratings.exhibition) = c("Standalone Casino","Integrated Entertainment Complex")
stand.and.integrated.ratings.exhibition.long = melt(stand.and.integrated.ratings.exhibition, varnames=c("Rating","Casino Type"), value.name="Percentage")
# Reordering the rating text labels for the graphing.
stand.and.integrated.ratings.exhibition.long$Rating = factor(stand.and.integrated.ratings.exhibition.long$Rating, levels=levels(casino.orig$Q7_A_StandAlone)[1:5])
# Graphing ratings of casino suitability for the exhibition place location
ggplot(stand.and.integrated.ratings.exhibition.long, aes(x=stand.and.integrated.ratings.exhibition.long$"Casino Type", fill=Rating, y=Percentage,label=sprintf("%.02f %%", Percentage*100))) + geom_bar(position="dodge") + coord_flip() + ggtitle("Ratings of Casino Suitability \nat Exhibition Place by Casino Type") + scale_x_discrete(name="") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent) + geom_text(aes(x=stand.and.integrated.ratings.exhibition.long$"Casino Type", y=Percentage, ymax=Percentage, label=sprintf("%.01f%%",Percentage*100), hjust=.75), position = position_dodge(width=1),size=4) + scale_fill_few(palette="light") + theme_wsj()
# Reshaping the Port Lands ratings for graphing
stand.and.integrated.ratings.portlands = cbind(prop.table(as.matrix(table(casino.orig$Q7_C_StandAlone)[2:6])),
prop.table(as.matrix(table(casino.orig$Q7_C_Integrated)[2:6])))
colnames(stand.and.integrated.ratings.portlands) = c("Standalone Casino", "Integrated Entertainment Complex")
stand.and.integrated.ratings.portlands.long = melt(stand.and.integrated.ratings.portlands, varnames=c("Rating","Casino Type"), value.name="Percentage")
# Reording the rating text labels for the graping.
stand.and.integrated.ratings.portlands.long$Rating = factor(stand.and.integrated.ratings.portlands.long$Rating, levels=levels(casino.orig$Q7_A_StandAlone)[1:5])
# Graphing ratings of casino suitability for the port lands location
ggplot(stand.and.integrated.ratings.portlands.long, aes(x=stand.and.integrated.ratings.portlands.long$"Casino Type", fill=Rating, y=Percentage,label=sprintf("%.02f %%", Percentage*100))) + geom_bar(position="dodge") + coord_flip() + ggtitle("Ratings of Casino Suitability \nat Port Lands by Casino Type") + scale_x_discrete(name="") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent) + geom_text(aes(x=stand.and.integrated.ratings.portlands.long$"Casino Type", y=Percentage, ymax=Percentage, label=sprintf("%.01f%%",Percentage*100), hjust=.75), position = position_dodge(width=1),size=4) + scale_fill_few(palette="light") + theme_wsj()
# This was the part in my analysis where I looked at postal codes (FSAs really) and their coordinates
# Sorry I'm not more linear in how I do my analysis vs. write about it ūüôā
# You'll notice that I've imported the geocode file as ffdf. This led to faster merging with the
# original casino data set. This meant that I had to coerce the casino.orig data frame into ffdf format
# But I work with it every day at work, so I'm used to it by now, despite its idiosynchracies.
casino.orig$PostalCode = toupper(casino.orig$PostalCode)
pcodes = read.csv.ffdf(file="/home/inkhorn/Downloads/zipcodeset.txt", first.rows=50000, next.rows=50000, colClasses=NA, header=FALSE)
names(pcodes) = c("Postal","Lat","Long","City","Prov")
pcodes$FSA = as.ff(as.factor(toupper(substr(pcodes[,"Postal"], 1,3))))
casino.orig = as.ffdf(casino.orig)
casino.orig$PostalCode = as.ff(as.factor(toupper(casino.orig[,"PostalCode"])))
casino.orig = merge(casino.orig, pcodes, by.x="PostalCode", by.y="FSA", all.x=TRUE)
# This is the code for the full map I generated
casino.gc = casino.orig[which(!is.na(casino.orig[,"Lat"])),] # making sure only records with coordinates are included…
mymap = MapBackground(lat=casino.gc$Lat, lon=casino.gc$Long)
PlotOnStaticMap(mymap, casino.gc$Lat, casino.gc$Long, cex=1.5, pch=21, bg="orange")
# Here I'm getting a list of cities, winnowing it down, and using it to filter the
# geocode coordinates to zoom in on the map I generated.
cities = data.frame(table(casino.orig[,"City"]))
cities = cities[cities$Freq > 0,]
cities = cities[order(cities$Freq, decreasing=TRUE),]
cities = cities[cities$Var1 != '',]
cities.filter = cities[1:28,] # Here's my top cities variable (i set an arbitrary dividing line…)
names(cities.filter) = c("City","# Responses")
# Here's where I filtered the original casino ffdf so that it only contained the cities
# that I wanted to see in Southern Ontario
casino.top.so = casino.orig[which(casino.orig[,"City"] %in% cities.filter$Var1),]
# here's a transparency function that I used for the southern ontario map
addTrans <- function(color,trans)
{
# This function adds transparancy to a color.
# Define transparancy with an integer between 0 and 255
# 0 being fully transparant and 255 being fully visable
# Works with either color and trans a vector of equal length,
# or one of the two of length 1.
if (length(color)!=length(trans)&!any(c(length(color),length(trans))==1)) stop("Vector lengths not correct")
if (length(color)==1 & length(trans)>1) color <- rep(color,length(trans))
if (length(trans)==1 & length(color)>1) trans <- rep(trans,length(color))
num2hex <- function(x)
{
hex <- unlist(strsplit("0123456789ABCDEF",split=""))
return(paste(hex[(xx%%16)/16+1],hex[x%%16+1],sep=""))
}
rgb <- rbind(col2rgb(color),trans)
res <- paste("#",apply(apply(rgb,2,num2hex),2,paste,collapse=""),sep="")
return(res)
}
# Finally here's the southern ontario map code
mymap = MapBackground(lat=casino.top.so$Lat, lon=casino.top.so$Long)
PlotOnStaticMap(mymap, casino.top.so$Lat, casino.top.so$Long, cex=1.5, pch=21, bg=addTrans("orange",10))
# Here's some code for summarizing and plotting the response data to the question
# around issues of importance regarding the new casino (question 3)
q3.summary = matrix(NA, 16,1,dimnames=list(c("Design of the facility",
"Employment opportunities","Entertainment and cultural activities",
"Expanded convention facilities", "Integration with surrounding areas",
"New hotel accommodations","Problem gambling & health concerns",
"Public safety and social concerns","Public space",
"Restaurants","Retail","Revenue for the City","Support for local businesses",
"Tourist attraction","Traffic concerns","Training and career development"),c("% Very Important")))
for (i in 8:23) {
q3.summary[i7] = mean(casino[,i] == 3, na.rm=TRUE)}
q3.summary = as.data.frame(q3.summary[order(q3.summary[,1], decreasing = FALSE),])
names(q3.summary)[1] = "% Very Important"
q3.summary$Concern = rownames(q3.summary)
q3.summary = q3.summary[order(q3.summary$"% Very Important", decreasing=FALSE),]
q3.summary$Concern = factor(q3.summary$Concern, levels=q3.summary$Concern)
ggplot(q3.summary, aes(x=Concern, y=q3.summary$"% Very Important")) + geom_point(size=5, colour="forest green") + coord_flip() + ggtitle("Issues of Importance Surrounding\nthe New Casino") + scale_x_discrete(name="Issues of Importance") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent) + theme_wsj()
# This chunk of code deals with summarizing and plotting the questions surrounding
# what features people might want if a new Integrated Entertainment Complex is built
q7a.summary = matrix(NA, 9,1, dimnames=list(c("No Casino","Casino Only", "Convention Centre Space", "Cultural and Arts Facilities",
"Hotel","Nightclubs","Restaurants","Retail","Theatre"),c("% Include")))
for (i in 36:44) {
q7a.summary[i35] = mean(casino[,i], na.rm=TRUE)}
q7a.summary = as.data.frame(q7a.summary[order(q7a.summary[,1], decreasing = FALSE),])
names(q7a.summary)[1] = "% Include"
q7a.summary$feature = rownames(q7a.summary)
q7a.summary$feature = factor(q7a.summary$feature, levels=q7a.summary$feature)
ggplot(q7a.summary, aes(x=feature, y=q7a.summary$"% Include")) + geom_point(size=5, colour="forest green") + coord_flip() + ggtitle("What People Would Want in an Integrated\nEntertainment Complex in Downtown Toronto") + scale_x_discrete(name="Features") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent,name="% Wanting the Feature") + theme_wsj()
q7b.summary = matrix(NA, 9,1, dimnames=list(c("No Casino","Casino Only", "Convention Centre Space", "Cultural and Arts Facilities",
"Hotel","Nightclubs","Restaurants","Retail","Theatre"),c("% Include")))
for (i in 52:60) {
q7b.summary[i51] = mean(casino[,i], na.rm=TRUE)}
q7b.summary = as.data.frame(q7b.summary[order(q7b.summary[,1], decreasing = FALSE),])
names(q7b.summary)[1] = "% Include"
q7b.summary$feature = rownames(q7b.summary)
q7b.summary$feature = factor(q7b.summary$feature, levels=q7b.summary$feature)
ggplot(q7b.summary, aes(x=feature, y=q7b.summary$"% Include")) + geom_point(size=5, colour="forest green") + coord_flip() + ggtitle("What People Would Want in an Integrated\nEntertainment Complex at the Exhbition Place") + scale_x_discrete(name="Features") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent,name="% Wanting the Feature") + theme_wsj()
q7c.summary = matrix(NA, 9,1, dimnames=list(c("No Casino","Casino Only", "Convention Centre Space", "Cultural and Arts Facilities",
"Hotel","Nightclubs","Restaurants","Retail","Theatre"),c("% Include")))
for (i in 68:76) {
q7c.summary[i67] = mean(casino[,i], na.rm=TRUE)}
q7c.summary = as.data.frame(q7c.summary[order(q7c.summary[,1], decreasing = FALSE),])
names(q7c.summary)[1] = "% Include"
q7c.summary$feature = rownames(q7c.summary)
q7c.summary$feature = factor(q7c.summary$feature, levels=q7c.summary$feature)
ggplot(q7c.summary, aes(x=feature, y=q7b.summary$"% Include")) + geom_point(size=5, colour="forest green") + coord_flip() + ggtitle("What People Would Want in an Integrated\nEntertainment Complex in Port Lands") + scale_x_discrete(name="Features") + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + scale_y_continuous(labels=percent,name="% Wanting the Feature") + theme_wsj()
# It sucks, but I imported yet another version of the casino dataset so that I wouldn't have to use
# the annoying ffdf indexing notation (e.g. df[,"variable1"])
casino.orig2 = read.csv("/home/inkhorn/Downloads/casino_survey_results20130325.csv")
# Finally, here's some code where I processed and plotted the Gender and Age demographic variables
casino$Gender = casino.orig$Gender
casino$Gender = ifelse(!(casino.orig2$Gender %in% c("Female","Male","Transgendered")), "Did not disclose",
ifelse(casino.orig2$Gender == "Female","Female",
ifelse(casino.orig2$Gender == "Male","Male","Transgendered")))
casino$Gender = factor(casino$Gender, levels=c("Transgendered","Did not disclose","Female","Male"))
ggplot(casino, aes(x=Gender,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("Gender Distribution of Respondents") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)),
geom="text") + scale_y_continuous(labels=percent)
casino$Age = ifelse(casino.orig2$Age == "", "Did not disclose",
ifelse(casino.orig2$Age == "Under 15", "Under 15",
ifelse(casino.orig2$Age == "15-24", "15-24",
ifelse(casino.orig2$Age == "25-34", "25-34",
ifelse(casino.orig2$Age == "35-44", "35-44",
ifelse(casino.orig2$Age == "45-54","45-54",
ifelse(casino.orig2$Age == "55-64","55-64",
ifelse(casino.orig2$Age == "65 or older", "65 or older","Did not disclose"))))))))
casino$Age = factor(casino$Age, levels=c("Did not disclose","Under 15","15-24","25-34","35-44","45-54","55-64","65 or older"))
ggplot(casino, aes(x=Age,y=..count../sum(..count..))) + geom_bar(fill="forest green") + coord_flip() + ggtitle("Age Distribution of Respondents") + scale_x_discrete(name="") + theme_wsj() + theme(title=element_text(size=22),plot.title=element_text(hjust=.8)) + stat_bin(aes(label = sprintf("%.02f %%", ..count../sum(..count..)*100)), geom="text") + scale_y_continuous(labels=percent)

ggplot2: Creating a custom plot with two different geoms

This past week for work I had to create some plots to show the max, min, and median of a measure across the levels of a qualitative variable, and show the max and min of the same variable within a subgroup of the dataset.  To illustrate what I mean, I took a fun dataset from the data and story library and recreated the plot that I made at work.
The data are supposed to represent the results of a study done on how long it took subjects to complete a pen and paper maze while they were either smelling a floral scent or not.  Other than the time to completion (the main variable of interest), some other variables are recorded, like sex and whether or not the subject is a smoker.
¬†I represent the min, max, and median completion time that it took men and women to complete the maze with the “crossbar” geom, which show obvious overlap in performance across the genders (however it’s apparent that men were a bit more variable in their performance than women). ¬†The seemingly interesting result is represented by the dots. ¬†The dots show the min and max values for the smokers within each gender. ¬†While smoking doesn’t appear to have an effect on performance amongst the women, it seems to be correlated with much slower performance amongst the men in the study. ¬†Then again, I just re-checked the data, which show that there are only 2 male smokers and 6 female smokers, so the comparison seems pretty unreliable.
Stepping away from the study itself, what I like here is that you can call up several geoms in the same plot, passing different data subsets to each geom. ¬†It allows for useful customization so that you can tackle problems that aren’t so cut and dried.
Finally, I realize that before putting this kind of graph into a report, I really should create a little legend that shows the reader that the ends of the boxes are max and mins, and the middle lines represent medians, and the dots represent max and mins of the subset.
Following is the code and the plot I created:


scents = read.table("clipboard",header=TRUE,sep="\t")
strial3.by.sex.wide = ddply(scents, 'Sex', function (x) quantile(x$S.Trial.3, c(0,.5,1), na.rm=TRUE))
strial3.by.sex.smokers = melt(ddply(subset(scents,Smoker == "Y") , 'Sex', function (x) quantile(x$S.Trial.3, c(0,1), na.rm=TRUE)),variable.name="Percentile",value.name="Time")
ggplot() + geom_crossbar(data=strial3.by.sex.wide, aes(x=Sex, y=strial3.by.sex.wide$"50%", ymin=strial3.by.sex.wide$"0%", ymax=strial3.by.sex.wide$"100%"),fill="#bcc927",width=.75) +
geom_point(data=strial3.by.sex.smokers, aes(x=Sex, y=Time, stat="identity"), size=3)
+ opts(legend.title = theme_text(size=10, face="bold"), legend.text = theme_text(size=10),
axis.text.x=theme_text(size=10), axis.text.y=theme_text(size=10,hjust=1), axis.title.x=theme_text(size=12,face="bold"), axis.title.y=theme_text(size=12, angle=90,
face="bold")) + scale_y_continuous(name="Time to Completion")

Are scatterplots too complex for lay folks?

Usually, I like to write about the solutions to problems I’ve had, but today I only have a problem to write about.

This is the second research job I’ve had outside of academia, and in both cases I’ve met with resistance when I’ve tried to display bivariate relations using scatterplot. For example, a colleague came past my work computer yesterday while I had on screen a scatterplot with a linear trend line showing. She looked at the plot and blurted out “Wow, that looks complicated!”.

One option that I’ve tried in the past to display bivariate relations where I’d normally use a scatterplot is to discretize, or bin the x variable, and show the average y value for each range level of the x variable. This can certainly can help to show the direction of the relation, but hides the number of data points that go into the averages. I like being able to see the diagonal stick like orientation of a dot cloud signalling a strong correlation, or seeing the loose, circular orientation that signals no apparent correlation.

Do I have to say goodbye to the scatterplot now that I’m outside of academia? Will I ever be able to fruitfully use it again to communicate results to lay folks? Are the bar graph and line graph my only viable tools for data visualization now? Is there something I’m missing that could help people see scatterplots as helpful representations of bivariate relations? I’d appreciate answers to any of my questions.