Which Torontonians Want a Casino? Survey Analysis Part 2

In my last post I said that I would try to investigate the question of who actually does want a casino, and whether place of residence is a factor in where they want the casino to be built.  So, here goes something:

The first line of attack in this blog post is to distinguish between people based on their responses to the third question on the survey, the one asking people to rate the importance of a long list of issues.  When I looked at this list originally, I knew that I would want to reduce the dimensionality using PCA.

library(psych)
issues.pca = principal(casino[,8:23], 3, rotate="varimax",scores=TRUE)

The PCA resulted in the 3 components listed in the table below.  The first component had variables loading on to it that seemed to relate to the casino being a big attraction with lots of features, so I named it “Go big or Go Home”.  On the second component there seemed to be variables loading on to it that related to technical details, while the third component seemed to have variables loading on to it that dealt with social or environmental issues.

Go Big or Go Home Concerned with Technical Details Concerned with Social/Environmental Issues or not Issue/Concern
Q3_A 0.181 0.751 Design of the facility
Q3_B 0.366 0.738 Employment Opportunities
Q3_C 0.44 0.659 Entertainment and cultural activities
Q3_D 0.695 0.361 Expanded convention facilities
Q3_E 0.701 0.346 Integration with surrounding areas
Q3_F 0.808 0.266 New hotel accommodations
Q3_G -0.117 0.885 Problem gambling & health concerns
Q3_H 0.904 Public safety and social concerns
Q3_I 0.254 0.716 Public space
Q3_J 0.864 0.218 Restaurants
Q3_K 0.877 0.157 Retail
Q3_L 0.423 0.676 -0.1 Revenue for the city
Q3_M 0.218 0.703 0.227 Support for local businesses
Q3_N 0.647 0.487 -0.221 Tourist attraction
Q3_O 0.118 0.731 Traffic concerns
Q3_P 0.497 0.536 0.124 Training and career development

Once I was satisfied that I had a decent understanding of what the PCA was telling me, I loaded the component scores into the original dataframe.

casino[,110:112] = issues.pca$scores
names(casino)[110:112] = c("GoBigorGoHome","TechnicalDetails","Soc.Env.Issues")

In order to investigate the question of who wants a casino and where, I decided to use question 6 as a dependent variable (the one asking where they would want it built, if one were to be built) and the PCA components as independent variables.  This is a good question to use, because the answer options, if you remember, are “Toronto”, “Adjacent Municipality” and “Neither”.  My approach was to model each response individually using logistic regression.

casino$Q6[casino$Q6 == ""] = NA
casino$Q6 = factor(casino$Q6, levels=c("Adjacent Municipality","City of Toronto","Neither"))

adj.mun = glm(casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
toronto = glm(casino$Q6 == "City of Toronto" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
neither = glm(casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))

Following are the summaries of each GLM:
Toronto:


Call:
glm(formula = casino$Q6 == "City of Toronto" ~ GoBigorGoHome +
TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
3.6426 0.4745 0.1156 0.4236 3.4835
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.58707 0.04234 37.48 <2e-16 ***
GoBigorGoHome 1.76021 0.03765 46.75 <2e-16 ***
TechnicalDetails 1.77155 0.05173 34.24 <2e-16 ***
Soc.Env.Issues 1.63057 0.04262 38.26 <2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13537 on 10365 degrees of freedom
Residual deviance: 6818 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 6826
Number of Fisher Scoring iterations: 6

Adjacent municipality:


Call:
glm(formula = casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome +
TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
1.0633 0.7248 0.5722 0.3264 2.7136
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.45398 0.02673 54.394 < 2e-16 ***
GoBigorGoHome 0.41989 0.02586 16.239 < 2e-16 ***
TechnicalDetails 0.18764 0.02612 7.183 6.82e-13 ***
Soc.Env.Issues 0.52325 0.03221 16.243 < 2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10431.8 on 10365 degrees of freedom
Residual deviance: 9756.4 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 9764.4
Number of Fisher Scoring iterations: 5

Neither location:


Call:
glm(formula = casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails +
Soc.Env.Issues, family = binomial(logit), data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
2.4090 0.7344 0.3934 0.8966 2.7194
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.22987 0.02415 9.517 <2e-16 ***
GoBigorGoHome 0.85050 0.02462 34.549 <2e-16 ***
TechnicalDetails 1.00182 0.02737 36.597 <2e-16 ***
Soc.Env.Issues 0.69707 0.02584 26.972 <2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14215 on 10365 degrees of freedom
Residual deviance: 10557 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 10565
Number of Fisher Scoring iterations: 4

And here is a quick summary of the above GLM information:
Summary of Casino GLMs

Judging from these results, it looks like those who want a casino in Toronto don’t focus on the big social/environmental issues surrounding the casino, but do focus on the flashy and non-flashy details and benefits alike.  Those who want a casino outside of Toronto do care about the social/environmental issues, don’t care as much about the flashy details, but do have a focus on some of the non-flashy details.  Finally, those not wanting a casino in either location care about the social/environmental issues, but don’t care about any of the details.

Here’s where the issue of location comes into play.  When I look at the summary for the GLM that predicts who wants a casino in an adjacent municipality, I get the feeling that it’s picking up people living in the down-town core who just don’t think the area can handle a casino.  In other words, I think there might be a “not in my backyard!” effect.

The first inkling that this might be the case comes from an article from the Martin Prosperity Institute (MPI), who analyzed the same data set, and managed to get a very nice looking heat map-map of the responses to the first question on the survey, asking people how they feel about having a new casino in Toronto.  From this map, it does look like people in Downtown Toronto are feeling pretty negative about a new casino, whereas those in the far east and west of Toronto are feeling better about it.

My next evidence comes from the cities uncovered by geocoding the responses in the data set.  I decided to create a very simple indicator variable, distinguishing those for whom the “City” is Toronto, and those for whom the city is anything else.  I like this better than the MPI analysis, because it looks at peoples’ attitudes towards a casino both inside and outside of Toronto (rather than towards the concept of a new Casino in Toronto).  If there really is a “not in my backyard!” effect, I would expect to see evidence that those in Toronto are more disposed towards a casino in an adjacent municipality, and that those from outside of Toronto are more disposed towards a casino inside Toronto!  Here we go:


library(ff)
library(ggthemes)
ffload(file="casino", overwrite=TRUE)
casino.orig$Outside.of.Toronto = as.ff(ifelse(casino.orig[,"City"] == "Toronto",0,1))
casino.in.toronto = glm(casino.orig[,"Q6"] == "City of Toronto" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
casino.outside.toronto = glm(casino.orig[,"Q6"] == "Adjacent Municipality" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
summary(casino.in.toronto)
Call:
glm(formula = casino.orig[, "Q6"] == "City of Toronto" ~ Outside.of.Toronto,
family = binomial(logit), data = casino.orig)
Deviance Residuals:
Min 1Q Median 3Q Max
0.9132 0.9132 0.7205 1.4669 1.7179
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.21605 0.02600 46.77 <2e-16 ***
Outside.of.Toronto 0.55712 0.03855 14.45 <2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16278 on 13881 degrees of freedom
Residual deviance: 16070 on 13880 degrees of freedom
(3884 observations deleted due to missingness)
AIC: 16074
Number of Fisher Scoring iterations: 4
————————————————–
summary(casino.outside.toronto)
Call:
glm(formula = casino.orig[, "Q6"] == "Adjacent Municipality" ~
Outside.of.Toronto, family = binomial(logit), data = casino.orig)
Deviance Residuals:
Min 1Q Median 3Q Max
0.7280 0.7280 0.5554 0.5554 1.9726
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19254 0.02583 46.16 <2e-16 ***
Outside.of.Toronto 0.59879 0.04641 12.90 <2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13786 on 13881 degrees of freedom
Residual deviance: 13611 on 13880 degrees of freedom
(3884 observations deleted due to missingness)
AIC: 13615
Number of Fisher Scoring iterations: 4
——————————————-
casino.loc.by.city.res = as.matrix(table(casino.orig[,"Outside.of.Toronto"], casino.orig[,"Q6"])[,2:4], dimnames=list(c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto"),colnames(casino.loc.by.city.res)))
rownames(casino.loc.by.city.res) = c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto")
casino.loc.by.city.res = melt(prop.table(casino.loc.by.city.res,1))
names(casino.loc.by.city.res) = c("City of Residence","Where to Build the Casino","value")
ggplot(casino.loc.by.city.res, aes(x=casino.loc.by.city.res$"Where to Build the Casino", y=value, fill=casino.loc.by.city.res$"City of Residence")) + geom_bar(position="dodge") + scale_y_continuous(labels=percent) + theme_wsj() + scale_fill_discrete(name="City of Residence") + ggtitle("If a casino is built, where would you prefer to see it located?")

view raw

casino.geo.r

hosted with ❤ by GitHub

Where located by city of residenceAs you can see here, those from the outside of Toronto are more likely to suggest building a casino in Toronto compared with those from the inside, and less likely to suggest building a casino in an adjacent municipality (with the reverse being true about those from the inside of Toronto).

That being said, when you do the comparison within city of residence (instead of across it like I just did), those from the inside of Toronto seem equally likely to suggest that the casino be built in our outside of the city, whereas those outside are much more likely to suggest building the casino inside Toronto than outside.  So, depending on how you view this graph, you might only say there’s evidence for a “not in my backyard!” effect for those living outside of Toronto.

As a final note, I’ll remind you that although these analyses point to which Torontonians do want a new casino, the fact from this survey remains that about 71% of respondents are unsupportive of a casino in Toronto, and 53% don’t want a casino built in either Toronto or and adjacent municipality.  I really have to wonder if they’re still going to go ahead with it!

Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees

Now that I’m on my winter break, I’ve been taking a little bit of time to read up on some modelling techniques that I’ve never used before. Two such techniques are Random Forests and Conditional Trees.  Since both can be used for classification, I decided to see how they compare against a simple binomial logistic regression (something I’ve worked with a lot) for binary classification.

The dataset I used contains records of the survival of Titanic Passengers and such information as sex, age, fare each person paid, number of parents/children aboard, number of siblings or spouses aboard, passenger class and other fields (The titanic dataset can be retrieved from a page on Vanderbilt’s website replete with lots of datasets; look for “titanic3”).

I took one part of the dataset to train my models, and another part to test them.  The factors that I focused on were passenger class, sex, age, and number of siblings/spouses aboard.

First, let’s look at the GLM:

titanic.survival.train = glm(survived ~ pclass + sex + pclass:sex + age + sibsp,
family = binomial(logit), data = titanic.train)

As you can see, I worked in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women).  Following is the model summary output, if you’re interested:

> summary(titantic.survival.train)

Call:
glm(formula = survived ~ pclass + sex + pclass:sex + age + sibsp, 
    family = binomial(logit), data = titanic.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9342  -0.6851  -0.5481   0.5633   2.3164  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     6.556807   0.878331   7.465 8.33e-14 ***
pclass         -1.928538   0.278324  -6.929 4.24e-12 ***
sexmale        -4.905710   0.785142  -6.248 4.15e-10 ***
age            -0.036462   0.009627  -3.787 0.000152 ***
sibsp          -0.264190   0.106684  -2.476 0.013272 *  
pclass:sexmale  1.124111   0.299638   3.752 0.000176 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 858.22  on 649  degrees of freedom
Residual deviance: 618.29  on 644  degrees of freedom
AIC: 630.29

Number of Fisher Scoring iterations: 5

So, after I used my model to predict survival probabilities on the testing portion of the dataset, I checked to see how many records showed a probability of over .5 (or 50%), and then how many of those records were actual survivors.  For the GLM, 146/164 (89%) of those records scored at 50% or higher were actual survivors.  Not bad!

Now let’s move on to Random Forests:

> titanic.survival.train.rf = randomForest(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train,ntree=5000, importance=TRUE)

> titanic.survival.train.rf

Call:
 randomForest(formula = as.factor(survived) ~ pclass + sex + age +      sibsp, data = titanic.train, ntree = 5000, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 5000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 22.62%
Confusion matrix:
    0   1 class.error
0 370  38  0.09313725
1 109 133  0.45041322

> importance(titanic.survival.train.rf)
               0          1 MeanDecreaseAccuracy MeanDecreaseGini
pclass  67.26795 125.166721            126.40379         34.69266
sex    160.52060 221.803515            224.89038         62.82490
age     70.35831  50.568619             92.67281         53.41834
sibsp   60.84056   3.343251             52.82503         14.01936

It seems to me that the output indicates that the Random Forests model is better at creating true negatives than true positives, with regards to survival of the passengers, but when I asked for the predicted survival categories in the testing portion of my dataset, it appeared to do a pretty decent job predicting who would survive and who wouldn’t:

For the Random Forests model, 155/184 (84%) of those records predicted to survive actually did survive!  Again, not bad.

Finally, lets move on to the Conditional Tree model:

> titanic.survival.train.ctree = ctree(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train)

> titanic.survival.train.ctree

	 Conditional inference tree with 7 terminal nodes

Response:  as.factor(survived) 
Inputs:  pclass, sex, age, sibsp 
Number of observations:  650 

1) sex == {female}; criterion = 1, statistic = 141.436
  2) pclass <= 2; criterion = 1, statistic = 55.831
    3) pclass <= 1; criterion = 0.988, statistic = 8.817       
     4)*  weights = 74      3) pclass > 1
      5)*  weights = 47 
  2) pclass > 2
    6)*  weights = 105 
1) sex == {male}
  7) pclass <= 1; criterion = 1, statistic = 15.095     
   8)*  weights = 88    
  7) pclass > 1
    9) age <= 12; criterion = 1, statistic = 14.851
      10) sibsp <= 1; criterion = 0.998, statistic = 12.062               11)*  weights = 18        
      10) sibsp > 1
       12)*  weights = 12 
    9) age > 12
      13)*  weights = 306

Titanic Conditional Tree

I really happen to like the graph output by plotting the conditional tree model.  I find it pretty easy to understand.  As you can see, the model started the split of the data according to sex, which it found to be most significant, then pclass, age, and then siblings/spouses.  That being said, let’s look at how the prediction went for comparison purposes:

134/142 (94%) of records predicted to survive actually did survive!  Great!!  Now let’s bring all those prediction stats into one table for a final look:

                     glm randomForests ctree
Actually Survived    146           155   134
Predicted to Survive 164           184   142
% Survived           89%           84%   94%

So, in terms of finding the highest raw number of folks who actually did survive, the Random Forests model won, but in terms of having the highest true positive rate, the Conditional Tree model takes the cake!  Neat exercise!

Notes:

(1) There were numerous records without an age value.  I estimated ages using a regression model taken from those who did have ages.  Following are the coefficients for that model:

44.59238 + (-5.98582*pclass) + (.15971*fare) + (-.14141*pclass:fare)

(2) Although I used GLM scores of over .5 to classify records as survived, I could have played with that to get different results.  I really just wanted to have some fun, and not try all possibilities here:)

(3) I hope you enjoyed this post!  If you have anything to teach me about Random Forests or Conditional Trees, please leave a comment and be nice (I’m trying to expand my analysis skills here!!).

EDIT:
I realized that I should also post the false negative rate for each model, because it’s important to know how many people each model missed coding as survived. So, here are the stats:

                           glm    rf ctree   gbm
Actually Survived          112   103   124    74
Predicted Not to Survive   495   475   517   429
False Negative Rate      22.6% 21.7%   24% 17.2%

As you can see, the gbm model shows the lowest false negative rate, which is pretty nice! Compare that against the finding that it correctly predicted the survival of 184/230 (80%) records that were scored with a probability of 50% or higher, and that makes a pretty decent model.

Memory Management in R, and SOAR

The more I’ve worked with my really large data set, the more cumbersome the work has become to my work computer.  Keep in mind I’ve got a quad core with 8 gigs of RAM.  With growing irritation at how slow my work computer becomes at times while working with these data, I took to finding better ways of managing my memory in R.

The best/easiest solution I’ve found so far is in a package called SOAR.  To put it simply, it allows you to store specific objects in R (data frames being the most important, for me) as RData files on your hard drive, and gives you the ability to analyze them in R without having them loaded into your RAM.  I emphasized the term analyze because every time I try to add variables to the data frames that I store, the data frame comes back into RAM and once again slows me down.

An example might suffice:

> r = data.frame(a=rnorm(10,2,.5),b=rnorm(10,3,.5))
> r

  a       b

1 1.914092 3.074571
2 2.694049 3.479486
3 1.684653 3.491395
4 1.318480 3.816738
5 2.025016 3.107468
6 1.851811 3.708318
7 2.767788 2.636712
8 1.952930 3.164896
9 2.658366 3.973425
10 1.809752 2.599830
> library(SOAR)
> Sys.setenv(R_LOCAL_CACHE=”testsession”)
> ls()
[1] “r”
> Store(r)
> ls()
character(0)
> mean(r[,1])
[1] 2.067694
> r$c = rnorm(10,4,.5)
> ls()
[1] “r”

So, the first thing I did was to make a data frame with some columns, which got stored in my workspace, and thus loaded into RAM.  Then, I initialized the SOAR library, and set my local cache to “testsession”.  The practical implication of that is that a directory gets created within the current directory that R is working out of (in my case, “/home/inkhorn/testsession”), and that any objects passed to the Store command get saved as RData files in that directory.

Sure enough, you see my workspace before and after I store the r object.  Now you see the object, now you don’t!  But then, as I show, even though the object is not in the workspace, you can still analyze it (in my case, calculate a mean from one of the columns).  However, as soon as I try to make a new column in the data frame… voila … it’s back in my workspace, and thus RAM!

So, unless I’m missing something about how the package is used, it doesn’t function exactly as I would like, but it’s still an improvement.  Every time I’m done making new columns in the data frame, I just have to pass the object to the Store command, and away to the hard disk it goes, and out of my RAM.  It’s quite liberating not having a stupendously heavy workspace, as when I’m trying to leave or enter R, it takes forever to save/load the workspace.  With the heavy stuff sitting on the hard disk, leaving and entering R go by a lot faster.

Another thing I noticed is that if I keep the GLMs that I’ve generated in my workspace, that seems to take up a lot of RAM as well and slow things down.  So, with writing the main dataframe to disk, and keeping GLMs out of memory, R is flying again!