Predicting Mobile Phone Prices

Recently a colleague of mine showed me a nauseating interactive scatterplot that plots mobile phones according to two dimensions of the user’s choice from a list of possible dimensions.  Although the interactive visualization was offensive to my tastes, the JSON data behind the visualization was intriguing.  It was easy enough to get the data behind it (see this link if you want an up to date copy and be sure to take out the “data=” from the start of the file! I pulled this data around noon on March 23rd.) so that I could start asking a simple question: Which of the available factors provided in the dataset were the most predictive of full mobile phone price?

I’ll present the graphs and then the predictive model first and then the code later on:

Price by OS and Brand:

Often when investigating a topic using data, we confirm things that we already knew to be true.  This is certainly the case here with price by OS and brand.  From the below boxplots we see that the bulk of iOS devices tend to be the most expensive, and that brand-wise Apple, Google, and Samsung seem to stick out.

Mobile Phone Price by Operating System

Mobile Phone Prices by Brand

Price by Storage Capacity, RAM, and SD Card Capacity:

Storage capacity is perhaps the least surprising to find as having such a sharply positive correlation with price. I think what is more surprising to me is that there aren’t more gradations of storage capacity in the higher range past 50 gigabytes.  I’m guessing this is because the bulk of these phones (bearing in mind roughly 90% of these phones are in fact smart phones) are catered towards lower income folks.  Can you guess which phones occupy the top-right-most position on the first graph?  If your answer involved the iPhone 6 then you’re right on two counts!

As you can see, the correlation between RAM and price is pretty linear (with phones costing $171.54 more for each additional gigabyte of RAM) and that between SD Card capacity and price is linear past the large group of phones with 0 SD Card capacity (with phones costing $3.64 more for each additional gigabyte of SD Card Capacity).

Price by Storage Capacity

Price by RAM

Price by SD Card

Price by Screen Size, Battery, and Weight:

The next factors that I think one would naturally think of when considering the price of a mobile phone are all related to how big the thing is. Smart phones these days have a lot of physical presence just by dint of their screen size alone. Add to the large screen size the batteries that are used to support such generous displays and you also get an impressive variety of weights to these phones.

In fact, for every additional inch of screen size to these phones, you can expect an additional .81504 ounces and 565.11 mAh of battery capacity. My own humble little smartphone (an HTC Desire 601) happens to be on the smaller and lighter side of the spectrum as far as screen size and weight goes (4.5 inches screen size, or 33rd percentile; 4.59 ounces or 26th percentile) but happens to have a pretty generous battery capacity all things considered (2100 mAh, or 56th percentile).

While positive correlations can be seen between Price and all these 3 factors, battery was the most correlated with Price, next to screen size and then weight.  There’s obviously a lot of variability in price when you look at the phones with the bigger screen sizes, as they probably tend to come packed with a variety of premium extra features that can be used to jack up the price.

Price by Screen Size

Price by Battery

Price by Weight

Putting it all together in a model:
Finally, let’s lump all of the factors provided in the data set into a model, and see how well it performs on a testing sample. I decided on an 80/20 training/testing split, and am of course using Max Kuhn’s fabulous caret package to do the dirty work. I ran a gbm model, shown below, and managed to get an R squared of 60.4% in the training sample, so not bad.

Stochastic Gradient Boosting 

257 samples
 23 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 173, 173, 171, 171, 172, 171, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE      Rsquared   RMSE SD   Rsquared SD
  1                   50      150.1219  0.5441107  45.36781  0.1546993  
  1                  100      147.5400  0.5676971  46.03555  0.1528225  
  1                  150      146.3710  0.5803005  45.00296  0.1575795  
  2                   50      144.0657  0.5927624  45.46212  0.1736994  
  2                  100      143.7181  0.6036983  44.80662  0.1787351  
  2                  150      143.4850  0.6041207  45.57357  0.1760428  
  3                   50      148.4914  0.5729182  45.27579  0.1903465  
  3                  100      148.5363  0.5735842  43.41793  0.1746064  
  3                  150      148.8497  0.5785677  43.39338  0.1781990  

Tuning parameter 'shrinkage' was held constant at a value of 0.1
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were n.trees = 150, interaction.depth = 2 and shrinkage = 0.1.

Now let’s look at the terms that came out as the most significant in the chosen model.  Below we see some unsurprising findings! Storage, battery, weight, RAM, and whether or not the phone uses iOS as the top 5. I guess I’m surprised that screen size was not higher up in the priority list, but at least it got in 6th place!

gbm variable importance

  only 20 most important variables shown (out of 41)

                  Overall
att_storage      100.0000
att_battery_mah   59.7597
att_weight        46.5410
att_ram           27.5871
att_osiOS         26.9977
att_screen_size   21.1106
att_sd_card       20.1130
att_brandSamsung   9.1220

Finally, let’s look at how our model did in the testing sample. Below I’ve shown you a plot of actual versus predicted price values. The straight line is what we would expect to see if there were a perfect correlation between the two (obviously not!!) while the smoothed line is the trend that we actually do see in the scatter plot. Considering the high R squared in the testing sample of 57% (not too far off from the training sample) it’s of course a nice confirmation of the utility of this model to see the smooth line following that perfect prediction line, but I won’t call be calling up Rogers Wireless with the magical model just yet!

Price by Predicted Price

In fact, before I close off this post, it would be remiss of me not to investigate a couple of cases in this final graph that look like outliers. The one on the bottom right, and the one on the top left.

The one on the bottom right happens to be a Sony Xperia Z3v Black with 32GB of storage space. What I learned from checking into this is that since the pricing data on the source website is pulled from amazon.com, sometimes instead of pulling the full regular price, it happens to pull the data on a day when a special sale or service agreement price is listed. When I pulled the data, the Xperia was listed at a price of $29.99. Today, on April 6th, the price that you would get if you looked it up through the source website is .99! Interestingly, my model had predicted a full price of $632.17, which was not very far off from the full price of $599.99 that you can see if you go on the listing on amazon.com. Not bad!

Now, how about the phone that cost so much but that the model said shouldn’t? This phone was none other than the Black LG 3960 Google Nexus 4 Unlocked GSM Phone with 16GB of Storage space. The price I pulled that day was a whopping $699.99 but the model only predicted a price of $241.86! Considering the specs on this phone, the only features that really seem to measure up are the storage (16GB is roughly in the 85th percentile for smart phones) and the RAM (2 GB is roughly in the 93rd percentile for smart phones). Overall though, the model can’t account for any other qualities that Google might have imbued into this phone that were not measured by the source website. Hence, this is a formidable model outlier!

If you take out the Sony Xperia that I mentioned first, the Adjusted R squared value goes up from 57% to 74%, and the Residual Standard Error decreases from $156 to $121. That’s a lot of influence for just one outlier that we found to be based on data quality alone. Wow!

Reflecting on this exercise, the one factor that I wished were collected is processor speed.  I’m curious how much that would factor into pricing decisions, but alas this information was unavailable.

Anyway, this was fun, and I hope not too boring for you, the readers. Thanks for reading!!


library(jsonlite)
cp = fromJSON(txt = "Cell Phone Data.txt", simplifyDataFrame = TRUE)
num.atts = c(4,9,11,12,13,14,15,16,18,22)
cp[,num.atts] = sapply(cp[,num.atts], function (x) as.numeric(x))
cp$aspect.ratio = cp$att_pixels_y / cp$att_pixels_x
cp$isSmartPhone = ifelse(grepl("smart|iphone|blackberry", cp$name, ignore.case=TRUE) == TRUE | cp$att_screen_size >= 4, "Yes", "No")
library(ggplot2)
library(ggthemr)
library(scales)
ggthemr("camoflauge")
ggplot(cp, aes(x=att_brand, y=price)) + geom_boxplot() + ggtitle("Mobile Phone Price by Brand") + theme(axis.text.x=element_text(angle=90, size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_discrete("Brand")
ggplot(cp, aes(x=att_weight, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Weight") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Weight (oz)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=att_screen_size, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Screen Size") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Screen Size (in)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=att_ram, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Amount of RAM") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("RAM (gb)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=att_sd_card, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by SD Card Capacity") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("SD Card Capacity (gb)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=ifelse(cp$att_dual_sim == 1, "Yes", "No"), y=price)) + geom_boxplot() + ggtitle("Mobile Phone Price by Dual Sim") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_discrete("Has Dual Sim Card?")
ggplot(cp, aes(x=att_storage, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Storage Capacity") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Storage Capacity (gb)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=att_battery_mah, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Battery Capacity") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Battery Capacity (mAh)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=aspect.ratio, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Aspect Ratio") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Aspect Ratio (Y Pixels / X Pixels)") + stat_smooth(se=FALSE)
ggplot(cp, aes(x=isSmartPhone, y=price)) + geom_boxplot() + ggtitle("Mobile Phone Price by Smart Phone Status") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_discrete("Is it a Smart Phone?")
ggplot(cp, aes(x=att_os, y=price)) + geom_boxplot() + ggtitle("Mobile Phone Price by Operating System") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_discrete("Operating System")
library(caret)
control = trainControl(method="cv")
in_train = createDataPartition(cp$price, p=.8, list=FALSE)
model.gbm = train(price ~ att_brand + att_weight + att_screen_size +
att_ram + att_sd_card + att_dual_sim +
att_storage + att_battery_mah + att_os, data=cp,
method="gbm", trControl=control, verbose=FALSE, subset=in_train)
cp$att_brand = factor(cp$)
cp.test = cp[in_train,]
cp.test = subset(cp.test, att_brand != 'TOTO')
cp.test = na.omit(cp.test)
cp.test$pred.price = predict(model.gbm, cp.test)
ggplot(cp.test, aes(x=pred.price, y=price)) + geom_point(size=3) + ggtitle("Mobile Phone Price by Predicted Price") + theme(axis.text.x=element_text(size=14, vjust=0.5), axis.text.y=element_text(size=14), axis.title.x=element_text(size=15), axis.title.y=element_text(size=15), plot.title=element_text(size=17)) + scale_y_continuous(labels=dollar, name="Price (USD?)") + scale_x_continuous("Predicted Price", labels=dollar) + geom_abline(intercept=0, slope=1, colour="yellow") + stat_smooth(se=FALSE)

Data Until I Die: My blog title and statement of values

When I started keeping this Blog, my intent was to write about and keep helpful snippets of R code that I used in the line of work.  It was the start of my second job after grad school and I was really excited about getting to use R on a regular basis outside of academia!  Well, time went on and so did the number of posts I put on here.  After a while the posts related to work started to dip considerably.  Then, I found my third post grad-school job, which shifted the things I needed R for at work.  I still use R at work, but not for exactly the same things.

After I got my third job, I noticed that all my blog posts were for fun, and not for work.  That’s when I fiddled a little bit with my blog title to incorporate the concept of ‘fun’ in there.  Now that I’ve carried on these ‘for fun’ analyses which I’ve been posting every 1-2 months, I realize that it’s an obsession of mine that’s not going away.  Data, I realize, is a need of mine (probably more than sex, to be honest).  I work with data in the office, I play with data when I can find fun data sets every once in a while at home.

Data and data analysis are comfortably things that I can see myself doing for the rest of my life.  That’s why I decided to call this blog “Data Until I Die!”.  Sometimes I’ll post boring analyses, and sometimes I’ll post really interesting analyses.  The main thing is, this Blog is a good excuse to fuel my Data drive 🙂

Thanks for reading!

Who uses E-Bikes in Toronto? Fun with Recursive Partitioning Trees and Toronto Open Data

I found a fun survey released to the Toronto Open Data website that investigates the travel/commuting behaviour of Torontonians, but with a special focus on E-bikes.  When I opened up the file, I found various demographic information, in addition to a question asking people their most frequently used mode of transportation.  Exactly 2,238 people responded to this survey, of which 194 were frequent E-bike users.  I figured that’s enough to do some data mining and an especially fun opportunity to use recursive partitioning trees to do that data mining!

Following is the code I used (notice in the model statements that I focus specifically on E-bike users versus everyone else):


library(rpart)
library(plyr)
library(rpart.plot)
ebike = read.csv("E-Bike_Survey_Responses.csv")
# This next part is strictly to change any blank responses into NAs
ebike[,2:10][ebike[,2:10] == ''] = NA
# In this section we use mapvalues from the plyr package to get rid of blanks, but also
# to reduce the number of values in each factor that we use.
ebike$Sex = mapvalues(ebike$Sex, c('',levels(ebike$Sex)[c(1,2,6)]), c('Other', rep("Other",10)))
ebike$Health = mapvalues(ebike$How.would.you.describe.your.level.of.physical.health., c('', levels(ebike$How.would.you.describe.your.level.of.physical.health.)[c(1,4,5,6,12,13)]), c(NA, rep("Other",7)))
ebike$Edu = mapvalues(ebike[,5], c('', levels(ebike[,5])[c(1,4,8,14,23)]), c(NA, rep('Other',20)))
ebike$Income = mapvalues(ebike[,6], '', NA)
ebike$Age = mapvalues(ebike[,2], '', NA)
# People put a lot of varying answers in here, but the categories I've chosen here can be found in most of them.
ebike$transport = factor(ifelse(grepl("bicycle",ebike[,11]),"Bicycle",
ifelse(grepl("e-bike", ebike[,11]), "E-bike",
ifelse(grepl("car", ebike[,11]), "Car",
ifelse(grepl("transit", ebike[,11]), "Transit","Other")))))
# Here we ask R to make two trees based first on Sex Health and Age (they seem like they go together)
# then based on education and income. You can try to put them together, but you will find that only some are
# chosen as the most significant for the classification. Therefore, keeping them apart describes for us
# E-bike users on separate dimensions.
b = rpart(transport == "E-bike"~ Sex + Health + Age, data=ebike)
c = rpart(transport == "E-bike" ~ Edu + Income, data=ebike)
# And here we plot the two Partition Tree models. I like seeing the factor label
# values in their entirety, so I've chosen a large enough number for the 'faclen' argument
# in each call to rpart.plot
rpart.plot(b, type=1,extra=1, varlen=0, faclen=10)
rpart.plot(c, type=1,extra=1, varlen=0, faclen=20)

view raw

ebike.r

hosted with ❤ by GitHub

Here is the first tree based on Sex, Health, and Age (Remember that the factor levels shown are not the only ones.  When you look on the “no” side of the tree, it means that you are examining the proportion of ebike users who are described by factor levels not shown):

Health and Age Tree
As you can see, only Health and Age came out as significantly discriminating between E-bike users and everyone else.  What this tree is telling us is that it’s in fact those people who are not in “Excellent, Good, Very good” health who are likely to use E-bikes, but rather an un-shown part of the Health spectrum: “Other,Fairly good,Poor”.  That’s interesting in and of itself.  It seems that of those people who are in Excellent or Very Good health, they are more likely (44%) to be riding Bicycles than people in other levels of health (23%).  That makes sense!  You’re not going to choose something effortful if your health isn’t great.

We also see a very interesting finding that it is in fact the 50 – 64 year olds (whose health isn’t great) who are more likely to be riding an E-bike compared to people of all other age groups!

Here’s the second tree based on Education and Income:

Education and Income Tree
Here we see that it is in fact not the university educated ones more likely to ride E-bikes, but in fact people with “College or trade school diploma,High school diploma”.  Interesting!!  Further, we see that amongst those who aren’t university educated, it’s those who say they make lower than $80,000 in income who are more likely to ride E-bikes.

So now we have an interesting picture emerging, with two parallel descriptions of who is most likely to ride E-bikes:

1) 50 – 64 year olds in not the greatest of health
and
2) Non University educated folks with lower than $80,000 income.

Toronto, these are your E-bike users!

Estimating Ages from First Names Part 2 – Using Some Morbid Test Data

In my last post, I wrote about how I compiled a US Social Security Agency data set into something usable in R, and mentioned some issues scaling it up to be usable for bigger datasets.  I also mentioned the need for data to test out the accuracy of my estimates.  First, I’ll show you how I prepped the dataset that it became more scalable (for the code that got us here, see my last post):

name_data_wavgpop_unisex = ddply(name_data, .(Name), function (x) sum(x$Rel_Pop*as.numeric(x$Year))/sum(x$Rel_Pop))
name_data_wavgpop_unisex$V1 = round(name_data_wavgpop_unisex$V1,0)

Above I’ve taken a different tactic to predicting expected year of birth according to name than I started out with in my last post. Here I’m using the relative popularity of the names in each year as weights for each year value. Multiplying them by the years, I get a weighted average of Year that gives me predicted year of birth. Then I round off the predictions to the nearest integer and continue on my way. Also, because test data doesn’t seem to come packaged with gender info, I’ve constructed the weighted averages using all relative popularity values for each name, regardless of whether or not that name has been used for both sexes (a.k.a. “Jordan”).

Now enter the test data. I’ve discovered that the easiest way of getting real names and ages off the internet is by looking for lists of victims of some horrible tragedy. The biggest such list of victims I could find was a list of 9/11 victims.  It’s not exactly formatted for easy analysis, and I was too lazy to get the data programatically, so I just copy-pasted into LibreOffice Calc the names and ages from the first 4 lists on the page (all from either American Airlines, or United Airlines) for a total of 285 observations.  I then extracted the first names, and then imported the first names and ages into R.

worldtrade = read.csv("world trade.csv")
worldtrade.ages = sqldf("SELECT a.*, b.V1 as Year FROM [worldtrade] AS a LEFT JOIN name_data_wavgpop_unisex AS b on a.Name == b.Name")
worldtrade.ages$Pred.Age = 2001 - as.numeric(worldtrade.ages$Year)

As you can see, I opted to use sqldf to append the appropriate predicted birth years for each name on the list I imported. I then got the predicted ages by subtracting each predicted birth year from 2001. Finally, let’s have a look at the resulting model fit (showing how close each predicted age was to the real age of the victim at the time of death):

Predicted vs Real Ages for 911 Victims

 

 

 

As you can see, it’s not a tight prediction in the least bit.  According to model fit statistics, there’s an adjusted r-squared of 14.6% and a residual standard error of 15.58 years.  You can also see from the scatter plot that the prediction doesn’t become reasonably linear until about age 30 and onwards.  Overall, I’d say it’s not too impressive, and I’d imagine it’s even worse for predicting who’s under 10 years old!

Well, this was fun (if only a little disappointing).  That’s statistics for you – sometimes it confirms, and sometimes it humbles you.  If you think you can show me a better way of using this name trending data to better predict ages than what I’ve done here, feel free to show me!

Which Torontonians Want a Casino? Survey Analysis Part 2

In my last post I said that I would try to investigate the question of who actually does want a casino, and whether place of residence is a factor in where they want the casino to be built.  So, here goes something:

The first line of attack in this blog post is to distinguish between people based on their responses to the third question on the survey, the one asking people to rate the importance of a long list of issues.  When I looked at this list originally, I knew that I would want to reduce the dimensionality using PCA.

library(psych)
issues.pca = principal(casino[,8:23], 3, rotate="varimax",scores=TRUE)

The PCA resulted in the 3 components listed in the table below.  The first component had variables loading on to it that seemed to relate to the casino being a big attraction with lots of features, so I named it “Go big or Go Home”.  On the second component there seemed to be variables loading on to it that related to technical details, while the third component seemed to have variables loading on to it that dealt with social or environmental issues.

Go Big or Go Home Concerned with Technical Details Concerned with Social/Environmental Issues or not Issue/Concern
Q3_A 0.181 0.751 Design of the facility
Q3_B 0.366 0.738 Employment Opportunities
Q3_C 0.44 0.659 Entertainment and cultural activities
Q3_D 0.695 0.361 Expanded convention facilities
Q3_E 0.701 0.346 Integration with surrounding areas
Q3_F 0.808 0.266 New hotel accommodations
Q3_G -0.117 0.885 Problem gambling & health concerns
Q3_H 0.904 Public safety and social concerns
Q3_I 0.254 0.716 Public space
Q3_J 0.864 0.218 Restaurants
Q3_K 0.877 0.157 Retail
Q3_L 0.423 0.676 -0.1 Revenue for the city
Q3_M 0.218 0.703 0.227 Support for local businesses
Q3_N 0.647 0.487 -0.221 Tourist attraction
Q3_O 0.118 0.731 Traffic concerns
Q3_P 0.497 0.536 0.124 Training and career development

Once I was satisfied that I had a decent understanding of what the PCA was telling me, I loaded the component scores into the original dataframe.

casino[,110:112] = issues.pca$scores
names(casino)[110:112] = c("GoBigorGoHome","TechnicalDetails","Soc.Env.Issues")

In order to investigate the question of who wants a casino and where, I decided to use question 6 as a dependent variable (the one asking where they would want it built, if one were to be built) and the PCA components as independent variables.  This is a good question to use, because the answer options, if you remember, are “Toronto”, “Adjacent Municipality” and “Neither”.  My approach was to model each response individually using logistic regression.

casino$Q6[casino$Q6 == ""] = NA
casino$Q6 = factor(casino$Q6, levels=c("Adjacent Municipality","City of Toronto","Neither"))

adj.mun = glm(casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
toronto = glm(casino$Q6 == "City of Toronto" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))
neither = glm(casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails + Soc.Env.Issues, data=casino, family=binomial(logit))

Following are the summaries of each GLM:
Toronto:


Call:
glm(formula = casino$Q6 == "City of Toronto" ~ GoBigorGoHome +
TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
3.6426 0.4745 0.1156 0.4236 3.4835
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.58707 0.04234 37.48 <2e-16 ***
GoBigorGoHome 1.76021 0.03765 46.75 <2e-16 ***
TechnicalDetails 1.77155 0.05173 34.24 <2e-16 ***
Soc.Env.Issues 1.63057 0.04262 38.26 <2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13537 on 10365 degrees of freedom
Residual deviance: 6818 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 6826
Number of Fisher Scoring iterations: 6

Adjacent municipality:


Call:
glm(formula = casino$Q6 == "Adjacent Municipality" ~ GoBigorGoHome +
TechnicalDetails + Soc.Env.Issues, family = binomial(logit),
data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
1.0633 0.7248 0.5722 0.3264 2.7136
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.45398 0.02673 54.394 < 2e-16 ***
GoBigorGoHome 0.41989 0.02586 16.239 < 2e-16 ***
TechnicalDetails 0.18764 0.02612 7.183 6.82e-13 ***
Soc.Env.Issues 0.52325 0.03221 16.243 < 2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10431.8 on 10365 degrees of freedom
Residual deviance: 9756.4 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 9764.4
Number of Fisher Scoring iterations: 5

Neither location:


Call:
glm(formula = casino$Q6 == "Neither" ~ GoBigorGoHome + TechnicalDetails +
Soc.Env.Issues, family = binomial(logit), data = casino)
Deviance Residuals:
Min 1Q Median 3Q Max
2.4090 0.7344 0.3934 0.8966 2.7194
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.22987 0.02415 9.517 <2e-16 ***
GoBigorGoHome 0.85050 0.02462 34.549 <2e-16 ***
TechnicalDetails 1.00182 0.02737 36.597 <2e-16 ***
Soc.Env.Issues 0.69707 0.02584 26.972 <2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14215 on 10365 degrees of freedom
Residual deviance: 10557 on 10362 degrees of freedom
(7400 observations deleted due to missingness)
AIC: 10565
Number of Fisher Scoring iterations: 4

And here is a quick summary of the above GLM information:
Summary of Casino GLMs

Judging from these results, it looks like those who want a casino in Toronto don’t focus on the big social/environmental issues surrounding the casino, but do focus on the flashy and non-flashy details and benefits alike.  Those who want a casino outside of Toronto do care about the social/environmental issues, don’t care as much about the flashy details, but do have a focus on some of the non-flashy details.  Finally, those not wanting a casino in either location care about the social/environmental issues, but don’t care about any of the details.

Here’s where the issue of location comes into play.  When I look at the summary for the GLM that predicts who wants a casino in an adjacent municipality, I get the feeling that it’s picking up people living in the down-town core who just don’t think the area can handle a casino.  In other words, I think there might be a “not in my backyard!” effect.

The first inkling that this might be the case comes from an article from the Martin Prosperity Institute (MPI), who analyzed the same data set, and managed to get a very nice looking heat map-map of the responses to the first question on the survey, asking people how they feel about having a new casino in Toronto.  From this map, it does look like people in Downtown Toronto are feeling pretty negative about a new casino, whereas those in the far east and west of Toronto are feeling better about it.

My next evidence comes from the cities uncovered by geocoding the responses in the data set.  I decided to create a very simple indicator variable, distinguishing those for whom the “City” is Toronto, and those for whom the city is anything else.  I like this better than the MPI analysis, because it looks at peoples’ attitudes towards a casino both inside and outside of Toronto (rather than towards the concept of a new Casino in Toronto).  If there really is a “not in my backyard!” effect, I would expect to see evidence that those in Toronto are more disposed towards a casino in an adjacent municipality, and that those from outside of Toronto are more disposed towards a casino inside Toronto!  Here we go:


library(ff)
library(ggthemes)
ffload(file="casino", overwrite=TRUE)
casino.orig$Outside.of.Toronto = as.ff(ifelse(casino.orig[,"City"] == "Toronto",0,1))
casino.in.toronto = glm(casino.orig[,"Q6"] == "City of Toronto" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
casino.outside.toronto = glm(casino.orig[,"Q6"] == "Adjacent Municipality" ~ Outside.of.Toronto, data=casino.orig, family=binomial(logit))
summary(casino.in.toronto)
Call:
glm(formula = casino.orig[, "Q6"] == "City of Toronto" ~ Outside.of.Toronto,
family = binomial(logit), data = casino.orig)
Deviance Residuals:
Min 1Q Median 3Q Max
0.9132 0.9132 0.7205 1.4669 1.7179
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.21605 0.02600 46.77 <2e-16 ***
Outside.of.Toronto 0.55712 0.03855 14.45 <2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16278 on 13881 degrees of freedom
Residual deviance: 16070 on 13880 degrees of freedom
(3884 observations deleted due to missingness)
AIC: 16074
Number of Fisher Scoring iterations: 4
————————————————–
summary(casino.outside.toronto)
Call:
glm(formula = casino.orig[, "Q6"] == "Adjacent Municipality" ~
Outside.of.Toronto, family = binomial(logit), data = casino.orig)
Deviance Residuals:
Min 1Q Median 3Q Max
0.7280 0.7280 0.5554 0.5554 1.9726
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19254 0.02583 46.16 <2e-16 ***
Outside.of.Toronto 0.59879 0.04641 12.90 <2e-16 ***
Signif. codes: 0***0.001**0.01*0.05.0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13786 on 13881 degrees of freedom
Residual deviance: 13611 on 13880 degrees of freedom
(3884 observations deleted due to missingness)
AIC: 13615
Number of Fisher Scoring iterations: 4
——————————————-
casino.loc.by.city.res = as.matrix(table(casino.orig[,"Outside.of.Toronto"], casino.orig[,"Q6"])[,2:4], dimnames=list(c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto"),colnames(casino.loc.by.city.res)))
rownames(casino.loc.by.city.res) = c("Those Living Inside the City of Toronto", "Those Living Outside the City of Toronto")
casino.loc.by.city.res = melt(prop.table(casino.loc.by.city.res,1))
names(casino.loc.by.city.res) = c("City of Residence","Where to Build the Casino","value")
ggplot(casino.loc.by.city.res, aes(x=casino.loc.by.city.res$"Where to Build the Casino", y=value, fill=casino.loc.by.city.res$"City of Residence")) + geom_bar(position="dodge") + scale_y_continuous(labels=percent) + theme_wsj() + scale_fill_discrete(name="City of Residence") + ggtitle("If a casino is built, where would you prefer to see it located?")

view raw

casino.geo.r

hosted with ❤ by GitHub

Where located by city of residenceAs you can see here, those from the outside of Toronto are more likely to suggest building a casino in Toronto compared with those from the inside, and less likely to suggest building a casino in an adjacent municipality (with the reverse being true about those from the inside of Toronto).

That being said, when you do the comparison within city of residence (instead of across it like I just did), those from the inside of Toronto seem equally likely to suggest that the casino be built in our outside of the city, whereas those outside are much more likely to suggest building the casino inside Toronto than outside.  So, depending on how you view this graph, you might only say there’s evidence for a “not in my backyard!” effect for those living outside of Toronto.

As a final note, I’ll remind you that although these analyses point to which Torontonians do want a new casino, the fact from this survey remains that about 71% of respondents are unsupportive of a casino in Toronto, and 53% don’t want a casino built in either Toronto or and adjacent municipality.  I really have to wonder if they’re still going to go ahead with it!

Split, Apply, and Combine for ffdf

Call me incompetent, but I just can’t get ffdfdply to work with my ffdf dataframes.  I’ve tried repeatedly and it just doesn’t seem to work!  I’ve seen numerous examples on stackoverflow, but maybe I’m applying them incorrectly.  Wanting to do some split-apply-combine on an ffdf, yet again, I finally broke down and made my own function that seems to do the job! It’s still crude, I think, and it will probably break down when there are NA values in the vector that you want to split, but here it is:

mtapply = function (dvar, ivar, funlist) {
  lenlist = length(funlist)
  outtable = matrix(NA, dim(table(ivar)), lenlist, dimnames=list(names(table(ivar)), funlist))
  c = 1
  for (f in funlist) {
    outtable[,c] = as.matrix(tapply(dvar, ivar, eval(parse(text=f))))
    c = c + 1
  }
  return (outtable)}

As you can see, I’ve made it so that the result is a bunch of tapply vectors inserted into a matrix.  “dvar”, unsurprisingly, is your dependent variable.  “ivar”, your independent variable.  “funlist” is a vector of function names typed in as strings (e.g. c(“median”,”mean”,”mode”).  I’ve wasted so much of my time trying to get ddply or ffdfdply to work on an ffdf, that I’m happy that I now have anything that does the job for me.

Now that I think about it, this will fall short if you ask it to output more than one quantile for each of your split levels.  If you can improve this function, please be my guest!

Finding Patterns Amongst Binary Variables with the homals Package

It’s survey analysis season for me at work!  When analyzing survey data, the one kind of analysis I have realized that I’m not used to doing is finding patterns in binary data.  In other words, if I have a question to which multiple, non-mutually exclusive (checkbox) answers apply, how do I find the patterns in peoples’ responses to this question?

I tried apply PCA and Factor Analysis alternately, but they really don’t seem well suited to the analysis of data consisting of only binary columns (1s and 0s). In searching for something that works, I came across the homals package.  While the main function is described as a “homogeneity analysis”, its one ability that interests me is called “non-linear PCA”.  This is supposed to be able to reduce the dimensionality of your dataset even when the variables are all binary.

Well, here’s an example using some real survey data (with masked variable names).  First we start off with the purpose of the data and some simple summary stats:

It’s a group of 6 variables (answer choices) showing peoples check-box responses to a question asking them why they donated to a particular charity.  Following are the numbers of responses to each answer choice:

mapply(whydonate, FUN=sum, 1)
 V1  V2  V3  V4  V5  V6 
201  79 183 117 288 199

With the possible exception of answer choice V2, there are some pretty healthy numbers in each answer choice.  Next, let’s load up the homals package and run our non-linear PCA on the data.

library(homals)
fit = homals(whydonate)

fit
Call: homals(data = whydonate)

Loss: 0.0003248596 

Eigenvalues:
    D1     D2 
0.0267 0.0156 

Variable Loadings:
           D1          D2
V1 0.28440348 -0.10010355
V2 0.07512143 -0.10188037
V3 0.09897585  0.32713745
V4 0.20464762  0.21866432
V5 0.26782837 -0.09600215
V6 0.33198532 -0.04843107

As you can see, it extracts 2 dimensions by default (it can be changed using the “ndim” argument in the function), and it gives you what looks very much like a regular PCA loadings table.

Reading it naively, the pattern I see in the first dimension goes something like this: People tended to answer affirmatively to answer choices 1,4,5, and 6 as a group (obviously not all the time and altogether though!), but those answers didn’t tend to be used alongside choices 2 and 3.

In the second  dimension I see: People tended to answer affirmatively to answer choices 3 and 4 as a group.  Okay, now as a simple check, let’s look at the correlation matrix for these binary variables:

cor(whydonate)

           V1            V2            V3         V4          V5         V6
V1 1.00000000  0.0943477325  0.0205241732 0.16409945 0.254854574 0.45612458
V2 0.09434773  1.0000000000 -0.0008474402 0.01941461 0.038161091 0.08661938
V3 0.02052417 -0.0008474402  1.0000000000 0.21479291 0.007465142 0.11416164
V4 0.16409945  0.0194146144  0.2147929137 1.00000000 0.158325383 0.22777471
V5 0.25485457  0.0381610906  0.0074651417 0.15832538 1.000000000 0.41749064
V6 0.45612458  0.0866193754  0.1141616374 0.22777471 0.417490642 1.00000000

The first dimension is easy to spot in the “V1” column above. Also, we can see the second dimension in the “V3” column above – both check out! I find that neat and easy. Does anyone use anything else to find patterns in binary data like this? Feel free to tell me in the comments!

Multiple Classification and Authorship of the Hebrew Bible

Sitting in my synagogue this past Saturday, I started thinking about the authorship analysis that I did using function word counts from texts authored by Shakespeare, Austen, etc.  I started to wonder if I could do something similar with the component books of the Torah (Hebrew bible).

A very cursory reading of the Documentary Hypothesis indicates that the only books of the Torah supposed to be authored by one person each were Vayikra (Leviticus) and Devarim (Deuteronomy).  The remaining three appear to be a confusing hodgepodge compilation from multiple authors.  I figured that if I submitted the books of the Torah to a similar analysis, and if the Documentary Hypothesis is spot-on, then the analysis should be able to accurately classify only Vayikra and Devarim.

The theory with the following analysis (taken from the English textual world, of course) seems to be this: When someone writes a book, they write with a very particular style.  If you are going to be able to detect that style, statistically, it is convenient to detect it using function words.  Function words (“and”, “also”, “if”, “with”, etc) need to be used regardless of content, and therefore should show up throughout the text being analyzed.  Each author uses a distinct number/proportion of each of these function words, and therefore are distinguishable based on their profile of usage.

With that in mind, I started my journey.  The first steps were to find an online source of Torah text that I could easily scrape for word counts, and then to figure out which hebrew function words to look for.  For the Torah text, I relied on the inimitable Chabad.org.  They hired good rational web developer(s) to make their website, and so looping through each perek (chapter) of the Torah was a matter of copying and pasting html page numbers from their source code.

Several people told me that I’d be wanting for function words in Hebrew, as there are not as many as in English.  However, I found a good 32 of them, as listed below:

Transliteration Hebrew Function Word Rough Translation Word Count
al עַל Upon 1262
el אֶל To 1380
asher אֲשֶׁר That 1908
ca_asher כַּאֲשֶׁר As 202
et אֶת (Direct object marker) 3214
ki כִּי For/Because 1030
col וְכָל + כָּל + לְכָל + בְּכָל + כֹּל All 1344
ken כֵּן Yes/So 124
lachen לָכֵן Therefore 6
hayah_and_variants תִּהְיֶינָה + תִּהְיֶה + וְהָיוּ + הָיוּ + יִהְיֶה + וַתְּהִי + יִּהְיוּ + וַיְהִי + הָיָה Be 819
ach אַךְ But 64
byad בְּיַד By 32
gam גַם Also/Too 72
mehmah מֶה + מָה What 458
haloh הֲלֹא Was not? 17
rak רַק Only 59
b_ad בְּעַד For the sake of 5
loh לֹא No/Not 1338
im אִם If 332
al2 אַל Do not 217
ele אֵלֶּה These 264
haheehoo הַהִוא + הַהוּא That 121
ad עַד Until 324
hazehzot הַזֶּה + הַזֹּאת + זֶה + זֹאת This 474
min מִן From 274
eem עִם With 80
mi מִי Who 703
oh אוֹ Or 231
maduah מַדּוּעַ Why 10
etzel אֵצֶל Beside 6
heehoo הִוא + הוּא + הִיא Him/Her/It 653
az אָז Thus 49

This list is not exhaustive, but definitely not small!  My one hesitation when coming up with this list surrounds the Hebrew word for “and”.  “And” takes the form of a single letter that attaches to the beginning of a word (a “vav” marked with a different vowel sound depending on its context), which I was afraid to try to extract because I worried that if I tried to count it, I would mistakenly count other vav’s that were a valid part of a word with a different meaning.  It’s a very frequent word, as you can imagine, and its absence might very well affect my subsequent analyses.

Anyhow, following is the structure of Torah:

‘Chumash’ / Book Number of Chapters
‘Bereishit’ / Genesis 50
‘Shemot’ / Exodus 40
‘Vayikra’ / Leviticus 27
‘Bamidbar’ / Numbers 36
‘Devarim’ / Deuteronomy 34

Additionally, I’ve included a faceted histogram below showing the distribution of word-counts per chapter by chumash/book of the Torah:

m = ggplot(torah, aes(x=wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

Word Count Dist by Chumash

You can see that the books are not entirely different in terms of word counts of the component chapters, except for the possibility of Vayikra, which seems to tend towards the shorter chapters.

After making a Python script to count the above words within each chapter of each book, I loaded it up into R and split it into a training and testing sample:

torah$randu = runif(187, 0,1)
torah.train = torah[torah$randu <= .4,] torah.test = torah[torah$randu > .4,]

For this analysis, it seemed that using Random Forests made the most sense.  However, I wasn’t quite sure if I should use the raw counts, or proportions, so I tried both. After whittling down the variables in both models, here are the final training model definitions:

torah.rf = randomForest(chumash ~ al + el + asher + caasher + et + ki + hayah + gam + mah + loh + haheehoo + oh + heehoo, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

torah.rf.props = randomForest(chumash ~ al_1 + el_1 + asher_1 + caasher_1 + col_1 + hayah_1 + gam_1 + mah_1 + loh_1 + im_1 + ele_1 + mi_1 + oh_1 + heehoo_1, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

As you can see, the final models were mostly the same, but with a few differences. Following are the variable importances from each Random Forests model:

> importance(torah.rf)

 Word MeanDecreaseAccuracy MeanDecreaseGini
hayah 31.05139 5.979567
heehoo 20.041149 4.805793
loh 18.861843 6.244709
mah 18.798385 4.316487
al 16.85064 5.038302
caasher 15.101464 3.256955
et 14.708421 6.30228
asher 14.554665 5.866929
oh 13.585169 2.38928
el 13.010169 5.605561
gam 5.770484 1.652031
ki 5.489 4.005724
haheehoo 2.330776 1.375457

> importance(torah.rf.props)

Word MeanDecreaseAccuracy MeanDecreaseGini
asher_1 37.074235 6.791851
heehoo_1 29.87541 5.544782
al_1 26.18609 5.365927
el_1 17.498034 5.003144
col_1 17.051121 4.530049
hayah_1 16.512206 5.220164
loh_1 15.761723 5.157562
ele_1 14.795885 3.492814
mi_1 12.391427 4.380047
gam_1 12.209273 1.671199
im_1 11.386682 2.651689
oh_1 11.336546 1.370932
mah_1 9.133418 3.58483
caasher_1 5.135583 2.059358

It’s funny that the results, from a raw numbers perspective, show that hayah, the hebrew verb for “to be”, shows at the top of the list.  That’s the same result as in the Shakespeare et al. analysis!  Having established that all variables in each model had some kind of an effect on the classification, the next task was to test each model on the testing sample, and see how well each chumash/book of the torah could be classified by that model:

> torah.test$pred.chumash = predict(torah.rf, torah.test, type="response")
> torah.test$pred.chumash.props = predict(torah.rf.props, torah.test, type="response")

> xtabs(~torah.test$chumash + torah.test$pred.chumash)
                  torah.test$pred.chumash
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            4            5          2         8          7
       'Bereishit'           1           14          1        14          2
       'Devarim'             1            2         17         0          1
       'Shemot'              2            4          4         9          2
       'Vayikra'             5            0          4         0          5

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash),1)
                  torah.test$pred.chumash
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.15384615   0.19230769 0.07692308 0.30769231 0.26923077
       'Bereishit'  0.03125000   0.43750000 0.03125000 0.43750000 0.06250000
       'Devarim'    0.04761905   0.09523810 0.80952381 0.00000000 0.04761905
       'Shemot'     0.09523810   0.19047619 0.19047619 0.42857143 0.09523810
       'Vayikra'    0.35714286   0.00000000 0.28571429 0.00000000 0.35714286

> xtabs(~torah.test$chumash + torah.test$pred.chumash.props)
                  torah.test$pred.chumash.props
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            0            5          0        13          8
       'Bereishit'           1           16          0        13          2
       'Devarim'             0            2         11         4          4
       'Shemot'              1            4          2        13          1
       'Vayikra'             3            3          0         0          8

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash.props),1)
                  torah.test$pred.chumash.props
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.00000000   0.19230769 0.00000000 0.50000000 0.30769231
       'Bereishit'  0.03125000   0.50000000 0.00000000 0.40625000 0.06250000
       'Devarim'    0.00000000   0.09523810 0.52380952 0.19047619 0.19047619
       'Shemot'     0.04761905   0.19047619 0.09523810 0.61904762 0.04761905
       'Vayikra'    0.21428571   0.21428571 0.00000000 0.00000000 0.57142857

So, from the perspective of the raw number of times each function word was used, Devarim, or Deuteronomy, seemed to be the most internally consistent, with 81% of the chapters in the testing sample correctly classified. Interestingly, from the perspective of the proportion of times each function word was used, we see that Devarim, Shemot, and Vayikra (Deuteronomy, Exodus, and Leviticus) had over 50% of their chapters correctly classified in the training sample.

I’m tempted to say here, at the least, that there is evidence that at least Devarim was written with one stylistic framework in mind, and potentially one distinct author. From a proportion point of view, it appears that Shemot and Vayikra also show an internal consistency suggestive of close to one stylistic framework, or possibly a distinct author for each book. I’m definitely skeptical of my own analysis, but what do you think?

The last part of this analysis comes from a suggestion given to me by a friend, which was that once I modelled the unique profiles of function words within each book of the Torah, I should use that model on some post-Biblical texts.  Apparently one idea is that the “Deuteronomist Source” was also behind the writing of Joshua, Judges, and Kings.  If the same author was behind all four books, then when I train my model on these books, they should tend to be classified by my model as Devarim/Deuteronomy, moreso than other books.

As above, below I show the distribution of word count by book, for comparison’s sake:

> m = ggplot(neviim, aes(wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

Word Count Dist by Prophets Book

Interestingly, it seems as though chapters in these particular post-Biblical texts seem to be a bit longer, on average, than those in the Torah.

Next, I gathered counts of the same function words in Joshua, Judges, and Kings as I had for the 5 books of the Torah, and tested my random forests Torah model on them.  As you can see below, the result was anything but clear on that matter:

> xtabs(~neviim$chumash + neviim$pred.chumash)
              neviim$pred.chumash
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           3            7          7         6          1
      'Judges'           2           11          2         6          0
      'Kings'            0            8          3        10          1

> xtabs(~neviim$chumash + neviim$pred.chumash.props)
              neviim$pred.chumash.props
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           2            8          6         7          1
      'Judges'           0            9          2         9          1
      'Kings'            0            7          6         7          2

I didn’t even bother to re-express this table into fractions, because it’s quite clear that each  book of the prophets that I analyzed didn’t seem to be clearly classified in any one category.  Looking at these tables, there doesn’t yet seem to me to be any evidence, from this analysis, that whoever authored Devarim/Deuteronomy also authored these post-biblical texts.

Conclusion

I don’t think that this has been a full enough analysis.  There are a few things in it that bother me, or make me wonder, that I’d love input on.  Let me list those things:

  1. As mentioned above, I’m missing the inclusion of the Hebrew “and” in this analysis.  I’d like to know how to extract counts of the Hebrew “and” without extracting counts of the Hebrew letter “vav” where it doesn’t signify “and”.
  2. Similar to my exclusion of “and”, there are a few one letter prepositions that I have not included as individual predictor variables.  Namely ל, ב, כ, מ, signifying “to”, “in”/”with”, “like”, and “from”.  How do I count these without counting the same letters that begin a different word and don’t mean the same thing?
  3. Is it more valid to consider the results of my analyses that were done on the word frequencies as proportions (individual word count divided by total number of words in the chapter), or are both valid?
  4. Does a list exist somewhere that details, chapter-by-chapter, which source is believed to have written the Torah text, according to the Documentary Hypothesis, or some more modern incarnation of the Hypothesis?  I feel that if I were able to categorize the chapters specifically, rather than just attributing them to the whole book (as a proxy of authorship), then the analysis might be a lot more interesting.

All that being said, I’m intrigued that when you look at the raw number of how often the function words were used, Devarim/Deuteronomy seems to be the most internally consistent.  If you’d like, you can look at the python code that I used to scrape the chabad.org website here: python code for scraping, although please forgive the rudimentary coding!  You can get the dataset that I collected for the Torah word counts here: Torah Data Set, and the data set that I collected for the Prophetic text word counts here: Neviim data set.  By all means, do the analysis yourself and tell me how to do it better 🙂

My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

After the work I did for my last post, I wanted to practice doing multiple classification.  I first thought of using the famous iris dataset, but felt that was a little boring.  Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors.  Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.

The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books.  In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!!  The following table describes the numerical breakdown of books and chapters from each author:

# Books # Chapters/Scenes
Austen 5 317
London 6 296
Milton 2 55
Shakespeare 12 173

Overall, the dataset contains 841 rows and 71 columns.  69 of those columns are the counted stop words (wow!), 1 is for what’s called the “BookID”, and the last is for the Author.  I hope that the word counts are the number of times each word shows up per 100 words, or something that makes the counts comparable between authors/books.

The first thing I did after loading up the data into R was to create a training and testing set:

> authorship$randu = runif(841, 0,1)
> authorship.train = authorship[authorship$randu < .4,]
> authorship.test = authorship[authorship$randu >= .4,]

Then I set out to try to predict authorship in the testing data set using a Random Forests model, a Conditional Inference Tree model, and a Linear Discriminant Analysis model.

Random Forests Model

Here’s the code I used to train the Random Forests model (after finding out that the word “one” seemed to not be too important for the classification):

authorship.model.rf = randomForest(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have + her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train, ntree=5000, mtry=15, importance=TRUE)

It seemed to me that the mtry argument shouldn’t be too low, as we are trying to discriminate between authors!  Following is a graph showing the Mean Decrease in Accuracy for each of the words in the Random Forests Model:

Discriminating Words - RF

As you can see, some of the most important words for classification in this model were “was”, “be”, “to”, “the”, “her” and “had.  At this point, I can’t help but think of the ever famous “to be, or not to be” line, and can’t help but wonder if these are the sort of words that would show up more often in Shakespeare texts.  I don’t have the original texts to re-analyze, so I can only rely on what I have in this dataset to try to answer that question.  Doing a simple average of how often the word “be” shows up per chapter per author, I see that it shows up an average of 12.9 times per scene in Shakespeare, and an average of 20.2 times per chapter in Austen!  Shakespeare doesn’t win that game!!

Anyway, the Random Forests model does pretty well in classifying authors in the test set, as you can see in the counts and proportions tables below:

> authorship.test$pred.author.rf = predict(authorship.model.rf, authorship.test, type="response")
> table(authorship.test$Author, authorship.test$pred.author.rf)
> prop.table(table(authorship.test$Author, authorship.test$pred.author.rf),1)

              Austen London Milton Shakespeare
  Austen         182      4      0           0
  London           1    179      0           1
  Milton           0      1     33           2
  Shakespeare      1      2      0         102

                   Austen      London      Milton Shakespeare
  Austen      0.978494624 0.021505376 0.000000000 0.000000000
  London      0.005524862 0.988950276 0.000000000 0.005524862
  Milton      0.000000000 0.027777778 0.916666667 0.055555556
  Shakespeare 0.009523810 0.019047619 0.000000000 0.971428571

If you look on the diagonal, you can see that the model performs pretty well across authors.  It seems to do the worst with Milton (although still pretty darned well), but I think that should be expected due to the low number of books and chapters from him.

Conditional Inference Tree

Here’s the code I used to train the Conditional Inference Tree model:

> authorship.model.ctree = ctree(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have +  her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train)

Following is the plot that shows the significant splits done by the Conditional Inference Tree:

Discriminating Words - CTree

As is painfully obvious at first glance, there are so many end nodes that seeing the different end nodes in this graph is out of the question.  Still useful however are the ovals indicating what words formed the significant splits.  Similar to the Random Forests model, we see “be” and “was” showing up as the most significant words in discriminating between authors.  Other words it splits on don’t seem to be as high up on the list in the Random Forests model, but the end goal is prediction, right?

Here is how the Conditional Inference Tree model did in predicting authorship in the test set:

> authorship.test$pred.author.ctree = predict(authorship.model.ctree, authorship.test, type="response")
> table(authorship.test$Author, authorship.test$pred.author.ctree)
> prop.table(table(authorship.test$Author, authorship.test$pred.author.ctree),1)

              Austen London Milton Shakespeare
  Austen         173      8      1           4
  London          18    148      3          12
  Milton           0      6     27           3
  Shakespeare      6     10      5          84

                   Austen      London      Milton Shakespeare
  Austen      0.930107527 0.043010753 0.005376344 0.021505376
  London      0.099447514 0.817679558 0.016574586 0.066298343
  Milton      0.000000000 0.166666667 0.750000000 0.083333333
  Shakespeare 0.057142857 0.095238095 0.047619048 0.800000000

Overall it looks like the Conditional Inference Tree model is doing a worse job predicting authorship compared with the Random Forests model (again, looking at the diagonal).  Again we see the Milton records popping up as having the lowest hit rate for classification, but I think it’s interesting/sad that only 80% of Shakespeare records were correctly classified.  Sorry old man, it looks like this model thinks you’re a little bit like Jack London, and somewhat like Jane Austen and John Milton.

Linear Discriminant Analysis

Finally we come to the real star of this particular show.  Here’s the code I used to train the model:

> authorship.model.lda = lda(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have +  her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train)

There’s a lot of summary information that the lda function spits out by default, so I won’t post it here, but I thought the matrix scatterplot of the records plotted along the 3 linear discriminants looked pretty great, so here it is:

linear discriminants for authorship

From this plot you can see that putting all the words in the linear discriminant model seems to have led to pretty good discrimination between authors.  However, it’s in the prediction that you see this model shine:

> authorship.test$pred.author.lda = predict(authorship.model.lda, authorship.test, type="response")
> authorship.test$pred.author.lda = predict(authorship.model.lda, authorship.test)$class
> table(authorship.test$Author, authorship.test$pred.author.lda)
> prop.table(table(authorship.test$Author, authorship.test$pred.author.lda),1)

              Austen London Milton Shakespeare
  Austen         185      1      0           0
  London           1    180      0           0
  Milton           0      0     36           0
  Shakespeare      0      0      0         105

                   Austen      London      Milton Shakespeare
  Austen      0.994623656 0.005376344 0.000000000 0.000000000
  London      0.005524862 0.994475138 0.000000000 0.000000000
  Milton      0.000000000 0.000000000 1.000000000 0.000000000
  Shakespeare 0.000000000 0.000000000 0.000000000 1.000000000

As you can see, it performed flawlessly with Milton and Shakespeare, and almost flawlessly with Austen and London.

Looking at an explanation of LDA from dtreg I’m thinking that LDA is performing best here because the discriminant functions created are more sensitive to the boundaries between authors (defined here by stop word counts) than the binary splits made on the predictor variables in the decision tree models.  Does this hold for all cases of classification where the predictor variables are numeric, or does it break down if the normality of the predictors is grossly violated?  Feel free to give me a more experienced/learned opinion in the comments!

Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees

Now that I’m on my winter break, I’ve been taking a little bit of time to read up on some modelling techniques that I’ve never used before. Two such techniques are Random Forests and Conditional Trees.  Since both can be used for classification, I decided to see how they compare against a simple binomial logistic regression (something I’ve worked with a lot) for binary classification.

The dataset I used contains records of the survival of Titanic Passengers and such information as sex, age, fare each person paid, number of parents/children aboard, number of siblings or spouses aboard, passenger class and other fields (The titanic dataset can be retrieved from a page on Vanderbilt’s website replete with lots of datasets; look for “titanic3”).

I took one part of the dataset to train my models, and another part to test them.  The factors that I focused on were passenger class, sex, age, and number of siblings/spouses aboard.

First, let’s look at the GLM:

titanic.survival.train = glm(survived ~ pclass + sex + pclass:sex + age + sibsp,
family = binomial(logit), data = titanic.train)

As you can see, I worked in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women).  Following is the model summary output, if you’re interested:

> summary(titantic.survival.train)

Call:
glm(formula = survived ~ pclass + sex + pclass:sex + age + sibsp, 
    family = binomial(logit), data = titanic.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9342  -0.6851  -0.5481   0.5633   2.3164  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     6.556807   0.878331   7.465 8.33e-14 ***
pclass         -1.928538   0.278324  -6.929 4.24e-12 ***
sexmale        -4.905710   0.785142  -6.248 4.15e-10 ***
age            -0.036462   0.009627  -3.787 0.000152 ***
sibsp          -0.264190   0.106684  -2.476 0.013272 *  
pclass:sexmale  1.124111   0.299638   3.752 0.000176 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 858.22  on 649  degrees of freedom
Residual deviance: 618.29  on 644  degrees of freedom
AIC: 630.29

Number of Fisher Scoring iterations: 5

So, after I used my model to predict survival probabilities on the testing portion of the dataset, I checked to see how many records showed a probability of over .5 (or 50%), and then how many of those records were actual survivors.  For the GLM, 146/164 (89%) of those records scored at 50% or higher were actual survivors.  Not bad!

Now let’s move on to Random Forests:

> titanic.survival.train.rf = randomForest(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train,ntree=5000, importance=TRUE)

> titanic.survival.train.rf

Call:
 randomForest(formula = as.factor(survived) ~ pclass + sex + age +      sibsp, data = titanic.train, ntree = 5000, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 5000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 22.62%
Confusion matrix:
    0   1 class.error
0 370  38  0.09313725
1 109 133  0.45041322

> importance(titanic.survival.train.rf)
               0          1 MeanDecreaseAccuracy MeanDecreaseGini
pclass  67.26795 125.166721            126.40379         34.69266
sex    160.52060 221.803515            224.89038         62.82490
age     70.35831  50.568619             92.67281         53.41834
sibsp   60.84056   3.343251             52.82503         14.01936

It seems to me that the output indicates that the Random Forests model is better at creating true negatives than true positives, with regards to survival of the passengers, but when I asked for the predicted survival categories in the testing portion of my dataset, it appeared to do a pretty decent job predicting who would survive and who wouldn’t:

For the Random Forests model, 155/184 (84%) of those records predicted to survive actually did survive!  Again, not bad.

Finally, lets move on to the Conditional Tree model:

> titanic.survival.train.ctree = ctree(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train)

> titanic.survival.train.ctree

	 Conditional inference tree with 7 terminal nodes

Response:  as.factor(survived) 
Inputs:  pclass, sex, age, sibsp 
Number of observations:  650 

1) sex == {female}; criterion = 1, statistic = 141.436
  2) pclass <= 2; criterion = 1, statistic = 55.831
    3) pclass <= 1; criterion = 0.988, statistic = 8.817       
     4)*  weights = 74      3) pclass > 1
      5)*  weights = 47 
  2) pclass > 2
    6)*  weights = 105 
1) sex == {male}
  7) pclass <= 1; criterion = 1, statistic = 15.095     
   8)*  weights = 88    
  7) pclass > 1
    9) age <= 12; criterion = 1, statistic = 14.851
      10) sibsp <= 1; criterion = 0.998, statistic = 12.062               11)*  weights = 18        
      10) sibsp > 1
       12)*  weights = 12 
    9) age > 12
      13)*  weights = 306

Titanic Conditional Tree

I really happen to like the graph output by plotting the conditional tree model.  I find it pretty easy to understand.  As you can see, the model started the split of the data according to sex, which it found to be most significant, then pclass, age, and then siblings/spouses.  That being said, let’s look at how the prediction went for comparison purposes:

134/142 (94%) of records predicted to survive actually did survive!  Great!!  Now let’s bring all those prediction stats into one table for a final look:

                     glm randomForests ctree
Actually Survived    146           155   134
Predicted to Survive 164           184   142
% Survived           89%           84%   94%

So, in terms of finding the highest raw number of folks who actually did survive, the Random Forests model won, but in terms of having the highest true positive rate, the Conditional Tree model takes the cake!  Neat exercise!

Notes:

(1) There were numerous records without an age value.  I estimated ages using a regression model taken from those who did have ages.  Following are the coefficients for that model:

44.59238 + (-5.98582*pclass) + (.15971*fare) + (-.14141*pclass:fare)

(2) Although I used GLM scores of over .5 to classify records as survived, I could have played with that to get different results.  I really just wanted to have some fun, and not try all possibilities here:)

(3) I hope you enjoyed this post!  If you have anything to teach me about Random Forests or Conditional Trees, please leave a comment and be nice (I’m trying to expand my analysis skills here!!).

EDIT:
I realized that I should also post the false negative rate for each model, because it’s important to know how many people each model missed coding as survived. So, here are the stats:

                           glm    rf ctree   gbm
Actually Survived          112   103   124    74
Predicted Not to Survive   495   475   517   429
False Negative Rate      22.6% 21.7%   24% 17.2%

As you can see, the gbm model shows the lowest false negative rate, which is pretty nice! Compare that against the finding that it correctly predicted the survival of 184/230 (80%) records that were scored with a probability of 50% or higher, and that makes a pretty decent model.