Now that I’m on my winter break, I’ve been taking a little bit of time to read up on some modelling techniques that I’ve never used before. Two such techniques are Random Forests and Conditional Trees. Since both can be used for classification, I decided to see how they compare against a simple binomial logistic regression (something I’ve worked with a lot) for binary classification.
The dataset I used contains records of the survival of Titanic Passengers and such information as sex, age, fare each person paid, number of parents/children aboard, number of siblings or spouses aboard, passenger class and other fields (The titanic dataset can be retrieved from a page on Vanderbilt’s website replete with lots of datasets; look for “titanic3”).
I took one part of the dataset to train my models, and another part to test them. The factors that I focused on were passenger class, sex, age, and number of siblings/spouses aboard.
First, let’s look at the GLM:
titanic.survival.train = glm(survived ~ pclass + sex + pclass:sex + age + sibsp, family = binomial(logit), data = titanic.train)
As you can see, I worked in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women). Following is the model summary output, if you’re interested:
> summary(titantic.survival.train) Call: glm(formula = survived ~ pclass + sex + pclass:sex + age + sibsp, family = binomial(logit), data = titanic.train) Deviance Residuals: Min 1Q Median 3Q Max -2.9342 -0.6851 -0.5481 0.5633 2.3164 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.556807 0.878331 7.465 8.33e-14 *** pclass -1.928538 0.278324 -6.929 4.24e-12 *** sexmale -4.905710 0.785142 -6.248 4.15e-10 *** age -0.036462 0.009627 -3.787 0.000152 *** sibsp -0.264190 0.106684 -2.476 0.013272 * pclass:sexmale 1.124111 0.299638 3.752 0.000176 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 858.22 on 649 degrees of freedom Residual deviance: 618.29 on 644 degrees of freedom AIC: 630.29 Number of Fisher Scoring iterations: 5
So, after I used my model to predict survival probabilities on the testing portion of the dataset, I checked to see how many records showed a probability of over .5 (or 50%), and then how many of those records were actual survivors. For the GLM, 146/164 (89%) of those records scored at 50% or higher were actual survivors. Not bad!
Now let’s move on to Random Forests:
> titanic.survival.train.rf = randomForest(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train,ntree=5000, importance=TRUE) > titanic.survival.train.rf Call: randomForest(formula = as.factor(survived) ~ pclass + sex + age + sibsp, data = titanic.train, ntree = 5000, importance = TRUE) Type of random forest: classification Number of trees: 5000 No. of variables tried at each split: 2 OOB estimate of error rate: 22.62% Confusion matrix: 0 1 class.error 0 370 38 0.09313725 1 109 133 0.45041322 > importance(titanic.survival.train.rf) 0 1 MeanDecreaseAccuracy MeanDecreaseGini pclass 67.26795 125.166721 126.40379 34.69266 sex 160.52060 221.803515 224.89038 62.82490 age 70.35831 50.568619 92.67281 53.41834 sibsp 60.84056 3.343251 52.82503 14.01936
It seems to me that the output indicates that the Random Forests model is better at creating true negatives than true positives, with regards to survival of the passengers, but when I asked for the predicted survival categories in the testing portion of my dataset, it appeared to do a pretty decent job predicting who would survive and who wouldn’t:
For the Random Forests model, 155/184 (84%) of those records predicted to survive actually did survive! Again, not bad.
Finally, lets move on to the Conditional Tree model:
> titanic.survival.train.ctree = ctree(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train) > titanic.survival.train.ctree Conditional inference tree with 7 terminal nodes Response: as.factor(survived) Inputs: pclass, sex, age, sibsp Number of observations: 650 1) sex == {female}; criterion = 1, statistic = 141.436 2) pclass <= 2; criterion = 1, statistic = 55.831 3) pclass <= 1; criterion = 0.988, statistic = 8.817 4)* weights = 74 3) pclass > 1 5)* weights = 47 2) pclass > 2 6)* weights = 105 1) sex == {male} 7) pclass <= 1; criterion = 1, statistic = 15.095 8)* weights = 88 7) pclass > 1 9) age <= 12; criterion = 1, statistic = 14.851 10) sibsp <= 1; criterion = 0.998, statistic = 12.062 11)* weights = 18 10) sibsp > 1 12)* weights = 12 9) age > 12 13)* weights = 306
I really happen to like the graph output by plotting the conditional tree model. I find it pretty easy to understand. As you can see, the model started the split of the data according to sex, which it found to be most significant, then pclass, age, and then siblings/spouses. That being said, let’s look at how the prediction went for comparison purposes:
134/142 (94%) of records predicted to survive actually did survive! Great!! Now let’s bring all those prediction stats into one table for a final look:
glm randomForests ctree Actually Survived 146 155 134 Predicted to Survive 164 184 142 % Survived 89% 84% 94%
So, in terms of finding the highest raw number of folks who actually did survive, the Random Forests model won, but in terms of having the highest true positive rate, the Conditional Tree model takes the cake! Neat exercise!
Notes:
(1) There were numerous records without an age value. I estimated ages using a regression model taken from those who did have ages. Following are the coefficients for that model:
44.59238 + (-5.98582*pclass) + (.15971*fare) + (-.14141*pclass:fare)
(2) Although I used GLM scores of over .5 to classify records as survived, I could have played with that to get different results. I really just wanted to have some fun, and not try all possibilities here:)
(3) I hope you enjoyed this post! If you have anything to teach me about Random Forests or Conditional Trees, please leave a comment and be nice (I’m trying to expand my analysis skills here!!).
EDIT:
I realized that I should also post the false negative rate for each model, because it’s important to know how many people each model missed coding as survived. So, here are the stats:
glm rf ctree gbm Actually Survived 112 103 124 74 Predicted Not to Survive 495 475 517 429 False Negative Rate 22.6% 21.7% 24% 17.2%
As you can see, the gbm model shows the lowest false negative rate, which is pretty nice! Compare that against the finding that it correctly predicted the survival of 184/230 (80%) records that were scored with a probability of 50% or higher, and that makes a pretty decent model.
Nice trick, i would recommend to also look at http://en.wikipedia.org/wiki/Gradient_boosting
In short it will try to classify survivors, but add the mistakes to the earlier population. This way the complicated cases will be added into the mix, and the regression trees will pay more attention to them.
I think it would fit right up with the random forest and the conditional tress, but your take would be interesting 🙂
Okay I checked out the gbm package and attempted to fit my training data using it. Here we go:
titanic.survival.train.gbm = gbm(survived ~ pclass + sex + age + sibsp, data=titanic.train, distribution=”bernoulli”,n.trees=5000, verbose=TRUE)
Using this model, I found the following summary stats (I really like that it breaks down relative influence into percents!):
var rel.inf
1 sex 57.567932
2 pclass 29.718241
3 age 8.550585
4 sibsp 4.163243
Then finally, 184/230 (80%) of records scored as over 50% likely to survive actually did survive. Neat!!
For the GLM model:
titanic.survival.train = glm(survived ~ pclass + sex + pclass:sex + age + sibsp,
family = binomial(logit), data = titanic.train)
What does pclass:sex mean? Are they multiplied together to create a cross term?
Thanks!
Hi there,
pclass by sex is an interaction term, whereby pclass is multiplied by sex. Pclass has 3 values (1,2,3) and sex has two (female/male, or 0,1). So when you multiply them together, you get 4 values: 0, 1, 2, 3. The values above 0 only show up when a record is marked as male, hence the variable title “pclass:sexmale” in the GLM output above.
If there were no special effect for passenger class for males, then I think pclass:sexmale should pretty much look like pclass and therefore wouldn’t be significant. However there is a special effect for passenger class for males (it’s much smaller than that of the females), so therefore it is significant!
Can you share the code you used to generate the random forest and conditional tree models?
Hi Jon,
Thanks for the input. I edited my post to include the calls I made to the randomForest and ctree functions!
Why are you giving your Logistic Regression the benefit of the interaction term, but none of the other models? Wouldn’t it be fair to either remove that interaction, or give it as a term for the other models as well?
I tried to put the interaction term in the other models as well, but they just ignore it. I think that by their nature, decision tree type models already discover interactions between variables without you having to specify them!
I also tried to put the interaction term in the gbm model I posted in my comment above, but it came out as having 0 relative importance!
That is correct, decision tree type models do discover their own interactions, as would a support vector machine (another interesting one to try on this data…).
Very interesting post! I’m going to play around with this data some myself, now.
Having the interaction in the GLM/LR model means you’re applying ‘expert’ knowledge – something that isn’t possible or as necessary in the tree models. How does your GLM model compare to the trees if you don’t include the expert knowledge of the interaction term?
Believe it or not, taking out the interaction term actually improved the prediction quite a bit:
188/232 (81%) records with a score of over 50% were actual survivors, whereas 70/427 (16%) records with a score of 50% or below were actual survivors.
So it has a pretty good true positive rate, and a pretty low false negative rate!
Pingback: Who Survived on the Titanic? Predictive Classification with Parametric and Non-parametric Models « Political Methodology
For the GLM, can you please provide a piece of code where you checked how many records showed a probability of over .5 (or 50%), and then how many of those records were actual survivors.
Hi there,
First I had to predict the probabilities for the test sample:
titanic.test2$probs.glm = predict(titantic.survival.train, titanic.test2, type=”response”)
Once I had those, then I could check how many showed a probability over .5:
sum(titanic.test2$probs.glm > .5)
And then I could get the number of those who were actual survivors:
sum(titanic.test2$probs.glm > .5 & titanic.test2$survived == 1)
same can be computed as the Classification table for Logistic Regression using ClassLog() from QuantPsyc package.
Pingback: My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis | Data and Analysis with R, at Work
Great post! Question: Why didn’t you treat pclass as a categorical variable in your logistic regression? Would results change? Any additional commentary regarding this detail would be much appreciated.
Hola chilito,
If I modify the GLM I made for “thebigjc” (the one without the interaction term that was making my survival estimates too conservative) so that passenger class is a factor rather than numeric variable, I don’t get too much improvement in the stats:
193/240 (80.4%) of records with a score of over 50% actually survived, whereas 65/419 (16%) of records with a score of 50% or below were survivors. The true positive and false negative rates basically stay the same, even though the raw numbers change a little bit.
I think that you can reasonably assume here that passenger class is an ordinal variable with a linear effect on survival rate, and therefore it’s not so bad to model it as numeric!
Pingback: My Intro to Multiple Classification… « Another Word For It
Pingback: Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees | Process Improvement Stuff I want to Know | Scoop.it
Hi,
I saw this on the R-Blogger’s email list back in December, and have finally gotten around to playing with this data!
I’ll apologize in advance for the total noob question…after your ran the model on your training set, how did you run (or cross-validate) the model on your test set?
FWIW, I found the “caret” package to be helpful with taking care of the mundate work of dividing up a sample into a training and test subset, and found the “pscl” package and its “hitmiss” function to easily compute the percentage correctly classified.
Thanks again for such an interesting blog post!
Hi there,
Thanks for your comment! I’ve never tried the packages you mentioned before. Breaking up datasets into train/test samples though is important to me. I’ll be sure to try the caret package next time I do some modelling!
As we all know: So many packages! So little time…
hi!
can you please provide me a code for “confusion matrix” (logistic regression) in R in order to finish finally my homework 😦
Thanks,
nicole
Install the R package “caret” and then type in
?confusionMatrix
When you are done reading that, then just enter the command in a two step process–first, obtain a prediction model, then use the confusionMatrix command on the model to get the stats on accuracy of the model. So something like
your.Model <- predict(model.developed.using.training.set, newdata=your.training.set) #that should be one line
confusionMatrix(your.Model, true.state) #true.state would be something like "titantic$Survived").
As always, Google is your friend.
A few thinks on your Random Forest model.
1) Did you check the error convergence on your RF model by plotting the model object? It is a misnomer that you cannot overfit a RF model. One sure way is to over-correlate the ensemble and 5000 trees seems like a bit overkill that may lead to an overfit problem.
2) We have observed that model selection is important (Murphy et al., 2010; Evans et al., 2011). This is because RF does not have to account for noise in node splits making plurality in the votes matrix much cleaner.
3) The standard probability cutoff for a RF model (for the positive class) is p=0.65 (Evans & Cushman (2008). I tested this critical value, via simulation when we came up with the posterior distribution method for deriving class probabilities and found that p=0.65 is very stable and consistent with the Breiman (2001) votes threshold.
4) Random Forest (and Conditional Tree) is non-linear and accounts for high-dimensional interactions. As such, it is a very bad idea to withhold data. This is because you have no idea how the data is being truncated in p-dimensional space. A nonlinear relationship can occur in a small portion of the distribution and you may unintentional remove the portion of the variable that is driving the process or defining the interaction. This is one of the problems in comparing linear and nonlinear models and should be a major consideration in interpreting results. Absolute accuracy can only tell you so much when modeling complex nonlinear data.
4) How did the models compare in relation to the ROC/AUC and Kappa?
Hi Jeffrey,
Thank you for your post/comment.
You have stated that , ‘it is a very bad idea to withhold data’, especially for non-linear models. Do you suggest we use 100% of the data in the training set? OR do you have other recommendations on obtaining the test data?
I would also add that you have a bit of a class balance (zero inflation) problem with 64% of observations in the “survived” column representing the negative class (n=1313, n[1]=449, n[0]=846).
is there anyway we can extract coefficients of each significant variables from ctree methods?