2.3 Logistic regression
We will first use binary logistic regression for model building using t_train
data and then assess its performance using t_test
data. For this we will use caret
package although base R has glm()
function which will also suffice.
m1 <- caret::train(Survived ~ .,
data = t_train[, -c(8:10)], # Drop Name, Ticket, and Cabin
method = "glm",
family = binomial())
summary(m1)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6753 -0.6524 -0.4155 0.6583 2.4333
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.484932 0.649749 8.442 < 2e-16 ***
## Pclass -1.170724 0.166886 -7.015 2.30e-12 ***
## Sexmale -2.485338 0.219876 -11.303 < 2e-16 ***
## Age -0.045550 0.008486 -5.367 7.99e-08 ***
## SibSp -0.418137 0.119379 -3.503 0.000461 ***
## Parch -0.099356 0.133415 -0.745 0.456445
## Fare 0.002014 0.002562 0.786 0.431715
## EmbarkedQ 0.105943 0.415548 0.255 0.798764
## EmbarkedS -0.361276 0.262003 -1.379 0.167925
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 950.86 on 713 degrees of freedom
## Residual deviance: 644.08 on 705 degrees of freedom
## AIC: 662.08
##
## Number of Fisher Scoring iterations: 5
If you have seen the movie Titanic, perhaps you know that the ship’s captain followed certain rules for evacuation. Women and children got to go first, and therefore, had a very high chance of survival. On the other hand, men from 3rd class had almost no chance of survival.
We get to see that playing out in the data. As Pclass
increases, probability of survival drops. We will have to compute the odds ration to quantify this. Similarly, males on average had much smaller chance of survival compared to females. Next, younger passenger had a much higher chance of survival compared to an older passenger. Interestingly, people with siblings and/or spouse on Titanic had lower probability of survival! I don’t know the reason for this.
2.3.1 Model performance
Let’s check out the performance of the model out of the sample using t_test
data set. For this, we will use confusionMatrix()
function from caret
package.
caret::confusionMatrix(predict(m1, subset(t_test, select = -Survived)),
reference = t_test$Survived,
positive = "Survived")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Diseased Survived
## Diseased 95 15
## Survived 14 53
##
## Accuracy : 0.8362
## 95% CI : (0.7732, 0.8874)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.38e-10
##
## Kappa : 0.6528
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7794
## Specificity : 0.8716
## Pos Pred Value : 0.7910
## Neg Pred Value : 0.8636
## Prevalence : 0.3842
## Detection Rate : 0.2994
## Detection Prevalence : 0.3785
## Balanced Accuracy : 0.8255
##
## 'Positive' Class : Survived
##
It’s a simple model and yet quite good! In most cases people were getting accuracies in low 80%s so we are not doing bad at all.7
Extension for you to try: Using interactions between variables (e.g.,
Sex
andAge
), check whether you can improve the accuracy of the model.↩