2.3 Logistic regression

We will first use binary logistic regression for model building using t_train data and then assess its performance using t_test data. For this we will use caret package although base R has glm() function which will also suffice.

m1 <- caret::train(Survived ~ .,
                   data = t_train[, -c(8:10)], # Drop Name, Ticket, and Cabin
                   method = "glm",
                   family = binomial()) 

summary(m1)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6753  -0.6524  -0.4155   0.6583   2.4333  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.484932   0.649749   8.442  < 2e-16 ***
## Pclass      -1.170724   0.166886  -7.015 2.30e-12 ***
## Sexmale     -2.485338   0.219876 -11.303  < 2e-16 ***
## Age         -0.045550   0.008486  -5.367 7.99e-08 ***
## SibSp       -0.418137   0.119379  -3.503 0.000461 ***
## Parch       -0.099356   0.133415  -0.745 0.456445    
## Fare         0.002014   0.002562   0.786 0.431715    
## EmbarkedQ    0.105943   0.415548   0.255 0.798764    
## EmbarkedS   -0.361276   0.262003  -1.379 0.167925    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 644.08  on 705  degrees of freedom
## AIC: 662.08
## 
## Number of Fisher Scoring iterations: 5

If you have seen the movie Titanic, perhaps you know that the ship’s captain followed certain rules for evacuation. Women and children got to go first, and therefore, had a very high chance of survival. On the other hand, men from 3rd class had almost no chance of survival.

We get to see that playing out in the data. As Pclass increases, probability of survival drops. We will have to compute the odds ration to quantify this. Similarly, males on average had much smaller chance of survival compared to females. Next, younger passenger had a much higher chance of survival compared to an older passenger. Interestingly, people with siblings and/or spouse on Titanic had lower probability of survival! I don’t know the reason for this.

2.3.1 Model performance

Let’s check out the performance of the model out of the sample using t_test data set. For this, we will use confusionMatrix() function from caret package.

caret::confusionMatrix(predict(m1, subset(t_test, select = -Survived)),
                       reference = t_test$Survived,
                       positive = "Survived")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Diseased Survived
##   Diseased       95       15
##   Survived       14       53
##                                           
##                Accuracy : 0.8362          
##                  95% CI : (0.7732, 0.8874)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.38e-10        
##                                           
##                   Kappa : 0.6528          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7794          
##             Specificity : 0.8716          
##          Pos Pred Value : 0.7910          
##          Neg Pred Value : 0.8636          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2994          
##    Detection Prevalence : 0.3785          
##       Balanced Accuracy : 0.8255          
##                                           
##        'Positive' Class : Survived        
## 

It’s a simple model and yet quite good! In most cases people were getting accuracies in low 80%s so we are not doing bad at all.7


  1. Extension for you to try: Using interactions between variables (e.g., Sex and Age), check whether you can improve the accuracy of the model.