2.1 Data
Let’s start with loading required packages.
library(dplyr)
library(ggplot2)
library(caret)
library(mice)
library(psych)
library(doParallel)
Read Titanic training and test data files.
titanic_train <- read.csv("http://bit.ly/2DVwM0d")
titanic_test <- read.csv("http://bit.ly/2Jn7USt")
We will use titanic_train
for model building and then, if you wish, test the efficacy of the model using titanic_test
. However, titanic_test
doesn’t have the true values of the dependent variables survived
. To know whether your classification is good, you will have to submit it on Kaggle and get the score.
Let’s find the structure of the data and what it contains.
## $strict.width
## [1] "wrap"
##
## $digits.d
## [1] 3
##
## $vec.len
## [1] 4
##
## $drop.deparse.attr
## [1] TRUE
##
## $formatNum
## function (x, ...)
## format(x, trim = TRUE, drop0trailing = TRUE, ...)
## <environment: 0x7f97f3c96d58>
str(titanic_train, vec.len = 2)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 ...
## $ Survived : int 0 1 1 1 0 ...
## $ Pclass : int 3 1 3 1 3 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 ...
## $ Parch : int 0 0 0 0 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 ...
## $ Fare : num 7.25 71.28 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 ...
Note that PassengerId
, Name
, Ticket
, and Cabin
seem like variables that we will not use in modeling. However, as we will see below there is a possibility for using information contained in Name
. Also notice that Embarked
has 4 levels but one of them is blank. Take a look at its distribution:
table(titanic_train$Embarked)
##
## C Q S
## 2 168 77 644
As only 2 values are missing, we should either drop these observations or we should impute them. An easy fix is to replace them by the mode of the distribution, which is S
.
titanic_train <- titanic_train %>%
mutate(Embarked = factor(ifelse(Embarked == "", "S", as.character(Embarked))))
The variable description from Kaggle is as shown in Table 2.1
## Warning in kableExtra::kable_styling(., latex_options = "scale_down"):
## Please specify format in kable. kableExtra can customize either HTML or
## LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Also, Kaggle provides more information on the variables as follows:
Variable Notes
Pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
Sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: The data set defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children traveled only with a nanny, therefore parch = 0 for them.
2.1.1 Missing values
psych::describe(titanic_train) %>%
select(-vars, -trimmed, -mad, -range, -se) %>%
knitr::kable(digits = 2,
align = "c",
caption = "Summary Statistics",
booktabs = TRUE) # kable prints nice-looking tables.
n | mean | sd | median | min | max | skew | kurtosis | |
---|---|---|---|---|---|---|---|---|
PassengerId | 891 | 446.00 | 257.35 | 446.00 | 1.00 | 891.00 | 0.00 | -1.20 |
Survived | 891 | 0.38 | 0.49 | 0.00 | 0.00 | 1.00 | 0.48 | -1.77 |
Pclass | 891 | 2.31 | 0.84 | 3.00 | 1.00 | 3.00 | -0.63 | -1.28 |
Name* | 891 | 446.00 | 257.35 | 446.00 | 1.00 | 891.00 | 0.00 | -1.20 |
Sex* | 891 | 1.65 | 0.48 | 2.00 | 1.00 | 2.00 | -0.62 | -1.62 |
Age | 714 | 29.70 | 14.53 | 28.00 | 0.42 | 80.00 | 0.39 | 0.16 |
SibSp | 891 | 0.52 | 1.10 | 0.00 | 0.00 | 8.00 | 3.68 | 17.73 |
Parch | 891 | 0.38 | 0.81 | 0.00 | 0.00 | 6.00 | 2.74 | 9.69 |
Ticket* | 891 | 339.52 | 200.83 | 338.00 | 1.00 | 681.00 | 0.00 | -1.28 |
Fare | 891 | 32.20 | 49.69 | 14.45 | 0.00 | 512.33 | 4.77 | 33.12 |
Cabin* | 891 | 18.63 | 38.14 | 1.00 | 1.00 | 148.00 | 2.09 | 3.07 |
Embarked* | 891 | 2.54 | 0.79 | 3.00 | 1.00 | 3.00 | -1.26 | -0.22 |
Only Age
has missing values. This makes out job quite easy. We will not throw away the missing observations. Instead, we will impute them using random forest. We don’t have to do it manually. Instead, we will use mice()
function from mice
package.
set.seed(9009)
miceMod <- mice::mice(select(titanic_train,
-c(Survived, PassengerId, Name, Cabin, Ticket)),
method = "rf") # perform mice imputation based on random forest.
##
## iter imp variable
## 1 1 Age
## 1 2 Age
## 1 3 Age
## 1 4 Age
## 1 5 Age
## 2 1 Age
## 2 2 Age
## 2 3 Age
## 2 4 Age
## 2 5 Age
## 3 1 Age
## 3 2 Age
## 3 3 Age
## 3 4 Age
## 3 5 Age
## 4 1 Age
## 4 2 Age
## 4 3 Age
## 4 4 Age
## 4 5 Age
## 5 1 Age
## 5 2 Age
## 5 3 Age
## 5 4 Age
## 5 5 Age
Build a complete data set and add back 4 variables that we removed previously. Also convert Survived
into a factor with more explicit labels.
titanic_train2 <- mice::complete(miceMod) %>%
mutate(Name = titanic_train$Name,
Cabin = titanic_train$Cabin,
Tiket = titanic_train$Ticket,
Survived = factor(ifelse(titanic_train$Survived == 1,
"Survived", "Diseased")))
Check whether there are any missing values
anyNA(titanic_train2)
## [1] FALSE
There are no missing values any more.
Note that we did not use Name
and Cabin
to impute missing age
because there is likely to be little information in these variables. But, interestingly, Name
also contains the person’s title, which can be extracted and used for model building. It can be a relevant variable in particular if it contains information that is not captured by other variables. I am refraining from doing it in order to keep this exercise short. Furthermore, it seems that adding these variables doesn’t materially improve prediction accuracy. This is probably because these variables are associated with Pclass
, Sex
, and Fare
.6
There are several solutions to Titanic contest online. You can check their code to see how they used these variables in their model.↩