2.1 Data

Let’s start with loading required packages.

library(dplyr)
library(ggplot2)
library(caret)
library(mice)
library(psych)
library(doParallel)

Read Titanic training and test data files.

titanic_train <- read.csv("http://bit.ly/2DVwM0d")
titanic_test <- read.csv("http://bit.ly/2Jn7USt")

We will use titanic_train for model building and then, if you wish, test the efficacy of the model using titanic_test. However, titanic_test doesn’t have the true values of the dependent variables survived. To know whether your classification is good, you will have to submit it on Kaggle and get the score.

Let’s find the structure of the data and what it contains.

## $strict.width
## [1] "wrap"
## 
## $digits.d
## [1] 3
## 
## $vec.len
## [1] 4
## 
## $drop.deparse.attr
## [1] TRUE
## 
## $formatNum
## function (x, ...) 
## format(x, trim = TRUE, drop0trailing = TRUE, ...)
## <environment: 0x7f97f3c96d58>

str(titanic_train, vec.len = 2)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 ...
##  $ Survived   : int  0 1 1 1 0 ...
##  $ Pclass     : int  3 1 3 1 3 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 ...
##  $ Age        : num  22 38 26 35 35 ...
##  $ SibSp      : int  1 1 0 1 0 ...
##  $ Parch      : int  0 0 0 0 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 ...
##  $ Fare       : num  7.25 71.28 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 ...

Note that PassengerId, Name, Ticket, and Cabin seem like variables that we will not use in modeling. However, as we will see below there is a possibility for using information contained in Name. Also notice that Embarked has 4 levels but one of them is blank. Take a look at its distribution:

table(titanic_train$Embarked)

## 
##       C   Q   S 
##   2 168  77 644

As only 2 values are missing, we should either drop these observations or we should impute them. An easy fix is to replace them by the mode of the distribution, which is S.

titanic_train <- titanic_train %>% 
  mutate(Embarked = factor(ifelse(Embarked == "", "S", as.character(Embarked))))

The variable description from Kaggle is as shown in Table 2.1

## Warning in kableExtra::kable_styling(., latex_options = "scale_down"):
## Please specify format in kable. kableExtra can customize either HTML or
## LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.

Table 2.1: Variables Description
Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Also, Kaggle provides more information on the variables as follows:

Variable Notes

Pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

Sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The data set defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children traveled only with a nanny, therefore parch = 0 for them.

2.1.1 Missing values

psych::describe(titanic_train) %>%
  select(-vars, -trimmed, -mad, -range, -se) %>% 
  knitr::kable(digits = 2,
             align = "c",
             caption = "Summary Statistics",
             booktabs = TRUE) # kable prints nice-looking tables.

Table 2.2: Summary Statistics
	n	mean	sd	median	min	max	skew	kurtosis
PassengerId	891	446.00	257.35	446.00	1.00	891.00	0.00	-1.20
Survived	891	0.38	0.49	0.00	0.00	1.00	0.48	-1.77
Pclass	891	2.31	0.84	3.00	1.00	3.00	-0.63	-1.28
Name*	891	446.00	257.35	446.00	1.00	891.00	0.00	-1.20
Sex*	891	1.65	0.48	2.00	1.00	2.00	-0.62	-1.62
Age	714	29.70	14.53	28.00	0.42	80.00	0.39	0.16
SibSp	891	0.52	1.10	0.00	0.00	8.00	3.68	17.73
Parch	891	0.38	0.81	0.00	0.00	6.00	2.74	9.69
Ticket*	891	339.52	200.83	338.00	1.00	681.00	0.00	-1.28
Fare	891	32.20	49.69	14.45	0.00	512.33	4.77	33.12
Cabin*	891	18.63	38.14	1.00	1.00	148.00	2.09	3.07
Embarked*	891	2.54	0.79	3.00	1.00	3.00	-1.26	-0.22

Only Age has missing values. This makes out job quite easy. We will not throw away the missing observations. Instead, we will impute them using random forest. We don’t have to do it manually. Instead, we will use mice() function from mice package.

set.seed(9009)

miceMod <- mice::mice(select(titanic_train, 
                             -c(Survived, PassengerId, Name, Cabin, Ticket)), 
                method = "rf")  # perform mice imputation based on random forest.

## 
##  iter imp variable
##   1   1  Age
##   1   2  Age
##   1   3  Age
##   1   4  Age
##   1   5  Age
##   2   1  Age
##   2   2  Age
##   2   3  Age
##   2   4  Age
##   2   5  Age
##   3   1  Age
##   3   2  Age
##   3   3  Age
##   3   4  Age
##   3   5  Age
##   4   1  Age
##   4   2  Age
##   4   3  Age
##   4   4  Age
##   4   5  Age
##   5   1  Age
##   5   2  Age
##   5   3  Age
##   5   4  Age
##   5   5  Age

Build a complete data set and add back 4 variables that we removed previously. Also convert Survived into a factor with more explicit labels.

titanic_train2 <- mice::complete(miceMod) %>% 
  mutate(Name = titanic_train$Name,
         Cabin = titanic_train$Cabin,
         Tiket = titanic_train$Ticket,
         Survived = factor(ifelse(titanic_train$Survived == 1, 
                                  "Survived", "Diseased")))

Check whether there are any missing values

anyNA(titanic_train2)

## [1] FALSE

There are no missing values any more.

Note that we did not use Name and Cabin to impute missing age because there is likely to be little information in these variables. But, interestingly, Name also contains the person’s title, which can be extracted and used for model building. It can be a relevant variable in particular if it contains information that is not captured by other variables. I am refraining from doing it in order to keep this exercise short. Furthermore, it seems that adding these variables doesn’t materially improve prediction accuracy. This is probably because these variables are associated with Pclass, Sex, and Fare.⁶

There are several solutions to Titanic contest online. You can check their code to see how they used these variables in their model.↩