2.2 Training and test sets

Although Kaggle provided us with both training and test sets, we can’t actually use the test set for model evaluation. Therefore, we must create our own test set. We will use createDataPartition() from caret package to create an index of row numbers to keep in the training set. The rest will go in the test set.

set.seed(5555)
index <- caret::createDataPartition(
  titanic_train2$Survived, 
  p = 0.8,
  list = FALSE # Caret returns a list by default
)

t_train <- titanic_train2[index,]
t_test <- titanic_train2[-index,]

The great aspect of createDataPartition is that it keeps the proportion of the classes in the specified variables the same in the two data sets. Let’s take a look:

table(titanic_train2$Survived) / length(titanic_train2$Survived)

## 
##  Diseased  Survived 
## 0.6161616 0.3838384

table(t_train$Survived) / length(t_train$Survived)

## 
##  Diseased  Survived 
## 0.6162465 0.3837535

table(t_test$Survived) / length(t_test$Survived)

## 
##  Diseased  Survived 
## 0.6158192 0.3841808