2.2 Training and test sets
Although Kaggle provided us with both training and test sets, we can’t actually use the test set for model evaluation. Therefore, we must create our own test set. We will use createDataPartition()
from caret
package to create an index of row numbers to keep in the training set. The rest will go in the test set.
set.seed(5555)
index <- caret::createDataPartition(
titanic_train2$Survived,
p = 0.8,
list = FALSE # Caret returns a list by default
)
t_train <- titanic_train2[index,]
t_test <- titanic_train2[-index,]
The great aspect of createDataPartition
is that it keeps the proportion of the classes in the specified variables the same in the two data sets. Let’s take a look:
table(titanic_train2$Survived) / length(titanic_train2$Survived)
##
## Diseased Survived
## 0.6161616 0.3838384
table(t_train$Survived) / length(t_train$Survived)
##
## Diseased Survived
## 0.6162465 0.3837535
table(t_test$Survived) / length(t_test$Survived)
##
## Diseased Survived
## 0.6158192 0.3841808