2.4 XGBoost

The next model that we will consider is XGBoost. We will first set training controls. For more information please read caret documentation.

We willl opt for 5-fold cross-validation. Feel free to change the number to 10 if you want. Note that we are requesting class probabilities by setting classProbs to TRUE. This will return class probabilities along with the predictions. Although we won’t use these probabilities in this example, you can take a look at the predicted probabilities.8 Setting allowParallel to TRUE will enable us to make use of parallel processing if you use a multi-core processor.

trControl <- trainControl(method = "cv",
                          number = 5,
                          verboseIter = FALSE,
                          classProbs = TRUE,
                          summaryFunction = twoClassSummary,
                          savePredictions = TRUE,
                          allowParallel = TRUE)

2.4.1 Hyperparameter tuning

Next, we will create a hyperparameter tuning grid. Hyperparameters are specific to a model. For instance, in the logistic regression, there is no hyperparameter to tune. However, in most machine learning techniques there will be hyperparameters and we have to find their optimal levels. Usually, the preferred method for that is grid search because the model is far too complex to have a closed form.

For XGBoost tree, the important hyperparameters to tune are as follows:

eta(\(\small \eta\)): This is also known as the learning rate. eta shrinks the weights associated with features/variables so this is a regularization parameter. \(\small \eta \in [0, 1]\)

gamma (\(\small \gamma\)): This is the minimum loss reduction that is required for further partitioning a leaf node. Thus, larger values of gamma are going to make model more conservative. \(\small \gamma \in [0, \infty]\)

max_depth: Maximum depth of a tree. A deeper tree is more complex and might overfit.

min_child_weight: From XGBoost documentation9 - Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be. min_child_weight \(\small \in [0, \infty]\)

colsample_bytree: The parameter that determines subsampling of variables. colsample_bytree \(\small \in [0, 1]\)

subsample: This is the percentage of observations to be used for training in each boosting iteration. The default is 1.

nrounds: This controls the maximum number of iterations. For classification, this is equivalent to the number of trees to grow.

tuneGrid <- expand.grid(nrounds = seq(10, 100, 10),
                        max_depth = seq(2, 8, 1),
                        eta = c(0.1, 0.2, 0.3),
                        gamma = 10^c(-1:3),
                        colsample_bytree = seq(0, 1, 0.2),
                        min_child_weight = 1,
                        subsample = 1)

2.4.2 Model training

The next piece of code will do the model training using the controls and grid we creates. Note that tuneGrid object has 6,300 rows, meaning that the model will be estimated 6,300 times. However, that’s not the end of it. We also specify 5-fold cross-validation, which means the model will actually be estimated for 31,500 times! So this will likely take a lot of time. I strongly recommend not doing this in the class. To speed up the model execution, we will opt for parallel processing. For this we will use doParallel package. In the code below, input the number of cores you want to use for parallel processing. This will depend on your computer.

Warning: This code might take several minutes to execute depending on your computer!

# Don't run this code in the class

cl <- makePSOCKcluster(6)
registerDoParallel(cl)

set.seed(888)

m2 <- train(Survived ~. ,
            data = t_train[, -c(8:10)], # Drop Name, Ticket, and Cabin
            method = 'xgbTree',
            trControl = trControl,
            tuneGrid = tuneGrid)

stopCluster(cl) # Turn off parallel processing and free up the cores.
registerDoSEQ()

The best hyper parameters for this model and data appear to be as follows:

print(m2$bestTune)
Table 2.3: Best Hyperparameters
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
2153 30 2 0.2 0.1 1 1 1

Next we will use these parameters to build the model in the class.

m3 <- train(Survived ~. ,
            data = t_train[, -c(8:10)], # Drop Name, Ticket, and Cabin
            method = 'xgbTree',
            trControl = trControl,
            tuneGrid = data.frame(nrounds = 30, 
                                  max_depth = 2, 
                                  eta = 0.2, 
                                  gamma = 0.1,
                                  colsample_bytree = 1, 
                                  min_child_weight = 1, 
                                  subsample = 1))
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.

2.4.3 Model performance

confusionMatrix(predict(m3, subset(t_test, select = -Survived)),
                reference = t_test$Survived,
                positive = "Survived")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Diseased Survived
##   Diseased      103       19
##   Survived        6       49
##                                           
##                Accuracy : 0.8588          
##                  95% CI : (0.7986, 0.9065)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 9.459e-13       
##                                           
##                   Kappa : 0.6904          
##                                           
##  Mcnemar's Test P-Value : 0.0164          
##                                           
##             Sensitivity : 0.7206          
##             Specificity : 0.9450          
##          Pos Pred Value : 0.8909          
##          Neg Pred Value : 0.8443          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2768          
##    Detection Prevalence : 0.3107          
##       Balanced Accuracy : 0.8328          
##                                           
##        'Positive' Class : Survived        
## 

Turns out that XGBoost did only about as good as logistic regression. This just goes on to show that logistic regression in many cases is still a good algorithm to use.10


  1. We use the probabilities in the Insurance Calls example: 4.8

  2. https://xgboost.readthedocs.io/en/latest/parameter.html

  3. What changes can you make to your logistic regression model so that it produces better predictions?