4.6 Tweaking the model (and data!)

We have several tools at our disposal to improve the model performance. My first advice is to use grid search and tune mtry. If you have time or a powerful computer at your disposal, tune the number of trees as well. We have fixed it at 1,000.

4.6.1 Dropping irrelevant variables

We can drop a few less important variables from our model as they might be adding noise. Let’s keep only the variables with scaled importance more than 10

impvar <- varImp(modelRF2, scale = TRUE)[[1]] %>% 
  tibble::rownames_to_column() %>%
  filter(Overall > 10) %>% 
  pull(rowname)

Now, impvar is a vector with the important variables. The next part will take some time to finish running because I am going to try multiple values of mtry. However, as the number of variables is small, this will be much quicker than the larger model above.

set.seed(9999)

modelRF3 <- train(CarInsurance ~ . , 
                 data = select(dt4_train, CarInsurance, impvar), 
                 method = "rf", 
                 metric = "Accuracy",
                 tuneGrid = expand.grid(mtry = c(1:9)),
                 trControl = trControl,
                 ntree = 1000)
print(modelRF3)
## Random Forest 
## 
## 3201 samples
##    9 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2880, 2881, 2881, 2881, 2881, 2880, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.6769974  0.2468218
##   2     0.7019848  0.3267983
##   3     0.7016714  0.3375243
##   4     0.6897875  0.3214448
##   5     0.6826000  0.3125865
##   6     0.6816596  0.3136671
##   7     0.6757269  0.3008956
##   8     0.6766556  0.3027233
##   9     0.6747855  0.2987287
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(modelRF3)

Clearly, the solution with mtry = 2 is the best in this scenario. Let’s see how the model performs out of the sample on the test set.

confusionMatrix(predict(modelRF3, select(dt4_test, impvar)), 
                reference = dt4_test$CarInsurance, 
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  470 162
##        Yes   9 158
##                                           
##                Accuracy : 0.786           
##                  95% CI : (0.7559, 0.8139)
##     No Information Rate : 0.5995          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5159          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4938          
##             Specificity : 0.9812          
##          Pos Pred Value : 0.9461          
##          Neg Pred Value : 0.7437          
##              Prevalence : 0.4005          
##          Detection Rate : 0.1977          
##    Detection Prevalence : 0.2090          
##       Balanced Accuracy : 0.7375          
##                                           
##        'Positive' Class : Yes             
## 

By dropping the variables we did not increase the accuracy of the model. In fact, the sensitivity is now lower. Therefore, we will not use this model further.

4.6.2 Balancing classes

Note that the proportion of “Yes” and “No” in our model is not 50:50.

table(dt4_train$CarInsurance)
## 
##   No  Yes 
## 1917 1284

We can balance these classes and hope to improve classification accuracy. For this we will use ROSE function from ROSE package.25

ROSE function creates synthetic samples in order to balance the classes. Below, I keep the sample size the same, so in order to balance the two classes, ROSE will undersample from “No” and oversample from “Yes”. As ROSE() returns a list, we retain the data frame that’s relevant for us. Also note that we can specify a random number seed in the function.

dt4_train2 <- ROSE::ROSE(CarInsurance ~ .,
                   data = select(dt4_train, -starts_with("Call")),
                   N = 3201,
                   p = 0.5,
                   seed = 305)$data

Check the class balance.

table(dt4_train2$CarInsurance)
## 
##   No  Yes 
## 1590 1611

Now the two classes are almost equally balanced. Let’s use the new synthetic sample.

set.seed(9999)

modelRF4 <- train(CarInsurance ~ . , 
                 data = dt4_train2, 
                 method = "rf", 
                 metric = "Accuracy",
                 tuneGrid = tuneGrid,
                 trControl = trControl,
                 ntree = 1000)
print(modelRF4)
## Random Forest 
## 
## 3201 samples
##   39 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2881, 2881, 2880, 2881, 2881, 2881, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9222099  0.8443877
## 
## Tuning parameter 'mtry' was held constant at a value of 19

Wow, look at that! By using synthetic sampling, we increased the accuracy of our model to 92.2%. But does that also help us improve the out-of-sample performance?

confusionMatrix(predict(modelRF4, select(dt4_test, -CarInsurance, -starts_with("Call"))), 
                reference = dt4_test$CarInsurance, 
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  466 234
##        Yes  13  86
##                                           
##                Accuracy : 0.6909          
##                  95% CI : (0.6575, 0.7228)
##     No Information Rate : 0.5995          
##     P-Value [Acc > NIR] : 5.315e-08       
##                                           
##                   Kappa : 0.2729          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.2687          
##             Specificity : 0.9729          
##          Pos Pred Value : 0.8687          
##          Neg Pred Value : 0.6657          
##              Prevalence : 0.4005          
##          Detection Rate : 0.1076          
##    Detection Prevalence : 0.1239          
##       Balanced Accuracy : 0.6208          
##                                           
##        'Positive' Class : Yes             
## 

Looks like our new model has worse out-of-sample performance. :(

This situation where you have an improved in-sample performance but a worse out-of-sample performance is known as overfitting.

Due to the poor predictive power of the model built on synthetic sample, let’s go back to our original model, modelRF2 for the rest of the analysis.