8.13 Random forest for segment description

Finally, we will use random forest for creating segment definitions. If you want to read more on building a predictive model using random forest, please see Section 4.3.

8.13.1 Prepare the data

We will make minor changes to the data set. First, we will drop irrelevant variables. Next we will recode gender, married, income, and km_cluster so that their levels are character variables rather than numbers. We then reclassify these 4 variables as factors. This is essential because R will internally create dummy variables for factors. The factor levels will be used to create internal variable names, which will work only for character values of the levels.

cluster_dt_rf <- cluster_data %>% 
  select(-c(customer, frequency, monetary_value, recency)) %>% 
  mutate(gender = plyr::mapvalues(
                  gender,
                  from = c(0, 1),
                  to = c("f", "m")),
         married = plyr::mapvalues(married,
                   from = c(0,1 ),
                   to = c("no", "yes")),
         income = plyr::mapvalues(income,
                  from = c(1:6),
                  to = c("i1", "i2", "i3", 
                         "i4", "i5", "i6")),
         km_cluster = plyr::mapvalues(km_cluster,
                      from = c(1:4),
                      to = c("c1", "c2", "c3", "c4"))) %>% 
  mutate_at(vars(gender, married, income, km_cluster), as.factor)

8.13.2 Create train and test data sets

index <- createDataPartition(cluster_dt_rf$km_cluster, 
                             p = 0.8,
                             list = FALSE)
train_dt <- cluster_dt_rf[index, ]
test_dt <- cluster_dt_rf[-index, ]

8.13.3 Set up train control

We are using 10-fold cross-validation.

trControl <- trainControl(method = "cv", #crossvalidation
                          number = 10,   # 10 folds
                          search = "grid",
                          classProbs = TRUE  #computes class probabilities
                          ) 

tuneGrid_large <- expand.grid(mtry = c(1:(ncol(train_dt) - 2)))

8.13.4 Train the model. This will take a few minutes.

set.seed(2222)

modelRF_large <- train(km_cluster ~ . , 
                 data = train_dt, 
                 method = "rf", 
                 metric = "Accuracy",
                 tuneGrid = tuneGrid_large,
                 trControl = trControl,
                 ntree = 1000)

Check which mtry gives us the best result.

print(modelRF_large)
## Random Forest 
## 
## 402 samples
##   7 predictor
##   4 classes: 'c1', 'c2', 'c3', 'c4' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 362, 363, 361, 361, 362, 363, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.6138886  0.4144991
##   2     0.7857429  0.6930965
##   3     0.7855629  0.6944846
##   4     0.7779925  0.6853990
##   5     0.7830597  0.6925895
##   6     0.7806207  0.6891800
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Turns out that the model with mtry = 2 is the best.

8.13.5 Get the variable importance

varImp(modelRF_large, scale = TRUE)
## rf variable importance
## 
##                 Overall
## clv            100.0000
## sow             74.0256
## first_purchase  20.1671
## incomei5         4.0889
## genderm          1.7341
## loyalty          1.4986
## marriedyes       1.4867
## incomei6         0.6776
## incomei2         0.6415
## incomei4         0.3877
## incomei3         0.0000

clv and sow are the most important predictors. The cluster analysis effectively clustered people based on their customer lifetime value and share of wallet!49


  1. If this is going to be used to identify new customers, where do we get data on CLV and SOW?