8.13 Random forest for segment description
Finally, we will use random forest for creating segment definitions. If you want to read more on building a predictive model using random forest, please see Section 4.3.
8.13.1 Prepare the data
We will make minor changes to the data set. First, we will drop irrelevant variables. Next we will recode gender
, married
, income
, and km_cluster
so that their levels are character variables rather than numbers. We then reclassify these 4 variables as factors. This is essential because R will internally create dummy variables for factors. The factor levels will be used to create internal variable names, which will work only for character values of the levels.
cluster_dt_rf <- cluster_data %>%
select(-c(customer, frequency, monetary_value, recency)) %>%
mutate(gender = plyr::mapvalues(
gender,
from = c(0, 1),
to = c("f", "m")),
married = plyr::mapvalues(married,
from = c(0,1 ),
to = c("no", "yes")),
income = plyr::mapvalues(income,
from = c(1:6),
to = c("i1", "i2", "i3",
"i4", "i5", "i6")),
km_cluster = plyr::mapvalues(km_cluster,
from = c(1:4),
to = c("c1", "c2", "c3", "c4"))) %>%
mutate_at(vars(gender, married, income, km_cluster), as.factor)
8.13.2 Create train and test data sets
index <- createDataPartition(cluster_dt_rf$km_cluster,
p = 0.8,
list = FALSE)
train_dt <- cluster_dt_rf[index, ]
test_dt <- cluster_dt_rf[-index, ]
8.13.3 Set up train control
We are using 10-fold cross-validation.
trControl <- trainControl(method = "cv", #crossvalidation
number = 10, # 10 folds
search = "grid",
classProbs = TRUE #computes class probabilities
)
tuneGrid_large <- expand.grid(mtry = c(1:(ncol(train_dt) - 2)))
8.13.4 Train the model. This will take a few minutes.
set.seed(2222)
modelRF_large <- train(km_cluster ~ . ,
data = train_dt,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid_large,
trControl = trControl,
ntree = 1000)
Check which mtry
gives us the best result.
print(modelRF_large)
## Random Forest
##
## 402 samples
## 7 predictor
## 4 classes: 'c1', 'c2', 'c3', 'c4'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 362, 363, 361, 361, 362, 363, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.6138886 0.4144991
## 2 0.7857429 0.6930965
## 3 0.7855629 0.6944846
## 4 0.7779925 0.6853990
## 5 0.7830597 0.6925895
## 6 0.7806207 0.6891800
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Turns out that the model with mtry = 2
is the best.
8.13.5 Get the variable importance
varImp(modelRF_large, scale = TRUE)
## rf variable importance
##
## Overall
## clv 100.0000
## sow 74.0256
## first_purchase 20.1671
## incomei5 4.0889
## genderm 1.7341
## loyalty 1.4986
## marriedyes 1.4867
## incomei6 0.6776
## incomei2 0.6415
## incomei4 0.3877
## incomei3 0.0000
clv
and sow
are the most important predictors. The cluster analysis effectively clustered people based on their customer lifetime value and share of wallet!49
If this is going to be used to identify new customers, where do we get data on CLV and SOW?↩