8.5 Scaling variables
Cluster analysis can use any numeric variable for clustering. The algorithms rely on a distance metric and therefore it is preferable to scale all the variables to map on to the same scale. The easiest scaling is converting data into z scores which involves mean centering a variable and then scaling that variable by the standard deviation. The scale()
function from base R performs centering and scaling in one step.
cluster_data_pro <- cluster_data %>%
select(recency, frequency, monetary_value) %>%
scale() %>%
as.data.frame()
Verify that we have 0 means and 1 standard deviations
print("Means")
## [1] "Means"
sapply(cluster_data_pro, mean) %>% round(4)
## recency frequency monetary_value
## 0 0 0
print("Standard Deviations")
## [1] "Standard Deviations"
sapply(cluster_data_pro, sd)
## recency frequency monetary_value
## 1 1 1