8.5 Scaling variables

Cluster analysis can use any numeric variable for clustering. The algorithms rely on a distance metric and therefore it is preferable to scale all the variables to map on to the same scale. The easiest scaling is converting data into z scores which involves mean centering a variable and then scaling that variable by the standard deviation. The scale() function from base R performs centering and scaling in one step.

cluster_data_pro <- cluster_data %>%
  select(recency, frequency, monetary_value) %>%
  scale() %>% 

Verify that we have 0 means and 1 standard deviations

## [1] "Means"
sapply(cluster_data_pro, mean) %>%  round(4)
##        recency      frequency monetary_value 
##              0              0              0
print("Standard Deviations")
## [1] "Standard Deviations"
sapply(cluster_data_pro, sd)
##        recency      frequency monetary_value 
##              1              1              1