8.5 Scaling variables

Cluster analysis can use any numeric variable for clustering. The algorithms rely on a distance metric and therefore it is preferable to scale all the variables to map on to the same scale. The easiest scaling is converting data into z scores which involves mean centering a variable and then scaling that variable by the standard deviation. The scale() function from base R performs centering and scaling in one step.

cluster_data_pro <- cluster_data %>%
  select(recency, frequency, monetary_value) %>%
  scale() %>% 
  as.data.frame()

Verify that we have 0 means and 1 standard deviations

print("Means")

## [1] "Means"

sapply(cluster_data_pro, mean) %>%  round(4)

##        recency      frequency monetary_value 
##              0              0              0

print("Standard Deviations")

## [1] "Standard Deviations"

sapply(cluster_data_pro, sd)

##        recency      frequency monetary_value 
##              1              1              1