8 Cluster Analysis

Most of the code in this book chapter is thanks to Pallav Routh who is a
PhD student in my department.

Imagine that a marketer wants to group customers into identifiable groups. The marketer has customer demographic as well as purchase behavior data on these customers. The objective then is to create groups such that the customers within each group are homogeneous and customers in any two groups are heterogeneous. As these groups are not yet formed, there is no “target” variable that the marketer can use to build a predictive model. All the marketer has is the data on customer characteristics and purchase behavior. In machine learning, a modeling problem without a labelled target variable is called unsupervised learning problem. In this specific case of customer segmentation, cluster analysis turns out to be a highly popular unsupervised learning method.

Cluster analysis or clustering is a task that can be completed using many different algorithms. In this exercise, we will use k-means clustering, which identifies k number of clusters (or groups) that gives us the low variability within a cluster and high variability between two clusters.