Building Customer Clusters Using Unsupervised Machine Learning

Applying Unsupervised Machine Learning in Business

Samson Afolabi
6 min readApr 13, 2022
Photo by Liam Shaw on Unsplash

According to the IBM, “Unsupervised learning, also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings without the need for human intervention”. One of the common unsupervised learning is clustering.

Clustering involves discovering natural grouping in data. Clustering algorithms interpret the input data and find natural group or clusters in the feature space.

Problem Definition

‘SARAH’ is a bike sharing company with locations in Berlin and Hamburg. Our Data contains the information of 1000 customers and information of bike rentals between January and December,2020.

To improve SARAH’s marketing strategies and customer relations, we would build a clustering algorithm to segment customers into different groups.

link to Github repo: https://bit.ly/3KJVb9r

Data Exploration

explore customers
explore rentals data

I built a simple dashboard to explore the data using Google data Analytics. link here: https://bit.ly/3xpVFxE

Feature Engineering

Using the information from the rentals dataframe, we would like to create a few columns that could be informative for our clustering model.

First, we convert the ‘rental_start_date’ to a datetime column, from this we can extract the month and year to create the rental_start_month and rental_start_year columns respectively.

We can then merge the rental’s and customer’s dataframe together on the customer_ID columns on both dataframes. From the new dataframe we can calculate ‘monthly_rent_average’,’mean_monthly_rental_revenue’,’mean_monthly_rental_duration’.

Lastly, from the customer’s dataframe we can also create the ‘registration_to_rental_days’ which is basically the number of days between the customer’s ‘registration_date’ and the ‘first_rent_date’.

Feature Engineering:Creating new features
Lagos, Nigeria. Photo by Nupo Deyon Daniel on Unsplash

Model Development

Here we will build 2 clustering models. The K-Means and Mini-Batch K-Means clustering models.

The K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to clusters in an effort to minimize the variance within each cluster.

While the Mini-Batch K-means is a modified version of k-means that makes updates to the cluster centroids using mini-batches of samples rather than the entire dataset, which can reduce computation cost for large datasets and it is perhaps more robust to statistical noise. You can read about more clustering techniques here: https://bit.ly/3M0dEib

For both models, we will set the value of k to be 5. (more about deciding this value later 😉)

First, we encode the sex column using the ordinal encoder and also normalize the customer_age, mean_monthly_rental_revenue,mean_monthly_rental_duration,registration_to_rental_days columns. Normalizing these columns will ensure that all the columns contribute equally in assigns each customers to the different clusters.

From scikit-learn we can simply import the K-Means and Mini-Batch K-Means class. The main configuration to tune in both models is “n_clusters” hyperparameter set to estimate the number of clusters in the data.

K-means Model Pipeline
Minibatch K-means Model Pipeline

Explore Model Results

Going forward I will explore the results of the k-Means Model Pipeline.

Note: The preferred model should be the one with clusters with the customer segmentation from a business point.

Some observations from the K-means clustering:

  1. Group 0 contains customers with less ride frequency but with the highest mean monthly rental revenue.
  2. Group 4 contains who ride often(to work,school) but possibly for short distances.
  3. While Group 1 and Group 3 seem similar in both the mean_monthly_rental_avenue & monthly_rent_average.

Choosing the Optimal Number of Clusters

The number of clusters selected from a business or mathematical point. We can choose to select the number of customer clusters based on what would on a certain business target.

In this section, I will reference the brilliant article by Indraneel Dutta Baruah on “Cheat sheet for implementing 7 methods for selecting the optimal number of clusters in Python”.

From a mathematical standpoint, there are a couple of popular techniques for selecting optimal number of clusters. In this article we would consider:

  1. Elbow Method
  2. Silhouette Coefficient
  3. Calinski-Harabasz Index

The Elbow Method is the most popular method for determining the optimal number of clusters. According to Indraneel, “The idea behind the elbow method is that the explained variation changes rapidly for a small number of clusters and then it slows down leading to an elbow.”

According to the documentation, The KElbowVisualizer implements the “elbow” method by fitting the model with a range of values for 𝐾. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point.

Using the Yellowbrick library we implement the elbow to determine the optimal number of clusters for the k-means clustering.

Elbow Method for K-Means Clustering.

According to the Elbow Method, the Optimal number of clusters, we should implement is 3.

We can also use the KElbowVisualizer to implement the Silhouette coefficient. The silhouette score calculates the mean Silhouette Coefficient of all samples. According to Indraneel, “The Silhouette Coefficient for a point i is defined as follows:

Silhouette Coefficient

where b(i) is the smallest average distance of point i to all points in any other cluster and a(i) is the average distance of i from all other points in its cluster”. Implementing this for our K-means model, we get:

Silhouette Coefficient for k-means clustering.

The Silhouette Coefficient gives the optimal cluster count to be 2.

Lastly, we consider the Calinski-Harabasz Index. The calinski_harabasz score computes the ratio of dispersion between and within clusters.

According to Indraneel, “The index is calculated by dividing the variance of the sums of squares of the distances of individual objects to their cluster center by the sum of squares of the distance between the cluster centers. Higher the Calinski-Harabasz Index value, better the clustering model.” Implementing this for our K-Means model we get:

Calinski Harabasz Score for KMeans Clustering

Read more about selecting number of clusters here: https://bit.ly/3JvMwpE

How can this Clustering Help SARAH’s Marketing?

Building the right customer clusters can help SARAH, in many ways. Some are:

  1. With an optimal customer clustering, SARAH can create and share the right ride packages with the customers.
  2. Having a good understanding of the customer clusters and specific location demands can help with ride location planning i.e. having the right amount of bicycles in SARAH’s operational locations per time.
  3. The right clustering would help understanding SARAH’s customer base and possible needs.

Once again, the github link to the codes is : https://bit.ly/3KRjaTO

I hope this was helpful for you. Looking forward to your comments and questions here, meanwhile you can also follow me on twitter and Linkedin.

If you enjoyed this article, you may consider buying me coffee ☕️.

Vielen Dank😊

References

  1. Cheat sheet for implementing 7 methods for selecting the optimal number of clusters in Python by Indraneel Dutta Baruah
  2. Clustering Algorithms With Python by Machine Learning Mastery.

--

--

Samson Afolabi

A Nigerian and a Data Scientist with a Mathematical Background.