Clustering algorithms are a powerful machine learning technique that works on unsupervised data.
As you may have guessed, clustering algorithms cluster groups of data point together based on their features. The most common algorithms in machine learning are hierarchical clustering and K-Means clustering.
These clusters are used in a variety of different ways. For example, if you are clustering data on Customers, you can use the clusters to target marketing campaigns.
Clustering algorithms are also used by search engines to group together similar news stories.
These articles surface to you in groups depending on what you searched for.
There are many ways you can use clustering algorithms so let’s dive on into how to get set up.
What is unsupervised learning?
Before we get started, let me first introduce the concept of unsupervised learning.
Unsupervised learning is where you train a machine learning algorithm, but you don’t give it the answer to the problem.
To give you an example, if you feed the model with customer data and you are looking to split these customers into groups.
You don’t know what the groups will look like before you start; hence it is an unsupervised learning problem.
What are the top different types of clustering algorithms?
Today we are going to introduce the top 2 clustering algorithms.
1) K-means clustering algorithm
The K-Means clustering algorithm is an iterative process where you are trying to minimize the distance of the data point from the average data point in the cluster.
Put simply you are trying to create the closest possible clusters of data.
To start the algorithm, you define the number of clusters you need and choose the starting mean data points.
The algorithm will then cluster the data points around the means depending on distance.
There are many different ways that you can calculate the distance, yer it’s not always just ‘distance.’ I know – confusing! I’m not going to go into this in depth now, but you can read more about the distance and the algorithm here.
Once you have calculated the distances, you change the mean data point.
When you recalculate the distances, if any data points are closer to the new mean, the new mean becomes the official mean.
Are you still with me?
You then just repeat the process until the distances don’t decrease again.
When this happens, you have your clusters!
How do you determine how many clusters are optimal in K-Means clustering algorithms?
It’s all very well, I hear you say, but how do you decide how many clusters you need in the first place.
The elbow method looks at the variance between clusters and uses this to determine how many clusters you need.
It is easy to implement and visualize using python.
The name comes from the bend in the graph, or elbow, which indicates the number of clusters.
2) Hierarchical clustering
Hierarchical clustering algorithms seek to create a hierarchy of clustered data points.
The algorithm aims to minimise the number of clusters by merging those closest to one another using a distance measurement such as Euclidean distance for numeric clusters or Hamming distance for text.
This continues until you have the minimum number of clusters of similar data points.
The algorithm you use to do this, in most cases, is hierarchical agglomerative clustering algorithm.
This hierarchy is again based on distance and can be visualized using a dendrogram.
The height of the bar in the dendrogram indicates how different the clusters are. This is where your distance metric comes in as this is the y-axis of the dendrogram.
You keep working on creating and joining the clusters you have made, moving from the closest to the furthest away, until all clusters are linked.
How do you determine how many clusters are optimal in Hierarchical clustering algorithms?
The way you identify how many clusters are optimal for hierarchical clustering algorithms is by drawing a horizontal line between the highest clusters on the dendrogram where you do not cross any of the vertical lines.
Don’t worry if that sentence is confusing – I’ve made a diagram below!
Once you have your line, all you need to do is count the number of vertical lines done and that gives you the optimal number of clusters.
How do you choose your clustering algorithm models?
Both of the clustering algorithm methods we have discussed today are very useful for data scientists.
So how do you choose which to use?
Well to help you here’s a table to help you decide.
|Works on large datasets|
Easy to understand and implement
|Needs additional step (elbow method) to calculate number of clusters|
|Hierarchical||Only one step needed to create and calculate number of clusters|
|Doesn’t work of large datasets|
Less intuitive (in my opinion :))
How do you implement clustering algorithms using python?
Before you do any type of data analysis using clustering algorithms however you need to clean your data.
This process is called data pre-processing and is essential for ensuring you get a good output from your algorithm.
Some steps to follow are:
- Check for outliers in the data that could skew the results
- Replace missing data points with the average value for that data point (this is one option generally seen as better than removing that data point entirely)
- Feature scaling: If you have input variables on very different scales, you may need to scale them to ensure you don’t get impact bias for one variable
The Youtube tutorial videos #37, 38 and 39 cover some techniques to do this here (link)
Implementing K-Means clustering algorithms in python using the Scikit-Learn module:
- Import the KMeans class from cluster module
- Find the number of clusters using the elbow method
- Create you K-Means clusters
Implementing Hierarchical Clustering algorithms in python using SciPy module:
- Import the cluster
- Create a dendrogram
Now you know how to implement clustering algorithms. Where will you use them?