Self Organising Maps, (SOMs), are an unsupervised deep learning technique.
Used to cluster together outputs with similar features, SOMs are often described as one of deep learnings equivalent to K-Means Clustering.
In fact, I will use K-Means Clustering to explain how a self-organizing map works.
When I was first introduced to them, I instantly fell in love as they were:
- Highly visual
- From a theoretical perspective, relatively easy to understand
Without any further ado let’s dive into the colorful world of self-organizing maps!
What is a Self-Organising Map?
The output of the SOM gives the different data inputs representation on a grid. The grid is where the map idea comes in.
It is deemed self-organizing as the data determines which point it will sit on the map via the SOM algorithm.
The image below is an example of a SOM.
In the process of creating the output, map, the algorithm compares all of the input vectors to one another to determine where they should end up on the map.
Looking at K-Means Clustering
To help me explain this further let’s take a look at K-Means Clustering.
This description of K-Means is taken from my previous post on clustering algorithms.
The K-Means clustering algorithm is an iterative process where you are trying to minimize the distance of the data point from the average data point in the cluster.
Put simply you are trying to create the closest possible clusters of data.
The K-Means algorithm will clusters the data points around the means depending on the distance from the mean.
How does K-Means relate to the SOM algorithm?
In the SOM algorithm, you are essentially following the same process as in K-Means.
However, the process differs in two distinct ways:
- Instead of mapping clusters, you map each output coordinate to the map
- In K-Means to data points are clustered relative to the cluster mean, whereas in SOM they are clustered relative to one another to gain representation on the map.
So how does a SOM work?
The first thing you need to do is initialize the weights of your neurons.
(Oh yer you still have neurons – this is a deep learning technique after all)
As you run each input data vector through the algorithm, the point on the grid most aligned to this data pulls towards the data.
Then it’s the next input vectors turn. The grid is again pulled towards the data point where most relevant. During this process, the weights are updated.
This pulling, updating process continues, looping over the input data until you end up with a fully converged map.
This graphic below from Wikimedia shows the process.
And that’s it!
Now you have some idea of the theory, let’s look at the step by step implementation of SOMs using python.
How to implement a SOM in python – step by step
I am probably going to sound like a broken record here if you have read any of my other tutorials buuuttttt…..
Step 1 of the implementation is to clean your data!
Step 1: Clean the data
You need to follow the standard process of data wrangling to make sure that the output of your algorithm makes sense.
These include but are not limited to:
- Check for outliers in the data that could skew the results – to help you understand outliers in the data using the describe function in pandas. This will give you the statistical information on the data you are using to see if it looks sensible.
- Remove unnecessary columns from the data using iloc and pandas
- Standardize column values
- Replace missing data points with the average value for that data point (you could remove the row, but this is generally seen as better than removing that data point entirely to avoid loss of data)
You will then need to ensure that you apply feature scaling to the data so no one data point over biases the output.
What is Feature scaling?
If you have input variables on very different scales, you may need to scale them to ensure you don’t get impact bias for one variable – not all algorithms need this so check the documentation to see if you do.
Step 2: Download the code for implementing SOMs
Good news! Some beautiful human has created source code that you can use and download for free.
This starter code is wonderful because it means you don’t need to code all of the complex background mathematics behind the SOM algorithm.
Step 3: Initialise the weights and run
Once you have downloaded the code and set it up in your working directory, then it’s time to start creating your SOM.
As we discussed in the theory section, the first step is initializing the weights to the neurons.
Once that’s done, you can train the algorithm.
Step 4: Creating your grid
To create the grid you will first need to import some modules from pyplot.
Once imported you then use these to create your map.
First plot the distance to the neurons on the matrix, then add any relevant markers, color code as you wish and done!
Step 5: Visualising the data
This final step is to use python’s show function to see the grid.
And that’s everything!
I could have grouped step 4 and 5 to be fair but I like the number 5.
Using the source code it’s very simple to create your own self-organizing map.
This technique can be great for analyzing large datasets, with multiple variables quickly.
However, it is not without its limitations.
Want some more tips on creating graphs with python? This post on classification algorithms has a couple of cool plots to try.
Limitations of Self Organising Maps
The limitations of the SOM have been much discussed in the literature.
While I love the ease at which a SOM can be understood, and the visualization, there are issues.
Some of these issues I have outlined below:
- The algorithm does not build a generative model for the data
- The whole system relies on a predefined distance in feature space
- Though the code itself is quick to write, the algorithm itself is slow training
- Sorry to tell you this but it is not as intuitive as you think. While the neurons may appear close on the map (topological proximity), in reality, they may be far away in feature space
- Challenge to process categorical data or mixed data effectively
- Does not work well on small data sets – this could be said for many machine learning algorithms but it is especially true with SOMs
Despite the SOMs drawbacks though I still enjoy this algorithm to help me understand some of the big picture analysis around my dataset.
An overview of SOMs
To conclude on everything we have discussed today around self-organizing maps.
- Unsupervised deep learning
- The name comes from the output of the algorithm which maps multidimensional data onto a 2D grid
- Works by iterating over data and pulling the grid towards the most relevant data
- Use the source code found online to implement with ease in python
- Has some limitations including slow train time and the system relies on predefined distances in the feature space