How to Generate Powerful Self-Organizing Maps Using Python

Self Organising Maps, (SOMs), are an unsupervised deep learning technique.

Used to cluster together outputs with similar features, SOMs are often described as one of deep learnings equivalent to K-Means Clustering.

In fact, I will use K-Means Clustering to explain how a self-organizing map works.

When I was first introduced to them, I instantly fell in love as they were:

Highly visual
From a theoretical perspective, relatively easy to understand

Without any further ado let’s dive into the colorful world of self-organizing maps!

What is a Self-Organising Map?

A self-organizing map is a 2D representation of a multidimensional dataset.

The output of the SOM gives the different data inputs representation on a grid. The grid is where the map idea comes in.

It is deemed self-organizing as the data determines which point it will sit on the map via the SOM algorithm.

The image below is an example of a SOM.

Source: https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/

In the process of creating the output, map, the algorithm compares all of the input vectors to one another to determine where they should end up on the map.

Looking at K-Means Clustering

To help me explain this further let’s take a look at K-Means Clustering.

This description of K-Means is taken from my previous post on clustering algorithms.

The K-Means clustering algorithm is an iterative process where you are trying to minimize the distance of the data point from the average data point in the cluster.

Put simply you are trying to create the closest possible clusters of data.

The K-Means algorithm will clusters the data points around the means depending on the distance from the mean.

How does K-Means relate to the SOM algorithm?

In the SOM algorithm, you are essentially following the same process as in K-Means.

However, the process differs in two distinct ways:

Instead of mapping clusters, you map each output coordinate to the map
In K-Means to data points are clustered relative to the cluster mean, whereas in SOM they are clustered relative to one another to gain representation on the map.

Cluster around a mean

So how does a SOM work?

The first thing you need to do is initialize the weights of your neurons.

(Oh yer you still have neurons – this is a deep learning technique after all)

As you run each input data vector through the algorithm, the point on the grid most aligned to this data pulls towards the data.

Then it’s the next input vectors turn. The grid is again pulled towards the data point where most relevant. During this process, the weights are updated.

This pulling, updating process continues, looping over the input data until you end up with a fully converged map.

This graphic below from Wikimedia shows the process.

Source: https://en.wikipedia.org/wiki/Self-organizing_map#/media/File:TrainSOM.gif

And that’s it!

Now you have some idea of the theory, let’s look at the step by step implementation of SOMs using python.

How to implement a SOM in python – step by step

I am probably going to sound like a broken record here if you have read any of my other tutorials buuuttttt…..

Step 1 of the implementation is to clean your data!

Step 1: Clean the data

You need to follow the standard process of data wrangling to make sure that the output of your algorithm makes sense.

These include but are not limited to:

Check for outliers in the data that could skew the results – to help you understand outliers in the data using the describe function in pandas. This will give you the statistical information on the data you are using to see if it looks sensible.
Remove unnecessary columns from the data using iloc and pandas
Standardize column values
Replace missing data points with the average value for that data point (you could remove the row, but this is generally seen as better than removing that data point entirely to avoid loss of data)

You will then need to ensure that you apply feature scaling to the data so no one data point over biases the output.

What is Feature scaling?

If you have input variables on very different scales, you may need to scale them to ensure you don’t get impact bias for one variable – not all algorithms need this so check the documentation to see if you do.

Step 2: Download the code for implementing SOMs

Good news! Some beautiful human has created source code that you can use and download for free.

This starter code is wonderful because it means you don’t need to code all of the complex background mathematics behind the SOM algorithm.

Step 3: Initialise the weights and run

Once you have downloaded the code and set it up in your working directory, then it’s time to start creating your SOM.

As we discussed in the theory section, the first step is initializing the weights to the neurons.

Once that’s done, you can train the algorithm.

code to import mini SOM and train self organising map

Step 4: Creating your grid

To create the grid you will first need to import some modules from pyplot.

Once imported you then use these to create your map.

First plot the distance to the neurons on the matrix, then add any relevant markers, color code as you wish and done!

SOM Visualisation — Plotting the SOM grid

Step 5: Visualising the data

This final step is to use python’s show function to see the grid.

show()

And that’s everything!

I could have grouped step 4 and 5 to be fair but I like the number 5.

Using the source code it’s very simple to create your own self-organizing map.

This technique can be great for analyzing large datasets, with multiple variables quickly.

However, it is not without its limitations.

Want some more tips on creating graphs with python? This post on classification algorithms has a couple of cool plots to try.

READY TO TRY YOUR HAND AT UNSUPERVISED DEEP LEARNING? Want an easy entry into visualisation and deep learning? Self organising maps are a great way to get started in this technique. This article will help you get started.

Limitations of Self Organising Maps

The limitations of the SOM have been much discussed in the literature.

While I love the ease at which a SOM can be understood, and the visualization, there are issues.

Some of these issues I have outlined below:

The algorithm does not build a generative model for the data
The whole system relies on a predefined distance in feature space
Though the code itself is quick to write, the algorithm itself is slow training
Sorry to tell you this but it is not as intuitive as you think. While the neurons may appear close on the map (topological proximity), in reality, they may be far away in feature space
Challenge to process categorical data or mixed data effectively
Does not work well on small data sets – this could be said for many machine learning algorithms but it is especially true with SOMs

Despite the SOMs drawbacks though I still enjoy this algorithm to help me understand some of the big picture analysis around my dataset.

An overview of SOMs

To conclude on everything we have discussed today around self-organizing maps.

Unsupervised deep learning
The name comes from the output of the algorithm which maps multidimensional data onto a 2D grid
Works by iterating over data and pulling the grid towards the most relevant data
Use the source code found online to implement with ease in python
Has some limitations including slow train time and the system relies on predefined distances in the feature space

Ready to get started with Machine Learning Algorithms? Try the FREE Bootcamp