Classification algorithms are a powerful tool in any machine learning engineer’s arsenal.
Algorithms developed using classification machine learning has been shown to be a robust way to solve many real-world problems. Notably, it is used in research to predict the likelihood of a person developing a disease or cancer.
There are many different types of classification machine learning algorithms to choose from.
This article outlines the different types of classification algorithm, how they work and then how to implement them using python.
But first, when to use classification analysis in machine learning.
When is Classification analysis used in machine learning?
Classification algorithms, like regression algorithms, are used in supervised machine learning on continuous data.
When we talk about supervised learning in ML, what we mean is that we have a specific set of training data for the algorithm to learn from. This training data contains all the inputs as well as the output value of an actual incident in the data.
For example, will this person develop diabetes? Yes or No. In classification analysis, the labeled training data set will have a sample set of people and their characteristics alongside whether or not they developed diabetes.
This training data is used to teach the machine how different characteristics of a person’s genetics or lifestyle contribute to whether or not they would get diabetes. Based on these inputs (features) the model will then predict, the probability they will get diabetes.
Depending on the type of classification algorithm used, the probability of each algorithm will be calculated differently.
Classification analysis is an example of a problem relying on a non-continuous dataset, otherwise known as discrete data.
It is the opposite of the problems involving continuous data which you would use regression analysis to understand solutions.
Want to understand more about regression analysis in more detail? Check out this post.
What are the different types of classification algorithms?
There are many different types of analysis you can run using classification algorithms.
1/ Logistic Regression
Logistic regression is a regression algorithm used to assign a probability of a point being in one class or the other.
The logistic regression algorithm creates a curve is plotted as a curve of probabilities that are dependant on the input variables. This curve formed goes between the true and false points. Each probability or data point prediction is calculated as an output of a logarithmic calculation, hence logistic regression.
Logistic regression is a linear algorithm. Points that fall above the 0.5 probability are assigned a true classification, and those below are assigned false.
Source – wikipedia logistic regression plot
When you plot logistic regression predictions, you will see a split between the true and false zones in your dataset. These areas are separated by a straight line. Hence Logistic regression is a linear algorithm despite plotting a curve.
Now that took me a little while to get my head around!
2/ K-Nearest Neighbours (k-NN)
The K-nearest neighbors calculation uses the points closest to the data point you are trying to predict to assign a probability to it being true or false.
You have an option to choose the number of neighbors you analyze for each new data point. However, typically people look at 5.
Source – wikipedia 5k-NN plots
The algorithm will assign the likelihood of an item to fit into each class. Once this is complete, you will obtain a decision barrier as shown in the diagram above.
3/ Support Vector Machine Learning
Support vector classification machine learning attempts to maximize the distance between points in different classes.
It uses two points as the support vectors and creates a decision boundary between them.
As with logistic regression, this creates a linear decision boundary.
Source – wikipedia Linear SVM plot
This graph above shows 3 options for decision boundaries using SVM
H3 is the best option from the 3 lines shown because:
- It maximises the distance between data points and the line
- It has the least incorrect assignments
3.1/ Kernel Support Vector Learning
But what happens is your data set cannot be split using a straight line? Well, then you need to look into Kernel SVM.
Hold up, what on earth is a kernel?
Essentially, a kernel is a mathematical trick used to transform your data points into a higher dimension.
When I talk about dimensions I am using the word in the mathematical sense – we’re not sending your data out into the space-time continuum!
As an example, if you have data on 1-axis, it is in the 1-dimension. Data plotted on 2-axis (i.e. x and y) is 2-dimensions, on 3-axis (x,y and z) is 3-dimensional etc etc.
Transforming the data into a higher dimension is useful as in a higher dimension it is possible that you can separate it using a straight line or plane between two vectors.
Source: wikipedia SVM
Depending on the shape of your data you can use different kernel calculations to separate the points. Examples of the different kernels include the Gaussian (as shown above), sigmoid and the polynomial kernels.
After you have created the plane, you transfer the dataset back to the original dimension and will have established a decision boundary for prediction.
4/ Naive Bayes
The Naive Bayes algorithm attempts to assign a classify to a data point by looking at the class of data points with similar features.
It is known as naive because it assumes that data points with similar features will be the same class.
To calculate the probability, the Naive Bayes classifier looks at:
- The probability of having the features of a data point
- The likelihood that the data point falls into a class out of the total data set
- The chance that a point with characteristics similar to that data point is the class
Once the Bayes theorem calculation is computed, you can assign the data point into a class.
5/ Decision Tree Classification
Decision tree classification splits the data into discrete sections that are arrived at following a set of binary classification decisions.
A prediction is then made taking the average value of all data in the section where the new data point lands.
6/ Random Forest Classification
Uses the same process as decision tree classification above but creates multiple decision trees and then makes a prediction for your data point based on the average projection of all trees created for the forest.
You have the option to choose the number of trees you create. Typically, however, people use 5.
What is overfitting? And why is it an issue?
Overfitting occurs when you fit an algorithm trying to classify all of the training data correctly.
It is a problem because it will work the classifier to account for outliers in your dataset that might not be relevant when you try to use the model on new data.
When you overfit the data to the training set, then it will not predict real-world data as well.
Ran Below is an example of a Random Forest Classifier that is overfitted to the training data set.
How do you evaluate which model to use?
There are two ways to evaluate your classification model:
- Confusion Matrix
- Cap curve analysis
The confusion matrix analyses the number of points at which your model prediction is ‘confused’ or predicts incorrectly.
Fewer incorrect predictions = better model
Example of a confusion matrix showing 9 total incorrect predictions
Cap Curve Analysis
Cap curve analysis looks at the difference between what you would expect in results from a perfect model and a random model. The model you have created is then tested relative to what you would expect from an ideal model.
There are two ways to complete cap-curve analysis.
- Assess the area under the curve of your model vs. a perfect model – this way can be quite a time-consuming analysis to compute
- Look at the 50% line on the X-axis and see where this equates to on the Y-axis.
Image Source = Machine Learning A-Z: Hands on Python and R Machine Learning; Cap Curve Analysis
The value on the Y-axis in Cap Curve analysis using the 50% technique is then used to evaluate the clas
- <60% – model is rubbish, try again
- 70-80% – model is good
- 80-90% – model is excellent
- 90%+ – model is too good, and you probably have cases of overfitting – try again
What are the differences between classification and regression analysis?
|Problems used on||Identifying likelihood a data point sits in one option or another||Looking to predict a value based on a number of features|
|How to evaluate the model?||Confusion matrix and Cap-curve analysis||P-values and Adjusted R-squared|
How do you implement classification algorithms using python?
In this section, I have provided links to the documentation in Scikit-Learn for implementing regression.
Before you do any type of data analysis using classification algorithms however you need to clean your data.
This process is called data pre-processing and is essential for ensuring you get a good output from your algorithm.
Some steps to follow are:
- Check for outliers in the data that could skew the results
- Replace missing data points with the average value for that data point (this is one option generally seen as better than removing that data point entirely)
- Feature scaling: If you have input variables on very different scales, you may need to scale them to ensure you don’t get impact bias for one variable
The Youtube tutorial videos #37, 38 and 39 cover some techniques to do this here (link)
Implementing regression algorithms in python using the Scikit-Learn module:
The first step is to import the classification module, i.e. Naive Bayes.
Then depending on the type of classification algorithm you need below are links to the documentation:
- Logistic Regression
- K-Nearest Neighbours
- Support Vector Machine Learning
- Naive Bayes
- Decision Tree Classification
- Random Forest Classification
To create the graphs of the decision boundary, you can use the following code:
So there you have it. That is how to implement
This is a powerful analysis technique that can be used on multiple machine learning problems. What problems will you use it to solve?