They say a picture speaks 1000 words. So to summarize regression analysis in machine learning, I have created an infographic.
The regression algorithms infographic is designed as a quick reminder of the basics of each algorithm.
Before I share it with you, however, I will give an overview of regression analysis.
When is regression analysis used in machine learning?
Regression algorithms are used in supervised machine learning on continuous data.
When we talk about supervised learning in ML, what we mean is that we have a set of training data for the algorithm to learn from. This training data contains all the inputs as well as the output value of an actual incident in the data.
For example, the number of rooms a house has (input) and the price of the house (output).
This training data is used to teach the machine how the number of rooms and price are related, allowing it to make predictions of the output, cost of a house, based on the inputs, number of rooms.
Continuous data is where the predicted output could, in theory, hold any numeric value. For example, house price data is classed as continuous.
Non-continuous data, also known as discrete data, for comparison is data derived from a question with a yes or no answer, i.e. is this item a chair? Yes/No. This is a classification problem, not a regression problem.
Regression algorithms allow you to predict with statistical significance:
- Impact of different variables on the outputs
- The output of a set of data inputs
Now let’s take a deeper look at regression algorithms
What are the different types of regression analysis?
There are many different types of analysis you can run using regression algorithms. These are shared in the infographic below, but I have also summarized them here.
- Linear Regression: Compares the relationship between two variables with a linear relationship, i.e. as one increases so does the other. It creates a line that tries to minimize the distance between each data point and the line. It is also known as the ordinary least squares model.
- Multivariable Regression: Similar to linear regression, however, you evaluate multiple input variables. Also, it does not always have to be a linear relationship.
- Polynomial Regression: Compares the relationship between two variables with a non-linear relationship, i.e. cubic relationship
- SVR: Support Vector Regression keeps all predictions within a certain threshold, or vector, of the actual values.
- Decision Tree: Decision tree regression splits the data into discrete sections that are arrived at following a set of binary classification decisions. A prediction is then made taking the average value of all data in the section where the new data point lands
- Random Forest: Random forest regression uses the same process as decision tree regression above but creates multiple decision trees and then makes a prediction for your data point based on the average projection of all trees created for the forest.
For more details, I recommend Wikipedia as a great resource (link)
Now for the fun stuff – implementing regression analysis!
How do you implement regression algorithms using python?
In this section, I have provided links to the documentation in Scikit-Learn for implementing regression.
Before you do any type of data analysis using regression algorithms however you need to clean your data.
This process is called data pre-processing and is essential for ensuring you get a good output from your algorithm.
Some steps to follow are:
- Check for outliers in the data that could skew the results
- Replace missing data points with the average value for that data point (this is one option generally seen as better than removing that data point entirely)
- Feature scaling: If you have input variables on very different scales, you may need to scale them to ensure you don’t get impact bias for one variable
The Youtube tutorial videos #37, 38 and 39 cover some techniques to do this here (link)
Implementing regression algorithms in python using the Scikit-Learn module:
The first step is to import the relevant regression module, i.e. linear regression.
Want to learn more about machine learning in python? Check out my previous post on choosing a machine learning in python course.
Then depending on the type of regression algorithm you need below are links to the documentation: