Everything You Need To Know For Data Preprocessing

Otherwise known as, Machine Learning Admin

There is a lot of data preprocessing admin involved in deploying machine learning algorithms.

The article covers the essential concepts you need to understand to be effective in machine learning. You will learn how to structure your data and pre-process it so that you don’t get caught out by simple errors.

As a beginner programmer, machine learning admin can really impact your overall efficiency when implementing algorithms.

There are a lot of different topics to master. The different technical considerations before you can begin to implement different machine learning algorithms can stop you from running experiments.

But don’t worry, all is not lost.

Before you throw up away your code and move to live the rest of your life in a cave away from syntax errors, read this article.

Time to learn how to organize data using python.

Number 1: Understand Data types

The first thing to get your head around is the different data types.

There are are two classes of data you want to think about

Mutable Data: Data that can be updated or changed once it has been created
Immutable Data: Data that cannot be edited once it has been created

Not too complicated so far, is it?

In addition to understanding the difference between mutable and immutable date, there are five other types of data.

Data type	Description
Integer	An integer is a whole number i.e one that doesn’t have a decimal
Float	A float is a number that does have a decimal point, or one that is not whole. You can define the number of decimal points to go to
String	A string is text-based data. For example a sentence. In python, you will surround a string with quotation marks to use it.
Image	You can use an image is data for machine learning experiments. By using CNNs to understand what are the important features of the image you can then use images as a data source in other machine learning experiments.

Number 2: Understand Data structures

Data can also take on many different structural forms.

I have summarised some of the most common data structures used in machine learning below.

Data Structure .	Definition	Example
List	Mutable python objects defined in square brackets. Can be iterated over in sequence, sliced and added to.	[‘hello’, 1, ’tree, 1, 1989]
Tuple	Immutable python objects defined in round brackets. Similar to lists in that you can iterate over them as a sequence, however, they cannot be changed	(‘hello’, 1, ’tree, 1, 1989)
Sets	An unordered sequence, similar to a list but with duplicate entries automatically removed	{‘hello’, 1, ’tree, 1989}
Dictionary	An unordered, indexed and mutable data structure.Features keys and values. You use the keys to look up the values	{‘Hello’: 1, ‘Tree’: 1989}
Numpy Array	Can be thought of as numpy’s version of a list. It is a grid of values, all of one type that can be multiple dimensions. The shape of the array is defined by the size along each dimension.	Tutoral
Pandas Series	A one-dimensional array of data of any type Can be indexed	Tutorial
Pandas Dataframe	A two-dimensional data structure in pandas. A mutable data type that can feature any type of data Described in the pandas documentation as like a ‘dict-like container for Series objects’.	Tutorial

Structuring data for data preprocessing

Once you know the different data structures used in machine learning, you can begin to learn how to manipulate them for analysis.

To ensure that your algorithm runs efficiently, you may wish to remove certain columns or rows.

The structure of my mind being blown!

In some cases, grouping the data by a certain variable can help you get a better visualisation.

It is also useful to understand how to split and combine your data so that you can create effective input variables for your algorithm.

1. Slicing data with pandas in python

Slicing data is what it sounds like.

During the slicing process, you are cutting the data up. Then, sometimes, you stick it back together. To slice your data you can use two pandas functions, loc, and iloc.

Note: to use these functions your data must be structured in a pandas dataframe. If it is not you will need to use a different set of functions to modify and shape a function on arrays. To get awesome tips on how to do this, try this article.

Loc and iloc are data selection operations used in pandas to select data points at different points.

The below table outlines the differences between loc and iloc.

Operation Used	How it works
loc	Selects data based on labels
iloc	Selects data based on the index

Here are three examples of how you can slice and rearrange data using these functions.

loc for data preprocessing — Here we use loc to select columns by their column header.

iloc example — Here you are using the index to create the second table – therefore you use iloc

You will find that the majority of your algorithms take their data in as a NumPy array. If you want to convert your data frames into the correct format for a NumPy array input, then you can add .values onto the end of your code.

Grouping data Groupby

There are many cases where you may wish to group your data by a certain characteristic before analyzing or visualizing.

If you want to group data in a pandas dataframe by a certain variable you can use the groupby.

Sometimes being in a group is the best!

For example, assume you have the below dataset. You can see that one of your variables is the year. If you want to run an analysis looking at total sales by year, you can do this with a groupby function on the Year column.

When you use the groupby function you create a groupby object of your grouped data.

This groupby object can then be used in a variety of different operations. For example, you can calculate the mean, sum, and count of the grouped values by the group.

Combining data using append

You can also combine your data to form a new data frame. This combination process is useful if you want to pull out the relevant columns and rows to run a smaller analysis or visualization on certain data.

You can also use combining to help you run experiments on data from different tables.

You can combine columns using the concat, merge and append functions on pandas data frames.

Concat:

Adding two or more sets of data next to each other

Merge:

Combining two or more data sets. Can also be used like a pivot table function for python.

Append:

Adding more data to the end of a list or array

Splitting a corpus of data

Another preprocessing step you may wish to complete if you are working on a natural language processing problem is splitting your data.

In this scenario, you will use the split function within your data preprocessing step to make it easier to process the text you want to analyze.

The split function breaks down the text into individual words that can then be used in the algorithm.

Here is an example of how to use split to separate out words in a tweet during sentiment analysis.

You can see more examples of the data preprocessing steps involved in natural language processing in this article.

Creating the correct Data shape

Once you have managed to get your data ready for your preprocessing you might be feeling pretty good.

You confidently move forward into using functions in sklearn to finish your data preprocessing and then BAM

You have been hit with an error – ‘data input, not in expected shape, was expecting 2D array and input is 1D array’

Or something equivalent.

Do not panic, just as you mastered slicing and dicing, you can manage shaping.

Shaping your data is an important part of any machine learning project.

Different machine learning algorithms and operations take data formatted into different, multi-dimensional shapes. This is because the algorithms are designed to process data of a certain structure.

Each variable is expected to be found in its correct place within your overall matrix of data. If it is not, then your algorithm (or function) cannot work.

Don’t worry though, this is pretty easy to fix once you get the hang of it.

You can understand the shape of your data and update it as required using the shape and reshape functions.

Ready to dive in? Let’s go!

How to use the Shape function

The shape function is used to help you find out the shape of your data in pandas and numpy.

That’s right – the function is the same for both!

When you get the shape of your data, the output you receive will be in one, two or three dimensions.

The first dimension is equivalent to the number of values (i.e. the number of users or rows of data).

The second dimension relates to the number of features (i.e. numbers of columns or features).

The third dimension relates as I visualize it, to the number of layers in the data (i.e. for an RNN the third dimension is the layers of data for your different timesteps)

If the number of features (columns) is 1 then you may get an output that does not show a second dimension. This 1 featured data shape can cause errors in processing when it is read by different functions.

You can combat this issue using the reshape function.

How to use the Reshape function

Once you know the shape of your data you can shift it around and reshape it get to the right format.

The reshape function takes your data and moves it around within the dimensions without altering the data itself.

Here are some examples of how to use it.

Checking the shape and then changing it with reshape

Dealing with NaN values – very important in data preprocessing!

Wonderful news, you are powering through the data preprocessing section of setting up machine learning algorithms.

Now it is time to start to deal with missing data.

The challenge with missing data is that it can:

Skew your results
Cause errors in your algorithms

The way I deal with missing data is to follow this simple process.

Missing data decision tree — Some things to look at when you have missing data

Identify the missing data
Make a judgment call
Implement the solution
Check for missing data again

To identify the amount of missing data in each of my columns you can use the below code.

Once you know how much missing data there is in each column, you can make a call on what to do with that column or data point.

Judgment Call 1: Delete the column

If there is a large amount of missing data you may choose to remove the column entirely.

For example, if one of your variables has >20% missing data and you have not got any indicators that it is an important variable, it may be easier to remove it.

One of the benefits of removing certain columns is that having fewer input variables can make your analysis more efficient.

The challenge is that you don’t want to remove important data.

This is where the judgment call comes in.

However to summarise, if there is a lot of data missing in a column, you won’t get anything from it.

To remove a column you can just use the slicing functions (loc and iloc) we discussed previously.

Judgment Call 2: Drop the values

The next option you have it to just remove the missing data rows from the dataset.

You may want to do this if you have just a few missing data points within a large dataset. In this scenario removing some of your instances should not impact your overall analysis.

To do this quickly and effectively you can use dropna

Judgment Call 3: Replace with the average or another value

The third and final option is to replace missing values with alternatives.

One reason you want to do this is if you have only a few missing values and you have a smaller dataset that is more likely to be impacted by removing values.

Depending on the data missing from the column you can look at replacing it using:

New data name i.e. replace NaN with ‘Unknown’ for categorical data
Replace with the mean value for numeric data. The advantage of this is using the average doesn’t bias the results
Replace with an alternative numeric value

Dealing with outliers

Outliers in your data can cause big problems in your analysis.

The main issue with outliers in data is that they can bias and edit the true results or predictions of your algorithm.

One of the best ways to deal with outliers is just to remove them.

This is a piece of code I found in a Hacknoon tutorial that I have adapted and used for removing outliers ever since.

Splitting the training and test set data

Ok, we are moving on to the machine learning specific sections of data preprocessing.

To ensure that your algorithm is training correctly and can work on real-world data, you need to split it into a training and a test set.

These training and test sets both hold a set of X and Y data for you to experiment with.

There are 3 main ways you can split out your training and test set data.

Both processes are detailed in the table below.

Method	What it does	Documentation
TrainTestSplit	Splits the data randomly, you can choose the percentage test data
KFold	Splits the data randomly into folds that can be cross-validated against. This process takes the algorithm at the same time as splitting the test and runs it all.
CrossValidation	Runs in the same process as KFold validation but in one code step.

Now time to move onto feature scaling.

Feature scaling

Now, depending on the type of training/test set split you implemented, you may need to run feature scaling before you split out the data.

You need to scale before you split the data when you are using KFold and Cross-Validation because you implement your algorithm at the same time as splitting the data.

But wait, we’re getting ahead of ourselves, what is feature scaling in data preprocessing?

Feature scaling is used when you have input variables on different scales.

For example, if you have one column where are the values between 1 and 100, and another where the values are between 1200 and 8899.

To prevent the algorithm over biasing for one of the variables you need to scale them. This ensures that the different scales in the original data set do not impact the results of your experiment.

To scale the variables you can use two different processes.

MinMaxScaler:
- Scales data between 1 and -1
- Used when the data doesn’t follow a normal distribution
StandardScaler:
- Scales the data to the zero mean distributed by the standard deviation
- Used when data follows a Gaussian distribution

Below are examples of how to code the different scaling options.

Encoding categorical variables

The final data preprocessing step you need to know about is how to encode categorical variables.

Categorical variables include data like gender, colour, country etc etc etc.

The reason you need to do this is that some algorithms cannot handle categorical data.

There are many different ways to encode categorical data however my favorite is get_dummys.

This is because get_dummys only uses one step while others use multiple.

This article covers the other ways to encode categorical data and how to use them in case you are interested.

Which is the best way to pre-process data?

So the way you complete the data preprocessing steps really depends upon the methods you are using.

We have already touched on the order when it comes to when to feature scaling vs splitting the data.

As a general rule I follow the process:

Structure data
Create the data structures I need
Remove NaNs
Encode categorical data
Splitting the data
Feature scaling

What happens if you’re still getting errors?

If you are still getting errors and you’re not sure how to fix them then it’s time to turn to our beloved search engines and StackOverflow.

I normally follow this process to find answers to errors that come up in my code.

Copy the error message
Paste the error message into the Google (or your chosen search engine)
Find a solution answer in the results that looks like it could be correct – usually, you’ll get them on StackOverflow
Review the options presented and test the various options

And that’s it! You are a data preprocessing master!

Summary of data structures and preprocessing

So now we have covered all of the major stages of data preprocessing.

To summarise the different processes:

Get your data structures down
Slice the data
Shape the data
Encode categorical variables
Scale the data
Deal with missing data
Decide what to do about outliers
BOSS YOUR MACHINE LEARNING PROJECTS!

I hope you now feel more confident in tackling data preprocessing! Good luck with your machine learning admin

Ready to get started with Machine Learning Algorithms? Try the FREE Bootcamp

Everything You Need To Know For Data Preprocessing

Otherwise known as, Machine Learning Admin

Number 1: Understand Data types

Number 2: Understand Data structures