Finding good data sets for machine learning is a nightmare. When surveyed, 22% of data scientists say that access to good quality data is their biggest challenge.
So when you’re trying to get practice with machine learning, how do you find data? When you do find data, how do you know if it is good?
Why data is so important
High-quality data sets for machine learning are the most important part of your project. Without good data, any predictions you are making from your algorithm will be poor quality.
As the old saying goes, shit and shit out.
You want to make sure that the data you are using in your machine learning algorithms and reflects the real world. If you don’t do this you run the risk of your predictions not being useful.
You need to make sure that you are using data sets for machine learning that are of high quality. The data needs to be an accurate representation of the problem that you are trying to solve.
How to think about data sets for machine learning
Below are some pointers for how to think about identifying data sets for machine learning and data collection.
To ensure you are producing good predictions think about:
- The problems you want to solve – think more strategically when you are learning about what problems people are actually facing, and how you can solve them.
- What kind of data would you get if you were in that situation?
- Where do you find this data?
- How is the data structured?
- What do you need in place to process this data?
Put yourself in the shoes of someone who needs to solve this problem.
When you do this exercise think about:
- What resources do they have
- Where can they get data
- What is that likely to look like?
Data science isn’t just about knowing the core skills, you need to be able to work in the real world with the data you can access
Where to find good data
There are plenty of sources out there where you can find data. But not all of this data is good data.
Make sure that you read the information that comes with the dataset before you start using it. Take a look at the different columns you have provided and understand some of the features of your dataset before you put it into a machine-learning algorithm.
Some sources of data sets for machine learning include:
- Papers that have been published with links to data – this is a great option but try to run your own experiments on the data and not just copy what the author did.
- Open-source providers like imagenet
- Kaggle data sets that are free to use
- Machine learning repository
You could also look at scraping data from websites or a document that you find online.
However, be careful when you do this to ensure that you are not infringing on any copyright and that you are allowed to use this data without infringing on people’s privacy.
There are certain techniques used by data scientists that become particularly helpful if you have just a small dataset. You can use pre-trained models to help you make better predictions on your new data.
We are of course talking about transfer learning!
Transfer learning when you have smaller datasets
Transfer learning allows you to take advantage of models that have been trained on large datasets. Using these models as a base you can then tune them to gain insights about new data sets for machine learning that you have collected.
Some great transfer learning networks include:
- Word2Vec – NLP
- GloVe – NLP
- Universal Sentence Encoder by Google – NLP
- ResNet34 – Computer Vision
- ResNet50 – Computer Vision
- VGG-16 – Computer Vision
What to do once you have your data sets for machine learning
Once you have your data and are ready to start experimenting it’s time to clean it up.
If you don’t clean up your data sets for machine learning, no matter how comprehensive it is you will likely run into issues. These are amplified you start trying to use the data for machine learning.
There are several standard data processing steps that you can use to prepare your data for machine learning algorithm. You can find details of the steps in this article (link).
Good luck with your experiments in machine learning and happy processing!