Bored of Kaggle and FiveThirtyEight? Here are the alternative strategies I use for getting high-quality and unique datasets
The key to a great data science project is a great dataset, but finding great data is much easier said than done.
I remember back when I was studying for my master’s in Data Science, a little over a year ago. Throughout the course, I found that coming up with project ideas was the easy part — it was finding good datasets that I struggled with the most. I would spend hours scouring the internet, pulling my hair out trying to find juicy data sources and getting nowhere.
Since then, I’ve come a long way in my approach, and in this article I want to share with you the 5 strategies that I use to find datasets. If you’re bored of standard sources like Kaggle and FiveThirtyEight, these strategies will enable you to get data that are unique and much more tailored to the specific use cases you have in mind.
Yep, believe it or not, this is actually a legit strategy. It’s even got a fancy technical name (“synthetic data generation”).
If you’re trying out a new idea or have very specific data requirements, making synthetic data is a fantastic way to get original and tailored datasets.
For example, let’s say that you’re trying to build a churn prediction model — a model that can predict how likely a customer is to leave a company. Churn is a pretty common “operational problem” faced by many companies, and tackling a problem like this is a great way to show recruiters that you can use ML to solve commercially-relevant problems, as I’ve argued previously:
However, if you search online for “churn datasets,” you’ll find that there are (at the time of writing) only two main datasets obviously available to the public: the Bank Customer Churn Dataset, and the Telecom Churn Dataset. These datasets are a fantastic place to start, but might not reflect the kind of data required for modelling churn in other industries.
Instead, you could try creating synthetic data that’s more tailored to your requirements.
If this sounds too good to be true, here’s an example dataset which I created with just a short prompt to that old chestnut, ChatGPT:
Of course, ChatGPT is limited in the speed and size of the datasets it can create, so if you want to upscale this technique I’d recommend using either the Python library
faker or scikit-learn’s
sklearn.datasets.make_regression functions. These tools are a fantastic way to programmatically generate huge datasets in the blink of an eye, and perfect for building proof-of-concept models without having to spend ages searching for the perfect dataset.
In practice, I have rarely needed to use synthetic data creation techniques to generate entire datasets (and, as I will explain later, you’d be wise to exercise caution if you intend to do this). Instead, I find this is a really neat technique for generating adversarial examples or adding noise to your datasets, enabling me to test my models’ weaknesses and build more robust versions. But, regardless of how you use this technique, it’s an incredibly useful tool to have at your disposal.
Creating synthetic data is a nice workaround for situations when you can’t find the type of data you’re looking for, but the obvious problem is that you’ve got no guarantee that the data are good representations of real-life populations.
If you want to guarantee that your data are realistic, the best way to do that is, surprise surprise…
… to actually go and find some real data.
One way of doing this is to reach out to companies that might hold such data and ask if they’d be interested in sharing some with you. At risk of stating the obvious, no company is going to give you data that are highly sensitive or if you are planning to use them for commercial or unethical purposes. That would just be plain stupid.
However, if you intend to use the data for research (e.g., for a university project), you might well find that companies are open to providing data if it’s in the context of a quid pro quo joint research agreement.
What do I mean by this? It’s actually pretty simple: I mean an arrangement whereby they provide you with some (anonymised/de-sensitised) data and you use the data to conduct research which is of some benefit to them. For example, if you’re interested in studying churn modelling, you could put together a proposal for comparing different churn prediction techniques. Then, share the proposal with some companies and ask whether there’s potential to work together. If you’re persistent and cast a wide net, you will likely find a company that is willing to provide data for your project as long as you share your findings with them so that they can get a benefit out of the research.
If that sounds too good to be true, you might be surprised to hear that this is exactly what I did during my master’s degree. I reached out to a couple of companies with a proposal for how I could use their data for research that would benefit them, signed some paperwork to confirm that I wouldn’t use the data for any other purpose, and conducted a really fun project using some real-world data. It really can be done.
The other thing I particularly like about this strategy is that it provides a way to exercise and develop quite a broad set of skills which are important in Data Science. You have to communicate well, show commercial awareness, and become a pro at managing stakeholder expectations — all of which are essential skills in the day-to-day life of a Data Scientist.
Lots of datasets used in academic studies aren’t published on platforms like Kaggle, but are still publicly available for use by other researchers.
One of the best ways to find datasets like these is by looking in the repositories associated with academic journal articles. Why? Because lots of journals require their contributors to make the underlying data publicly available. For example, two of the data sources I used during my master’s degree (the Fragile Families dataset and the Hate Speech Data website) weren’t available on Kaggle; I found them through academic papers and their associated code repositories.
How can you find these repositories? It’s actually surprisingly simple — I start by opening up paperswithcode.com, search for papers in the area I’m interested in, and look at the available datasets until I find something that looks interesting. In my experience, this is a really neat way to find datasets which haven’t been done-to-death by the masses on Kaggle.
Honestly, I’ve no idea why more people don’t make use of BigQuery Public Datasets. There are literally hundreds of datasets covering everything from Google Search Trends to London Bicycle Hires to Genomic Sequencing of Cannabis.
One of the things I especially like about this source is that lots of these datasets are incredibly commercially relevant. You can kiss goodbye to niche academic topics like flower classification and digit prediction; in BigQuery, there are datasets on real-world business issues like ad performance, website visits and economic forecasts.
Lots of people shy away from these datasets because they require SQL skills to load them. But, even if you don’t know SQL and only know a language like Python or R, I’d still encourage you to take an hour or two to learn some basic SQL and then start querying these datasets. It doesn’t take long to get up and running, and this truly is a treasure trove of high-value data assets.
To use the datasets in BigQuery Public Datasets, you can sign up for a completely free account and create a sandbox project by following the instructions here. You don’t need to enter your credit card details or anything like that — just your name, your email, a bit of info about the project, and you’re good to go. If you need more computing power at a later date, you can upgrade the project to a paid one and access GCP’s compute resources and advanced BigQuery features, but I’ve personally never needed to do this and have found the sandbox to be more than adequate.
My final tip is to try using a dataset search engine. These are incredibly tools that have only emerged in the last few years, and they make it very easy to quickly see what’s out there. Three of my favourites are:
In my experience, searching with these tools can be a much more effective strategy than using generic search engines as you’re often provided with metadata about the datasets and you have the ability to rank them by how often they’ve been used and the publication date. Quite a nifty approach, if you ask me.