When it comes to training models and stuff, data is of the essence. A model cannot perform well unless trained on an appropriate dataset, no matter how much you tune it. And while there are countless examples of great datasets in real life, they can sometimes be tedious to find when all you need is to demonstrate a mere concept.
For example, imagine you need to demonstrate the concept of class imbalance practically. Obviously, you first need to find a dataset that has a class imbalance. While you might achieve this with a few minutes of Kaggling, luck can be hard on us sometimes. Now imagine if you could just make up your own dataset with the required imbalance. Great, right?
So, to get rid of spending hours finding a suitable dataset, we will learn how to make your very own synthetic datasets today. We will use scikit-learn to achieve this and play around with the parameters, so you know how to deal with different scenarios.
Note: The notebook for this tutorial can be found here
We’ll be using the make_classification() function of the scikit-learn to create the synthetic dataset. Let’s import it and the other required libraries.
Now, let’s create a dataset with 1000 samples. We will have just two classes, with a perfect ratio of 1:1. Also, the samples of each class will be centered around a cluster. Here’s how this can be done:
Here are five data points from the dataset.
Let’s create a plot to visualize the data more effectively. Here’s the code for the function:
Here’s how the plot looks:
That’s pretty much it. You have just created your own dataset to use. However, let’s tweak it a little in the next section, like adding noise and playing around with the imbalance.
The function make_classification() comes with a flip_yparameter to add the noise. It makes a fraction of samples be assigned random classes, which in term introduces the noise in the dataset.
Here’s how we can use it:
Let’s view the plot now:
It’s pretty evident from the plot that we have successfully added the noise to the dataset. Try comparing it with the previous plot if you can’t see the noise added. You can further play around with the parameter and increase or decrease the noise as it suits you.
Let’s move forward with the class imbalance now.
There’s hardly a practical dataset in which you can’t spot some degree of class imbalance. There are some use cases like fraud detection where the imbalance can go as high as a 1000:1 ratio. We will use the weightsparameter to control the imbalance. Let’s try it out:
Here’s how it looks:
Since we set the weight parameter as 0.95, you can see that 95% of the samples now belong to class 0, and the rest 5% belong to class 1. Let’s turn the case around and instead make class 1 to be 95% and class 0 to be 5%.
Here’s the output:
That’s how easy it is to play around with the class imbalance. Let’s move on to the next section.
Class separation amounts a great deal to how good models can be trained. The greater the class separation, the better you can fit a model and make classifications. We will use the class_sep parameter to control the class separation. Let’s see it in the code below:
Here’s the output:
As you can see, the classes have no significant differences between them compared to the last visualizations. The higher the value you set for the parameter, the more class separation you will achieve.
Practical datasets are very useful in training machine learning models, but there are often cases where we just need to explore some models and need some particular type of data to train them. In such cases, it’s hard to find the suitable datasets. So, to cure them, we have seen how we can easily make synthetic datasets to fit our needs.
In the article, we made a custom dataset and played around with the noise, class imbalance, and class separation — basically everything you need to adjust datasets according to your requirements.
That’s it for today. Feel free to dive into the documentation to explore further. Happy coding!