Introducing nbsynthetic: a simple but powerful tabular data generation open source library for small datasets.
In this article we introduce nbsyntehtic, an open source project created by NextBrain.ml for a simple and robust unsupervised synthetic tabular data generation python library.
- Simple: Designed with a simple and stable unsupervised GAN (Generative Adversarial Network) architecture based in Keras.
- Robust: With specific hyperparameter tuning to ensure training stability while minimizing computational cost.
- Because it is based on an unsupervised architecture, users do not need to have a predefined target.
- It’s intended primarily for small datasets with both continuous and categorical features.
- Because of their simplicity, models can be run on a CPU.
- Modules for quick input data preparation and feature engineering are included.
- Modules for running statistical tests and comparing real and synthetic data (we don’t like the term “fake data”) are included. It also includes a special statistical test (Maximum Mean Discrepancy — MMD) that measures the distance between the means of two samples mapped into a reproducing kernel Hilbert space (RKHS).
- Plotting utilities are included for comparing the probability distributions of original and synthetic data.
Here you can find the nbsynthetic project (library, documentation, and examples).
Synthetic data is experiencing its glory days, with several applications for image, video, and speech generation. Recently, there has been a growing interest in generative models for applications such as creating new types of art or simulating video sequences. However, developments in tabular data seem to be less ambitious, despite being the most frequent type of data available in the world. Synthetic tabular data is disrupting industries like autonomous vehicles, healthcare, and financial services. The healthcare business embraces this novel idea, especially for addressing patients’ privacy concerns, but also for simulating synthetic genomic datasets or patient medical records in research projects.
Spreadsheets are used by nearly 700 million people worldwide every day to deal with small samples of data presented as tabular data.
Spreadsheets are used by nearly 700 million people worldwide every day to deal with small samples of data presented as tabular data. This information is regularly used to make decisions and gain insights. However, it is commonly considered as “poor quality’” data due to incomplete records or for being small (lack of statistical significance). Machine Learning could be highly valuable in these applications. But, as any data scientist is aware, the current state of the art in ML is focused on large datasets, excluding a substantial number of potential ML users. Furthermore, we must address modern statistics requirements that warn on the low reliability of ML algorithms when applied to small sample size data.
Synthetic tabular data is disrupting industries like autonomous vehicles, healthcare, and financial services.
As an example, we are helping a large psychiatric hospital with a data analysis project. They came to us with comprehensive research based on data collected over the last ten years. Psychiatric hospitalizations are critical, and this research began with the goal of improving early alerts and prevention protocols. We got the results in the form of a spreadsheet with thirty-eight columns and three hundred rows. There were numerous empty values (only seven rows contained all thirty-eight feature values). Certainly, that is a small amount of data for any data scientist, and even less for a statistician. It was, however, a challenging effort for them to collect this data. With this data, the validity of any statistical method would be questioned. By creating synthetic datasets, we were able to provide reliable information with statistical validity and also address the privacy issue, a critical point for patient record management.
Generative Adversarial Networks, or GANs, are the technology at the heart of these generative applications. GANs were introduced by Ian Goodfellow in 2014 . The idea was to engineer two separate neural networks and pit them against each other. The first neural network starts out by generating new data that is statistically similar to the input data. The second neural network is tasked with identifying which data is artificially created and which not. Both networks continuously compete with one another: the first tries to trick the second, and the second tries to figure out what the first is doing. The game ends when the second network is not able to ‘discriminate’ if data is coming from the first network output or from the original data. We call the first network generator and the second network discriminator.
Training generator and discriminator models at the same time is generally unstable by definition.
Training generator and discriminator models at the same time is generally unstable by definition, so the main drawbacks of GANs are unstable training and mode collapse. The evolution of GANs has brought interesting ideas to solve this issue, such as introducing extra information to the discriminator in order to get better accuracy and give more stability to the framework (conditional GANs or cGANs). This variant method requires a ‘target’ or reference class that conditions the GAN output with additional information. But when we are addressing the aforementioned target users, we find that many datasets do not have a single target feature because users want to make predictions on different features in order to get more insights about their data.
Conditioning the synthetic data to a single feature will also introduce a bias in the generated data if the user wants to, for example, solve a ML problem using another feature as a target. This is why a non-conditional GAN or non-supervised GAN (also called vanilla GAN) is interesting for such problems, because it is not necessary to choose a target.
Although the accuracy we get may be increased by providing the GAN with a reliable target class (an ‘extra’ condition in a cGAN), an unsupervised GAN is a versatile tool for these active spreadsheet users who have small to medium-sized data sets, with poor data and the need to gain general actionable insights. But it also has some limitations.
The reason why GANs are unstable during their training is that when the generator (G) and discriminator (D) are trained simultaneously, one model’s improvements are made at the cost of the other model in a non-cooperative game. nbsynthetic uses the Keras open-source software library. Its architecture is based on a linear topology using a basic Sequential architecture with three hidden layers in both the generator and discriminator. Our GAN model has a sequential architecture where G and D are connected.
This model has been designed with the following considerations in mind:
Initialization is the process to define the starting point for the optimization (learning or training) of the neural network model. When this process is not optimal, training process can fail due its instability. We will break the symmetry of our GAN by randomly initializing the weights. The idea is to avoid having all neurons in a layer learn the same information. Then we will use a Batch Normalization  which stabilizes learning by normalizing the input to each unit to have zero mean and unit variance.
As we have seen, as a result of the GANs game, both models (G and D) can fail to converge . In image generation, the main strategy to avoid these convergence failures has been using convolutional nets . The convolution layer maps input features to a higher-level representation while preserving their resolution by discarding irrelevant information over several downsampling steps . But this strategy, which has shown relevant advances for image and speech generation, does not seem ideal for small sample size tabular data because we can lose information in each step. Actually, when working with small sample size tabular datasets, we realized that many of the improvements made to boost the accuracy of GANs in image recognition are unfavorable. We chose a simple and dense architecture as the optimal approach for nbsynthetic.
3. Activation functions
LeakyReLU (a type of activation function based on a ReLU but it has a small slope for negative values instead of a flat slope) is common in GAN architectures. However, both have different missions. G has to generate a data representation as close as possible to the original data, whereas D has to decide (classification) if the output is different from the input data or not. For building both generator and discriminator sequential models, we use the LeakyReLU activation function. For model compilation, we will use a tanh activation function (with a range -1 to 1) for G, and for D we will use a sigmoid function ranged from 0 to 1, because it has to simply ‘decide’ if the data is valid or not.
tanh (and sigmoid but not LeaskyReLU) are continuous functions with continuous inverses. So the tanh layer (generator) preserves topological properties in the output layer, but leads to a much higher gradient than the sigmoid function. The idea is to help the network reach the global minimum faster in order to avoid confusing it when we have a mixture of continuous and categorical features.
We can’t use linear activation functions because categorical features would confuse the GAN. The discriminator task is to classify the output into “equal to input data” classes. Classification with a sigmoid unit (discriminator) is equivalent to trying to find a hyperplane that separates classes in the final layer representation. In this case, using a sigmoid function with a lower gradient will improve classification accuracy.
Both G and D are trained with stochastic gradient descent using the Adaptive Moment Estimation optimizer — Adam — to compute adaptive learning rates for each parameter. We are using a small learning rate (lr = 0.0002) and a reduced momentum term (or mean of the gradient) below the default value of 0.9 (β1 = 0.4), with the aim of reducing instability.
5. Noise injection
G creates a sample using a fixed-length random vector ξ as input. This vector is the latent space. A compressed representation of the data distribution will be generated after training when points in this multidimensional vector space match the input data. ξ is usually sampled from Gaussian distributions and should be able to improve the numerical stability of GANs . This process is also known as noise injection. The error function when training with noise is similar to a regularization function  where the coefficient λ of the regularizer is controlled by the noise variance.
A uniform distribution is actually a normal distribution with a maximum standard deviation. So using a uniform distribution we increase the value of the variance and so the value of the coefficient λ. We can reduce overfitting during training by increasing the coefficient λ with a simple network structure and a small sample tabular input data. Our experiments have supported this hypothesis.
6. Input data preparation
Input data preparation is perhaps the most important element for a non-supervised GAN. This network expects to receive low and medium sample size data (up to 100 features and less than 1000 instances). Furthermore, data can have both continuous and categorical columns. Continuous columns may not necessarily follow normal distributions and may contain outliers. Categorical columns can be boolean or multi-class .
The most important decision point in data preparation will be to identify both data types so that they can be treated differently. We only need to scale categorical columns from -1 to 1. (because we are using tanh activation functions). However, in order to be robust to different probability distributions of inputs and outliers, we must transform continuous data. We are going to map all different types of input probability distributions to uniform distributions using a quantile transformation . As a result, because noise injection (latent space) is also a uniform distribution, generator G will only process data with continuous uniform distributions as inputs.
Generating synthetic data has an important challenge in being sure that new data is very “close” to original data. There are several statistical tests to find out if two samples belong to the same probability distribution. The Student’s t-test, the Wilcoxon signed-rank test (a nonparametric variation of the paired Student’s t-test), and the Kolmogorov-Smirnov test for numerical features are included in this library . These tests compare the probability distributions of each feature in the input dataset and the synthetic data in a one-to-one manner (called the “two-sample test” or also known as the “homogeneity problem”).
Modern statistical tests are quite powerful, but certain assumptions must be made.
These assumptions make applying these tests to common datasets difficult when some features may conflict with the hypotheses while others don’t. This data also contains a combination of distinct types of features with different probability distributions, so a single test is not valid for all of them at the same time. Some tests, for example, rely on the normality assumption (data follows a normal distribution), yet we can have data with features of almost any distribution. For example, Student’s t-test is a test comparing means, while Wilcoxon’s tests the ordering of the data. For example, if you are analyzing data with many outliers, Wilcoxon’s test may be more appropriate. The Student’s t test relies on the assumption of normality; that is, samples are normally distributed.
On the other hand, the Wilcoxon test is only valid for continuous values. The Kolmogorov-Smirnov Goodness-of-Fit Test is used to decide if a sample comes from a population with a specific distribution. It only applies to continuous distributions like Wilcoxon’s test, and is likely to be more robust with distributions that are ahead of normality.
To address this issue, we proposed quite a different solution by using the Maximum Means Discrepancy test (MMD). MMD is a statistical test that checks if two samples belong to different distributions. This test calculates the difference in means between two samples, which are mapped onto a reproducing kernel Hilbert space (RKHS). The Maximum Mean Discrepancy has been extensively used in machine learning and nonparametric testing.
Based on samples drawn from each of them, MMD evaluates whether two distributions p and q differ by finding a smooth function that is large on points drawn from p and small on points drawn from q. The statistic measurement is the difference between the mean function values of the two samples; when this difference is large, the samples are most likely from different distributions. Another reason we chose this test is that it performs better with data from small sample sizes (a very common assumption in the majority of statistical tests).
Topological Data Analysis, or TDA , is a new approach to dealing with data from a different perspective. There are various advantages to applying this cutting-edge approach when comparing original with synthetic data:
- Quantitative analysis ignores essential information hidden in data. Also, in many data representations, it is unclear how much value actual data distances have, therefore measures are not always justified.
- TDA is concerned with distances and clusters in order to represent data in topological spaces. To build a topological space, we have to transform data points into simplicial complexes. These are representations of space as a union of points, intervals, triangles, and other higher-dimensional analogues formed by connecting points (also known as filtration). The effect of connecting points in a space by increasing some radius around them results in the creation of geometric objects called simplices (this is why they are called simplicial complexes).
Steps in persistence homology are illustrated in the above figure. First, we define a point cloud. Then we use a filtration method to create simplicical complexes and, finally, we identify the topological signatures of data (this is how we call links and loops) and represent them in a persistence diagram. These diagrams provide a useful way to summarize the topological structure of a point-cloud of data or a function. Data topological spaces are homotopy equivalent to input datasets.
A comparison of persistence diagrams from an input dataset and a synthetic one generated using nbsynthetic from the first is shown in the image below. As we can see, both diagrams are very similar signatures. We can see that links (red-colored points — H0) have a similar distribution, meaning that both have very similar topological signatures. In a synthetic dataset (ten times the length of the original dataset), there appears to be a loop (green colored point — H1) and even a void (H2), but there also appears to be noise.
We can also apply a quantitative analysis test to check if both diagrams are equivalent. We can use the Mann Whitney U test , which is used to test whether two samples are likely to derive from the same population. In the data used in the figure, the p-value is 1 for links and 1 for loops (we can reject the null hypothesis), meaning that both diagrams are equal. That is, generated synthetic data has the same topology as original input data.
The image below illustrates nbsynthetic data created from a three-dimensional tree point cloud. We can see that the “synthetic tree” shares very little with the “real tree.” If we run the MMD test as previously described, the MMD value is 0.12. We generally accept, as a measure of ‘closeness’ between original and synthetic data, MMD values of less than 0.05 (common values in our experiments have been between 0.001 and 0.02). This example was chosen because it clearly demonstrates our synthetic limitations: dealing with low-dimensional input data (3) and data containing only continuous columns.
Low-dimensional inputs confuse the GAN during the regularization process, resulting in incomprehensible outputs. Synthetic data is collapsing on all axes, meaning that GAN’s discriminator hasn’t been able to distinguish between real and generated data. As a result, the GAN is unable to generate a suitable representation of the input data when only continuous features are fed into it. But when we add an extra feature with a categorical dtype to the input data, MMD values fall automatically, meaning that the input data is more accurately represented. We must keep in mind that the input latent variable (or noise injection) of our generator has a uniform distribution. If we switch to a normal distribution, the accuracy improves as well (although it does not get very high accuracy either). It appears as if categorical features operate as “reference” or “conditional” input, as an external class does in conditional GANs or cGANs. This limitation helps us to understand a bit more how our GAN works.
We must remember that nbsynthetic does not fix input data drawbacks such as imbalanced data or data distributions with severe skewness. Synthetic data does not necessarily have to learn the exact distribution of input data features, but it will be close. What we demand from the GAN network is to understand how these features are connected to each other; that is, to understand patterns. Then, it could be necessary to perform additional transformations on synthetic data in order to achieve problems like heteroscedasticity or to reduce the bias generated for imbalanced target data.
A future step for this project is to include a module that transforms input data in order to avoid these limitations.
We have introduced a library for synthetic tabular data generation for use with small (and medium-sized) sample datasets. We have used an unsupervised GAN with a simple linear topology in order to reduce its complexity and computational cost. To make it reliable, we have acutely analyzed hyperparameter tuning to generate synthetic data as “close” to the original data as possible. We have also explored the best way to quantify this closeness with statistical tools. In the nbsynthetic git hub repository tutorial, you will find additional methods such as transfer learning or topological data analysis (also introduced here) that are not yet available in the library.
We want to continue improving the library in two directions:
- Exploring alternative GAN architectures for the above mentioned case of only continuous features being available in the input data.
- We are exploring more methods to quantify how different original and generated data are. Topological data analysis is the most promising way.