A series of mechanisms and tests that one can use to evaluate any tabular synthetic dataset, from resemblance and utility perspectives
Some of the main benefits of synthetic data are:
- Scalability: Whereas real training data is collected in a linear way, synthetic data can be generated in massive quantities, which alleviates the problem of needing to acquire a huge amount of high-quality training data for complex ML tasks.
- Privacy: Data availability remains a challenge due to complex privacy requirements, especially in healthcare. It is often difficult for healthcare researchers to get access to high-quality individual-level data for secondary purposes (e.g. testing new hypotheses, building statistical and ML models). This challenge can potentially be addressed using synthetic data.
However, please note that simply making data “synthetic” does not guarantee privacy. It has been shown that synthetic data generation does not always and completely preserve each individual’s privacy, especially with non-differentially-private data generation techniques. It is also extremely hard to empirically quantify the privacy performance of a synthetic dataset. A single privacy metric is only able to address a certain attack strategy (for example: re-identification attack, reconstruction attack or tracing attack, etc.). A dataset which does better against one attack does not necessarily mean it is protected from a different attack.
Furthermore, it is hard to estimate the scope of knowledge available to attackers — we need to make an assumption of what information is made available to them, and what secret bits they are trying to gain. Not to mention, these are known strategies — there is an unlimited number of unknown strategies that an attacker might employ. This means that being able to empirically quantify performance of one (or even multiple) privacy metric(s) for a synthetic dataset is not sufficient for one to say that the dataset is completely private.
That said, this article solely focuses on the utility performance of a synthetic dataset.
Gaussian Copula is a statistical modeling technique for data synthesis.
Copula allows us to decompose a joint probability distribution into the variables’ marginals (which by definition have no correlation) and a function which “couples” these marginals together. In other words, Copula is that “coupling” function, or a multivariate distribution with embedded correlation information. Gaussian Copula is thereby a multivariate normal distribution with learned correlations.
A high-level process of this data generator is as follows:
- Learn the probability distribution of each column
- Apply an inverse CDF transformation of a standard normal on them (i.e. convert the columns’ distributions into normal distributions)
- Learn the correlations of those newly generated random variables to build a copula model
- Sample from the Multivariate Standard Normal distribution with the learned correlations
Conditional Tabular Generative Adversarial Network (CTGAN) is a deep learning data synthesis technique. As the name suggests, this is a GAN-based method.
A vanilla GAN consists of 2 neural networks: one as a generator which takes some input and generate synthetic data from that. Then, there’s a second neural net which acts as a discriminator to see if they can tell the difference between the real and the synthetic data. The result from the discriminator is feedback to the generator to help the generator produces better synthetic outputs.
What differentiate a CTGAN from a vanilla GAN are:
- Conditional: Instead of randomly sample training data to feed into the generator, which might not sufficiently represent the minor categories of highly imbalanced categorical columns, CTGAN architecture introduces a conditional generator that generates rows conditioned on one of the discrete columns and samples training data according to the log-frequency of each category of that particular discrete column. This helps the GAN model evenly (not necessarily uniformly) explore all possible discrete values.
- Tabular: Unlike pixels’ values in images that are usually normally distributed, continuous variables in tabular data can follow non-Gaussian and/or complex multi-modal distributions. To address this, CTGAN represents continuous columns with mode-specific normalization.
For detailed description on how CTGAN works, refer to their published paper.
The natural questions after we generate some synthetic data is how good the data that just got generated is. Specifically:
- Does the synthetic data maintain statistical similarities with real data? Does the synthetic data preserve univariate and multivariate distributions? (i.e resemblance metric).
- Is it possible to obtain similar results and come to the same conclusions with synthetic data as we would with the real data for some planned tasks? (i.e. utility metric).
This article presents a series of mechanisms and tests that one can use to evaluate any tabular synthetic data, mostly from the perspective of preserving the original data’s statistical properties (resemblance) and ML efficacy (utility).
All evaluations are done on the Sepsis Survival Minimal Clinical Records Dataset (110,204 instances x 4 attributes) from the UCI Machine Learning Repository. We generate synthetic data using Gaussian Copula and CTGAN models from the Synthetic Data Vault implementation.
Given that age distribution has multiple modes in the original dataset, CTGAN did a better job maintaining this property, whereas Gaussian Copula turns the distribution into a single-mode only.
On the other hand, Gaussian Copula maintains the proportion between categories for gender and target variables better than CTGAN. Distribution of episode_number is better replicated by CTGAN, compared with Gaussian Copula.
All variables originally did not have any missing values, and the synthetic datasets from both CTGAN and Gaussian Copula are able to reproduce that.
Two-sample Kolmogorov–Smirnov (K-S) is used to test whether two samples come from the same distribution. We run this test across all variables between real and Gaussian Copula data as well as between real and CTGAN data. Since KS statistic is the maximum distance between two CDFs, the lower it is, the better for our use case.
Overall, the mean KS statistic across all variables is slightly smaller for CTGAN compared with Gaussian Copula.
So far, we have investigated the datasets column by column. Now, let’s take a look at the pairwise relationship.
The heat map of pairwise Pearsons correlations of Gaussian Copula seems to bear more resemblance to that of identity data. To verify this, we compute the Correlation Accuracy. First, discretize correlation coefficients into 6 levels: [-1, -0.5) (strong negative), [-0.5, -0.3) (middle negative), [-0.3, -0.1) (low negative), [-0.1, 0.1) (no correlation), [0.1, 0.3) (low positive), [0.3, 0.5) (middle positive), [0.5, 1) (strong positive). Then, calculate the percentage of pairs where the synthetic and original dataset assign the same correlation level.
The correlation accuracy of Gaussian Copula is much larger than that of CTGAN (83% vs 67%).
Given a party who doesn’t have access to the original dataset, is it possible for them to solve a machine learning problem based on synthetic data and draw insights as close as possible to what they would have generated with real data? To answer that question, we use the synthetic dataset to train an XGBoost classifier and use it to make predictions on the original data. Then, compare this score with what would have been achieved if we trained the XGBoost on the original data.
CTGAN is able to achieve a predictive performance that is closer to what we would have achieved with real data, compared with Gaussian Copula.
To evaluate how hard it is to distinguish between real and synthetic instances, we shuffle both datasets together with flags indicating whether the data is real or synthetic. Then train a ML model that tries to predict this flag. The easier it is to predict the flag, the more distinguishable between real and synthetic data.
For this test, we train XGBoost and Logistic Regression as the detectors. CTGAN synthetic data poses a harder challenge for both XGBoost and Logistic Regression to distinguish, compared with Gaussian Copula data, given that their corresponding detectors’ AUROC are lower.
For this particular Sepsis public dataset, from the individual variable evaluation standpoint, CTGAN and Gaussian Copula are neck-to-neck. However, Gaussian Copula has a surprisingly better pairwise correlation accuracy, whereas CTGAN achieves a better ML efficacy and is less likely to be detected.