Synthetic data is, to put it bluntly, fake data. As in, data that’s not actually from the population you’re interested in. (Population is a technical term in data science, which I explain here.) It’s data that you’re planning to treat as if it came from the place/group you wish it came from. (It didn’t.)
Synthetic data is, to put it bluntly, fake data.
Artificial data, synthetic data, fake data, and simulated data are all synonyms with slightly different heydays as the term du jour, so they carry poetic connotations from different eras. These days, the cool kids prefer the synthetic data buzzword, perhaps because investors need to be convinced that something new has been invented, rather than rediscovered. And there is something slightly new in play here, but (in my opinion) not new enough for all the old ideas to be irrelevant.
Let’s dive in!
(Note: the links in this post take you to explainers by the same author.)
If you’ve suffered through a graduate course on advanced probability and measure theory like I have (my therapist and I are still working through it over a decade later), you’ll be superfluously aware that there are infinite real numbers. Among other things, infinite means that if you try to enumerate them all, I can swoop in like a jerk and find you a new one, for example by adding 1 to your largest number, taking the average of your two closest numbers, or popping a digit on the back of the number with the longest series of digits after the decimal point.
This also means that if you give me the list of all the numbers ever recorded by humans over the history of humankind, I can still make a brand new one. Boom! The power.
Where am I going with this, besides providing fodder for your next beery debate on whether there’s such a thing as true originality (ugh)?
Let’s say you have a dataset full of human heights. Between any two measurements (say 173cm and 174cm, the interval wherein you’ll find my height) there are infinite…