How fake data can help machine learning

September 7, 2022

When experimenting with new machine learning models, a major hurdle is finding data. hyperexponential senior model developer Jonathan Bowden has explained how ‘fake data’ can be used to validate ideas for potential models.

Bowden said, “I have encountered problems with the lack of data on several occasions, and I am frequently prevented from developing ideas for potential models as a result.” One of the issues Bowden faces is data privacy and not wanting to cause a breach, while other times the issue is that datasets are too small to create meaningful conclusions.

However, to overcome these issues, users can either anonymise existing datasets or use the dataset properties to create new, semi-believable, fictitious data. Bowden has explored the latter route.

Bowden outlined a couple of ways to generate fake data. One is a service like Faker, which generates fake names, addresses and phone numbers. These are too basic for complex datasets.

The second option is a statistical simulation, which generates convincing data through normal, lognormal or gamma distribution. However, Bowden explained that this method fails when looking at correlations. For example, generating fictitious planes that are 100m long and only a 2m wingspan is not believable.

“The issue is that each of these features is unaware of the others, and there is no design manager overseeing how these features should relate to each other.”

A way to solve this is with Cholesky Decomposition, which forces a set of distributions into fixed correlations. The issue with this is that resulting distributions might not have all the properties of the original input. If a Gamma distribution was used as the input, it is unlikely to get Gamma out. The data will be random, but at least it is more believable, it said.

To make “genuinely believable fake data” firms would then need to add copulas. However, Bowden sees an easier way than all this.

“If, like me, your answer is “I don’t have time for that”, we can explore the world of Generalised Adversarial Networks (GANs). GANs are typically very popular in image generation; a good example is the AI-generated artworks by DALL.E-2. This is undoubtedly a more complex field than the use case for tabular, structured datasets, but the same principles of generators and discriminators apply.”

To see an experiment with GANs, read the full post here.