Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any ongoing research for multi-table CTGAN solutions? #268

Open
wilhelmagren opened this issue Feb 14, 2023 · 0 comments
Open

Any ongoing research for multi-table CTGAN solutions? #268

wilhelmagren opened this issue Feb 14, 2023 · 0 comments
Labels
pending review This issue needs to be further reviewed, so work cannot be started question General question about the software

Comments

@wilhelmagren
Copy link

TL;DR, ideas, thoughts, insights, about multi-table solutions using the CTGAN model? Yay or nay?


Hi,

Let me start of by saying how much I enjoy this repository. You truly managed to make the CTGAN model easily digestible, both in your paper, and in the implemented code.

I am wondering; is there ongoing research for multi-table synthetic data GAN based solutions (e.g. extending the CTGAN to be hierarchical, which Hazy supposedly can make, ref). Or is it not worth exploring it?

If it is not worth exploring multi-table CTGAN, could someone offer me some insight as to why? Does it have to do with difficulties capturing long-term primary-foreign key relations? Maintaining referential integrity? Model complexity? Are Gaussian Copulas just the better alternative for encoding the statistic properties of table relations?

I understand that CTGAN is designed to be conditional on discrete columns during training, for one table. But could one not extend the model to e.g. sample the latent space noise vector $z \sim \mathcal{N}(\mu_r, \sigma_r)$ from a prior distribution based on related table statistics $\mu_r$ and $\sigma_r$ aggregated over all the columns? This way you would, again, condition your prior on information that is relevant to the table being synthesized.

Nevertheless, I think synthetic data is a very interesting area of research, and I'm eager to read anyone's opinions, insights, or comments on the questions which I pose above.

Regards,

@wilhelmagren wilhelmagren added pending review This issue needs to be further reviewed, so work cannot be started question General question about the software labels Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending review This issue needs to be further reviewed, so work cannot be started question General question about the software
Projects
None yet
Development

No branches or pull requests

1 participant