Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save fitted transformer object #141

Open
belcheva opened this issue Mar 16, 2021 · 6 comments
Open

Save fitted transformer object #141

belcheva opened this issue Mar 16, 2021 · 6 comments

Comments

@belcheva
Copy link

belcheva commented Mar 16, 2021

Problem Description

When working with large datasets fitting the transformer to the data takes a long time (for example a sample of 500 000 rows and 1500 columns takes around 10 hours on nvidia Quadro RTX 6000). Currently we use a self-built-in feature in the .fit() method of the CTGANSynthesizer class to save and load a fitted transformer object.

Maybe these adjustments could be useful for other users working on large scale data?
I could prepare a PR in case this would be useful.

@fealho
Copy link
Member

fealho commented Mar 17, 2021

Hi @belcheva, could you elaborate a little more? Do you mean that you're saving and restoring the DataTransformer so that you can re-use it between multiple training runs and/or between multiple calls of fit?

Also, note that the transformer does not run on the GPU, only the main model does.

@belcheva
Copy link
Author

Hi, @fealho, sorry, for this question the GPU was indeed irrelevant.

The idea is to be able to save and restore the DataTransformer between mutliple calls of CTGANSynthesizer.fit() when using the same data for training different models.

When you train a new model with different parameters you don’t have to call DataTransformer.fit() again. You can instead load the fitted DataTransformer you have already saved the first time, and then transform the data calling DataTransformer.transform(). Of course this works only when you use the exactly same data.

As far as I understand, it is a problem to save and load only the transformed data itself, because the DataTransformer properties and methods are used in CTGANSynthesizer.sample()

I hope this is more clear, I am open for any questions!

@fealho
Copy link
Member

fealho commented Mar 23, 2021

@csala what do you think? I can see the use of this for CTGAN, but I’m not sure if it’s applicable to SDV as well.

@csala
Copy link
Contributor

csala commented Mar 23, 2021

@csala what do you think? I can see the use of this for CTGAN, but I’m not sure if it’s applicable to SDV as well.

Well, the interesting thing is that this functionality is actually already implemented in SDV! In SDV all the tabular models fit a Table object, which you can later on export and re-use across multiple models without having to refit it. The problem here is that the DataTransformer is not being used at the Table level, but rather as part of the CTGAN core, which means that CTGAN cannot take advantage of this (yet).

So my conclusion here is: Yes, this is a useful functionality. But on the other side, I do not think it is worth to implement this as a special feature or even part of CTGAN. Instead, what makes more sense is to end up decoupling the DataTransformer from CTGAN and moving it to RDT, so not only reusing the fitted transformer is possible, but also changing the transformer parameters or even do other combinations.

@belcheva I'm also curious, if possible: would you mind explaining what you are doing with CTGAN and what is the use case in which you are fitting multiple models with the same transformer?

@belcheva
Copy link
Author

belcheva commented Mar 28, 2021

@csala We use this to speed up parameter tuning and find out which architecture gives the best quality for our dataset.
From my understanding, to test different layer architectures and learning parameters each time we need to initiate a new CTGANSynthesizer object. With this the transformer gets initiated again and needs to be refitted to the data. With a saved transformer we can load the fitted transformer and then use this data for training. Hopefully, I am not mistaken that the transforming of the data is independent of the model parameters.
I would be very happy about any hint towards how to run efficient hyperparameter tuning for CTGAN.

@johnhalloran321
Copy link

Hi @belcheva, I've been interested in exactly the feature you've described, i.e., loading the fitted transformer data for later use. Could you make your code available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants