Save fitted transformer object #141

belcheva · 2021-03-16T12:48:58Z

Problem Description

When working with large datasets fitting the transformer to the data takes a long time (for example a sample of 500 000 rows and 1500 columns takes around 10 hours on nvidia Quadro RTX 6000). Currently we use a self-built-in feature in the .fit() method of the CTGANSynthesizer class to save and load a fitted transformer object.

Maybe these adjustments could be useful for other users working on large scale data?
I could prepare a PR in case this would be useful.

fealho · 2021-03-17T16:21:15Z

Hi @belcheva, could you elaborate a little more? Do you mean that you're saving and restoring the DataTransformer so that you can re-use it between multiple training runs and/or between multiple calls of fit?

Also, note that the transformer does not run on the GPU, only the main model does.

belcheva · 2021-03-18T13:11:27Z

Hi, @fealho, sorry, for this question the GPU was indeed irrelevant.

The idea is to be able to save and restore the DataTransformer between mutliple calls of CTGANSynthesizer.fit() when using the same data for training different models.

When you train a new model with different parameters you don’t have to call DataTransformer.fit() again. You can instead load the fitted DataTransformer you have already saved the first time, and then transform the data calling DataTransformer.transform(). Of course this works only when you use the exactly same data.

As far as I understand, it is a problem to save and load only the transformed data itself, because the DataTransformer properties and methods are used in CTGANSynthesizer.sample()

I hope this is more clear, I am open for any questions!

fealho · 2021-03-23T02:22:25Z

@csala what do you think? I can see the use of this for CTGAN, but I’m not sure if it’s applicable to SDV as well.

csala · 2021-03-23T21:15:52Z

@csala what do you think? I can see the use of this for CTGAN, but I’m not sure if it’s applicable to SDV as well.

Well, the interesting thing is that this functionality is actually already implemented in SDV! In SDV all the tabular models fit a Table object, which you can later on export and re-use across multiple models without having to refit it. The problem here is that the DataTransformer is not being used at the Table level, but rather as part of the CTGAN core, which means that CTGAN cannot take advantage of this (yet).

So my conclusion here is: Yes, this is a useful functionality. But on the other side, I do not think it is worth to implement this as a special feature or even part of CTGAN. Instead, what makes more sense is to end up decoupling the DataTransformer from CTGAN and moving it to RDT, so not only reusing the fitted transformer is possible, but also changing the transformer parameters or even do other combinations.

@belcheva I'm also curious, if possible: would you mind explaining what you are doing with CTGAN and what is the use case in which you are fitting multiple models with the same transformer?

belcheva · 2021-03-28T11:41:03Z

@csala We use this to speed up parameter tuning and find out which architecture gives the best quality for our dataset.
From my understanding, to test different layer architectures and learning parameters each time we need to initiate a new CTGANSynthesizer object. With this the transformer gets initiated again and needs to be refitted to the data. With a saved transformer we can load the fitted transformer and then use this data for training. Hopefully, I am not mistaken that the transforming of the data is independent of the model parameters.
I would be very happy about any hint towards how to run efficient hyperparameter tuning for CTGAN.

johnhalloran321 · 2021-07-27T01:01:02Z

Hi @belcheva, I've been interested in exactly the feature you've described, i.e., loading the fitted transformer data for later use. Could you make your code available?

liuzrcc mentioned this issue Apr 28, 2021

Single thread data transform is slow for huge table #151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save fitted transformer object #141

Save fitted transformer object #141

belcheva commented Mar 16, 2021 •

edited

fealho commented Mar 17, 2021

belcheva commented Mar 18, 2021

fealho commented Mar 23, 2021

csala commented Mar 23, 2021

belcheva commented Mar 28, 2021 •

edited

johnhalloran321 commented Jul 27, 2021

Save fitted transformer object #141

Save fitted transformer object #141

Comments

belcheva commented Mar 16, 2021 • edited

Problem Description

fealho commented Mar 17, 2021

belcheva commented Mar 18, 2021

fealho commented Mar 23, 2021

csala commented Mar 23, 2021

belcheva commented Mar 28, 2021 • edited

johnhalloran321 commented Jul 27, 2021

belcheva commented Mar 16, 2021 •

edited

belcheva commented Mar 28, 2021 •

edited