Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HELP] CTGAN has Reproducibility? #380

Closed
limhasic opened this issue May 8, 2024 · 8 comments
Closed

[HELP] CTGAN has Reproducibility? #380

limhasic opened this issue May 8, 2024 · 8 comments
Labels
resolution:WAI The software is working as intended

Comments

@limhasic
Copy link

limhasic commented May 8, 2024

Environment details

If you are already running CTGAN, please indicate the following details about the environment in
which you are running it:

  • CTGAN version: 0.10.0
  • Python version: 3.9.5
  • Operating System: ubuntu 20.04

Problem description

from ctgan import CTGAN
from ctgan import load_demo

real_data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGAN(epochs=1, verbose = True)
ctgan.set_random_state(123)

ctgan.fit(real_data, discrete_columns)

# set seed
seed = 42

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) 

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

SEED_VALUE = 42

np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)

# Create synthetic data
#ctgan.set_random_state(123)
synthetic_data1 = ctgan.sample(1000)
#ctgan.set_random_state(123)
synthetic_data2 = ctgan.sample(1000)
# ctgan.set_random_state(123) 

# synthetic_data1 & synthetic_data2 comparison
if np.array_equal(synthetic_data1, synthetic_data2):
    print("synthetic_data1 & synthetic_data2 is equal.")
else:
    print("synthetic_data1 & synthetic_data2 is not equal.")

i tried this thousand times but .. still synthetic_data1 & synthetic_data2 is not equal.

image

@limhasic limhasic added new Label applied to new issues question General question about the software labels May 8, 2024
@srinify
Copy link

srinify commented May 9, 2024

Hi there @limhasic I'm not able to reproduce this. With both 1 and 10 epochs, I was able to generate the same exact data from 2 different CTGAN models.

from ctgan import CTGAN
from ctgan import load_demo

real_data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGAN(epochs=1, verbose = True)
ctgan.set_random_state(123)
ctgan.fit(real_data, discrete_columns)

ctgan2 = CTGAN(epochs=1, verbose = True)
ctgan2.set_random_state(123)
ctgan2.fit(real_data, discrete_columns)

a = ctgan.sample(100)
b = ctgan2.sample(100)

a.equals(b)

^ The last line returns True and you can also visually inspect and see that the datasets are the same.

@srinify srinify added under discussion Issue is currently being discussed and removed question General question about the software new Label applied to new issues labels May 9, 2024
@limhasic
Copy link
Author

Is it possible to share the environment? Damn I got false again

i have ran on

python 3.8.10
ctgan 0.9.1
numpy 1.24.4
torch  1.10.1+cu111 
ubuntu 20.04...

@srinify
Copy link

srinify commented May 13, 2024

I ran my code in Google Colab: https://colab.research.google.com/

Python 3.10.12
ctgan 0.10.0
numpy 1.25.2
torch 2.2.1
Ubuntu 18.04.3 LTS (I believe, based on what Google said for Colab)

A few things to consider:

  • Have you tried this with SDV's CTGANSynthesizer instead of using CTGAN directly?
  • When you inspect both dataframes, where are the differences? Specific rows? Specific column? Number of rows? Etc

@srinify
Copy link

srinify commented May 13, 2024

@limhasic after some more investigation, it turns out we actually don't support reproducibility when fitting a synthesizer. The reproducibility we do support right now is only during sampling (generating 2 samples from the same synthesizer with the same random state).

Out of curiosity, what's the motivation to have reproducibility during model fitting itself?

@limhasic
Copy link
Author

limhasic commented May 14, 2024

@srinify I am working on synthetic data.

Therefore, there is a lot of interest in evaluation indicators and generation methods between original data and synthetic data.

However, when generating data with CTGAN for evaluation, different results were obtained each time.

Since the sample did not show reproducibility, I started thinking about seed control for fitting.

Since it is still morning, I will test it in the Colab environment you sent.

also,

  1. Have you tried this with SDV's CTGANSynthesizer instead of using CTGAN directly?
    -> I tried both while changing environments.

  2. When you inspect both dataframes, where are the differences? Specific rows? Specific column? Number of rows? Etc
    -> First of all, I think it is different if the specific rows are different.

@limhasic
Copy link
Author

Close by checking sampling reproducibility in the latest version of CTGANSynthesizer.

@limhasic
Copy link
Author

Reproducibility is visible in simple data, but when the number of columns increases to more than 25, reproducibility is lost. When I wake up, I observe the phenomenon of the generator emitting different data.

@srinify
Copy link

srinify commented May 21, 2024

Thanks for sharing context into your use case @limhasic I've opened this feature request to add reproducibility at the model fitting level with your use case: sdv-dev/SDV#2022

DataCebo is a very small team and we use community interest to help us prioritize what to work on! So we hope more people will add their use cases to that issue over time.

Closing this issue out as software is working as intended right now.

@srinify srinify closed this as completed May 21, 2024
@srinify srinify added resolution:WAI The software is working as intended and removed under discussion Issue is currently being discussed labels May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolution:WAI The software is working as intended
Projects
None yet
Development

No branches or pull requests

2 participants