Conditional sampling and cross-entropy loss #235

AndresAlgaba · 2022-07-19T11:53:43Z

Hi everyone! I have a question/problem regarding the conditional sampling in the sample method of the CTGANSynthesizer using the condition_column and condition_value arguments. For example, derived from the Usage Example in the README:

samples = ctgan.sample(1000, condition_column='sex', condition_value=' Male')

Note that the whitespace in condition_value=' Male' is intentional, see #233 and #234.

Environment Details

CTGAN version: latest (0.5.2.dev1)
Python version: 3.9.7
Operating System: Windows

Problem description

Intuitively, it seems that when a model is sufficiently trained, the conditional sampling should (almost) only generate examples satisfying the criteria given by the conditional vector. To monitor whether this is happening during training, I've printed the cross-entropy loss as follows:

if self._verbose:
    print(f'Epoch {i+1}, Loss G: {loss_g.detach().cpu(): .4f},'  # noqa: T001
          f'Loss D: {loss_d.detach().cpu(): .4f}',
          f'Cross Entropy: {cross_entropy.detach().cpu(): .4f}',
          flush=True)

CTGAN/ctgan/synthesizers/ctgan.py

Line 419 in 5358af7

if self._verbose:

The cross-entropy loss rapidly approaches zero, indicating that the generated examples satisfy the conditional vector criteria during training.

However, when sampling with the sample method, the generated samples do not satisfy the criteria substantially more than when no criteria are given (and thus, the empirical distribution is used). I could not find any issues in the code, and was wondering whether my intuition was wrong?

What I already tried

from ctgan import CTGANSynthesizer
from ctgan import load_demo

data = load_demo()

discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGANSynthesizer(epochs=100, verbose=True)
ctgan.fit(data, discrete_columns)

# conditional
samples = ctgan.sample(1000, condition_column='sex', condition_value=' Male')
samples["sex"].value_counts().plot(kind='bar')

# unconditional
samples = ctgan.sample(1000)
samples["sex"].value_counts().plot(kind='bar')

I have also done a similar analysis using the test example:

CTGAN/tests/integration/synthesizer/test_ctgan.py

Line 123 in 5358af7

def test_synthesizer_sample():

but reached similar results.

The text was updated successfully, but these errors were encountered:

AndresAlgaba · 2022-07-19T14:11:35Z

Upon further inspection, I believe there may be a problem in:

if condition_column is not None and condition_value is not None:
    condition_info = self._transformer.convert_column_name_value_to_id(
        condition_column, condition_value)
    global_condition_vec = self._data_sampler.generate_cond_from_condition_column_info(
        condition_info, self._batch_size)

CTGAN/ctgan/synthesizers/ctgan.py

Line 443 in 5358af7

if condition_column is not None and condition_value is not None:

The condition_column always appears as the first n columns of the global_condition_vec. For example, condition_column='sex', condition_value=' Male' and condition_column='workclass', condition_value=' State-gov' lead to the same global_condition_vec. They are both the first category of their respective discrete variable. I will further look into this.

AndresAlgaba · 2022-07-19T14:39:38Z

Update: I believe a problem may reside in the generate_cond_from_condition_column_info of the DataSampler.

def generate_cond_from_condition_column_info(self, condition_info, batch):
    """Generate the condition vector."""
    vec = np.zeros((batch, self._n_categories), dtype='float32')
    id_ = self._discrete_column_matrix_st[condition_info['discrete_column_id']]
    id_ += condition_info['value_id']
    vec[:, id_] = 1
    return vec

CTGAN/ctgan/data_sampler.py

Line 153 in 5358af7

id_ = self._discrete_column_matrix_st[condition_info['discrete_column_id']]

Specifically, the _discrete_column_matrix_st attribute is initialized as:

self._discrete_column_matrix_st = np.zeros(n_discrete_columns, dtype='int32')

And does not seem to be changed afterward.

Therefore:

id_ = self._discrete_column_matrix_st[condition_info['discrete_column_id']]

Will always return id_ zero.

I believe _discrete_column_cond_st has to be used instead of _discrete_column_matrix_st. This seems to generate the appropriate global_condition_vec.

However, this does not seem to solve the initial issue. I will look further into the conditional generation part, which was my main issue.

I noticed that matrix_st = self._discrete_column_matrix_st[col_idx] is also used here:

CTGAN/ctgan/data_sampler.py

Line 123 in 5358af7

matrix_st = self._discrete_column_matrix_st[col_idx]

And again, I believe that matrix_st will always be zero. I am not sure whether this may cause any unwanted behavior?

npatki · 2022-07-19T21:28:50Z

Hi @AndresAlgaba, thanks for filing and looking to this. I just wanted to confirm that we've seen this.

We can update this issue when we have more bandwidth to debug. If you do end up finding the root cause, please let us know!

BTW What is your overall use case for conditional sampling / synthetic data? Even if this conditional vector manipulation may not be working as intended, you can still use a reject sampling-based approach (synthesizing data without any conditions and then throwing way rows you don't need). The SDV library provides convenience wrappers around CTGAN to help you do exactly this. This User Guide may be helpful, particularly the conditional sampling section.

AndresAlgaba · 2022-07-20T12:30:59Z

Hi @npatki, no problem, and thanks for the confirmation!

Besides the change from _discrete_column_matrix_st to _discrete_column_cond_st (as mentioned above):

CTGAN/ctgan/data_sampler.py

Line 153 in 5358af7

id_ = self._discrete_column_matrix_st[condition_info['discrete_column_id']]

(By the way, I found issue #169 talking about a similar issue with _discrete_column_matrix_st).

I have found that proper sampling requires the generator to be put in evaluation mode:

self._generator.eval()

As batch normalization is used in the generator.

I have opened a PR with the proposed changes #236.

Thank you for the suggestion on the SDV library! An issue (sdv-dev/SDV#623) brought me to examine the conditional sampling :).

AndresAlgaba added pending review This issue needs to be further reviewed, so work cannot be started question General question about the software labels Jul 19, 2022

npatki added under discussion Issue is currently being discussed and removed pending review This issue needs to be further reviewed, so work cannot be started labels Jul 19, 2022

AndresAlgaba mentioned this issue Jul 20, 2022

Fix bugs for conditional sampling #236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditional sampling and cross-entropy loss #235

Conditional sampling and cross-entropy loss #235

AndresAlgaba commented Jul 19, 2022

AndresAlgaba commented Jul 19, 2022 •

edited

AndresAlgaba commented Jul 19, 2022 •

edited

npatki commented Jul 19, 2022

AndresAlgaba commented Jul 20, 2022 •

edited

Conditional sampling and cross-entropy loss #235

Conditional sampling and cross-entropy loss #235

Comments

AndresAlgaba commented Jul 19, 2022

Environment Details

Problem description

What I already tried

AndresAlgaba commented Jul 19, 2022 • edited

AndresAlgaba commented Jul 19, 2022 • edited

npatki commented Jul 19, 2022

AndresAlgaba commented Jul 20, 2022 • edited

AndresAlgaba commented Jul 19, 2022 •

edited

AndresAlgaba commented Jul 19, 2022 •

edited

AndresAlgaba commented Jul 20, 2022 •

edited