Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discrete_column_matrix_st from data_sampler class is always 0 #169

Open
tejuafonja opened this issue Aug 27, 2021 · 4 comments
Open

discrete_column_matrix_st from data_sampler class is always 0 #169

tejuafonja opened this issue Aug 27, 2021 · 4 comments
Labels
question General question about the software

Comments

@tejuafonja
Copy link
Contributor

Hello,

I noticed that discrete_column_matrix_st from the data_sampler class is not updated and it's always 0. If this is the case, then the function sample_original_condvec(self, batch) would not be picking the correct category of the discrete column. I'm not sure if this is the desired behavior. I'm of the opinion that the discrete_column_matrix_st is supposed to help keep track of the discrete column start position in the data matrix. If my assumption were to be correct, updating the matrix with self._discrete_column_matrix_st[current_id] = st as done below should solve the issue. Please feel free to correct my assumption if otherwise and thank you for open-sourcing your code.

# excerpt: data_sampler.py, line 60-78
for column_info in output_info:
            if is_discrete_column(column_info):
                span_info = column_info[0]
                ed = st + span_info.dim
                category_freq = np.sum(data[:, st:ed], axis=0)
                if log_frequency:
                    category_freq = np.log(category_freq + 1)
                category_prob = category_freq / np.sum(category_freq)
                self._discrete_column_category_prob[current_id, :span_info.dim] = (
                    category_prob)
                self._discrete_column_cond_st[current_id] = current_cond_st
                self._discrete_column_n_category[current_id] = span_info.dim

                # **updated self._discrete_column_matrix_st**
                self._discrete_column_matrix_st[current_id] = st

                current_cond_st += span_info.dim
                current_id += 1
                st = ed
            else:
                st += sum([span_info.dim for span_info in column_info])
@pvk-developer pvk-developer added needs discussion question General question about the software labels Sep 7, 2021
@JonathanDZiegler
Copy link

I agree that this needs attention. Besides sample_original_condvec, generate_cond_from_condition_column_info is also affected. In the current state, the condition vectors created are not correct. Thanks @tejuafonja for suggesting a fix!

@JonathanDZiegler
Copy link

The fix seems to have an issue with mixed tables. Your code also counts continuous columns, which is not correct. A simple expansion to two counters (one overall and one for discrete columns) seems to do the trick:

st = 0
discrete_st = 0
current_id = 0
current_cond_st = 0
for column_info in output_info:
    if is_discrete_column(column_info):
        span_info = column_info[0]
        ed = st + span_info.dim
        discrete_ed = discrete_st + span_info.dim
        category_freq = np.sum(data[:, st:ed], axis=0)
        if log_frequency:
            category_freq = np.log(category_freq + 1)
        category_prob = category_freq / np.sum(category_freq)
        self._discrete_column_category_prob[current_id, :span_info.dim] = category_prob
        self._discrete_column_cond_st[current_id] = current_cond_st
        self._discrete_column_n_category[current_id] = span_info.dim

        # **updated self._discrete_column_matrix_st**
        self._discrete_column_matrix_st[current_id] = discrete_st

        current_cond_st += span_info.dim
        current_id += 1
        st = ed
        discrete_st = discrete_ed
    else:
        st += sum([span_info.dim for span_info in column_info])

@tejuafonja
Copy link
Contributor Author

Thanks for improving the suggested fix!

AndresAlgaba added a commit to AndresAlgaba/CTGAN that referenced this issue Jul 21, 2022
AndresAlgaba added a commit to AndresAlgaba/CTGAN that referenced this issue Jul 25, 2022
AndresAlgaba added a commit to AndresAlgaba/CTGAN that referenced this issue Jul 25, 2022
@AndresAlgaba
Copy link

AndresAlgaba commented Jul 25, 2022

Hi everyone, I believe that the initial solution from @tejuafonja is correct. See #236 for more details on this, and feel free to give some feedback there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software
Projects
None yet
Development

No branches or pull requests

5 participants