Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what the domain size in adult_domain.json ? #5

Open
bigdaronlee163 opened this issue Jan 16, 2022 · 3 comments
Open

what the domain size in adult_domain.json ? #5

bigdaronlee163 opened this issue Jan 16, 2022 · 3 comments

Comments

@bigdaronlee163
Copy link

No description provided.

@bigdaronlee163
Copy link
Author

I guess the meaning of domain size is how many the classes are ? But when I tested this idea, I found that some did not correspond to this 。

this is my result:

for c in col:
print(max(data.df[c]))

74
8
99
15
6
14
5
4
1
99
87
98
41
1

shape = (85, 9, 100, 16, 7, 15, 6, 5, 2, 100, 100, 99, 42, 2) in adult_domain.json ?

Another question is, will this size affect the accuracy of synthetic data?

@bigdaronlee163
Copy link
Author

the method like MST,MWEM+PGM only be applied to discrete data?

@ryan112358
Copy link
Owner

@bigdaronlee163 great questions.

  1. All we require is that 0<= max(data.df[col]) < data.domain[col], which you'll see is true in your example above.

  2. It's better to have smaller domains in general. If you have one attribute that can take 10000 possible values, the code may work, but it's scalability and/or accuracy may suffer. It's best to keep these small, even 100 is probably larger than it needs to be for the discretized numeric attributes in the adult dataset.

  3. Yes, all mechanisms that use Private-PGM expect discrete data, but it's an interesting open problem to develop approaches that can handle numeric data as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants