Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about parity5_plus_5 #179

Open
amueller opened this issue Oct 10, 2023 · 5 comments
Open

Questions about parity5_plus_5 #179

amueller opened this issue Oct 10, 2023 · 5 comments

Comments

@amueller
Copy link

amueller commented Oct 10, 2023

Would it be possible to get a description of the parity5_plus_5 dataset? There's several things that are confusing about it for me.
First, there are some duplicate rows, which seems odd. The rows count from 0 to 1023 in binary, and there are 1124 rows in the dataset, meaning there are 100 duplicate rows.

Also, I'm not sure I understand the name of the dataset. The equation for the class label seems to be

data['class'] == data[['Bit_2', 'Bit_3', 'Bit_4', 'Bit_6', 'Bit_8']].sum(axis=1) % 2

but I'm not sure what the intuition behind this is or how it relates to the name. I assume there's some simple binary formula behind this, but I don't immediately see it.
Or is it just referring to the fact that the other five bits don't influence the outcome?

@lacava
Copy link
Collaborator

lacava commented Oct 23, 2023

@ryanurbs do you happen to know the equation for this dataset?

@amueller
Copy link
Author

I think the explanation is actually just that there's a subset of 5 bits whose parity is computed and the other bits are ignored. but I'm still confused by the duplication of some rows.

@ryanurbs
Copy link
Member

@lacava @amueller I'm looking into getting a definitive answer to your question. We received this dataset from a colleague.

@ryanurbs
Copy link
Member

@lacava @amueller I found a published description of the parity5+5 problem here: https://sci2s.ugr.es/keel/pdf/algorithm/congreso/liu-3.pdf

You are indeed correct that only 5 of the features are relevant (Bits 2,3,4,6,8) and the other 5 are randomly generated. The underlying predictive pattern in this dataset is that if there are an even number of zeros across those features, then the outcome is 1, otherwise 0. I'm not sure why there are extra redundant rows in this dataset, as there should be 1024 unique rows as described in the above paper as well. I'm not certain of the exact origins of this particular dataset so it might not be possible to track down where the extra rows came from, but you might just remove the redundant rows depending on what experiment you are looking to run. The name parity5+5 comes from the fact that this dataset is basically the original parity5 problem with 5 irrelevant features added to it.

@amueller
Copy link
Author

amueller commented Oct 23, 2023

@ryanurbs thank you for the explanation. Interesting to know that the published version only has 1024 rows, so this might have been some processing mix-up along the way. Feel free to close. I was asking for openml.org where we might decide to drop the duplicate rows in a new version of the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants