Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate datasets. #167

Open
36 of 45 tasks
alexzwanenburg opened this issue Oct 13, 2022 · 8 comments
Open
36 of 45 tasks

Duplicate datasets. #167

alexzwanenburg opened this issue Oct 13, 2022 · 8 comments

Comments

@alexzwanenburg
Copy link
Contributor

alexzwanenburg commented Oct 13, 2022

While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.

  • cmc and contraceptive are the same. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate cmc and contraceptive datasets.
  • 195_auto_price and 207_autoPrice. The symboling feature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for auto. The difference between the datasets is the target, which is price for 195_auto_price and 207_autoPrice, and symboling for auto, as well as how missing values were removed. The original dataset may be found on the UCI ML repository.
    • Parse data from the original into the expected format with price as target.
    • Parse data from the original into the expected format with symboling as target.
    • Ensure that Description of each new dataset references the other.
    • Deprecate 195_auto_price, 207_autoPrice and auto datasets.
  • glass and prnn_fglass. The target class levels seem to be switched between datasets. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate glass and prnn_fglass datasets.
  • heart_c, cleve, cleveland_nominal and cleveland. The cleve and heart_c data sets have a binarized target (vs. ordinal in the other two datasets); the cleveland_nominal data set contains only a feature subset. The original can be found on the UCI ML repository.
  • heart_statlog is a subset of the cleve data set.
  • heart_h and hungarian appear to be the same.
    • Parse Cleveland data from the original into the expected format.
    • Parse Hungarian data from the original into the expected format.
    • Parse Switzerland data (currently missing) from the original into the expected format.
    • Parse VA Long beach data (currently missing) from the original into the expected format.
    • Deprecate heart_c, cleve, cleveland_nominal, cleveland, heart_statlog, heart_h and hungarian datasets.
  • colic and horse_colic appear to be the same. The original can be found on the UCI ML repository. This issue was also mentioned in horse-colic class labels are in [1,2] #75.
    • Parse data from the original into the expected format.
    • Deprecate colic and horse_colic datasets.
  • vote and house_votes_84 are identical.
    • Identify original source.
    • Parse data from the original into the expected format.
    • Deprecate vote and house_votes_84 datasets.
  • breast_cancer_wisconsin and wdbc are the same. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate breast_cancer_wisconsin and wdbc datasets.
  • australian, buggyCrx, credit_a and crx are identical or based on the same data.
    • Identify original source.
    • Parse data from the original into the expected format.
    • Deprecate australian, buggyCrx, credit_a and crx datasets.
  • breast_w and breast are based on the same data. The breast dataset has a Sample code number feature that is not present in breast_w. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate breast_w and breast datasets.
  • diabetes and pima appear to be identical.
    • Identify original source. This dataset appears to have been hosted at the UCI ML repository. However, the original owner seems to have withdrawn permission to use this dataset.
    • Parse data from the original into the expected format.
    • Deprecate diabetes and pima datasets.
  • credit_g and german appear to be identical.
    • Identify original source. The original can be found the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate credit_g and german datasets.
  • solar_flare_2 and flare derive from the same data, but differ in the way the target is formulated. solar_flare_2 also contains two additional features.
    • Identify original source. The original can be found the UCI ML repository. There are three targets, of which one is useful for ML prediction. The additional features in solar_flare_2 are in fact the other two targets.
    • Parse data from the original into the expected format.
    • Deprecate solar_flare_2 and flare datasets.
  • car and car_evaluation are based on the same dataset. In the car_evaluation dataset several categorical (ordinal) features from car are one-hot-encoded. The original can be found on the UCI ML repository. This issue was also mention in car and car_evaluation seem to be identical #84.
    • Parse data from the original into the expected format.
    • Deprecate car and car_evaluation datasets.
  • chess and kr_vs_kp are identical. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate chess and kr_vs_kp datasets.
  • satimage and 294_satellite_image are the same, with the exception that 294_satellite_image incorrectly specifies a regression problem. The original can be found on the UCI ML repository, and has multiple (6) classes as target.
    • Parse data from the original into the expected format.
    • Deprecate satimage and 294_satellite_image datasets.
  • 197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act are based on the same dataset, with the difference being that 227_cpu_small and 562_cpu_small have fewer features.
    • Identify original source.
    • Parse data from the original into the expected format.
    • Deprecate 197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act datasets.
  • poker and 1595_poker are identical except for the target specification. The original can be found on the UCI ML repository, and suggest the target is ordinal.
    • Parse data from the original into the expected format.
    • Deprecate poker and 1595_poker datasets.

My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:

@trangdata
Copy link
Collaborator

Thank you so much for your detailed investigation of the dataset collection @alexzwanenburg! Would you have the bandwidth to make a PR to address (even part of) the duplications?

@alexzwanenburg
Copy link
Contributor Author

alexzwanenburg commented Oct 17, 2022

Yes I can create the PR to address this issue, it may take a few weeks to fully address these issues though.

I have two questions:

  • What should I do with the duplicate datasets? Issue Backwards compatibility #119 was not fully addressed. I would propose to add a deprecated tag to the metadata yaml file that, if present, refers to the new dataset. I would expect this to have the following behaviour:
    • If the deprecated tag is present, and not empty (~ or null), the dataset will no longer be visible on the PMLB GitHub Pages.
    • If the deprecated tag is present, and not empty (~ or null), fetching the dataset will produce a warning.
    • Deprecated data sets are fully removed with the next major release (v2.0.0).
  • Some datasets have a known license. Can I add a license tag to the metadata to document this, e.g. license: CC-BY-4.0?

@lacava
Copy link
Collaborator

lacava commented Oct 17, 2022

all those suggestions look good to me.

@lacava
Copy link
Collaborator

lacava commented Apr 27, 2023

Hi @alexzwanenburg , thanks again for your work spearheading this. Do you still plan to make a PR for these changes? 🙏

@alexzwanenburg
Copy link
Contributor Author

Yes, but I still need to update the four final datasets. I can create a PR for the work I have already done.

@lacava
Copy link
Collaborator

lacava commented Oct 23, 2023

ping on this @alexzwanenburg , hopefully we could pick up where you left off if you create a PR

@gkronber
Copy link
Contributor

@alexzwanenburg I'm ready to help finish this PR. Is your fork up-to-date with your changes documented in this issue?

@alexzwanenburg
Copy link
Contributor Author

I made a PR. I haven't addressed the last four datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants