Duplicate datasets. #167

alexzwanenburg · 2022-10-13T13:11:37Z

While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.

cmc and contraceptive are the same. The original can be found on the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate cmc and contraceptive datasets.
195_auto_price and 207_autoPrice. The symboling feature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for auto. The difference between the datasets is the target, which is price for 195_auto_price and 207_autoPrice, and symboling for auto, as well as how missing values were removed. The original dataset may be found on the UCI ML repository.
- Parse data from the original into the expected format with price as target.
- Parse data from the original into the expected format with symboling as target.
- Ensure that Description of each new dataset references the other.
- Deprecate 195_auto_price, 207_autoPrice and auto datasets.
glass and prnn_fglass. The target class levels seem to be switched between datasets. The original can be found on the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate glass and prnn_fglass datasets.
heart_c, cleve, cleveland_nominal and cleveland. The cleve and heart_c data sets have a binarized target (vs. ordinal in the other two datasets); the cleveland_nominal data set contains only a feature subset. The original can be found on the UCI ML repository.
heart_statlog is a subset of the cleve data set.
heart_h and hungarian appear to be the same.
- Parse Cleveland data from the original into the expected format.
- Parse Hungarian data from the original into the expected format.
- Parse Switzerland data (currently missing) from the original into the expected format.
- Parse VA Long beach data (currently missing) from the original into the expected format.
- Deprecate heart_c, cleve, cleveland_nominal, cleveland, heart_statlog, heart_h and hungarian datasets.
colic and horse_colic appear to be the same. The original can be found on the UCI ML repository. This issue was also mentioned in horse-colic class labels are in [1,2] #75.
- Parse data from the original into the expected format.
- Deprecate colic and horse_colic datasets.
vote and house_votes_84 are identical.
- Identify original source.
- Parse data from the original into the expected format.
- Deprecate vote and house_votes_84 datasets.
breast_cancer_wisconsin and wdbc are the same. The original can be found on the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate breast_cancer_wisconsin and wdbc datasets.
australian, buggyCrx, credit_a and crx are identical or based on the same data.
- Identify original source.
- Parse data from the original into the expected format.
- Deprecate australian, buggyCrx, credit_a and crx datasets.
breast_w and breast are based on the same data. The breast dataset has a Sample code number feature that is not present in breast_w. The original can be found on the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate breast_w and breast datasets.
diabetes and pima appear to be identical.
- Identify original source. This dataset appears to have been hosted at the UCI ML repository. However, the original owner seems to have withdrawn permission to use this dataset.
- ~~Parse data from the original into the expected format.~~
- Deprecate diabetes and pima datasets.
credit_g and german appear to be identical.
- Identify original source. The original can be found the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate credit_g and german datasets.
solar_flare_2 and flare derive from the same data, but differ in the way the target is formulated. solar_flare_2 also contains two additional features.
- Identify original source. The original can be found the UCI ML repository. There are three targets, of which one is useful for ML prediction. The additional features in solar_flare_2 are in fact the other two targets.
- Parse data from the original into the expected format.
- Deprecate solar_flare_2 and flare datasets.
car and car_evaluation are based on the same dataset. In the car_evaluation dataset several categorical (ordinal) features from car are one-hot-encoded. The original can be found on the UCI ML repository. This issue was also mention in car and car_evaluation seem to be identical #84.
- Parse data from the original into the expected format.
- Deprecate car and car_evaluation datasets.
chess and kr_vs_kp are identical. The original can be found on the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate chess and kr_vs_kp datasets.
satimage and 294_satellite_image are the same, with the exception that 294_satellite_image incorrectly specifies a regression problem. The original can be found on the UCI ML repository, and has multiple (6) classes as target.
- Parse data from the original into the expected format.
- Deprecate satimage and 294_satellite_image datasets.
197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act are based on the same dataset, with the difference being that 227_cpu_small and 562_cpu_small have fewer features.
- Identify original source.
- Parse data from the original into the expected format.
- Deprecate 197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act datasets.
poker and 1595_poker are identical except for the target specification. The original can be found on the UCI ML repository, and suggest the target is ordinal.
- Parse data from the original into the expected format.
- Deprecate poker and 1595_poker datasets.

My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:

The text was updated successfully, but these errors were encountered:

trangdata · 2022-10-14T13:38:49Z

Thank you so much for your detailed investigation of the dataset collection @alexzwanenburg! Would you have the bandwidth to make a PR to address (even part of) the duplications?

alexzwanenburg · 2022-10-17T06:49:53Z

Yes I can create the PR to address this issue, it may take a few weeks to fully address these issues though.

I have two questions:

What should I do with the duplicate datasets? Issue Backwards compatibility #119 was not fully addressed. I would propose to add a deprecated tag to the metadata yaml file that, if present, refers to the new dataset. I would expect this to have the following behaviour:
- If the deprecated tag is present, and not empty (~ or null), the dataset will no longer be visible on the PMLB GitHub Pages.
- If the deprecated tag is present, and not empty (~ or null), fetching the dataset will produce a warning.
- Deprecated data sets are fully removed with the next major release (v2.0.0).
Some datasets have a known license. Can I add a license tag to the metadata to document this, e.g. license: CC-BY-4.0?

lacava · 2022-10-17T18:59:28Z

all those suggestions look good to me.

lacava · 2023-04-27T21:50:33Z

Hi @alexzwanenburg , thanks again for your work spearheading this. Do you still plan to make a PR for these changes? 🙏

alexzwanenburg · 2023-04-28T06:35:53Z

Yes, but I still need to update the four final datasets. I can create a PR for the work I have already done.

lacava · 2023-10-23T13:20:40Z

ping on this @alexzwanenburg , hopefully we could pick up where you left off if you create a PR

gkronber · 2023-11-11T18:34:11Z

@alexzwanenburg I'm ready to help finish this PR. Is your fork up-to-date with your changes documented in this issue?

alexzwanenburg · 2023-11-26T09:49:37Z

I made a PR. I haven't addressed the last four datasets.

alexzwanenburg mentioned this issue Nov 26, 2023

Duplicate datsets #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate datasets. #167

Duplicate datasets. #167

alexzwanenburg commented Oct 13, 2022 •

edited

trangdata commented Oct 14, 2022

alexzwanenburg commented Oct 17, 2022 •

edited

lacava commented Oct 17, 2022

lacava commented Apr 27, 2023

alexzwanenburg commented Apr 28, 2023

lacava commented Oct 23, 2023

gkronber commented Nov 11, 2023

alexzwanenburg commented Nov 26, 2023

Duplicate datasets. #167

Duplicate datasets. #167

Comments

alexzwanenburg commented Oct 13, 2022 • edited

trangdata commented Oct 14, 2022

alexzwanenburg commented Oct 17, 2022 • edited

lacava commented Oct 17, 2022

lacava commented Apr 27, 2023

alexzwanenburg commented Apr 28, 2023

lacava commented Oct 23, 2023

gkronber commented Nov 11, 2023

alexzwanenburg commented Nov 26, 2023

alexzwanenburg commented Oct 13, 2022 •

edited

alexzwanenburg commented Oct 17, 2022 •

edited