Questing regarding loading UEA public datasets #92

michael-ychen · 2021-02-17T23:31:43Z

Hi, first of all, I really appreciate this wonderful library for processing time-series related issues. I am using pyts to loading UEA datasets, but I found that when I load a binary-class dataset, but the loaded labels are not binary. After debugging, I guess this might some issues existed with the line I provided below.

pyts/pyts/datasets/uea.py

Line 297 in 1aa4558

y.append(X_data[i][1])

I think the last number, X_data[i][-1], should be the label, either -1 or 1, instead of X_data[i][1]. Moreover, the X should drop the last column, which stands for labels.

I am not sure that my interpretation is correct or not. I look forward to hearing from you, and I wish this powerful tool becomes better and better. Thanks so much. :)

johannfaouzi · 2021-02-18T09:18:34Z

Hi,

Thanks for your kind words.

Regarding your question, I tried with SelfRegulationSCP1 and the labels are strings ('negativity' and 'positivity') and not -1 and +1 indeed. The labels are directly taken from the files, so I'm not sure that the best solution to change the labels in this function as it could be confusing for other users familiar with the datasets as they are. Maybe a better solution would be to change the labels directly in the original files. I think that you can raise an issue on this repository to do so: https://github.com/uea-machine-learning/tsml_repo

X_data[i] is a numpy.void object with 2 elements, so X_data[i][1] and X_data[i][-1] are equivalent. The ARFF structure is not really intuitive (I didn't know its existence before working on this project), so it's not as easy as a CSV file with all but last column for the input data and the last column for the target data.

I don't think that there is an issue with the current code, but I may be wrong. Could you give me more details about what made you have some doubts (which dataset, etc.) ?

Best,
Johann

michael-ychen · 2021-02-19T01:59:46Z

Hi Johann, I really appreciate your quick reply. The dataset I was working on is Wafer, which is a binary-class dataset. However, when I use the provided function, fetch_uea_dataset(), to load the target dataset, the values of data['target_train'] are more than two unique values. This finding makes me tracking how this function loads data, and splits the data into features and labels. Please feel free to let me know if I made any mistakes with using the library. Thanks a lot. :) Best, Michael

…

On Thu, Feb 18, 2021 at 4:18 AM Johann Faouzi ***@***.***> wrote: Hi, Thanks for your kind words. Regarding your question, I tried with SelfRegulationSCP1 and the labels are strings ('negativity' and 'positivity') and not -1 and +1 indeed. The labels are directly taken from the files, so I'm not sure that the best solution to change the labels in this function as it could be confusing for other users familiar with the datasets *as they are*. Maybe a better solution would be to change the labels directly in the original files. I think that you can raise an issue on this repository to do so: https://github.com/uea-machine-learning/tsml_repo X_data[i] is a numpy.void object with 2 elements, so X_data[i][1] and X_data[i][-1] are equivalent. The ARFF structure is not really intuitive (I didn't know its existence before working on this project), so it's not as easy as a CSV file with all but last column for the input data and the last column for the target data. I don't think that there is an issue with the current code, but I may be wrong. Could you give me more details about what made you have some doubts (which dataset, etc.) ? Best, Johann — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#92 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AS4PY3I5W4DWA6V5IUT7XBTS7TLQBANCNFSM4XZIGVLQ> .

michael-ychen · 2021-02-19T02:03:41Z

By the way, Wafer is the second dataset I tried before tracking the fetch_uea_dataset() function. The first dataset is SharePriceIncrease, which also should be a binary-class dataset, but the labels contain more than two unique values.

Thanks,

johannfaouzi · 2021-02-19T08:30:08Z

I think that I understood the issue.

UCR usually refers to the univariate time series classification archive, while UEA refers to the multivariate time series classification archive. Both datasets (Wafer and SharePriceIncrease) are univariate time series classification datasets, and should thus be loaded using pyts.datasets.fetch_ucr_dataset. fetch_uea_dataset will give unexpected results in this case. I should probably add a test in this function to make sure the dataset is multivariate and it would raise an error when trying to load a univariate dataset.

That being said, you should have an error when using pyts.datasets.fetch_ucr_dataset or pyts.datasets.fetch_uea_dataset if you don't provide the folder (data_home parameter) because these datasets are not listed in the available datasets (I haven't updated the list of available datasets for a while, I should definitely do it).

With pyts.datasets.fetch_ucr_dataset I can load a local version of Wafer. I can't with SharePriceIncrease because of this line:

pyts/pyts/datasets/ucr.py

Line 283 in 1aa4558

except IndexError:

because an OSError is raised and not an IndexError. When replacing this line with except (IndexError, OSError):, it works as intended. I think that I only used an IndexError because all the previous univariate datasets always had TXT files, but it seems not to be the case for more recent ones.

I hope this helps you a bit. I have some work to do to update these functions and I'm a bit busy right now, so I would suggest you to use these fixes yourself in your local version of pyts, but I will try to add them in the repository as soon as possible.

Best,
Johann

michael-ychen · 2021-02-19T22:14:00Z

Hi Johann, Yes, you are absolutely right. I think there are two reasons: (1) I worked on uni-variate time series, and (2) I just commented out the line for checking the dataset list since my target datasets are not in that list. By the way, I have solved the problem by customizing the fetch_uea_dataset(). Thank you so much! :) Best Regards, Michael

…

On Fri, Feb 19, 2021 at 3:30 AM Johann Faouzi ***@***.***> wrote: I think that I understood the issue. UCR usually refers to the *univariate* time series classification archive, while UEA refers to the *multivariate* time series classification archive. Both datasets (Wafer and SharePriceIncrease) are *univariate* time series classification datasets, and should thus be loaded using pyts.datasets.fetch_ucr_dataset <https://pyts.readthedocs.io/en/stable/generated/pyts.datasets.fetch_ucr_dataset.html#pyts.datasets.fetch_ucr_dataset>. fetch_uea_dataset will give unexpected results in this case. I should probably add a test in this function to make sure the dataset is multivariate and it would raise an error when trying to load a univariate dataset. That being said, you should have an error when using pyts.datasets.fetch_ucr_dataset or pyts.datasets.fetch_uea_dataset if you don't provide the folder (data_home parameter) because these datasets are not listed in the available datasets (I haven't updated the list of available datasets for a while, I should definitely do it). With pyts.datasets.fetch_ucr_dataset I can load a local version of Wafer. I can't with SharePriceIncrease because of this line: https://github.com/johannfaouzi/pyts/blob/1aa45589b91a12e8d55db86f1f97dca0b6e99984/pyts/datasets/ucr.py#L283 because an OSError is raised and not an IndexError. When replacing this line with except (IndexError, OSError):, it works as intended. I hope this helps you a bit. I have some work to do to update these functions and I'm a bit busy right now, so I would suggest you to use these fixes yourself in your local version of pyts, but I will try to add them in the repository as soon as possible. Best, Johann — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#92 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AS4PY3IA5H2UQLAXTLRYLKLS7YOSJANCNFSM4XZIGVLQ> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questing regarding loading UEA public datasets #92

Questing regarding loading UEA public datasets #92

michael-ychen commented Feb 17, 2021 •

edited

johannfaouzi commented Feb 18, 2021

michael-ychen commented Feb 19, 2021 via email

michael-ychen commented Feb 19, 2021

johannfaouzi commented Feb 19, 2021 •

edited

michael-ychen commented Feb 19, 2021 via email

Questing regarding loading UEA public datasets #92

Questing regarding loading UEA public datasets #92

Comments

michael-ychen commented Feb 17, 2021 • edited

johannfaouzi commented Feb 18, 2021

michael-ychen commented Feb 19, 2021 via email

michael-ychen commented Feb 19, 2021

johannfaouzi commented Feb 19, 2021 • edited

michael-ychen commented Feb 19, 2021 via email

michael-ychen commented Feb 17, 2021 •

edited

johannfaouzi commented Feb 19, 2021 •

edited