Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questing regarding loading UEA public datasets #92

Open
michael-ychen opened this issue Feb 17, 2021 · 5 comments
Open

Questing regarding loading UEA public datasets #92

michael-ychen opened this issue Feb 17, 2021 · 5 comments

Comments

@michael-ychen
Copy link

michael-ychen commented Feb 17, 2021

Hi, first of all, I really appreciate this wonderful library for processing time-series related issues. I am using pyts to loading UEA datasets, but I found that when I load a binary-class dataset, but the loaded labels are not binary. After debugging, I guess this might some issues existed with the line I provided below.

y.append(X_data[i][1])

I think the last number, X_data[i][-1], should be the label, either -1 or 1, instead of X_data[i][1]. Moreover, the X should drop the last column, which stands for labels.

I am not sure that my interpretation is correct or not. I look forward to hearing from you, and I wish this powerful tool becomes better and better. Thanks so much. :)

@johannfaouzi
Copy link
Owner

Hi,

Thanks for your kind words.

Regarding your question, I tried with SelfRegulationSCP1 and the labels are strings ('negativity' and 'positivity') and not -1 and +1 indeed. The labels are directly taken from the files, so I'm not sure that the best solution to change the labels in this function as it could be confusing for other users familiar with the datasets as they are. Maybe a better solution would be to change the labels directly in the original files. I think that you can raise an issue on this repository to do so: https://github.com/uea-machine-learning/tsml_repo

X_data[i] is a numpy.void object with 2 elements, so X_data[i][1] and X_data[i][-1] are equivalent. The ARFF structure is not really intuitive (I didn't know its existence before working on this project), so it's not as easy as a CSV file with all but last column for the input data and the last column for the target data.

I don't think that there is an issue with the current code, but I may be wrong. Could you give me more details about what made you have some doubts (which dataset, etc.) ?

Best,
Johann

@michael-ychen
Copy link
Author

michael-ychen commented Feb 19, 2021 via email

@michael-ychen
Copy link
Author

By the way, Wafer is the second dataset I tried before tracking the fetch_uea_dataset() function. The first dataset is SharePriceIncrease, which also should be a binary-class dataset, but the labels contain more than two unique values.

Thanks,

@johannfaouzi
Copy link
Owner

johannfaouzi commented Feb 19, 2021

I think that I understood the issue.

UCR usually refers to the univariate time series classification archive, while UEA refers to the multivariate time series classification archive. Both datasets (Wafer and SharePriceIncrease) are univariate time series classification datasets, and should thus be loaded using pyts.datasets.fetch_ucr_dataset. fetch_uea_dataset will give unexpected results in this case. I should probably add a test in this function to make sure the dataset is multivariate and it would raise an error when trying to load a univariate dataset.

That being said, you should have an error when using pyts.datasets.fetch_ucr_dataset or pyts.datasets.fetch_uea_dataset if you don't provide the folder (data_home parameter) because these datasets are not listed in the available datasets (I haven't updated the list of available datasets for a while, I should definitely do it).

With pyts.datasets.fetch_ucr_dataset I can load a local version of Wafer. I can't with SharePriceIncrease because of this line:

except IndexError:

because an OSError is raised and not an IndexError. When replacing this line with except (IndexError, OSError):, it works as intended. I think that I only used an IndexError because all the previous univariate datasets always had TXT files, but it seems not to be the case for more recent ones.

I hope this helps you a bit. I have some work to do to update these functions and I'm a bit busy right now, so I would suggest you to use these fixes yourself in your local version of pyts, but I will try to add them in the repository as soon as possible.

Best,
Johann

@michael-ychen
Copy link
Author

michael-ychen commented Feb 19, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants