Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an easier way of adding datasets #1507

Open
george-gca opened this issue Aug 7, 2023 · 5 comments
Open

Add an easier way of adding datasets #1507

george-gca opened this issue Aug 7, 2023 · 5 comments
Labels
Discussion enhancement New feature or request
Milestone

Comments

@george-gca
Copy link

george-gca commented Aug 7, 2023

Feature Request

Is your feature request related to a problem? Please describe.
Currently adding a dataset to be used by other users requires parting from a given template and then pip installing the dataset.

Describe the solution you'd like
An easier way of adding an (un)labeled dataset to be used in ASReview.

Describe alternatives you've considered
Maybe something like adding from URL, but giving a link to a json file like the one used in BenchmarkDataGroup

Teachability, Documentation, Adoption, Migration Strategy
After reading the info from the json file, it could display some information like the one exhibited in the Benchmarks dataset panel, but also for unlabeled datasets:

image

@Rensvandeschoot
Copy link
Member

Thank you for your feature request and for championing open science. We wholeheartedly share your commitment to transparency, as reflected in our open-source pipeline. Our overarching goal is to harness collective data to continually refine and improve models.

To start the discussion, this is what already is available:

  • Importing via DOI: We have incorporated this feature in a recent pull request. DOIs offer a robust foundation for crafting reproducible workflows. After publishing a dataset on reputable data repositories like the Open Science Framework, users can effortlessly add it to ASReview using its DOI.

  • Importing via URL: The existing user interface for URL imports remains mostly intact. What's new is our two-step approach: the URL undergoes a validation process, returning a filename, followed by the dataset upload. This not only expands our URL compatibility but also guarantees preliminary validation.

  • **Benchmark data via SYNERGY **: Regarding benchmark data, our team, led by @J535D165, is in the process of integrating the SYNERGY datasets. Users interested in contributing datasets can utilize the SYNERGY pipeline for this purpose.

We're always open to more suggestions and input, so let's explore even more ways to interact with our users' data!

@george-gca
Copy link
Author

george-gca commented Aug 10, 2023

What I meant is, I created some datasets and would like to make them easily accessible to other researchers. Importing via url is fine, but it doesn't display much information about the dataset. Having the option to import something like a json file with some metadata to be displayed would be informative for the user, much like in the screenshot above.

Also, implementing something like the benchmark datasets part but for community datasets would be great. Being more clear, having a tab with Community datasets much like the one in the screenshot above for the Benchmark datasets, but to be used in Oracle mode. This way the user could select from a list of community curated datasets, download and start using it without ever leaving ASReview. Most of these datasets would be unlabeled, serving as the starting point for someone's review.

Not only to be used to create a systematic review per definition, but also as a way of finding new papers in an area of interest that were just published in a conference, which is more of my use case. So, for example, new conference makes papers available, someone creates a dataset with information from this conference, and users could readily access this information and start reviewing new papers of interest.

@Rensvandeschoot
Copy link
Member

Yes, I totally understand! I just wanted to create an overview what is already possible :-)
Adding a dataset to the software is also possible via this template: https://github.com/asreview/template-extension-new-dataset. You need to host the data on a server, and if you start ASReview you will see this dataset as part of the software (it is some time ago I tried this template myself, but it should still work... if not, let us know!). Perhaps this solution is already closer to what you have in mind...?

@george-gca
Copy link
Author

I already cited the template in the issue description.

My problem with this approach is having to create a package that need to be installed with pip. I don't think a dataset has that much of information and functionalities to need a package for itself.

It should be something more smooth, like creating a yml or json file with metadata and pointing to where the real data should be downloaded from. Note that this solution is already implemented for json files in the BenchmarkDataGroup. I think it just need to be available for Oracle mode and documented.

@J535D165
Copy link
Member

This is a great idea. Never thought about this somehow. We are welcoming contributions to this, and our team is also interested in implementing this.

@J535D165 J535D165 added this to the v2.0 milestone Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants