Add an easier way of adding datasets #1507

george-gca · 2023-08-07T19:41:09Z

Feature Request

Is your feature request related to a problem? Please describe.
Currently adding a dataset to be used by other users requires parting from a given template and then pip installing the dataset.

Describe the solution you'd like
An easier way of adding an (un)labeled dataset to be used in ASReview.

Describe alternatives you've considered
Maybe something like adding from URL, but giving a link to a json file like the one used in BenchmarkDataGroup

Teachability, Documentation, Adoption, Migration Strategy
After reading the info from the json file, it could display some information like the one exhibited in the Benchmarks dataset panel, but also for unlabeled datasets:

Rensvandeschoot · 2023-08-08T06:09:52Z

Thank you for your feature request and for championing open science. We wholeheartedly share your commitment to transparency, as reflected in our open-source pipeline. Our overarching goal is to harness collective data to continually refine and improve models.

To start the discussion, this is what already is available:

Importing via DOI: We have incorporated this feature in a recent pull request. DOIs offer a robust foundation for crafting reproducible workflows. After publishing a dataset on reputable data repositories like the Open Science Framework, users can effortlessly add it to ASReview using its DOI.
Importing via URL: The existing user interface for URL imports remains mostly intact. What's new is our two-step approach: the URL undergoes a validation process, returning a filename, followed by the dataset upload. This not only expands our URL compatibility but also guarantees preliminary validation.
**Benchmark data via SYNERGY **: Regarding benchmark data, our team, led by @J535D165, is in the process of integrating the SYNERGY datasets. Users interested in contributing datasets can utilize the SYNERGY pipeline for this purpose.

We're always open to more suggestions and input, so let's explore even more ways to interact with our users' data!

george-gca · 2023-08-10T15:23:12Z

What I meant is, I created some datasets and would like to make them easily accessible to other researchers. Importing via url is fine, but it doesn't display much information about the dataset. Having the option to import something like a json file with some metadata to be displayed would be informative for the user, much like in the screenshot above.

Also, implementing something like the benchmark datasets part but for community datasets would be great. Being more clear, having a tab with Community datasets much like the one in the screenshot above for the Benchmark datasets, but to be used in Oracle mode. This way the user could select from a list of community curated datasets, download and start using it without ever leaving ASReview. Most of these datasets would be unlabeled, serving as the starting point for someone's review.

Not only to be used to create a systematic review per definition, but also as a way of finding new papers in an area of interest that were just published in a conference, which is more of my use case. So, for example, new conference makes papers available, someone creates a dataset with information from this conference, and users could readily access this information and start reviewing new papers of interest.

Rensvandeschoot · 2023-08-11T13:38:07Z

Yes, I totally understand! I just wanted to create an overview what is already possible :-)
Adding a dataset to the software is also possible via this template: https://github.com/asreview/template-extension-new-dataset. You need to host the data on a server, and if you start ASReview you will see this dataset as part of the software (it is some time ago I tried this template myself, but it should still work... if not, let us know!). Perhaps this solution is already closer to what you have in mind...?

george-gca · 2023-08-11T17:16:06Z

I already cited the template in the issue description.

My problem with this approach is having to create a package that need to be installed with pip. I don't think a dataset has that much of information and functionalities to need a package for itself.

It should be something more smooth, like creating a yml or json file with metadata and pointing to where the real data should be downloaded from. Note that this solution is already implemented for json files in the BenchmarkDataGroup. I think it just need to be available for Oracle mode and documented.

J535D165 · 2023-08-14T10:07:42Z

This is a great idea. Never thought about this somehow. We are welcoming contributions to this, and our team is also interested in implementing this.

george-gca assigned J535D165 Aug 7, 2023

Rensvandeschoot added enhancement New feature or request Discussion labels Aug 8, 2023

Rensvandeschoot unassigned J535D165 Aug 8, 2023

J535D165 added this to the v2.0 milestone Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an easier way of adding datasets #1507

Add an easier way of adding datasets #1507

george-gca commented Aug 7, 2023 •

edited

Rensvandeschoot commented Aug 8, 2023

george-gca commented Aug 10, 2023 •

edited

Rensvandeschoot commented Aug 11, 2023

george-gca commented Aug 11, 2023

J535D165 commented Aug 14, 2023

Add an easier way of adding datasets #1507

Add an easier way of adding datasets #1507

Comments

george-gca commented Aug 7, 2023 • edited

Feature Request

Rensvandeschoot commented Aug 8, 2023

george-gca commented Aug 10, 2023 • edited

Rensvandeschoot commented Aug 11, 2023

george-gca commented Aug 11, 2023

J535D165 commented Aug 14, 2023

george-gca commented Aug 7, 2023 •

edited

george-gca commented Aug 10, 2023 •

edited