Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIH Common Fund - SPARC Dataset Structure #8

Open
jgrethe opened this issue Jan 16, 2021 · 8 comments
Open

NIH Common Fund - SPARC Dataset Structure #8

jgrethe opened this issue Jan 16, 2021 · 8 comments

Comments

@jgrethe
Copy link

jgrethe commented Jan 16, 2021

There is an effort within the SPARC program to develop such a structure:
https://sparc.science/help/3FXikFXC8shPRd8xZqhjVT

A white paper is about to be published. There is also some tooling being developed to assist researchers in migrating files to the structure as well as tools for validation.

@SylvainTakerkart
Copy link
Collaborator

SylvainTakerkart commented Feb 1, 2021

Thanks @jgrethe for this post!

I think this is the same initiative as what @tgbugs described e.g here: #4 (comment)), correct? (if so, I propose to close this issue to keep things centralized in just one thread... ok with you @jgrethe ?)

Cheers,

Sylvain

@yarikoptic
Copy link
Member

.xlsx files for seemingly trivial tabular data - yikes! anyone knows what was motivation for going with that beast instead of a simple .tsv?

@jbpoline
Copy link
Member

jbpoline commented Feb 1, 2021

👍
This sounds something worth correcting

@tgbugs
Copy link

tgbugs commented Feb 1, 2021

.tsv, .csv, and .json are all also supported. The reason .xlsx is supported is because originally it is easier for non-technicals to work with in their existing workflows, and because we needed an additional layer in the files to be able to communicate required vs optional fields. Over time another reason is because it is next to impossible to get non-technical users to fix bad file encodings (e.g. latin-1 encodings). Also we have found that users struggle with tsv vs csv vs semi-colon separated, and using the defaults for them avoids many layers of confusion.

There is some tension between deposition format (xlsx) and other more interoperable formats that we might like to publish with the dataset. Right now we have only implemented functions that go from xlsx -> json, but have plans to implement going in the other direction as well, so that the xlsx file could serve purely as a user interface and never actually appear in the published dataset.

@tgbugs
Copy link

tgbugs commented Feb 1, 2021

@SylvainTakerkart yes, same one I mention in #4 (comment).

@yarikoptic
Copy link
Member

So every tool supporting this format for output needs to be able to write xlsx and ensure consistent dumping also in all other formats? In other words: Multiplicity of possible data representations IMHO just brings possible inconsistency, difficulty in I/O, and for unclear benefit, since Excel etc open tsv just fine.

@tgbugs
Copy link

tgbugs commented Feb 2, 2021

@yarikoptic no. Writing xlsx is only needed to make the life of the user easier if they are depositing data in xlsx format. In the minimal case writing xlsx would not be required, and for publication we might replace the xlsx files with tsv or json so that people who wanted to use the dataset did not have to deal with parsing the xlsx files.

In the minimal case a validator would just read the xlsx file in and tell the user "this is malformed." That validation is implemented at 3 levels, xlsx -> generic tabular, tabular -> json, and json. Only the xlsx -> generic tabular step needs additional work beyond csv/tsv. In the maximal case it can be easier to show users malformed errors by writing another xlsx file with all the bad fields marked in red. If you were doing this via a web interface there are other options and of course the user might never interact with the underlying json structure at all.

edit: with regard to possible inconsistency, we have found that the more steps away from default that a user has to take, the more likely they are to produce inconsistent data. By supporting the defaults that 90% of our data depositors experience, we cut out a lot of steps that they can screw up.

In short, there are more human errors that can happen when using tsv and csv that are significantly harder to fix than any of the implementation issues that might or might not be encountered when using xlsx. Note that I think that this is true despite the fact that the current implementation of the validation pipelines always run two parsers for xlsx files so that we can catch different sets of errors. Better to do that than to try to get 20 different labs to change how they save their files on 3 operating systems and 5 different localization defaults (actually probably more operating systems because some labs are probably still on windows xp for some of their data acquisition computers).

@oruebel
Copy link

oruebel commented Feb 17, 2021

The paper on the SPARC Data Structure is here https://doi.org/10.1101/2021.02.10.430563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants