Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include doi as required field in meta.yaml #158

Open
jbusecke opened this issue Jan 13, 2023 · 8 comments
Open

Include doi as required field in meta.yaml #158

jbusecke opened this issue Jan 13, 2023 · 8 comments

Comments

@jbusecke
Copy link
Contributor

I am working through a documentation with @rabernat which outlines how a proper citation using pangeo-forge data would look like. We noticed that the catalog page does not display the doi, which is needed to cite the original data source in a paper.

I propose to add a required field in the meta.yaml that contains the doi (or possibly a list of dois) for a given dataset. This could be then be used to have a 'copy citation' button on each catalog entry.

@rabernat
Copy link

More broadly, we could think about how we want recipes to be cited. Here is example we came up with today for this dataset.

The data in this study originated from the NASA "GPM IMERG Late Precipitation L3 1 day 0.1 degree x 0.1 degree V06 (GPM_3IMERGDL)" dataset (Huffman et al., 2019)
The data were accessed via the Pangeo Forge ARCO data repository (Stern et al., 2022) on Jan. 13, 2023.
The Pangeo Forge recipe that generated the data is located at https://pangeo-forge.org/dashboard/feedstock/81

Huffman, G.J., E.F. Stocker, D.T. Bolvin, E.J. Nelkin, Jackson Tan (2019), GPM IMERG Late Precipitation L3 1 day 0.1 degree x 0.1 degree V06, Edited by Andrey Savtchenko, Greenbelt, MD, Goddard Earth Sciences Data and Information Services Center (GES DISC). https://doi.org/10.5067/GPM/IMERGDL/DAY/06

Stern, Charles, R. Abernathey, J. Hamman, R. Wegener Rachel, C. Lepore, S. Harkins, A. Merose.
Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data Production.
Frontiers in Climate, 10 February 2022
https://doi.org/10.3389/fclim.2021.782909

What's missing here is a good citation for the recipe author (in this case, @briannapagan). Brianna, I'm curious, what sort of acknowledgement of your role would make sense here?

@briannapagan
Copy link

I would second showing the doi/list of doi's. Recipe authors have the responsibility of properly citing the original dataset. As for acknowledging the recipe auhors - is adding a doi for the recipe itself over-doing it? Do we need recipe author acknowledgement at all? Using NASA as an example, the archivers/folks who are working at the data centers do not get acknowledgement for maintaining the data collections themselves, just a nod potentially to the data center itself.

@jbusecke
Copy link
Contributor Author

jbusecke commented Jan 18, 2023

Thanks for that @briannapagan!

I personally think we have the opportunity to change the status quo for the better here. I would personally advocate for a doi per recipe, which I think will acknowledge the important work which will be the foundation of how climate data science might be done in the future?

We cannot assume that recipe maintainers are financially compensated for their work (as is the case at NASA?), so I think providing an easy way to acknowledge their efforts would be fair, and might create a needed incentive for a diverse group of people to contribute recipes?

A practical consideration for reproducibility: If we e.g. decide to implement a zenodo webhook for feedstocks, we could get a doi + a secondary archived location for the code. This would increase the chance of researchers in the future to actually reproce a given dataset with a particular version of the recipe (even if it has to be run on your local computer).

@briannapagan
Copy link

Along the same lines, is the recipe maintainer which receives the acknowledgement also responsible for maintaining in perpetuity? I am going to sound like a broken record, but data archives are very much alive. If some reprocessing error is caught, and original data source republished, the zarr store must be updated. Is the onerous on the shoulders of the maintainer to always ensure the zarr store is accurate? How do we connect the upstream data providers to this?

@briannapagan
Copy link

I personally think we have the opportunity to change the status quo for the better here. I would personally advocate for a doi per recipe, which I think will acknowledge the important work which will be the foundation of how climate data science might be done in the future?

Also great! +2 for doi per recipe.

@jbusecke
Copy link
Contributor Author

jbusecke commented Jan 18, 2023

Is the onerous on the shoulders of the maintainer to always ensure the zarr store is accurate? How do we connect the upstream data providers to this?

Excellent point! Naively Id think we should aim to make involve them feedstock maintainers/contributors, but I realize this might be hard.

@cisaacstern cisaacstern transferred this issue from pangeo-forge/pangeo-forge.org Dec 7, 2023
@cisaacstern
Copy link
Member

👋 all, I've moved this issue here to pangeo-forge-runner because as of #134, the schema for meta.yaml lives here.

@jbusecke
Copy link
Contributor Author

I think there are several questions mixed in this discussion:

  1. Should we require the doi as part of meta.yaml (clearly a runner issue)
  2. How do we cite the code of a recipe itself (to me this is not related to runner, not sure where it belongs but its more of a meta/docs question I guess)

Any suggestions where to separate the discussion on 2?

Moving forward here: I am a strongly for enforcing dois in the meta.yaml as a default! Perhaps we can have some sort of an opt-out option for testing though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants