Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Software Heritage Identifiers (SWHID) as source of repository #219

Open
douardda opened this issue Nov 17, 2020 · 9 comments

Comments

@douardda
Copy link

Add support for SWHID as provider of jupyter notebooks

The Software Heritage project aims at collecting, preserving, and sharing all software that is publicly available in source code form (see the Software Heritage Misson).

To be able to do so, each software source code artifact must be identified by an intrinsic persistent identifier, the SWHID (see also this document

As a result, as soon as a Jupyter notebook has been harvested and stored in the Software Heritage Archive (be it by the regular scrapping process of SWH or because it is the result of a software deposit on a open archive repository like HAL), it would make it possible to use binder to directly run a notebook even if the original source for this code has disappeared.

@welcome
Copy link

welcome bot commented Nov 17, 2020

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@manics
Copy link
Member

manics commented Nov 17, 2020

Hi! This definitely sounds interesting. Thanks for linking to the background documents. For the benefit of anyone short on time could you perhaps say a bit about how mature SWHID is? E.g. Is it quite new, well established, who's supporting it long term, how much uptake has it had in the community (and which communities), etc?

Is this something you want to work on, or are you just drawing this to our attention to stimulate some discussion? If you want some general comments it'd worth posting on the Jupyter community forum https://discourse.jupyter.org/ where a much wider audience hang out.

@douardda
Copy link
Author

douardda commented Nov 18, 2020

Hi @manics,

The Software Heritage project is now pretty well established, and the usage of SWHID as identifiers for long term identification of source code artifacts adoption is getting wider.. We currently have the HAL repository that automatically stores source code coming with papers deposited on there (see this one for example), and recently eLife and IPOL are also using the Software Heritage Archive as backend for long term preservation and identification of software used in their papers, and we are working at having more communities involved. We also are working on having SWHID used for software citation in academic papers. For now, eLife and IPOL are starting to use it, and JTCAM now recommends their usage. @rdicosmo even wrote a biblatex style for SWHID!

There is also this paper by the RDA/Force11 Software Source Code Identification Working Group that may be interesting in this regard.

About doing it on our side or not, it's something I need to discuss with other SWH team members. One possibility might be to use the Sloan Foundation grant to finance this work (not completely sure we can). I expect it should not be a very long task, but as we all know, it's always more complicated than expected.

@manics
Copy link
Member

manics commented Nov 18, 2020

@douardda Thanks for the update! Before putting in a grant proposal let's make sure there's a consensus from the maintainers here that it should be added.

@douardda
Copy link
Author

Yes indeed. For the record, I believe there have a discussion on a similar subject a few years ago between @rdicosmo and @minrk (and maybe others).

Now, what is the proper way of reaching such a consensus? Should I create a discussion on discourse?

@manics
Copy link
Member

manics commented Nov 18, 2020

We don't have a formal process for accepting new content providers in repo2docker, though since you mention it I think it's something we should consider.

We've got our monthly JupyterHub team meeting tomorrow:
jupyterhub/team-compass#346
I'll add this issue to the agenda, just so everyone's aware of it. If you're free you're more than welcome to join the meeting and say a few words about this, just add it to the agenda

@betatim
Copy link
Member

betatim commented Nov 19, 2020

I tried it out yesterday to get a feeling for it. I went to https://www.softwareheritage.org/ and scrolled down to the search box. Typed "binder-examples" which took me to https://archive.softwareheritage.org/browse/search/?q=binder-examples&with_visit=true&with_content=true. I selected the first binder-examples/requirements result and ended up here https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://github.com/binder-examples/requirements. There is a download button there which would get me that version. I clicked the download button but that showed me a message "Archive cooking service is currently experiencing issues. Please try again later.".

Then I looked at the API docs to find out how we could automate this. https://archive.softwareheritage.org/api/1/ has a list of all the endpoints. I think https://archive.softwareheritage.org/api/1/vault/directory/doc/ followed by a call to https://archive.softwareheritage.org/api/1/vault/directory/raw/doc/ would be what we need.

Another example I found is https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://github.com/binder-examples/r which is the archived version of https://github.com/binder-examples/r. However it looks like the last time it was archived was in 2018. What is the process for getting things archived? Is there a crawler that constantly checks for new things? Do people submit requests for archiving?

One thing I was wondering is who uses SWHIDs right now and how. It would be good to talk to people who use it to retrieve files to learn more about how they use it and how they expect things to work.

Overall it reminds me of archive.org but for software.

@douardda
Copy link
Author

(FTR as we discussed these points in the last monthly JupyterHub team meeting)

The idea is indeed to use the SWH public API to let a user use a SWHID as source of repository (the same way one can currently use a DOI).

Then I looked at the API docs to find out how we could automate this. https://archive.softwareheritage.org/api/1/ has a list of all the endpoints. I think https://archive.softwareheritage.org/api/1/vault/directory/doc/ followed by a call to https://archive.softwareheritage.org/api/1/vault/directory/raw/doc/ would be what we need.

The vault may not be the ideal way of retrieving the directory needed to build the binder execution environment because it's an asynchronous service.
In this case, I think using using the API to list the content of a directory the given SWHID refers to, either directly if the SWHID is a reference to a directory (swh:dir:) or the directory linked to the revision if it's a revision (swh:rev:). (For other SHWID types, it will depend on other aspects like if there are enough context to get an non-ambiguous directory that can be retrieved from that SWHID).
Then retrieve the directory content using API calls.

About delays in the archival, yes they happen. We do our best to keep the lag as small as possible, but we cannot guarantee a git revision pushed on github will be gathered in the SWH Archive in a given amount of time.

The typical use case to me is more something like a user finds a SWHID of a piece of code (in a jupyter notebook) as in a scientific paper and want to try this notebook.

@douardda
Copy link
Author

I wrote a quick PR to add support for SWHID in repo2docker, see jupyterhub/repo2docker#988

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants