Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement git submodules for specs #182

Open
wants to merge 14 commits into
base: add-github-workflow
Choose a base branch
from

Conversation

anatoly-scherbakov
Copy link
Contributor

@anatoly-scherbakov anatoly-scherbakov commented Oct 18, 2023

Alleviate the step of manually downloading specifications for tests to use them. Instead, put specifications into a dedicated directory using Git submodules mechanism.

@anatoly-scherbakov
Copy link
Contributor Author

closes #181

@BigBlueHat
Copy link
Contributor

@anatoly-scherbakov let's give this a try. Can you expand the README in this PR also to explain what the pre-testing steps would include now (i.e. git submodule init etc.)?

It would be great to simplify testing.

Thanks for exploring this idea!
🎩

@anatoly-scherbakov anatoly-scherbakov marked this pull request as ready for review November 12, 2023 16:37
@anatoly-scherbakov
Copy link
Contributor Author

@BigBlueHat @davidlehn My apologies for the delay; I think I have something to show here now.

@BigBlueHat
Copy link
Contributor

Hey @anatoly-scherbakov, Sorry for the delay here. I continue to feel a bit squeamish about using git submodules. It's not a pattern we typically use in our other implementations, and it's a rarely used (and oft derided...) git feature. I'm also concerned about the overhead of keeping the sha's pinned accurately.

However, I did find that the jsonld.js implementation has a nice approach to this same problem using npm run commands: https://github.com/digitalbazaar/jsonld.js/blob/main/package.json#L96-L101

My thought is that we could do something similar here by providing a simple Python script that would run the git clone --depth 1 commands, and ideally also provide environment variables to use (as the JS implementation does) in case someone already has the test suites on hand but in different locations (as we often do because we work on those also).

Would you be up for taking a crack at that?

@anatoly-scherbakov
Copy link
Contributor Author

@BigBlueHat thank you for your review. I understand the concern but might disagree in certain regards, let me try to elaborate on them.

Let's analyze the approach taken at jsonld.js.

    "fetch-json-ld-org-test-suite": "if [ ! -e test-suites/json-ld.org ]; then git clone --depth 1 https://github.com/json-ld/json-ld.org.git test-suites/json-ld.org; fi",

What happens if JSON-LD test suite is updated?

  1. A developer already has a downloaded version of JSON-LD test suite locally.
    • The developer is not notified about the change
    • If they realize that the change has occurred, to refresh the spec version, they have to manually remove the spec directory
    • Different developers will inevitably debug the implementation against different versions of the test suite, which leads to issues potentially complicated to debug
    • It is hard to track which exact version of the test suite one has, and Git doesn't help with that
  2. A developer submits their changes to jsonld-js to CI, and CI fails
    • That's because the spec has been updated and CI downloaded the latest version
    • First of all, the developer needs to realize that had happened
    • Then, they have to download the latest spec and debug what has changed,
    • Which wasn't in the original scope of their pull request.

I feel these cases can be very annoying and might hurt developer productivity.

Update of the spec version is not tied to an explicit human decision, which goes against normal practice for dependencies in many environments.

For instance, with Python poetry, you have to issue poetry update to update your dependencies; otherwise, their versions are explicitly pinned in poetry.lock. The same applies to cargo (Rust dependency manager), to npm (with its package.lock), and others.

Git submodules essentially provide the same level of control over the version of dependencies. It is easy to provide a make command which will upgrade dependencies, say

make upgrade-submodules

I'll add it in a moment in another commit. This will upgrade each submodule to its latest version; afterwards, the developer can run tests, make sure everything is 🟢, and submit to CI.

As to the popularity of git submodules, — quick search on Github returns a few popular projects relying upon them: pytorch, ghc, draw.io, mono, ceph, obs-studio, micropython, qemu and more.

Let me know what you think!

@anatoly-scherbakov
Copy link
Contributor Author

@BigBlueHat

P. S. To address the concern about testing pyld, this can be done by:

python tests/runtests.py ~/projects/json-ld-api

…as documented in README.rst. I don't think git submodules would be an obstacle for local development in this fashion.

@BigBlueHat
Copy link
Contributor

@anatoly-scherbakov coming back to this again--sorry for the wait.

The experience of using the submodules was fine--not much different than the experience of using the approach we took in jsonld.js, so I'm fine with it.

@davidlehn unless you're diametrically opposed to this, I'd like to get it merged sooner than later.

Copy link
Contributor

@BigBlueHat BigBlueHat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small tweak which isn't a blocker--and could be done later--so approving.

Makefile Show resolved Hide resolved
tests/runtests.py Outdated Show resolved Hide resolved
@anatoly-scherbakov
Copy link
Contributor Author

Thanks for the review @BigBlueHat! I am writing an issue about the Makefile shortly but I do not believe I have the rights to merge this PR.

This might be also related to me being unable to create a PR in this repository which has another PR as its base.

@@ -0,0 +1,11 @@
[submodule "specifications/json-ld-api"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the specs, but since their sole purpose here is for testing, could this be named test-suites/? That's what the scripts in jsonld.js use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing!

I can rename those. A few things concern me a bit though:

  • Each repository we're checking out contains not just tests but the whole specification, which might be confusing and violate the principle of least surprise;
  • We will likely add more tests to the tests directory which will not be using the specifications, and the existence of both tests and test-suites will be confusing;
  • At some point, we might want to use the specifications for other purposes than tests, for instance, to aid documentation generation, or something.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the specifications/ folder name, as I do hope/expect we'll get more unit tests added to tests/ at some point, and test-suites/ would certainly be confusing.

@@ -0,0 +1,2 @@
upgrade-submodules:
git submodule update --remote --init --recursive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you remove all the checked out dirs without leaving git thinking there are unstaged changes? Put another way, how do you undo this command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to undo the version upgrade of the submodules, you can do:

git reset --hard

If you've modified the submodules locally, after having them upgraded (for debugging purposes perhaps) then you do:

git submodule foreach git reset --hard

README.rst Outdated Show resolved Hide resolved
tests/runtests.py Outdated Show resolved Hide resolved
]


def default_test_targets() -> List[Path]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing this so it works something like the jsonld.js test runner and checks for the local test suite dir, then for the sibling dir if not found. That allows for alternate workflows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't consider missing test suites an error before. I'm not sure what should happen for that. Never seemed a problem in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you believe there is a use case for it? Probably the main goal of turning specifications into submodules is to automate the connection to them, standardizing it across the developers' clones of pyld repo.

Having the specifications cloned into the root directory of the repo, for instance, will require .gitignore rules (or it's way too easy to commit them, which is likely not what one wants).

We can later add an error message to this function. If the default test locations are used, and if tests are missing, then probably something is wrong. For instance, the developer forgot to check out the submodules. Their tests will be incorrectly green locally, and then, after a push, failing CI will pose an unpleasant surprise.

@davidlehn
Copy link
Member

If this PR is going to go all-in, then consider updating docs to handle what #157 was doing to address #156.

@davidlehn
Copy link
Member

  • I don't necessarily disagree with the reasoning to use submodules. As a counter point, pyld and jsonld.js are over a decade old and have gotten by without using that feature. It's a matter of tradeoffs between using different commands and dealing with different problems. I'm not sure what the best approach is. Certainly better docs and tooling would help in whatever approach is used.
  • Adding this feature forces the maintainers to learn to use it. Help us (me) out with a few more snippets of docs perhaps in the README and/or CONTRIBUTING. Maybe even more script/makefile targets to run. Specifically, what do we (I) need to type to keep the test suites synced up all the time.
  • I have concerns about forgetting to update the test suite repo refs and not noticing newly added tests are failing. I can imagine complex processes to address this, but what is the theory here? I guess the idea is to tradeoff possibly running with updated tests unintentionally to having outdated tests unintentionally.
  • I put in some comments, partly directed towards making the former workflow possible with minimal hassle. I found that worked well for me. Especially when I was working on js and python at the same time and explicitly wanted to use shared sibling test suites. That's a particular use case for sure, but a real one.
  • I suppose I'm willing to try this change if pressed. To me it seems either method is fine for any potential contributors. There is almost no input here, so I'm not sure others feel strongly at all.

@davidlehn
Copy link
Member

@anatoly-scherbakov Could you also change the commit message with the fancy "+" and one with ellipses, arrow, and math symbol to use simple ASCII? Nicer to use chars people can easily type. Feel free to argue. :-)

@anatoly-scherbakov anatoly-scherbakov force-pushed the 181-git-submodules branch 2 times, most recently from 5798f0b to 440a251 Compare February 6, 2024 13:36
@anatoly-scherbakov
Copy link
Contributor Author

@davidlehn thanks again for the review!

I have applied a few requested changes.

Help us (me) out with a few more snippets of docs perhaps in the README and/or CONTRIBUTING. Maybe even more script/makefile targets to run.

Yes! I feel that less documentation and more automation is for the better :)

I have concerns about forgetting to update the test suite repo refs and not noticing newly added tests are failing.
Indeed, this is another side of the coin. Most dependency management tools (poetry, npm, ...) seem to be okay with that, staying on the safe side to make builds predictable.

Tools like Dependabot, which post notifications about out-of-date dependencies and even create PRs automatically, are often leveraged. I think I can easily imagine a rather simple scheduled GitHub actions workflow that will update the specs submodules every week, run the tests, and create a PR with that upgrade. Do you think we should create an issue for that?

I put in some comments, partly directed towards making the former workflow possible with minimal hassle. I found that worked well for me. Especially when I was working on js and python at the same time and explicitly wanted to use shared sibling test suites. That's a particular use case for sure, but a real one.

I see. The runtests.py script allows to specify the path to the specification. Say, you have a development version of a spec somewhere on your local disk; you can easily run the tests from that version using this command.

Could you also change the commit message with the fancy "+" and one with ellipses, arrow, and math symbol to use simple ASCII? Nicer to use chars people can easily type. Feel free to argue. :-)

No argument here: I just like those fancy Unicode characters, especially putting them into commit messages 🙂 , and have setup my Linux to easily type them, but this is purely a matter of taste. Removed ✅

@BigBlueHat BigBlueHat modified the milestones: v2.0.4, v2.0.5 Feb 6, 2024
@BigBlueHat
Copy link
Contributor

@davidlehn the key thing I'm beginning to appreciate about the submodule approach is that it will provide git stored evidence of what tests were run--based on the sha stored in .gitmodules. We don't have that now if developers typically just grab the latest revision--or their own fork.

It should also provide a means to send updates to the submodule repos--though the porcelain for that feels more like plumbing, so some more commands (as you note) would be helpful.

@BigBlueHat
Copy link
Contributor

@anatoly-scherbakov can you take rdf-canon out and go back to the old tests. We should do the rdf-canon change in a separate PR...and I believe it'll need more code integration so as not to be completely skipped.

@anatoly-scherbakov
Copy link
Contributor Author

@BigBlueHat the previous address was https://github.com/json-ld/normalization/tests. Right now, https://github.com/json-ld/normalization redirects to https://github.com/w3c-ccg/rdf-dataset-canonicalization. Is that the link we should use?

@anatoly-scherbakov anatoly-scherbakov changed the base branch from master to add-github-workflow February 6, 2024 17:56
@anatoly-scherbakov
Copy link
Contributor Author

I've rebased this branch on top of #185.

@BigBlueHat
Copy link
Contributor

@BigBlueHat the previous address was https://github.com/json-ld/normalization/tests. Right now, https://github.com/json-ld/normalization redirects to https://github.com/w3c-ccg/rdf-dataset-canonicalization. Is that the link we should use?

This should be the correct location of the old tests suites:
https://github.com/w3c-ccg/rdf-dataset-canonicalization/tree/main/tests

So, still needs an updated URL in this (or another) PR. The rds-canon ones, though, introduce a whole new host of concerns.

Sorry I didn't see that earlier!

@anatoly-scherbakov
Copy link
Contributor Author

  • Added rdf-dataset-canonicalization spec submodule,
  • Reconfigured the tests to use it and to ignore rdf-canon submodule,
  • Added make test command.

Should we replace rdf-canon references everywhere in the docs as well, given that we're going to get them back anyway later?

@BigBlueHat
Copy link
Contributor

  • Added rdf-dataset-canonicalization spec submodule,
  • Reconfigured the tests to use it and to ignore rdf-canon submodule,
  • Added make test command.

Should we replace rdf-canon references everywhere in the docs as well, given that we're going to get them back anyway later?

Yeah...I think we need to add rdf-canon stuff as a separate PR. Otherwise, we're pulling all this code and content down, and then not using it at all...which would be very confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants