metadata-only resolve with `pip download --dry-run --report`! #10748

cosmicexplorer · 2021-12-25T08:29:13Z

Hello! This is the beginning of an attempt to follow up on the gratuitous speedup promised in #7819. While I still need to spend more time on parallelizing downloads in another branch, I'd like to propose this change first as it's totally orthogonal to downloading files!

Problem

With this change, users can use pip's powerful new resolver to extract metadata from a package resolution, such as what dependencies each package requires. This is currently not possible in pip or any other tool I am aware of.

Getting more specific, this result can then be processed by external tools to:

generate a lockfile.
check whether transitive dependencies would introduce a conflict.
download packages directly from download URLs without having to traverse PyPI again.

#53 has a discussion in greater depth on this proposed feature.

Solution

Add a pip download --report flag which can produce a JSON output file to examine the relationships between dependencies during pip download. This PR also implements PEP 658 support, allowing pip download --report to test PEP 658 compliance.

cosmicexplorer · 2021-12-25T08:30:58Z

I think this needs:

test cases
bikeshedding over the flag names / whether two separate flags are appropriate (I think so)

uranusjr · 2021-12-25T12:21:04Z

This should use PEP 658 unless there are explicit reasons otherwise.

pradyunsg · 2021-12-25T13:24:07Z

We can also utilise that in a follow up, which adds support for PEP 658 across the board (i.e. allowing regular installations to also benefit from that). I don't think that PEP's use should be a blocking concern for this PR -- but it's definitely a really good idea to also utilise that.

cosmicexplorer · 2021-12-25T17:09:54Z

@uranusjr: absolutely will do! I assumed there was prior work on how to frame this change and am ready to conform to it. Looking that up right now. Thank you @pradyunsg for help on framing that.

cosmicexplorer · 2021-12-25T17:16:42Z

Ok, remembering the content of that PEP and checking out https://www.python.org/dev/peps/pep-0658/#specification -- @uranusjr I'm thinking that it might be appropriate to:

Determine whether the data-dist-info-metadata attribute is present.
If so, append .metadata to the download URL to retrieve the metadata file and add that to the info printed out by --print-download-urls.
If the hash of the metadata file is provided, add that to the info that is printed out by --print-download-urls.

Separately, I've realized I shouldn't be keying the output of --print-download-urls by req.name, but rather by the stringification of the req.

@pradyunsg does that sound like a reasonable approach? I plan to stick with this work until done, so I can promise to do that in a followup as well if appropriate.

cosmicexplorer · 2021-12-26T10:37:31Z

@uranusjr: made a ton of progress on PEP 658 support and I think it's feasible to include in this PR! There are a few hacks I made in the most recent commit, so I just converted this to a draft PR.

I don't believe too many changes should be necessary to remove the hacks, and I think in fact that we will actually end up converting some of the existing Link attribute parsing to a more correct form in the process! Will ping again when I think this is ready for review.

cosmicexplorer · 2021-12-28T14:20:21Z

Ok, I've marked this PR as ready for review again. I think the bits that:

manage mapping Requirements to Links, and
implement PEP 658 support

are pretty readable and don't modify the surrounding code too much.

You'll note I moved some functions from collector.py to link.py, since I was able to move a lot of the Link parsing into the Link.from_element() function, which is then called exactly once from collector.py. I also found that a method _clean_link() from collector.py was only used in testing, so I moved it into test_collector.py. Please let me know if that seems reasonable.

The two issues I'd like some advice on relate to testing:

test_from_link_vcs_with_source_dir_obtains_commit_id(), test_from_link_vcs_without_source_dir(), and test_should_cache_git_sha() are failing in CI on multiple python versions including 3.10, which I can't reproduce locally with nox -s test-3.10. I will try using an ubuntu docker image to repro the issue next--is there anything else I should try?
It's not clear to me how to add tests for the final user-facing change made by this PR, namely the actual flags --avoid-wheel-downloads and --print-download-urls, since there is currently no testing for the download command. Is there any command test that checks files output by pip that I can use as a reference, or should I try to write that part myself? I am totally ready to do so if necessary.

Thanks for any input! I will try to address the two issues I mentioned above presently.

pfmoore · 2021-12-28T14:50:11Z

Can the types of situation where this PR would be useful be summarised in a general way, without reference to specific examples? As someone who's never needed anything like this, I'm a bit unclear when (and how) someone would find this useful. What I'd like to be clear about is how we evaluate the benefit of a change like this, which has a visible cost in terms of (code and maintenance) complexity.

I'm also concerned about the user interface here. Having a pip download --but-dont-actually-download command seems conceptually confusing, and gives me the sense that the code is being added wherever it can be made to fit, rather than trying to achieve an intuitive and understandable user interface. (Note that a lot of pip has been developed this way, so I'm not suggesting this PR is uniquely bad in this regard, but I do think that at some point we need to stop and re-think how we integrate new features like this into our UI).

cosmicexplorer · 2021-12-28T15:21:16Z

@pfmoore:

Can the types of situation where this PR would be useful be summarised in a general way, without reference to specific examples? As someone who's never needed anything like this, I'm a bit unclear when (and how) someone would find this useful. What I'd like to be clear about is how we evaluate the benefit of a change like this, which has a visible cost in terms of (code and maintenance) complexity.

With this change, users can use pip's powerful new resolver to extract metadata from a package resolution, such as what dependencies each package requires. This is currently not possible in pip or any other tool I am aware of.

Getting more specific, this result can then be processed by external tools to:

generate a lockfile.
check whether transitive dependencies would introduce a conflict.
download packages directly from download URLs without having to traverse PyPI again.

#53 has a discussion in greater depth on this proposed feature.

I'm also concerned about the user interface here. Having a pip download --but-dont-actually-download command seems conceptually confusing, and gives me the sense that the code is being added wherever it can be made to fit, rather than trying to achieve an intuitive and understandable user interface.

This totally makes sense. I think shoving this into download was a hack. In #53 (comment) I propose moving this functionality into a new command pip resolve, and after some work cleaning up the implementation here, it's actually pretty easy to see how I would implement it in a new command. I think a strong argument for this is that --avoid-wheel-downloads without --print-download-urls doesn't seem to make any sense.

So I think if I move this into a new pip resolve command, that would avoid introducing confusing new flags, and probably require no changes to any other files except download.py.

What are your thoughts on that approach?

pradyunsg · 2021-12-28T15:28:53Z

That sounds like something to put into #53's resolve command; which would be what I'd prefer.

pradyunsg · 2021-12-28T15:30:41Z

The new command would need:

code changes (register the command, implement it in a new resolve.py file, and test it to the extent that you want.
documentation pages added for it (html and man)
more?

cosmicexplorer · 2021-12-28T16:01:53Z

@pradyunsg:

register the command, implement it in a new resolve.py file

This was just done in c78bb3d!

test it to the extent that you want.

documentation pages added for it (html and man)

more?

I will look into doing the above (as well as covering other goals of #53's resolve command that I may have missed) presently!

pfmoore · 2021-12-28T16:47:11Z

Thanks, @cosmicexplorer that was exactly the explanation I needed. I agree this would be good, and I like @pradyunsg's suggestion that this should go under pip resolve.

Personally, I wish that functionality like this could be separated out into more modular tools, because I don't see this as being a core piece of pip functionality, rather it's a piece of functionality that needs to be in pip just to re-use chunks of pip's machinery. But that's a whole other discussion, and not for now.

sbidoul · 2022-05-02T08:19:16Z

Thanks for your feedback @groodt ! I certainly hope this does not come across as any kind of competition.
My PR started merely as a way to illustrate my review comments - which I hope were not too heavy.

Now if I get to continue this work (which I'm happy to, as I sincerely believe this is going to be very valuable for the community), I'll definitely reuse part of the very valuable work done here (with proper credits of course).

There might also be use cases of @cosmicexplorer or others that the slightly different format I propose may not address, and I'm eager to get feedback about that too.

- create LinkWithSource to retain attrs from InstallRequirement - add tests for report output for top-level requirements - add tests for more of the report JSON format - add passing tests for JSON report output including PEP 658! - add docstrings to several classes and functions, including tests! - move the --report implementation into resolvelib - use an abstract base class instead of a Union for InfoType - use frozen dataclasses for InfoType subclasses

these two objects are intended to be fungible, but play different roles, since LinkHash actually parses its input and ArchiveInfo does not. ArchiveInfo's JSON (de)?serialization does not employ hash name or value validation, while LinkHash does not offer a JSON serialization.

cosmicexplorer · 2022-05-10T07:12:31Z

Thanks so much to @ofek for pinging me on twitter about this!! I absolutely have the time to get this mergeable in the next few weeks. I just fixed the merge conflicts and tests are passing again locally, so I will now review the thoughtful proposals from others in this thread.

cosmicexplorer · 2022-05-11T07:49:58Z

@sbidoul:

At first sight, I'd say direct_url.py and ArchiveInfo in particular can be extended to do hash validation if needed.
...
As a general note, if the report composes existing standards as much as possible (such as PEP 610, PEP 566 json metadata), it will be easier to consume downstream.

I'll look to do this. I believe that some of the reason for my decision to add functionality to link.py was to avoid an import cycle, and I think the other part of it was just wanting to get a working implementation of PEP 658 before changing something else. However, since I'm thinking of basing this PR off of #10771 (see below), we may be able to avoid this conundrum entirely.

As I mentioned above, I think it will be easier to review if you could at some point split this PR in smaller ones each addressing a distinct concern (i.e. the report, the dry-run part, PEP 658) which are mostly orthogonal to each other. Thinking about them independently may also help achieving the best possible design.

This seems absolutely correct, and I was going to immediately agree to it, before realizing that there is another possible split we could do in this work which relies on #10771 (where we would split out dist-info-metadata extraction instead). See below.

@groodt:

After carefully reading the comments from @cosmicexplorer @sbidoul and @pfmoore I think moving this behaviour to install makes the most sense.

I really appreciate your thoughtful comment. I now agree with both of you, and have acknowledged this in #10771 (comment). The below section came from thinking about how to reformulate this PR to augment the much smaller and nicer implementation in #10771, instead of being in conflict.

Comparison of Approaches to `--dry-run`

In that regard, have you had a chance to look at #10771? This is just an experiment but I can't help thinking the implementation approach is simpler. But I also can't help thinking I'm missing something with that approach.

I would absolutely prefer the much smaller implementation in that PR (which wouldn't have caused (minor) merge conflicts as in this one, for example). When I looked it over to see if you were missing anything, I realized there just seems to be a difference between our conception of --dry-run, which might motivate using both of our solutions:

My goals from A pip resolve command to convert to transitive == requirements very fast by scanning wheels for static dependency info (WORKING PROTOTYPE!) #7819 and for pip download --dry-run were to avoid downloading any wheels at all when generating the report with --report, allowing for the maximum effect of --use-feature=fast-deps.
- See extend fast-deps to sdists (and allow downloading foreign sdists) (working prototype!) #8929 for the extension to sdists, which would be a bit of a hack (like LazyZipOverHTTP) that PEP 658 would also allow us to avoid.
The benefits of pip install --report from Installation/resolution report (aka pip install --dry-run --report) #10771 are that the implementation is extremely small, because it can rely on having an actual dist to extract metadata from, and therefore doesn't need to intervene in the actual resolution process.
- It seems that if people are looking to rely on this report for e.g. generating lockfiles, that my goal of ensuring pip doesn't download any dists might be orthogonal to that. In particular, the meaning of --dry-run in Installation/resolution report (aka pip install --dry-run --report) #10771 refers to not actually modifying site-packages/, instead of not actually downloading anything.
- I also believe that Installation/resolution report (aka pip install --dry-run --report) #10771 may require some/a lot of the changes in this PR if it were to support PEP 658 directly, but it may not actually need to bother with that, since it relies upon dists definitely having been downloaded, so it can just extract the metadata that way instead of looking for dist-info-metadata. This corresponds to your suggestion in metadata-only resolve with pip download --dry-run --report! #10748 (comment) to remove dist_info_metadata from the report.

Proposal: first `install --report`, then make `--dry-run` faster

So my proposal here is:

Switch our focus to use Installation/resolution report (aka pip install --dry-run --report) #10771 as the PR that initially implements the --report json output, keeping the --dry-run and --ignore-installed options. This means we will have a PR that is much easier to review for maintainers, and which will expose most of the same functionality for end users as this one.
- The InstallationReport class and msg_to_json() method in Installation/resolution report (aka pip install --dry-run --report) #10771 are implemented more nicely than the equivalent in this PR too.
Make a new PR that moves dist-info-metadata extraction into the implementation of --use-feature=fast-deps. PEP 658 support would then be totally orthogonal to the resolution report.
- We would use dist-info-metadata whenever available, and fall back to LazyZipOverHTTP otherwise.
Make this PR branch off of Installation/resolution report (aka pip install --dry-run --report) #10771, and use the code in this PR to make pip install --dry-run avoid downloading any dists.

@uranusjr: is it appropriate to consider dist-info-metadata purely a way to speed up resolves? Or is there a reason we'd want to use it outside of --use-feature=fast-deps? It seems like the strategy of relying on PEP 658, but falling back to a heuristic like LazyZipOverHTTP or #8929 if not available, would mean --use-feature=fast-deps could probably be stabilized sooner.

cosmicexplorer · 2022-05-11T07:54:45Z

So to be clear on next steps after that lengthy comment, I'm planning to:

split out PEP 658 / dist-info-metadata support into a new PR, which uses it to implement --use-feature=fast-deps.
rebase this PR off of Installation/resolution report (aka pip install --dry-run --report) #10771, so that pip install --dry-run avoids downloading any dists, and pip install --report also includes download URLs along with each distribution's metadata.

sbidoul · 2022-05-11T09:39:23Z

Thanks for your detailed feedback @cosmicexplorer. It looks like we have a path forward !

Regarding PEP 658 I totally agree it can and should be implemented independently. In my mind it is an optimization of the metadata preparation step when the index provides dist-info-metadata, and then delay the actual downloading of the sdist or wheel. That makes me think it will imply some refactoring of the download process (some kind of "download on demand" mechanism, not sure).

cosmicexplorer · 2022-05-11T21:37:21Z

That makes me think it will imply some refactoring of the download process (some kind of "download on demand" mechanism, not sure).

It turns out that this refactoring was already done for the purpose of --use-feature=fast-deps by @McSinyx! I still want to implement parallel downloads on top of that in a separate change soon.

pradyunsg · 2022-09-30T13:26:14Z

@cosmicexplorer What's the next piece here, with #11111 resolved? IIUC, we need to now make --dry-run faster?

hrfuller · 2022-10-04T21:45:24Z

I was talking to @jjhelmus about a way to make --dry-run faster...apparently he discovered that pip install --dry-run --report downloads wheels after the pep658 resolve to construct the report. I think he may have a patch to generate the report from the resolve metadata only.

cosmicexplorer · 2022-10-05T17:38:13Z

I think he may have a patch to generate the report from the resolve metadata only.

@jjhelmus @hrfuller I would love to see that patch! That's definitely what I wanted to accomplish here.

What's the next piece here, with #11111 resolved? IIUC, we need to now make --dry-run faster?

@pradyunsg Sorry for writing such a sprawling comment above! Yes, I think that would be the way to finish up this workstream.

I suspect we may find a speedup if we can implement some form of pipelining, where we pull down wheel metadata or execute setup.py for an sdist in parallel with the resolution. I can vaguely see a way to make this work nicely with coroutines, but that might end up being a big project.

I think a conclusive win would be to avoid downloading wheels as well with --dry-run, and we can think about bigger changes separately from that.

jjhelmus · 2022-10-13T20:27:06Z

As @hrfuller mentioned I have been exploring metadata only resolves. What I found was that with the changes from #11111 metadata only resolves are possible but at the end of the resolve the prepare_linked_requirements_more method triggers a download of wheels for the resolved distribution. This allows the existing logic to report information (names, version, metadata, and download URLs and hashes) about the resolved distributions.

With the changes in #11512, this information is provided by the .metadata files directly and no wheels need to be downloaded.

Thanks for your work on this @cosmicexplorer . I was really excited when I discovered that this was possible.

sbidoul · 2023-01-22T12:40:16Z

I this we can close this, as it is covered by pip install --dry-run --report? Let me know if not.

cosmicexplorer force-pushed the metadata-only-resolve branch from 401ba1d to 6360d8d Compare December 25, 2021 08:29

cosmicexplorer marked this pull request as draft December 26, 2021 10:35

cosmicexplorer force-pushed the metadata-only-resolve branch 12 times, most recently from 72d8357 to 8c755e8 Compare December 28, 2021 13:01

cosmicexplorer changed the title ~~Metadata only resolve~~ metadata-only resolve with --print-download-urls and --avoid-wheel-downloads Dec 28, 2021

cosmicexplorer marked this pull request as ready for review December 28, 2021 14:09

cosmicexplorer force-pushed the metadata-only-resolve branch from 3765b8d to c78bb3d Compare December 28, 2021 16:02

sbidoul mentioned this pull request May 8, 2022

Add --dry-run option to pip install #11096

Merged

implement --dry-run

39ccd3f

cosmicexplorer force-pushed the metadata-only-resolve branch from 7cb145a to 18bf0f8 Compare May 10, 2022 07:03

pypa-bot removed the needs rebase or merge PR has conflicts with current master label May 10, 2022

cosmicexplorer force-pushed the metadata-only-resolve branch from 18bf0f8 to dc36f0e Compare May 10, 2022 07:04

cosmicexplorer added 3 commits May 10, 2022 03:06

Add the JSON report to html/man documentation and add NEWS entry!

76520a5

cosmicexplorer force-pushed the metadata-only-resolve branch from dc36f0e to c068eb5 Compare May 10, 2022 07:06

cosmicexplorer mentioned this pull request May 11, 2022

Use data-dist-info-metadata (PEP 658) to decouple resolution from downloading #11111

Merged

groodt mentioned this pull request May 14, 2022

Init: pypa/installer bazelbuild/rules_python#700

Merged

12 tasks

github-actions bot added the needs rebase or merge PR has conflicts with current master label Jun 1, 2022

trevormorgan mentioned this pull request Jun 8, 2022

Modernize setup and build infrastructure QCoDeS/broadbean#109

Merged

5 tasks

mherrmann3 mentioned this pull request Jun 10, 2022

Dealing with dependencies in pyCSEP SCECcode/pycsep#192

Open

casassg mentioned this pull request Oct 5, 2022

Remove the experimental fast-deps feature #11478

Open

sbidoul closed this Jan 22, 2023

This was referenced Jan 25, 2023

Feature request: Support dependency hashing pypa/hatch#716

Open

Don't deepfreeze on each environmental action sbidoul/hatch-pip-deepfreeze#2

Open

github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metadata-only resolve with `pip download --dry-run --report`! #10748

metadata-only resolve with `pip download --dry-run --report`! #10748

cosmicexplorer commented Dec 25, 2021 •

edited

cosmicexplorer commented Dec 25, 2021

uranusjr commented Dec 25, 2021

pradyunsg commented Dec 25, 2021 •

edited

cosmicexplorer commented Dec 25, 2021

cosmicexplorer commented Dec 25, 2021

cosmicexplorer commented Dec 26, 2021

cosmicexplorer commented Dec 28, 2021 •

edited

pfmoore commented Dec 28, 2021

cosmicexplorer commented Dec 28, 2021 •

edited

pradyunsg commented Dec 28, 2021 •

edited

pradyunsg commented Dec 28, 2021

cosmicexplorer commented Dec 28, 2021 •

edited

pfmoore commented Dec 28, 2021

sbidoul commented May 2, 2022 •

edited

cosmicexplorer commented May 10, 2022

cosmicexplorer commented May 11, 2022

cosmicexplorer commented May 11, 2022

sbidoul commented May 11, 2022

cosmicexplorer commented May 11, 2022

pradyunsg commented Sep 30, 2022

hrfuller commented Oct 4, 2022

cosmicexplorer commented Oct 5, 2022

jjhelmus commented Oct 13, 2022

sbidoul commented Jan 22, 2023

metadata-only resolve with pip download --dry-run --report! #10748

metadata-only resolve with pip download --dry-run --report! #10748

Conversation

cosmicexplorer commented Dec 25, 2021 • edited

Problem

Solution

cosmicexplorer commented Dec 25, 2021

uranusjr commented Dec 25, 2021

pradyunsg commented Dec 25, 2021 • edited

cosmicexplorer commented Dec 25, 2021

cosmicexplorer commented Dec 25, 2021

cosmicexplorer commented Dec 26, 2021

cosmicexplorer commented Dec 28, 2021 • edited

pfmoore commented Dec 28, 2021

cosmicexplorer commented Dec 28, 2021 • edited

pradyunsg commented Dec 28, 2021 • edited

pradyunsg commented Dec 28, 2021

cosmicexplorer commented Dec 28, 2021 • edited

pfmoore commented Dec 28, 2021

sbidoul commented May 2, 2022 • edited

cosmicexplorer commented May 10, 2022

cosmicexplorer commented May 11, 2022

Comparison of Approaches to --dry-run

Proposal: first install --report, then make --dry-run faster

cosmicexplorer commented May 11, 2022

sbidoul commented May 11, 2022

cosmicexplorer commented May 11, 2022

pradyunsg commented Sep 30, 2022

hrfuller commented Oct 4, 2022

cosmicexplorer commented Oct 5, 2022

jjhelmus commented Oct 13, 2022

sbidoul commented Jan 22, 2023

metadata-only resolve with `pip download --dry-run --report`! #10748

metadata-only resolve with `pip download --dry-run --report`! #10748

cosmicexplorer commented Dec 25, 2021 •

edited

pradyunsg commented Dec 25, 2021 •

edited

cosmicexplorer commented Dec 28, 2021 •

edited

cosmicexplorer commented Dec 28, 2021 •

edited

pradyunsg commented Dec 28, 2021 •

edited

cosmicexplorer commented Dec 28, 2021 •

edited

sbidoul commented May 2, 2022 •

edited

Comparison of Approaches to `--dry-run`

Proposal: first `install --report`, then make `--dry-run` faster