pull preparer logic out of the resolver to consume metadata-only dists in commands #12186

cosmicexplorer · 2023-07-28T02:22:35Z

Continuation of #11512, now with test cases and a lot of comments.

Closes use .metadata distribution info when possible #11512.

Problem

The use case proposed in #53 uses pip just to execute its resolution logic, in order to generate e.g. a lockfile, without pulling down any dists (which may be quite large). We eventually arrived at pip install --report --dry-run as the way to enable this. However, the workstream to improve pip performance has been somewhat fractured and distributed across multiple efforts (as often occurs in open source projects), and by the time pypi had managed to enable PEP 658 metadata, it turns out we had failed to implement the caching behavior we wanted, as described in #11512.

Solution

Avoid finalizing the requirement preparer at all in the resolver itself, and require each command to explicitly instruct the requirement preparer to download anything further.
- This separation will hopefully make it easier to progressively cache/virtualize other sections of the resolve process later on as per memoizing dependency lookups with a local resolve index #12184.
Rename several methods of the requirement preparer in order to clarify their purpose.
- In particular, remove InstallRequirement#needs_more_preparation, and introduce the two methods .cache_concrete_dist() and .cache_virtual_metadata_only_dist() to manage a Distribution object attached to the InstallRequirement object itself (this is the iteration of .dist_from_metadata from use .metadata distribution info when possible #11512).
- Execute .cache_concrete_dist() from within the AbstractDistribution#get_metadata_distribution() implementations to centralize the single place where we instantiate a dist.
Move the test_download_metadata() tests into a new file test_install_metadata.py, because the metadata tests were supposed to be done against the install command anyway.

cosmicexplorer · 2023-07-28T02:30:41Z

cc @pfmoore @uranusjr @jjhelmus @ddelange

tests/functional/test_install_metadata.py

cosmicexplorer · 2023-07-28T03:01:29Z

q @pradyunsg @sbidoul @ anyone else: as I identify in the OP, this PR is less of a "bugfix", since we had not agreed on the exact behavior to commit to, and can almost be viewed more as new functionality, where we are now augmenting the guarantee of install --dry-run to include the new behavior that avoids downloading any dists at all when metadata is available. Is that a useful framing, or is it easier to keep this as a bugfix in the NEWS entry?

uranusjr · 2023-07-28T05:01:05Z

Question, would this be easier if we remove fast-deps first?

cosmicexplorer · 2023-07-28T05:57:22Z

I don't believe so. As I went on at depth about in #11478 (comment), the new logic we added to implement fast-deps (especially the changes to the requirement preparer to enable metadata-only dists) is still necessary to make PEP 658 support work at all. At this point, removing fast-deps itself is actually rather easy, since we've now compartmentalized it into only one of multiple ways to construct a requirement without a backing dist on the local disk.

The part that has proven error-prone, imo, is just the intrinsic difficulty of the problem of retrofitting these "virtual" dists into a codebase that wasn't architected for lazy computation (because it didn't need to). I believe that this change improves that situation, by rearranging the methods of the requirement preparer to make it very explicit where the "virtual" dists are finally hydrated into real installed dists. I am aware the diff looks quite large, but the vast majority of that is from transplanting the metadata tests to use install instead of download, which was the biggest source of confusion surrounding these change.

Personally, I think that fast-deps is still worth keeping at this point, because from what I've seen, the error-prone code here has been entirely within the requirement preparer, having nothing to do with the zip parsing. Right now I see fast-deps as an alternate way to test our code paths for "virtual" (metadata-only) dists outside of the specifics of PEP 658, and I think pip will benefit from hammering away at the "virtual" dist code paths in the future (see #12184).

...however, the fantastic work done to decouple the actual zip operations from the requirement preparer also means that it wouldn't be difficult to remove fast-deps entirely. I am just absolutely positive that fast-deps is orthogonal to the difficulty I have seen with implementing "virtual" dists, and I think that that difficulty we've seen is either intrinsic to the problem of efficiently resolving dependencies along with just normal tech debt we've been paying back over time.

I also believe the "virtual" dists workstream has been made more difficult than usual because it has inspired a lot of excitement about improving pip, so pip maintainers have had to piece together stuff written by many different people over time. I created this as a separate PR because I recognized there were several things to fix at once with the one I based it off of, and I thought it would be easier to have a single person with more context on the issue (me) take ownership.

src/pip/_internal/commands/install.py

pradyunsg · 2023-07-31T00:19:39Z

Did a quick skim over the PR, as I'm wrapping up to head to bed -- not taking a closer look right now since CI isn't particularly happy with this right now.

One thing about fast-deps -- there's low-hanging fruit there seemingly, based on the final few things stated in #8670 that I've not looked into.

uranusjr · 2023-07-31T00:25:54Z

There’s likely some details this isn’t getting correct now, but the general direction looks right to me. Also thanks for the “finalize” rename, that needed to be done. Should we also rename the needs_more_preparation flag?

sbidoul

I think @pradyunsg adressed the feeling I had but could not find the time to write down properly.

Another think I have in mind in this area is the possibility to work with partial metadata (name, version, dependencies, optional dependencies) when they are declared statically in pyproject.toml or PKG-INFO 2.2 in a sdist.

This should not be part of this PR, but maybe having this in mind could influence or orient the refactoring being done here.

src/pip/_internal/commands/install.py

sbidoul · 2023-10-01T17:49:17Z

src/pip/_internal/operations/prepare.py

+        is called for requirements already found in the wheel cache, so we need to
+        synthesize it for uncached results. Luckily, a DirectUrl can be parsed directly
+        from a url without any other context. However, this also means the download info
+        will only contain a hash if the link itself declares the hash.


This is the change I was calling out in another comment. I don't know if it can be considered a real breaking change but it's worth thinking about it.

Dry-run without actually downloading the wheels is definitely important. I'm wondering, though, if there are use case for a dry-run where the packages are downloaded (and cached) and hashes (possibly with different algorithms) computed and reported. For users wanting a "lock then install" workflow, this maybe desirable.

I think you're absolutely right! In fact, the functionality you describe is probably exactly what I would need for use in pex e.g. pex-tool/pex#2210 (because pex's pipelined installation process currently invokes pip download).

Let's say we add another flag --ensure-downloaded-hashes[=sha256,sha512]?

However, thinking about this use case also gave me another idea:

For users wanting a "lock then install" workflow, this maybe desirable.

I would love for pip to support this more directly. I'm wondering whether you would accept a separate PR to add:

pip install --ensure-downloaded-hashes[=sha256,sha512] (used with --report to ensure .download_info is populated)

pip install --from-report <file.json> (can be used with --require-hashes)

I am not sure how most users are currently processing --report output, though. I'm also not sure if there's another lockfile format we would want to convert to instead of using the --report output as a lockfile directly? If so, we could maybe put it somewhere like pip freeze --format=[requirements,lockfile] --from-report <file.json>?

I really like the idea of explicitly supporting this "lock then install" workflow, and I would like to know your thoughts on whether pip could do that. Please also let me know if this workflow is already supported. cc @uranusjr @pradyunsg @pfmoore

I really like the idea of explicitly supporting this "lock then install" workflow, and I would like to know your thoughts on whether pip could do that. Please also let me know if this workflow is already supported.

I don't think it's something pip needs to support directly, but it's something we can provide components for. Lock-then-install workflows are available in tools like Poetry and PDM, and I don't think we need to compete with them in this area.

pip install --from-report <file.json> (can be used with --require-hashes)

-1 on this. Processing the JSON to produce a requirements file (for now) or a standard lockfile (once such a format gets standardised) should, IMO, be the way forward here. We should focus on providing the low-level capabilities that higher-level workflow tools can be built from, not on pip becoming a workflow tool itself.

pip install --ensure-downloaded-hashes[=sha256,sha512] (used with --report to ensure .download_info is populated)

I'm not particularly comfortable with this for the same reason - if pip has the hashes, it'll add them to the report. If it doesn't have them, then a tool wrapping pip can just as easily add them for itself by downloading the file and computing the hash. Again, if pip were managing the whole workflow, I might have a different view, but I don't see that as pip's role.

See #12503 for more details on pip's role here.

When performing `install --dry-run` and PEP 658 .metadata files are available to guide the resolve, do not download the associated wheels. Rather use the distribution information directly from the .metadata files when reporting the results on the CLI and in the --report file. - describe the new --dry-run behavior - finalize linked requirements immediately after resolve - introduce is_concrete - funnel InstalledDistribution through _get_prepared_distribution() too

edmorley

I'm not a Pip maintainer, so am not able to approve the PR, however, took a look in the hope that another pair of eyes might help increase the confidence level of others approving the PR later.

This change looks great to me - the only nit I could find to leave a review comment about (when reviewing commit by commit) ended up being fixed in the typo fix in d6da334 - so I couldn't actually find anything on which I could leave a comment 😄

The thorough code comments, PR/commit descriptions and neatly split up commit sequence made reviewing this much easier too, thank you!

I'd love to see this land so it can then unblock some of your other performance improvement PRs (#12256, #12257).

edmorley · 2024-02-13T08:25:25Z

I don't suppose one of the pip maintainers have a chance to review this soon? I've reviewed it myself (above) fairly carefully - and the PR already had some review passes earlier from pip maintainers, so hopefully the final review now from someone shouldn't be too time consuming :-)

pfmoore · 2024-02-13T12:27:21Z

I'll be honest, I've avoided looking at this (in spite of the fact that I'm generally in favour of the idea) because it (and @cosmicexplorer's other PRs) seemed to be changing very rapidly - there were a lot of force-pushed changes in a short period, meaning that things were changing as I looked at them 🙁 In addition, this is a big change affecting a lot of files, so reviewing isn't a fast process (especially when you have limited time, as I do).

I'm concerned that the preparer is one of the least "obvious" parts of the resolution process - it's full of generic terms (not least of which is "prepare" itself!) which don't really give much clue as to what precisely is going on. Every time I need to deal with the preparer, I have to do a bunch of research to remind myself what's happening before I can proceed. I'm worried that if we delegate chunks of the "prepare" task to individual commands, we will end up making that problem worse, with everyone needing to keep track of what's happening.

To give an example, install.py now calls finalize_linked_requirements. It's not obvious to me why that call is needed. What exactly wasn't "finalized", and what's so special about linked requirements here? This is all stuff that was just as obscure before this PR (as I said, the preparer was never the most obvious bit of code!) but previously, the install command (and more importantly, maintainers working on that command) didn't need to care. Now it does. Why? What changed? The wheel and download commands used to call the (impressively uselessly named!) prepare_linked_requirements_more method, but install never did. I'm guessing that the point here is to pull out the stuff that everyone did into an explicit method, but then allow install to pass the "dry run" flag to say "don't download actual dists". But I'm working that out from the comments in this PR and seeing the call in the context of this PR. If I were just looking at the install command for some reason, long after this was merged, I wouldn't have that information.

I'm sorry, I know that I'm making this PR somehow responsible for the big maintainability mess that is the preparer, and that's not entirely fair. But one of the reasons this PR isn't getting eyes on it for review is because of that maintainability mess, and I honestly think we need to start paying back some of that technical debt if we're to have any chance of progressing on bigger (and important!) work like this. In particular, performance improvements tend to increase technical debt (you're explicitly trading speed against maintainability) and I think we need to be cautious here, given the current struggle we have with maintainer time.

I'd love to see this land so it can then unblock some of your other performance improvement PRs

I'm not sure this will unblock things as much as we'd hope. The problem is simply that all of these changes are huge - 21 files for this one, 38 and 43 for the two you linked. Getting time to review changes on this sort of scale is a problem in itself (for me, at least - I can't speak for the other maintainers). A better approach might be to refactor the PRs into simpler, step-by-step changes, each of which has a small, manageable, scope and which can be accepted individually, both reducing the technical debt and incrementally moving towards these end goals. For example a change that renamed and documented prepare_linked_requirements_more would be much easier to review and accept, and would pave the way for a follow-up that replaced it with better-factored components. And so on.

edmorley · 2024-02-13T13:01:00Z

I'm not sure this will unblock things as much as we'd hope. The problem is simply that all of these changes are huge - 21 files for this one, 38 and 43 for the two you linked

The other two PRs are stacked PRs (so they don't change as many files as that) - see the PR description for the messages like:

This PR is on top of #12186; see the +392/-154 diff against it at https://github.com/cosmicexplorer/pip/compare/metadata_only_resolve_no_whl...cosmicexplorer:pip:link-metadata-cache?expand=1.

notatallshaw · 2024-02-19T14:10:15Z

@pfmoore I've read through your last comment a few times and I'm not sure I understand if there is a way forward or not?

Would you prefer the PRs to be further broken down into something smaller or would you like something bigger that fixes a lot of the genericness of the preparer logic? Or are you saying that there's too much technical debt in this part of pip for anyone other than a maintainer to touch it?

I had some optimization ideas that are orthagonal to the ones created by this series or PRs. But it would touch a lot of the same logic and I didn't want the PRs to step on each others toes.

pfmoore · 2024-02-19T17:31:17Z

Would you prefer the PRs to be further broken down into something smaller or would you like something bigger that fixes a lot of the genericness of the preparer logic?

Neither. I'd prefer a series of smaller commits that make the preparer more maintainable. Is that what you mean by "fixes a lot of the genericness"? I was thinking about steps (each in its own PR) like renaming functions to be more meaningful, adding/improving comments and docstrings to explain the logic, factoring out complex code into (named, easier to understand) standalone functions, encapsulating conditionals in functions so that the calling code needs less branching, etc. Standard "refactoring" types of improvements.

Once those have been done, I'd hope that this PR can be refactored into a number of similar smaller PRs, each tackling a part of the task without needing the whole thing to be done in a "big bang".

Or are you saying that there's too much technical debt in this part of pip for anyone other than a maintainer to touch it?

No, I'm saying that there's so much technical debt, it's hard for anyone to touch it. Getting reviews is hard because it needs a second person to put the effort into understanding, and that's hard because of the technical debt. If we were in a better situation, committer reviews would be easier because the committers would be sufficiently comfortable with the code to do a review without needing to re-learn the underlying logic.

In fact, I think refactoring to reduce technical debt might be easier for an external committer, because they are coming at the code with no preconceptions, and with a fresh viewpoint. When I look at the code, all I think is "ugh, what a mess" 🙁

But it would touch a lot of the same logic and I didn't want the PRs to step on each others toes.

That's precisely the sort of over-coupling that the complexity of the existing code results in.

I don't want to make "fix the technical debt" into some sort of prerequisite or "tax" that must be paid by someone like yourself who simply wants to contribute improvements to pip. We definitely can't afford to discourage that sort of work! But I appreciate the frustration you feel with how hard it is to get reviews, and this is the best way I can think of for moving the "getting a review" problem forward. If you'd prefer not to get caught up in the technical debt issue, by all means ignore those parts of my comments, and simply wait until one of the core maintainers (which might well be me) gets time to do a proper review of the code here.

psf-chronographer bot added the bot:chronographer:provided label Jul 28, 2023

cosmicexplorer mentioned this pull request Jul 28, 2023

use .metadata distribution info when possible #11512

Open