Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull preparer logic out of the resolver to consume metadata-only dists in commands #12186

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Jul 28, 2023

Continuation of #11512, now with test cases and a lot of comments.

Problem

The use case proposed in #53 uses pip just to execute its resolution logic, in order to generate e.g. a lockfile, without pulling down any dists (which may be quite large). We eventually arrived at pip install --report --dry-run as the way to enable this. However, the workstream to improve pip performance has been somewhat fractured and distributed across multiple efforts (as often occurs in open source projects), and by the time pypi had managed to enable PEP 658 metadata, it turns out we had failed to implement the caching behavior we wanted, as described in #11512.

Solution

  1. Avoid finalizing the requirement preparer at all in the resolver itself, and require each command to explicitly instruct the requirement preparer to download anything further.
  2. Rename several methods of the requirement preparer in order to clarify their purpose.
    • In particular, remove InstallRequirement#needs_more_preparation, and introduce the two methods .cache_concrete_dist() and .cache_virtual_metadata_only_dist() to manage a Distribution object attached to the InstallRequirement object itself (this is the iteration of .dist_from_metadata from use .metadata distribution info when possible #11512).
    • Execute .cache_concrete_dist() from within the AbstractDistribution#get_metadata_distribution() implementations to centralize the single place where we instantiate a dist.
  3. Move the test_download_metadata() tests into a new file test_install_metadata.py, because the metadata tests were supposed to be done against the install command anyway.

@cosmicexplorer
Copy link
Contributor Author

cc @pfmoore @uranusjr @jjhelmus @ddelange

@cosmicexplorer cosmicexplorer force-pushed the metadata_only_resolve_no_whl branch 2 times, most recently from 720b0c2 to 6f4da4c Compare July 28, 2023 02:55
@cosmicexplorer
Copy link
Contributor Author

q @pradyunsg @sbidoul @ anyone else: as I identify in the OP, this PR is less of a "bugfix", since we had not agreed on the exact behavior to commit to, and can almost be viewed more as new functionality, where we are now augmenting the guarantee of install --dry-run to include the new behavior that avoids downloading any dists at all when metadata is available. Is that a useful framing, or is it easier to keep this as a bugfix in the NEWS entry?

@uranusjr
Copy link
Member

Question, would this be easier if we remove fast-deps first?

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Jul 28, 2023

I don't believe so. As I went on at depth about in #11478 (comment), the new logic we added to implement fast-deps (especially the changes to the requirement preparer to enable metadata-only dists) is still necessary to make PEP 658 support work at all. At this point, removing fast-deps itself is actually rather easy, since we've now compartmentalized it into only one of multiple ways to construct a requirement without a backing dist on the local disk.

The part that has proven error-prone, imo, is just the intrinsic difficulty of the problem of retrofitting these "virtual" dists into a codebase that wasn't architected for lazy computation (because it didn't need to). I believe that this change improves that situation, by rearranging the methods of the requirement preparer to make it very explicit where the "virtual" dists are finally hydrated into real installed dists. I am aware the diff looks quite large, but the vast majority of that is from transplanting the metadata tests to use install instead of download, which was the biggest source of confusion surrounding these change.

Personally, I think that fast-deps is still worth keeping at this point, because from what I've seen, the error-prone code here has been entirely within the requirement preparer, having nothing to do with the zip parsing. Right now I see fast-deps as an alternate way to test our code paths for "virtual" (metadata-only) dists outside of the specifics of PEP 658, and I think pip will benefit from hammering away at the "virtual" dist code paths in the future (see #12184).

...however, the fantastic work done to decouple the actual zip operations from the requirement preparer also means that it wouldn't be difficult to remove fast-deps entirely. I am just absolutely positive that fast-deps is orthogonal to the difficulty I have seen with implementing "virtual" dists, and I think that that difficulty we've seen is either intrinsic to the problem of efficiently resolving dependencies along with just normal tech debt we've been paying back over time.

I also believe the "virtual" dists workstream has been made more difficult than usual because it has inspired a lot of excitement about improving pip, so pip maintainers have had to piece together stuff written by many different people over time. I created this as a separate PR because I recognized there were several things to fix at once with the one I based it off of, and I thought it would be easier to have a single person with more context on the issue (me) take ownership.

@cosmicexplorer cosmicexplorer force-pushed the metadata_only_resolve_no_whl branch 2 times, most recently from 240414f to b9d851c Compare July 30, 2023 18:17
@pradyunsg
Copy link
Member

Did a quick skim over the PR, as I'm wrapping up to head to bed -- not taking a closer look right now since CI isn't particularly happy with this right now.

One thing about fast-deps -- there's low-hanging fruit there seemingly, based on the final few things stated in #8670 that I've not looked into.

@uranusjr
Copy link
Member

There’s likely some details this isn’t getting correct now, but the general direction looks right to me. Also thanks for the “finalize” rename, that needed to be done. Should we also rename the needs_more_preparation flag?

@cosmicexplorer cosmicexplorer force-pushed the metadata_only_resolve_no_whl branch 9 times, most recently from 21b77fc to d7d15be Compare July 31, 2023 11:10
Copy link
Member

@sbidoul sbidoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @pradyunsg adressed the feeling I had but could not find the time to write down properly.

Another think I have in mind in this area is the possibility to work with partial metadata (name, version, dependencies, optional dependencies) when they are declared statically in pyproject.toml or PKG-INFO 2.2 in a sdist.

This should not be part of this PR, but maybe having this in mind could influence or orient the refactoring being done here.

src/pip/_internal/commands/install.py Outdated Show resolved Hide resolved
is called for requirements already found in the wheel cache, so we need to
synthesize it for uncached results. Luckily, a DirectUrl can be parsed directly
from a url without any other context. However, this also means the download info
will only contain a hash if the link itself declares the hash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change I was calling out in another comment. I don't know if it can be considered a real breaking change but it's worth thinking about it.

Dry-run without actually downloading the wheels is definitely important. I'm wondering, though, if there are use case for a dry-run where the packages are downloaded (and cached) and hashes (possibly with different algorithms) computed and reported. For users wanting a "lock then install" workflow, this maybe desirable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're absolutely right! In fact, the functionality you describe is probably exactly what I would need for use in pex e.g. pex-tool/pex#2210 (because pex's pipelined installation process currently invokes pip download).

Let's say we add another flag --ensure-downloaded-hashes[=sha256,sha512]?

However, thinking about this use case also gave me another idea:

For users wanting a "lock then install" workflow, this maybe desirable.

I would love for pip to support this more directly. I'm wondering whether you would accept a separate PR to add:

  • pip install --ensure-downloaded-hashes[=sha256,sha512] (used with --report to ensure .download_info is populated)
  • pip install --from-report <file.json> (can be used with --require-hashes)

I am not sure how most users are currently processing --report output, though. I'm also not sure if there's another lockfile format we would want to convert to instead of using the --report output as a lockfile directly? If so, we could maybe put it somewhere like pip freeze --format=[requirements,lockfile] --from-report <file.json>?

I really like the idea of explicitly supporting this "lock then install" workflow, and I would like to know your thoughts on whether pip could do that. Please also let me know if this workflow is already supported. cc @uranusjr @pradyunsg @pfmoore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of explicitly supporting this "lock then install" workflow, and I would like to know your thoughts on whether pip could do that. Please also let me know if this workflow is already supported.

I don't think it's something pip needs to support directly, but it's something we can provide components for. Lock-then-install workflows are available in tools like Poetry and PDM, and I don't think we need to compete with them in this area.

pip install --from-report <file.json> (can be used with --require-hashes)

-1 on this. Processing the JSON to produce a requirements file (for now) or a standard lockfile (once such a format gets standardised) should, IMO, be the way forward here. We should focus on providing the low-level capabilities that higher-level workflow tools can be built from, not on pip becoming a workflow tool itself.

pip install --ensure-downloaded-hashes[=sha256,sha512] (used with --report to ensure .download_info is populated)

I'm not particularly comfortable with this for the same reason - if pip has the hashes, it'll add them to the report. If it doesn't have them, then a tool wrapping pip can just as easily add them for itself by downloading the file and computing the hash. Again, if pip were managing the whole workflow, I might have a different view, but I don't see that as pip's role.

See #12503 for more details on pip's role here.

cosmicexplorer and others added 6 commits January 16, 2024 23:05
When performing `install --dry-run` and PEP 658 .metadata files are
available to guide the resolve, do not download the associated wheels.

Rather use the distribution information directly from the .metadata
files when reporting the results on the CLI and in the --report file.

- describe the new --dry-run behavior
- finalize linked requirements immediately after resolve
- introduce is_concrete
- funnel InstalledDistribution through _get_prepared_distribution() too
Copy link
Contributor

@edmorley edmorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a Pip maintainer, so am not able to approve the PR, however, took a look in the hope that another pair of eyes might help increase the confidence level of others approving the PR later.

This change looks great to me - the only nit I could find to leave a review comment about (when reviewing commit by commit) ended up being fixed in the typo fix in d6da334 - so I couldn't actually find anything on which I could leave a comment 😄

The thorough code comments, PR/commit descriptions and neatly split up commit sequence made reviewing this much easier too, thank you!

I'd love to see this land so it can then unblock some of your other performance improvement PRs (#12256, #12257).

@edmorley
Copy link
Contributor

I don't suppose one of the pip maintainers have a chance to review this soon? I've reviewed it myself (above) fairly carefully - and the PR already had some review passes earlier from pip maintainers, so hopefully the final review now from someone shouldn't be too time consuming :-)

@pfmoore
Copy link
Member

pfmoore commented Feb 13, 2024

I'll be honest, I've avoided looking at this (in spite of the fact that I'm generally in favour of the idea) because it (and @cosmicexplorer's other PRs) seemed to be changing very rapidly - there were a lot of force-pushed changes in a short period, meaning that things were changing as I looked at them 🙁 In addition, this is a big change affecting a lot of files, so reviewing isn't a fast process (especially when you have limited time, as I do).

I'm concerned that the preparer is one of the least "obvious" parts of the resolution process - it's full of generic terms (not least of which is "prepare" itself!) which don't really give much clue as to what precisely is going on. Every time I need to deal with the preparer, I have to do a bunch of research to remind myself what's happening before I can proceed. I'm worried that if we delegate chunks of the "prepare" task to individual commands, we will end up making that problem worse, with everyone needing to keep track of what's happening.

To give an example, install.py now calls finalize_linked_requirements. It's not obvious to me why that call is needed. What exactly wasn't "finalized", and what's so special about linked requirements here? This is all stuff that was just as obscure before this PR (as I said, the preparer was never the most obvious bit of code!) but previously, the install command (and more importantly, maintainers working on that command) didn't need to care. Now it does. Why? What changed? The wheel and download commands used to call the (impressively uselessly named!) prepare_linked_requirements_more method, but install never did. I'm guessing that the point here is to pull out the stuff that everyone did into an explicit method, but then allow install to pass the "dry run" flag to say "don't download actual dists". But I'm working that out from the comments in this PR and seeing the call in the context of this PR. If I were just looking at the install command for some reason, long after this was merged, I wouldn't have that information.

I'm sorry, I know that I'm making this PR somehow responsible for the big maintainability mess that is the preparer, and that's not entirely fair. But one of the reasons this PR isn't getting eyes on it for review is because of that maintainability mess, and I honestly think we need to start paying back some of that technical debt if we're to have any chance of progressing on bigger (and important!) work like this. In particular, performance improvements tend to increase technical debt (you're explicitly trading speed against maintainability) and I think we need to be cautious here, given the current struggle we have with maintainer time.

I'd love to see this land so it can then unblock some of your other performance improvement PRs

I'm not sure this will unblock things as much as we'd hope. The problem is simply that all of these changes are huge - 21 files for this one, 38 and 43 for the two you linked. Getting time to review changes on this sort of scale is a problem in itself (for me, at least - I can't speak for the other maintainers). A better approach might be to refactor the PRs into simpler, step-by-step changes, each of which has a small, manageable, scope and which can be accepted individually, both reducing the technical debt and incrementally moving towards these end goals. For example a change that renamed and documented prepare_linked_requirements_more would be much easier to review and accept, and would pave the way for a follow-up that replaced it with better-factored components. And so on.

@edmorley
Copy link
Contributor

edmorley commented Feb 13, 2024

I'm not sure this will unblock things as much as we'd hope. The problem is simply that all of these changes are huge - 21 files for this one, 38 and 43 for the two you linked

The other two PRs are stacked PRs (so they don't change as many files as that) - see the PR description for the messages like:

This PR is on top of #12186; see the +392/-154 diff against it at https://github.com/cosmicexplorer/pip/compare/metadata_only_resolve_no_whl...cosmicexplorer:pip:link-metadata-cache?expand=1.

@notatallshaw
Copy link
Contributor

@pfmoore I've read through your last comment a few times and I'm not sure I understand if there is a way forward or not?

Would you prefer the PRs to be further broken down into something smaller or would you like something bigger that fixes a lot of the genericness of the preparer logic? Or are you saying that there's too much technical debt in this part of pip for anyone other than a maintainer to touch it?

I had some optimization ideas that are orthagonal to the ones created by this series or PRs. But it would touch a lot of the same logic and I didn't want the PRs to step on each others toes.

@pfmoore
Copy link
Member

pfmoore commented Feb 19, 2024

Would you prefer the PRs to be further broken down into something smaller or would you like something bigger that fixes a lot of the genericness of the preparer logic?

Neither. I'd prefer a series of smaller commits that make the preparer more maintainable. Is that what you mean by "fixes a lot of the genericness"? I was thinking about steps (each in its own PR) like renaming functions to be more meaningful, adding/improving comments and docstrings to explain the logic, factoring out complex code into (named, easier to understand) standalone functions, encapsulating conditionals in functions so that the calling code needs less branching, etc. Standard "refactoring" types of improvements.

Once those have been done, I'd hope that this PR can be refactored into a number of similar smaller PRs, each tackling a part of the task without needing the whole thing to be done in a "big bang".

Or are you saying that there's too much technical debt in this part of pip for anyone other than a maintainer to touch it?

No, I'm saying that there's so much technical debt, it's hard for anyone to touch it. Getting reviews is hard because it needs a second person to put the effort into understanding, and that's hard because of the technical debt. If we were in a better situation, committer reviews would be easier because the committers would be sufficiently comfortable with the code to do a review without needing to re-learn the underlying logic.

In fact, I think refactoring to reduce technical debt might be easier for an external committer, because they are coming at the code with no preconceptions, and with a fresh viewpoint. When I look at the code, all I think is "ugh, what a mess" 🙁

But it would touch a lot of the same logic and I didn't want the PRs to step on each others toes.

That's precisely the sort of over-coupling that the complexity of the existing code results in.

I don't want to make "fix the technical debt" into some sort of prerequisite or "tax" that must be paid by someone like yourself who simply wants to contribute improvements to pip. We definitely can't afford to discourage that sort of work! But I appreciate the frustration you feel with how hard it is to get reviews, and this is the best way I can think of for moving the "getting a review" problem forward. If you'd prefer not to get caught up in the technical debt issue, by all means ignore those parts of my comments, and simply wait until one of the core maintainers (which might well be me) gets time to do a proper review of the code here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants