Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conda fetch_latest_path interface may still read repodata into memory #440

Open
2 tasks done
dholth opened this issue Feb 15, 2024 · 2 comments
Open
2 tasks done
Labels
type::bug describes erroneous operation, use severity::* to classify the type

Comments

@dholth
Copy link
Contributor

dholth commented Feb 15, 2024

Checklist

  • I added a descriptive title
  • I searched open reports and couldn't find a duplicate

What happened?

This will require collaboration with the conda repository, reporting here because this project can benefit from lower in-Python memory usage while classic-solver conda will always load repodata in Python anyway.

We see that fetch_latest_path can still load repodata into memory here https://github.com/conda/conda/blob/main/conda/gateways/repodata/__init__.py#L862C1-L862C68

Could this function be patched with a "don't load" flag?

The logic to save cache state is a part of that convoluted function, so an attempt to bypass to a lower level might not correctly fetch from the local cache on a second request.

Conda Info

No response

Conda Config

No response

Conda list

No response

Additional Context

No response

@dholth dholth added the type::bug describes erroneous operation, use severity::* to classify the type label Feb 15, 2024
@jaimergp
Copy link
Contributor

jaimergp commented Feb 15, 2024

Thanks @dholth, yes, I'd love a method or a flag that can guarantee the minimal amount of IO to obtain the JSON cache path while guaranteeing is up-to-date.

I thought that fetch_latest_path() would do that, but it ends up calling fetch_latest() which does return raw_repodata so... maybe we just need a separate code path? Not sure how much this increases the maintenance burden, or whether this is a simple change.

I see that one of the reasons to load the JSON is to refresh the cache size but... if it's unchanged, I don't see how the cache info would be different. Even if we really want to do that, the size info might come from the stat payload, no?

Ideally we would have a function that minimally achieves:

  • If remote reports changes, stream remote to disk, update cache, return path
  • If no changes are reported, cache is valid, update the last modification time, return path.

@dholth
Copy link
Contributor Author

dholth commented Feb 15, 2024

It's a complicated method that is also supposed to maintain backwards compatibility. Hopefully now that we've found a reasonable lower-level API (necessary for more advanced repodata fetch) this can be simplified.

Yes the necessary cache information can be found from a stat call. I would like to move the cache metadata down a layer also; right now the old RepoInterface knows less about the cache and the new RepoInterface knows more about the cache as is necessary for jlap support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::bug describes erroneous operation, use severity::* to classify the type
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants