Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added /prune/ endpoint to removing "old" RPMs from a Repository. #3536

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ggainey
Copy link
Contributor

@ggainey ggainey commented May 14, 2024

closes #2909.

@ggainey ggainey force-pushed the 2909_prune branch 2 times, most recently from 4d8ad7c to edf4989 Compare May 14, 2024 01:32
@ggainey ggainey marked this pull request as draft May 14, 2024 04:28
@ggainey
Copy link
Contributor Author

ggainey commented May 14, 2024

Draft until I figure out what is up w/ permissions - can't reproduce locally?!?

@ggainey ggainey marked this pull request as ready for review May 14, 2024 16:23
@ggainey ggainey requested review from dralley and ipanova May 14, 2024 17:44
@ggainey
Copy link
Contributor Author

ggainey commented May 14, 2024

Some notes on performance:

Some unscientific performance results follow. The repositories are a selection of "real" repos, of small (zoo), medium (baseos-9) and large (appstream-9, epel) amounts of content.

All runs used concurrency==1, oci-env, on my workstation (Intel® Core™ i7-6700K × 8), info-logging only.
Repositories are --no-autopublish.

Repository List:

name: zoo
size: 35
dups: 4

name: rhel_9_x86_64_appstream
size: 17524
dups: 11834

name: rhel_9_x86_64_baseos
size: 6185
dups: 4993

name: epel9
size: 21236
dups: 0

How "prune" was run

$ TGROUP=`http -b POST :5001/pulp/api/v3/rpm/prune/ repo_hrefs:='["*"]' repo_concurrency:=1 keep_days=0 dry_run=True | jq -r .task_group`

How timings were evaluated

for t in `http ":5001${TGROUP}" | jq -r .tasks[].pulp_href`
do 
  pulp task show --href "${t}" | jq '.started_at, .finished_at, .progress_reports[0].message, .progress_reports[0].total' 
done

Performance Runs

  • dry-run, keep=0 (max deletes discovered only)

    • zoo : 46ms
    • epel : 850ms
    • base : 380ms
    • apps : 912ms
  • prune, keep=1000 (no deletes discovered only)

    • zoo : 49ms
    • epel : 849ms
    • base : 392ms
    • apps : 929ms
  • prune, keep=0 (max actual deletes)

    • zoo : 163ms
    • epel : 2214ms
    • base : 1823ms
    • apps : 4698ms
  • prune, keep=0, post-max-prune (0 actual deletes)

    • zoo : 206ms
    • epel : 1884ms
    • base : 628ms
    • apps : 1365ms

log = getLogger(__name__)


def prune_repo_nevras(repo_pk, keep_days, dry_run):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prune_repo_packages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make both n-to-keep and the time threshold configurable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably start with simplest and always n1 to keep, we could make it configurable later if we get requests for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have retain_package_versions if you want "how many nevras to keep", and retain_repo_versions to prune old repositoryversions. This is "get rid of nevras older than X" regardless of those settings. Do we really want to make this even more complicated? Even if someone somewhere thinks it would be nifty, I think we're already at the edge of "too many choices".

The `keep_days` argument allows the user to specify the number of days to allow "old" content to remain in the
repository. The default is 14 days.

The `repo_concurrency` argument allows the user to control how many `pulpcore-workers` can be operating on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is correct to expose this decision to the user of the repos, if he decided to set it to 50 it will do no good for the rest of pulp installation. Should we rather take same approach as with workers on import process? It would rather be a setting that is admin set/controlled and maybe having it again in percentage is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already a ton of ways for a single user to stuff up the dispatch process. I can, right now, create 1000 repos all pointing to the same remote, and then sync all of them in an endless loop.

I considered (and even at one point implemented) a similar percentage-setting. The prob w/ a setting is that it requires restarting all the workers to pick it up if you decide that you want something different, and can't be changed for just a specific run.

I also toyed with making this an admin-only command. That would answer the immediate COPR need. But our other future-looking service-needs "feel" like it needs to be accessible to end-users.

I think the "a single user can consume All The Workers" is a very valid issue - but it needs to be fixed at a more-foundational level than in individual calls. I'd rather see that addressed as its own feature for the entire system.

pulp_rpm/app/viewsets/prune.py Show resolved Hide resolved
pulp_rpm/app/tasks/prune.py Outdated Show resolved Hide resolved
pulp_rpm/app/serializers/prune.py Outdated Show resolved Hide resolved
# Validate that they are for RpmRepositories.
hrefs_to_return = []
for href in value:
hrefs_to_return.append(NamedModelViewSet.get_resource(href, RpmRepository))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case there was provided no-rpm href get_resource will fail, should we catch-ignore-it or rather fail the validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, hm. I have to refresh my memory on how get_resource() behaves if it can't find the resource. In any event - my take is to fail the call. But it would be "more polite" to check all the provided HREFs, deptermine which ones aren't RPM-repos, and fail with a message specifying all the incorrect repos, so the user can clean up their usage. WDYT?

pulp_rpm/app/serializers/prune.py Outdated Show resolved Hide resolved
staging_docs/user/guides/06-prune.md Show resolved Hide resolved
log = getLogger(__name__)


def prune_repo_nevras(repo_pk, keep_days, dry_run):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably start with simplest and always n1 to keep, we could make it configurable later if we get requests for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

As an administrator I want to set prune policy based on datetime
3 participants