Added /prune/ endpoint to removing "old" RPMs from a Repository. #3536

ggainey · 2024-05-14T01:22:50Z

closes #2909.

ggainey · 2024-05-14T04:29:06Z

Draft until I figure out what is up w/ permissions - can't reproduce locally?!?

ggainey · 2024-05-14T21:05:05Z

Some notes on performance:

Some unscientific performance results follow. The repositories are a selection of "real" repos, of small (zoo), medium (baseos-9) and large (appstream-9, epel) amounts of content.

All runs used concurrency==1, oci-env, on my workstation (Intel® Core™ i7-6700K × 8), info-logging only.
Repositories are --no-autopublish.

Repository List:

How "prune" was run

$ TGROUP=`http -b POST :5001/pulp/api/v3/rpm/prune/ repo_hrefs:='["*"]' repo_concurrency:=1 keep_days=0 dry_run=True | jq -r .task_group`

How timings were evaluated

for t in `http ":5001${TGROUP}" | jq -r .tasks[].pulp_href`
do 
  pulp task show --href "${t}" | jq '.started_at, .finished_at, .progress_reports[0].message, .progress_reports[0].total' 
done

Performance Runs

dry-run, keep=0 (max deletes discovered only)
- zoo : 46ms
- epel : 850ms
- base : 380ms
- apps : 912ms
prune, keep=1000 (no deletes discovered only)
- zoo : 49ms
- epel : 849ms
- base : 392ms
- apps : 929ms
prune, keep=0 (max actual deletes)
- zoo : 163ms
- epel : 2214ms
- base : 1823ms
- apps : 4698ms
prune, keep=0, post-max-prune (0 actual deletes)
- zoo : 206ms
- epel : 1884ms
- base : 628ms
- apps : 1365ms

pulp_rpm/app/serializers/prune.py

dralley · 2024-05-22T02:25:08Z

pulp_rpm/app/tasks/prune.py

+log = getLogger(__name__)
+
+
+def prune_repo_nevras(repo_pk, keep_days, dry_run):


prune_repo_packages

Should we make both n-to-keep and the time threshold configurable?

I'd probably start with simplest and always n1 to keep, we could make it configurable later if we get requests for it

We already have retain_package_versions if you want "how many nevras to keep", and retain_repo_versions to prune old repositoryversions. This is "get rid of nevras older than X" regardless of those settings. Do we really want to make this even more complicated? Even if someone somewhere thinks it would be nifty, I think we're already at the edge of "too many choices".

pulp_rpm/app/tasks/prune.py

ipanova · 2024-05-30T16:03:43Z

staging_docs/user/guides/06-prune.md

+The `keep_days` argument allows the user to specify the number of days to allow "old" content to remain in the
+repository. The default is 14 days.
+
+The `repo_concurrency` argument allows the user to control how many `pulpcore-workers` can be operating on


I wonder if this is correct to expose this decision to the user of the repos, if he decided to set it to 50 it will do no good for the rest of pulp installation. Should we rather take same approach as with workers on import process? It would rather be a setting that is admin set/controlled and maybe having it again in percentage is better.

There are already a ton of ways for a single user to stuff up the dispatch process. I can, right now, create 1000 repos all pointing to the same remote, and then sync all of them in an endless loop.

I considered (and even at one point implemented) a similar percentage-setting. The prob w/ a setting is that it requires restarting all the workers to pick it up if you decide that you want something different, and can't be changed for just a specific run.

I also toyed with making this an admin-only command. That would answer the immediate COPR need. But our other future-looking service-needs "feel" like it needs to be accessible to end-users.

I think the "a single user can consume All The Workers" is a very valid issue - but it needs to be fixed at a more-foundational level than in individual calls. I'd rather see that addressed as its own feature for the entire system.

pulp_rpm/app/viewsets/prune.py

pulp_rpm/app/tasks/prune.py

pulp_rpm/app/serializers/prune.py

ipanova · 2024-05-30T16:39:06Z

pulp_rpm/app/serializers/prune.py

+        # Validate that they are for RpmRepositories.
+        hrefs_to_return = []
+        for href in value:
+            hrefs_to_return.append(NamedModelViewSet.get_resource(href, RpmRepository))


in case there was provided no-rpm href get_resource will fail, should we catch-ignore-it or rather fail the validation?

Um, hm. I have to refresh my memory on how get_resource() behaves if it can't find the resource. In any event - my take is to fail the call. But it would be "more polite" to check all the provided HREFs, deptermine which ones aren't RPM-repos, and fail with a message specifying all the incorrect repos, so the user can clean up their usage. WDYT?

pulp_rpm/app/serializers/prune.py

staging_docs/user/guides/06-prune.md

ipanova · 2024-05-30T17:00:19Z

pulp_rpm/app/tasks/prune.py

+log = getLogger(__name__)
+
+
+def prune_repo_nevras(repo_pk, keep_days, dry_run):


I'd probably start with simplest and always n1 to keep, we could make it configurable later if we get requests for it

closes pulp#2909.

ggainey force-pushed the 2909_prune branch 2 times, most recently from 4d8ad7c to edf4989 Compare May 14, 2024 01:32

ggainey marked this pull request as draft May 14, 2024 04:28

ggainey force-pushed the 2909_prune branch from edf4989 to 8ea0201 Compare May 14, 2024 15:21

ggainey marked this pull request as ready for review May 14, 2024 16:23

ggainey requested review from dralley and ipanova May 14, 2024 17:44

ggainey force-pushed the 2909_prune branch from 8ea0201 to 2ec034a Compare May 15, 2024 19:43

dralley reviewed May 22, 2024

View reviewed changes

pulp_rpm/app/serializers/prune.py Outdated Show resolved Hide resolved

dralley reviewed May 22, 2024

View reviewed changes

pulp_rpm/app/tasks/prune.py Outdated Show resolved Hide resolved

ipanova reviewed May 30, 2024

View reviewed changes

ggainey force-pushed the 2909_prune branch from 2ec034a to 662cba0 Compare June 3, 2024 18:55

Added /prune/ endpoint to removing "old" RPMs from a Repository.

4d17deb

closes pulp#2909.

ggainey force-pushed the 2909_prune branch from 662cba0 to 4d17deb Compare June 3, 2024 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added /prune/ endpoint to removing "old" RPMs from a Repository. #3536

Added /prune/ endpoint to removing "old" RPMs from a Repository. #3536

ggainey commented May 14, 2024

ggainey commented May 14, 2024

ggainey commented May 14, 2024

dralley May 22, 2024

dralley May 22, 2024

ipanova May 30, 2024

ggainey Jun 3, 2024

ipanova May 30, 2024

ggainey Jun 3, 2024

ipanova May 30, 2024

ggainey Jun 3, 2024

ipanova May 30, 2024

		log = getLogger(__name__)


		def prune_repo_nevras(repo_pk, keep_days, dry_run):

Added /prune/ endpoint to removing "old" RPMs from a Repository. #3536

Are you sure you want to change the base?

Added /prune/ endpoint to removing "old" RPMs from a Repository. #3536

Conversation

ggainey commented May 14, 2024

ggainey commented May 14, 2024

ggainey commented May 14, 2024

Some notes on performance:

Repository List:

How "prune" was run

How timings were evaluated

Performance Runs

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment