Skip to content
This repository has been archived by the owner on Apr 2, 2024. It is now read-only.

Metrics retention job performance improved by recycling connections #1813

Open
mikberg opened this issue Dec 22, 2022 · 0 comments
Open

Metrics retention job performance improved by recycling connections #1813

mikberg opened this issue Dec 22, 2022 · 0 comments

Comments

@mikberg
Copy link
Contributor

mikberg commented Dec 22, 2022

This is more a potentially useful finding than a bug report.

My Promscale installation recently reached the point where metrics data retention policies started, and the maintenance jobs started to delete chunks. At the same time, the maintenance jobs started to take much longer to complete, and they were struggling to keep up with the expiring chunks. The maintenance jobs' performance would seemingly often crawl to a halt, and they would consume large amounts of memory. This lead to knock-on effects, such as high latencies, failing backup processes and postgres going into recovery mode.

I discovered that the maintenance jobs' performance would start out pretty good, handing one chunk every ~3-4 seconds. After a while, the time spent per chunk increased steadily to several minutes per chunk. At the same time, top showed memory use the postgres processes corresponding to the maintenance job PIDs growing steadily, into the GBs.

Killing and restarting the maintenance jobs seemed to help – they would start out again fresh, with high performance and throughput. After about 5 minutes, their performance would start to noticeably degrade.

I found this answer on the DBA Stack Exchange, which provided a hypothesis for what could be happening – the per-connection cache growing as the maintenance jobs touched more objects.

I tested out this hypothesis by writing this custom metric data retention job, executed as a Kubernetes CronJob. The job has a connection pool and a worker pool of the same size, and each database connection is recycled every 3 minutes. (It also tries to back off if performance drops, e.g. while backup processes are running.)

(The compression part of the maintenance job is still scheduled from Timescaledb's Jobs feature, with the retention part commented out).

This workaround/custom job indeed seems to have a consistently high performance, the same performance as the maintenance jobs had in the start of each run. This has solved my problems with high and increasing numbers of expired metric chunks, and makes the installation more performant. The maintenance jobs would previously run for many hours; the custom one often completes within a few minutes.

I'm unsure whether this would apply more generally and could speed up metrics retention jobs for others, or whether my installation is somehow misconfigured, causing it to need this workaround.

Before:
Screenshot 2022-12-22 at 14 24 44

After:
Screenshot 2022-12-22 at 14 27 47

(Time range on after-screenshot is shorter to avoid some unrelated problems)

Edit: promscale 0.16.0, timescaledb 2.8.1 and promscale_extension 0.7.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant