Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics scaling #148

Open
crabique opened this issue Oct 4, 2022 · 5 comments
Open

Metrics scaling #148

crabique opened this issue Oct 4, 2022 · 5 comments
Assignees

Comments

@crabique
Copy link

crabique commented Oct 4, 2022

Problem

At the moment, benji-{backup,restore}-pvc scripts push metrics to pushgateway immediately upon wrapped benji process exit, which is likely good enough for many use-cases.

In our case, however, I back up ~20k volumes in parallel with something like this (simplified):

kubectl get pvc -n xyz -o=custom-columns=:.metadata.name --no-headers \
  | xargs -I{} -P32 -rn1 benji-backup-pvc --field-selector="metadata.name={}"

This runs as a cronjob pod on a dedicated k8s worker and speeds up the backup process greatly, however running in 32 parallel threads it completely overwhelms pushgateway, no matter how veritcally big it is, all pushes begin to eventually time out even with high timeout set, so even though it is unable to push any metrics it still becomes the performance bottleneck for the backup process.

Our first idea was to scale pushgateway horizontally, but unfortunately, this is not really an option because of fragmentation and the fact that it uses memory-backed storage for metrics. Furthermore, scaling it horizontally is an anti-pattern, according to the developers.

Ideas

To work around that, we have a couple ideas that could be feasible to apply here (in no particular order of preference):

  • Allow the option to configure the parallel execution thread count as a chart value, collecting metrics internally for the entire run of the wrapper script and then submit all of the metrics at once to the pushgateway or have an internal rate-limiter/aggregator that would do this every few minutes
  • Allow users to set a flag so that the metrics are never submitted, but are instead buffered into a temporary file and implement something like benji-push-metrics to PUT metrics under the same label group, refreshing the entire state. This helper can then be called at the end of the run or as a background process continuously, every certain time interval, to interactively update the state representation on pushgateway export endpoint.
  • Allow users to configure a custom exporter to a file instead of pushgateway, e.g. with a file:// schema for the pushgateway configuration, it's pretty easy to submit a file there using curl

Either option would help me scale this better and marvel at mertics at the same time 🙂

We are not developers per se, however if this project is not actively maintained and you liked any of the option more than the others, we could handle implementation given the PR is not going to collect dust.

Please let me know what you think or if you want any more information, I would be glad to help.

@elemental-lf elemental-lf self-assigned this Oct 9, 2022
@elemental-lf elemental-lf changed the title Mertics scaling Metrics scaling Oct 9, 2022
@elemental-lf
Copy link
Owner

Thank you for reaching out, that's quite a lot of volumes that you have. So I gather that submitting several large requests to the pushgateway works but it's overwhelmed by many small requests. At first glance I like your first option the most but it probably is also the most involved one. What I don't like about the other options is that with an external file we might run into issues with concurrent accesses. On the other side I've been toying with the idea of using a slimmed down version of https://argoproj.github.io/argo-workflows/ for automating backups instead of simple cronjobs and such a file could be passed through as an artifact or there could be separate files which are aggregated at the end of the workflow and pushed to the gateway. I will need to think about this a bit.

@elemental-lf
Copy link
Owner

We also need to consider the fact that aggregating metrics makes it more likely that all or some of them are lost due to uncatched exceptions or other unhandled errors.

@elemental-lf
Copy link
Owner

@crabique I've extended benji-backup-pvc to accept a list of PVCs which might help with your use case. PVCs can by specified by <name> in which case the namespace specified by --namespace is used as a default or by <namespace>/<name>.

@crabique
Copy link
Author

Hi @elemental-lf ! Thanks for the update and sorry for the radio silence.

Unfortunately, this doesn't address the parallel execution aspect, but I think it could be a good workaround to combine this with xargs to have multiple pvc name batches passed to benji-backup-pvc at a time, so that metrics are pushed in batches at a lower interval. How many pvcs would be a sensible number in your opinion?

@elemental-lf
Copy link
Owner

Apart from the maximum line length which xargs probably takes into account I've no recommendation as to the number of PVCs per benji call. Specifying ten PVCs per call for example should reduce the number of calls to the pushgateway by the same amount. I think you'd have to experiment how much batching you would need to not overwhelm the pushgateway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants