Improvements to batching #1739

cevian · 2022-11-03T18:59:16Z

Description

Merge requirements

Please take into account the following non-code changes that you may need to make with your PR:

CHANGELOG entry for user-facing changes
Updated the relevant documentation

This change adds a new reservation system for metric batching. The reservation system forms a queue of metric batchers for copiers to pick up. The queue is ordered by the time a request hits Promscale. That ordering allows copiers to combine a request together better. This solves a few problems: 1. Previously the ordering of reservations was based on a channel and the ordering was based on when the batcher first made the reservation. While this may correlate with the time the request hit the connector, this ordering is less exact to what we actually want. This solves that problem by ordering explicitly on request start time. 2. It allows for Peeking into info associated with the first reservation. That is not possible with a channel-based approach and is used here for ensuring that we don't stop batching until a min duration havs passed since the oldest request in the batch hit the system. That operates as an alternative backpressure mechanism to the regular mechanism of copiers becoming available. While in most cases it shouldn't be necessary, because of the sharding adjustment Prometheus does, it can be useful in some cases. Note that the minimum latency is measured from the time the request hits the connector and NOT the time taken in the batcher. This better controls the latency penalty of this alternate backpressure system. Peeking can also be used to have more advanced logic for batch sizing. For example, we may want to batch based on number of samples. Without peeking, you'd need to pop the queue before learning the number of samples, in which case it is too late.

Should give us better observability.

- Significantly increase number of samples batched per metric. Heavy metrics are very likely to come from many requests, so flushing them sooner is better. Having to wait multiple times to flush is no bueno. - To prevent copy batches from getting to large in the face of potentially more samples/metric change that batch criteria to batch not just with a limit to # metrics but also a limit of # samples. - change preference from try to flush before batching to try to batch before flushing. This should increase batch sizes - For two metric with the same initial request time, flush the bigger one first, hoping the smaller one can become bigger before flush.

Previously, when a new set of requests came in, we often flushed a very small first batch because we flush as soon as any data becomes available in Peek. Since small batches are ineffecient we fix this by waiting till the entire first request is batched to flush (with a timeout).

cevian force-pushed the batch_reso branch from 0efce9a to 70b39e7 Compare November 3, 2022 19:01

cevian force-pushed the batch_reso branch 2 times, most recently from 41bc38e to 0ea364c Compare November 17, 2022 20:24

cevian added 5 commits December 12, 2022 21:42

Add better metrics for the batching system

0d2461e

Should give us better observability.

trace request logging

efdf869

cevian force-pushed the batch_reso branch 2 times, most recently from 05df804 to 2c36b75 Compare December 13, 2022 21:44

bug fix

4c34323

cevian force-pushed the batch_reso branch from 2c36b75 to 4c34323 Compare December 13, 2022 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to batching #1739

Improvements to batching #1739

cevian commented Nov 3, 2022

Improvements to batching #1739

Are you sure you want to change the base?

Improvements to batching #1739

Conversation

cevian commented Nov 3, 2022

Description

Merge requirements