Sharded compaction #3537

mdisibio · 2024-04-03T21:29:10Z

What this PR does:
This is the largest change to compaction in years. It is a new sharding compactor that splits blocks up by trace ID divisions. For example: instead of having 2 blocks with the entire range of trace IDs 00..FF, the sharding compactor can combine and split them into 2 blocks with the ranges 00..80 and 80..FF where each block contains half of the range it did before.

There are several benefits from this approach:

The read path is more efficient. Trace lookup can filter out more blocks faster by checking the Min/Max ID in the meta without having to resort to the bloom filter. Metrics queries inspect fewer blocks to find the traces in the job shard.
More effective deduplication of data in the backend. Because the sharding compactor clusters traces with similar values together, it is more effective at finding and deduplicating the copies of a trace due to replication factor. A recent internal analysis measured that compactors were only bringing the replication factor from 3 down to ~2.5. The sharding compactor should achieve ~1.5 or better.

The shard count is configurable globally and per-tenant. Setting to 2 or higher activates the new sharding compactor. 0 or 1 will fallback to the existing compactor. Expecting useful values to be 2 through 8. Values higher than 8 could be interesting, but at the cost of greatly increasing the size of the blocklist.

Description
How the sharding compactor works can be considered as 3 phases.

The first phase is taking fresh blocks from the ingesters and splitting them. The ingesters haven't been affected in this PR, and continue to flush blocks with the full range 00..FF. The sharding compactor identifies them by looking for blocks with CompactionLevel=0 and will split them. This is a combination of combine/split like described at the time, so that up to 4 (maxInputBlocks) ingester blocks will be rewritten as N sharded blocks. This step feeds work to the next step:
The second phase is the typical compaction/reduction of sharded blocks with other blocks in the same shard. This is same as the existing compaction, but shard-aware. The compactor identifies them by CompactionLevel>0 and the block min/max trace IDs fall within the same shard (i.e. the block is "well-sharded")
A third phase exists that will upgrade/resplit older blocks too, but it is the lowest priority and only when the first 2 phases have no outstanding work.

Which issue(s) this PR fixes:
Fixes n/a

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…t operations

… and metrics

…Remove compaction level of all grouping, remove active window. Comments

…emove default

mdisibio · 2024-04-04T13:00:42Z

tempodb/encoding/vparquet3/block_iterator.go

@@ -75,6 +75,7 @@ func (i *rawIterator) Next(context.Context) (common.ID, parquet.Row, error) {
 	}

 	if errors.Is(err, io.EOF) {
+		i.pool.Put(rows[0])


This fixes a "memory leak" where the buffer wasn't returned to the pool. This actually has a significant effect on memory because this isn't happening once at the end, but on every loop of the multiblock iterator. Once 1 block is exhausted it begins leaking a pooled buffer on every call. The better fix is to make the multiblock iterator stop calling Next() once an iter is exhausted, but it was going to be a lot more involved.

mapno

Nice work!

modules/overrides/config_legacy.go

tempodb/encoding/common/interfaces.go

tempodb/compactor_test.go

tempodb/sharding_block_selector.go

tempodb/tempodb.go

…ormat

knylander-grafana

Thank you for adding some information to the configuration docs. Will we need additional information in the docs that talks about when to use this ocnfiguration?

joe-elliott

long overdue work. looks great. added some Qs but fine with things as is.

if you need my approval please ask. the only reason i'm not approving is b/c you and @mapno have been handling these PRs

modules/overrides/config_legacy.go

modules/querier/querier_query_range.go

tempodb/encoding/common/interfaces.go

joe-elliott · 2024-04-09T18:59:19Z

tempodb/encoding/vparquet3/compactor.go

+		// ship block to backend if done
+		if currentBlock != nil && cmd.CutBlock(currentBlock.meta, lowestID) {
+			currentBlockPtrCopy := currentBlock
+			currentBlockPtrCopy.meta.StartTime = minBlockStart


won't this adjust the original "currentBlock" since they point to the same thing?

Hmm this block was moved from the bottom of the loop to the top so that the CutBlock callback can inspect the next trace (lowestID). You're right it does seem to be out of date and the copy is no longer needed. I'll take a look.

tempodb/sharding_block_selector.go

mdisibio · 2024-04-16T11:27:19Z

Hi everyone: Putting this back to draft because it is not yet ready for large scale. Testing it out in a large cluster is showing it difficult to achieve the balance between splitting and combining blocks, even with tuning. Would like to review the switch statement and priorities of each group of blocks. When there is an imbalance the result is that compactors are less effective and the blocklist increases more than acceptable: i.e. If the compactor spends too much time splitting blocks then the backend is composed of (too) many small blocks, and if the compactor spends too much combining, then ingester level-0 blocks are neglected.

mdisibio added 11 commits March 27, 2024 21:25

First working draft of sharding compactor

4b950fb

Fix to prioritize newer windows, like the timeWindow selector

4b7604a

Less aggressive default

7da76a5

8-byte compactor sharding too. Allow to choose single blocks for spli…

5daea68

…t operations

minor cleanup

12b6b29

Use block min/max to more efficiently skip blocks during trace lookup…

58b7d15

… and metrics

Save useful test but commented out

daaf857

Tune sharding compactor to order by compaction level within a shard. …

e45c5a9

…Remove compaction level of all grouping, remove active window. Comments

Fix config check

f9d596e

Add per-tenant override for compaction shards. Rename config field, r…

d311ecf

…emove default

Docs

c0ae168

mdisibio commented Apr 4, 2024

View reviewed changes

mdisibio added 2 commits April 4, 2024 09:05

changelog

e7bafaf

Merge branch 'main' into sharded-compaction

cc5867c

mdisibio marked this pull request as ready for review April 4, 2024 15:43

mdisibio requested review from knylander-grafana, joe-elliott, annanay25, mapno, kvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners April 4, 2024 15:43

mapno reviewed Apr 5, 2024

View reviewed changes

mdisibio added 4 commits April 9, 2024 12:09

rename twbs

f6aa484

review feedback, renames, obsolete comment

8f0d2af

Restore all encodings in benchmark

2d9f03c

Fix to include new shards field when converting overrides to legacy f…

d63447a

…ormat

knylander-grafana reviewed Apr 9, 2024

View reviewed changes

joe-elliott reviewed Apr 9, 2024

View reviewed changes

mdisibio marked this pull request as draft April 16, 2024 11:17

mdisibio mentioned this pull request Apr 16, 2024

Compactor tweaks: Fix pool leak when iter exhausted and async pages for better throughput #3579

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded compaction #3537

Sharded compaction #3537

mdisibio commented Apr 3, 2024 •

edited

mdisibio Apr 4, 2024

mapno left a comment

knylander-grafana left a comment

joe-elliott left a comment

joe-elliott Apr 9, 2024

mdisibio Apr 10, 2024

mdisibio commented Apr 16, 2024

Sharded compaction #3537

Are you sure you want to change the base?

Sharded compaction #3537

Conversation

mdisibio commented Apr 3, 2024 • edited

mdisibio Apr 4, 2024

Choose a reason for hiding this comment

mapno left a comment

Choose a reason for hiding this comment

knylander-grafana left a comment

Choose a reason for hiding this comment

joe-elliott left a comment

Choose a reason for hiding this comment

joe-elliott Apr 9, 2024

Choose a reason for hiding this comment

mdisibio Apr 10, 2024

Choose a reason for hiding this comment

mdisibio commented Apr 16, 2024

mdisibio commented Apr 3, 2024 •

edited