Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tiered compaction: duplicated L1 layer error in test_deletion_queue_recovery #7707

Open
1 of 3 tasks
Tracked by #7554
arpad-m opened this issue May 10, 2024 · 4 comments
Open
1 of 3 tasks
Tracked by #7554
Assignees
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@arpad-m
Copy link
Member

arpad-m commented May 10, 2024

Running the test_deletion_queue_recovery or test_uploads_and_deletions tests with tiered compaction enabled gives "duplicated L1 layer" errors:

2024-05-10T22:25:57.275644Z ERROR request{method=PUT path=/v1/tenant/b8c7c9b6739fed9060bfdf938ec9e9dc/timeline/5ffef4897699e4ff0fd68add218821ea/checkpoint request_id=227fdbbe-247c-4aee-8545-524487816dc4}:manual_checkpoint{tenant_id=b8c7c9b6739fed9060bfdf938ec9e9dc shard_id=0000 timeline_id=5ffef4897699e4ff0fd68add218821ea}: duplicated L1 layer layer=000000067F00000005000000000000000001-030000000000000000000000000000000002__0000000001535489-000000000154E229-00000001
2024-05-10T22:25:57.275660Z ERROR request{method=PUT path=/v1/tenant/b8c7c9b6739fed9060bfdf938ec9e9dc/timeline/5ffef4897699e4ff0fd68add218821ea/checkpoint request_id=227fdbbe-247c-4aee-8545-524487816dc4}:manual_checkpoint{tenant_id=b8c7c9b6739fed9060bfdf938ec9e9dc shard_id=0000 timeline_id=5ffef4897699e4ff0fd68add218821ea}: duplicated L1 layer layer=000000067F00000005000040000000000001-030000000000000000000000000000000002__0000000001535489-000000000154E229-00000001

visible with the following diff of test_deletion_queue_recovery:

-    env = neon_env_builder.init_start(initial_tenant_conf=TENANT_CONF)
+    tenant_conf = TENANT_CONF
+    tenant_conf["compaction_algorithm"] = '{{"kind": "Tiered"}}'
+    env = neon_env_builder.init_start(initial_tenant_conf=tenant_conf)

The test_deletion_queue_recovery test ran into all the important issues: previously, it ran into #7244 and #7296.

part of #7554

Tasks

@arpad-m arpad-m added the c/storage/pageserver Component: storage: pageserver label May 10, 2024
@jcsp
Copy link
Contributor

jcsp commented May 13, 2024

The test_deletion_queue_recovery test ran into all the important issues: previously, it ran into #7244 and #7296.

Can we lift the subset of this test that reproduces these issues into a dedicated compaction test, perhaps as part of the PR fixing this issue?

@arpad-m
Copy link
Member Author

arpad-m commented May 13, 2024

Can we lift the subset of this test that reproduces these issues into a dedicated compaction test

I could file a PR and then just allow the duplicated L1 layer errors.

@arpad-m
Copy link
Member Author

arpad-m commented May 14, 2024

I could file a PR and then just allow the duplicated L1 layer errors.

Done: #7758

arpad-m added a commit that referenced this issue May 15, 2024
Adds a test that is a reproducer for many tiered compaction bugs,
both ones that have since been fixed as well as still unfxied ones:
* (now fixed) #7296 
* #7707 
* #7759
* Likely also #7244 but I haven't tried that.

The key ordering bug can be reproduced by switching to
`merge_delta_keys` instead of `merge_delta_keys_buffered`, so reverting
a big part of #7661, although it only sometimes reproduces (30-50% of
cases).

part of #7554
@problame
Copy link
Contributor

problame commented May 15, 2024

Meeting notes:

  • quite a serious condition: we throw away the second struct Layer but we overwrote the on-disk file
  • => PS PageCache incoherency if the bit pattern of the new file is not identical
  • need to investigate & fix this, it's a potential problem right now (if compact legacy is not bitpattern-deteterministic).
  • But let's not get distracted => Goal for tiered compaction: never get into this situation in the first place.
  • Arpad: assumes root cause is too many loop iterations (just speculation though)

Action item: arpad & heikki to understand why it happens.

@arpad-m arpad-m self-assigned this May 20, 2024
a-masterov pushed a commit that referenced this issue May 20, 2024
Adds a test that is a reproducer for many tiered compaction bugs,
both ones that have since been fixed as well as still unfxied ones:
* (now fixed) #7296 
* #7707 
* #7759
* Likely also #7244 but I haven't tried that.

The key ordering bug can be reproduced by switching to
`merge_delta_keys` instead of `merge_delta_keys_buffered`, so reverting
a big part of #7661, although it only sometimes reproduces (30-50% of
cases).

part of #7554
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

3 participants