Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tenant Index Failures for deleted tenants #2878

Closed
zalegrala opened this issue Aug 31, 2023 · 5 comments · Fixed by #3611
Closed

Tenant Index Failures for deleted tenants #2878

zalegrala opened this issue Aug 31, 2023 · 5 comments · Fixed by #3611
Labels
component/compactor keepalive Label to exempt Issues / PRs from stale workflow type/bug Something isn't working

Comments

@zalegrala
Copy link
Contributor

Describe the bug

A change in #2781 was an attempt to fix this, but is incomplete. Also related to #2754

Currently when Tempo attempts to write a tenant index with 0 blocks and 0 compacted blocks, the tenant index is deleted, and errors to delete the already deleted index are now handled since #2781. However, some deleted tenants may still show up in the tenant list if any objects are left within the tenant object path. This results in the poller attempting to read the tenant index and failing (because the index does not exist), and then falling back to a full poll after incrementing the error counter and resulting in incorrect alerts.

Expected behavior

I viewed the change in #2781 as a stop gap, but since it didn't achieve the goal, I think we should discuss. My preferred approach would be to delete all tenant objects left over when we delete the tenant index.

  1. WriteTenantIndex)_ is called for a tenant with zero blocks and zero compacted blocks
  2. RawWriter.Delete() is called for the tenant index
  3. List() at the tenant path to find all remaining objects
  4. Loop over remaining objects and call delete.

This will result in a full tenant deletion, instead of only deleting the index. Note that even if we address the incorrect metric counting for a deleted tenant, we still make the calls to the backend, so my preference would be to delete all the objects from the backend.

Environment:

  • Infrastructure: kubernetes
  • Deployment tool: jsonnet

Additional Context
#2754

@joe-elliott
Copy link
Member

My preferred approach would be to delete all tenant objects left over when we delete the tenant index.

There's a dangerous race condition here where we could be creating a valid block which should not be deleted at the moment we decide to delete the tenant index.

This results in the poller attempting to read the tenant index and failing (because the index does not exist)

Won't this be fixed by the changes to polling you are currently working on? Where the "list" command starts to list meta.json's and meta.compacted.json's directly?

@zalegrala
Copy link
Contributor Author

One of my hopes for the Find() method in the polling improvements PR was to find files that are older than a certain time to avoid such a race. For example, if we were trying to write a 0 block index and instead attempt to delete, the Find() could say "older than 1h" and then we would have a little more safety around race conditions.

What the polling PR doesn't solve for is when a tenant should actually be delete. Left over blocks would still mean the tenant path would show up and would still try to poll for that tenant. There may be a small change we can make to the tenant listing to help here, but perhaps outside of the (growing) polling PR.

Copy link
Contributor

github-actions bot commented Nov 5, 2023

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Nov 5, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2023
@logamanig
Copy link

any update on this issue?

@joe-elliott joe-elliott added type/bug Something isn't working component/compactor keepalive Label to exempt Issues / PRs from stale workflow and removed stale Used for stale issues / PRs labels Jan 22, 2024
@joe-elliott joe-elliott reopened this Jan 22, 2024
@zalegrala
Copy link
Contributor Author

@logamanig Try disabling blocklist_poll_fallback. If you do, can you let us know how that goes for you?

zalegrala added a commit to zalegrala/tempo that referenced this issue Apr 10, 2024
The tenant index deletion was originally put in as TCO win, but did not
have the desired effect and surfaced other issues in the system.

Related grafana#2678
Related grafana#2754
Related grafana#2781
Related grafana#2878
Related grafana#3115
Related grafana#3223

Due to the number of issues here, and causing considerable noise on the
pager, perhaps the right thing to do is back out the tenant deletion.

Raising here for discussion.
@zalegrala zalegrala mentioned this issue Apr 10, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/compactor keepalive Label to exempt Issues / PRs from stale workflow type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants