Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/catalog/lease: high priority reads in lease acquisition of DROP descs may prevent RENAME/DROP #86324

Open
ajwerner opened this issue Aug 17, 2022 · 3 comments
Labels
A-schema-catalog Related to the schema descriptors collection and the catalog API in general. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Aug 17, 2022

Describe the problem

This is effectively the same problem as #61798. If an application drops a table and then continually attempts to access that table, the drop will not be able to make progress. This is classic live-lock due to mismatched priorities. This is because the code will attempt to perform a lease acquisition and that lease acquisition will push the writer attempting to drop the table. The lease acquisition will ultimately fail, but that's not relevant as the writer will have been pushed. In a cluster without high throughput, many nodes, and a long RTT between those nodes, it's probably the case that the drop will eventually find a moment to commit. It's geo-distributed clusters which will find this issue most problematic. As soon as the attempts to lease the descriptor stop, the drop will be able to proceed.

To Reproduce

Create a geo-distributed cluster with high latency. Run an application which attempts to access a table at high frequency. Drop that table. The drop will hang

Expected behavior

The drop should proceed.

Proposed solution

A possibility is to avoid initially using high priority when leasing and instead set a timeout on locks or something and then switch to high priority after an initial period that is longer than some multiple of the latency diameter of the cluster. Maybe 5s.

Another would be to stop doing this high priority dance in the first place.

Additional Context

There's some related discussion in #54633 and #46414.

Jira issue: CRDB-18697

gz#16237

@ajwerner ajwerner added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 17, 2022
@ajwerner ajwerner added this to Triage in SQL Foundations via automation Aug 17, 2022
@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Aug 17, 2022
@ajwerner
Copy link
Contributor Author

The lease manager does, I believe, have enough information to avoid leasing this descriptor.

@postamar postamar moved this from Triage to Backlog in SQL Foundations Aug 23, 2022
@postamar postamar added the A-schema-catalog Related to the schema descriptors collection and the catalog API in general. label Nov 10, 2022
@postamar postamar moved this from Backlog to Tech Debt That Nobody Outside Of The Team Cares About in SQL Foundations Nov 10, 2022
@ajwerner ajwerner changed the title sql/catalog/lease: high priority reads in lease acquisition of DROP descs may prevent dropping sql/catalog/lease: high priority reads in lease acquisition of DROP descs may prevent RENAME/DROP Feb 8, 2023
@ajwerner
Copy link
Contributor Author

It turns out this is realistically exactly the same problem as #96840. I'm closing that issue as a duplicate but I encourage a reader to read that issue. It has good details.

@ajwerner
Copy link
Contributor Author

The problem here ends up being the namespace lookups here. These are bad in part because they don't use a singleflight -- they should. Even if they did, we'd still have a tight loop resolving a name that does not exist with a high priority transaction; that will starve a writer. This comment is repeating (#96840 (comment))

The solution to starving the writer is a combination of The combination of #95225
and #95227. The proposed band-aid is to keep a cache of missing names and do some backoff when resolving them, gated on a cluster setting.

@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-catalog Related to the schema descriptors collection and the catalog API in general. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
SQL Foundations
  
Tech Debt That Nobody Outside Of The ...
Development

No branches or pull requests

2 participants