New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: deadlock between DDL in high-prio txns and descriptor leases #46414
Comments
See issue cockroachdb#46414 Because lease acquisition must operate at a high prio than schema changes, and SQL "HIGH PRIORITY" operates at the same prio as lease acquisition, we can't let ROLLBACK TO SAVEPOINT after schema change in HIGH PRIORITY txns. Release justification: fixes for high-priority or high-severity bugs in existing functionality Release note (sql change): ROLLBACK TO SAVEPOINT (for either regular savepoint or "restart savepoints" defined with `cockroach_restart`) causes a "feature not supported" error after a DDL statement in a HIGH PRIORITY transaction, in order to avoid a transaction deadlock. See issue cockroachdb#46414 for details.
See issue cockroachdb#46414 Because lease acquisition must operate at a high prio than schema changes, and SQL "HIGH PRIORITY" operates at the same prio as lease acquisition, we can't let ROLLBACK TO SAVEPOINT after schema change in HIGH PRIORITY txns. Release justification: fixes for high-priority or high-severity bugs in existing functionality Release note (sql change): ROLLBACK TO SAVEPOINT (for either regular savepoint or "restart savepoints" defined with `cockroach_restart`) causes a "feature not supported" error after a DDL statement in a HIGH PRIORITY transaction, in order to avoid a transaction deadlock. See issue cockroachdb#46414 for details.
@knz or @andreimatei, can either of you please provide a realistic example of when this deadlock would occur and a recommended workaround? |
Actually, I think I can use the example from one of the linked PRs. Does this look ok?
|
@jseldess I changed the behavior in PR #46415 - the deadlock is now avoided and reported as "unsupported feature". The relevant release note is this:
|
Your text in #46414 (comment) is thus correct and adequate. |
Thanks, @knz. |
@knz as I understand it, this mostly requires adding a new priority level for transactions for lease acquisition to use, so I'm moving it from SQL Schema to KV; let me know if you disagree. |
IT's both adding the new priority and teaching SQL how to use it. It's cross-team if you will. The question is "who should drive this". The current behavior is only visible to SQL users from the fact that certain operations will now be impossible (rejected) when issued within SQL transactions. Who's going to see that pain firsthand? I suppose it's going to be the SQL PM, not the KV PM. Maybe you can assign this to the PM nearest to you to see what they want to do with it. |
For the sake of the recording, one thing we've floated around is for KV to recognize the notion of dependent transactions, and incorporate that knowledge in the pushing protocol: If txn A depends on B and now B is trying to wait on A, don't actually wait. This seemed like a useful concept in general because we have other examples in the system of dependent transactions, and we've hit similar deadlocking issues with them too (e.g. resolving ranges, or any sort of "semi-sync" work). |
summary: ROLLBACK TO SAVEPOINT can cause a deadlock in a high-priority txn that contains DDL, so CockroachDB is set to disallow it altogether for the time being.
Ideally, a better solution would enable lease acquisition to operate at an ever higher priority.
Details
Prior to #46170, we had a deadlock when rolling back a txn for a restart / for a savepoint if the txn contained DDL, because subsequent lease acquisitions would wait on the main SQL txn which encloses it.
(This is because lease acquisition uses a separate kv txn.)
With #46170 committed, the lease acquisition occurs at a higher priority, which guarnatees deadlock avoidance in the common case where DDL is ran at normal priority.
However the deadlock still exists when the DDL is executed at high priority too.
Solution
We need two "high priority" levels, where lease acquisition is at a level higher than anything a SQL txn can be set to.
Jira issue: CRDB-5089
The text was updated successfully, but these errors were encountered: