New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: possible race condition on EXCHANGE TABLES
query
#61135
base: master
Are you sure you want to change the base?
Conversation
This is an automated comment for commit f0d55ea with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
Add a test that has a chance to reproduce the problem at least occasionally. |
src/Storages/IStorage_fwd.h
Outdated
@@ -12,6 +12,7 @@ class IStorage; | |||
|
|||
using ConstStoragePtr = std::shared_ptr<const IStorage>; | |||
using StoragePtr = std::shared_ptr<IStorage>; | |||
using StorageWeakPtr = std::weak_ptr<IStorage>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not allowed to use std::weak_ptr<IStorage>
because we rely on std::shared_ptr<IStorage>::unique()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of saving StoragePtr, we can save StorageID with resolved UUID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's exactly the point to not keep shared_ptr here - since on HTTP requests session can be preserved it would block database from DROP - making it weak_ptr doesn't block it and we don't need to keep it cached beyond session so it should be safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a race condition
src/Interpreters/Context.cpp
Outdated
@@ -912,6 +912,23 @@ String Context::getFilesystemCacheUser() const | |||
return shared->filesystem_cache_user; | |||
} | |||
|
|||
DatabaseAndTable Context::getOrCacheStorage(const StorageID & id, std::function<DatabaseAndTable()> f, std::optional<Exception> * exception) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could cache it in Context::resolveStorageID
, but a new method is fine too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to enforce this cache everywhere where we call to Context::resolveStorageID
- at least having a query context is a condition we need to use this cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can check this condition in resolveStorageID
src/Interpreters/Context.cpp
Outdated
return storage; | ||
} | ||
|
||
auto storage = f(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider a better name
{ | ||
DatabaseAndTable storage = DatabaseCatalog::instance().tryGetByUUID(storage_id->uuid); | ||
if (exception && !storage.second) | ||
exception->emplace(Exception(ErrorCodes::UNKNOWN_TABLE, "Table {} does not exist anymore - maybe it was dropped", id.getNameForLogs())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, a query will fail if it requests a table the second time, but the table was dropped after the first get. But we can improve it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to improve it - if table was dropped concurrently then query is better to be interrupted than trying to reposes the table from drop queue - at this point we don't even know who is holding it from physical deletion...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do know, see addStorageHolder
@yakov-olkhovskiy, restarting CI by merging master is pointless. It will not fix the failures. For two reasons: |
@tavplubix Mergeable check was failing because build failures like this:
I'm not sure I see anything related... could you point it out please? |
@yakov-olkhovskiy please check everything, and you will see. If you don't see it - try to debug every failure until you see. Just checking everything (and trying to debug) is always a good idea because we should not ignore failures even if they are unrelated (but in this case some of them are). |
@tavplubix found it... "Tests are not finished" is a confusing status though... |
...fast test is getting hung and logs don't show proper stack traces - I added "force tests" label to check it with thread sanitizer... |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fixed possible race condition when
EXCHANGE TABLES
executed in parallel.closes #61072