New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Running concurrent tests on a TimescaleDB instance #6848
Comments
We had a few reports of similar behavior before. I'm looking into the locking to see if we can pinpoint the issue. In the meanwhile, maybe this can help. We do some assurances in our tests to make sure we can drop databases without issues like this: Will let you know if I have something. |
@antekresic thanks for sharing the approach! May I also ask about the migrations situation? It looks like we solved it by locking the I see there's some test utility being used in the code you shared, |
I could also suggest trying:
They should stop all the workers during migration and you can re-enable them after. |
@antekresic These functions seem to work great! Gave it a couple of runs in the CI without any issues. Just a note that it seems (at least in version 2.10.2 of TimescaleDB) that the functions reside in the Our flow now on test setup is:
We then run the custom test code and finally on test teardown drop the database. Attempting to replicate your approach with terminating workers and removing connections from the database or even just calling Maybe worth mentioning is that the test framework we use will drop orphan databases that could not be deleted for some reason on previous tests' teardown, so that's another reason why we just do a blind attempt to drop them and if it doesn't work right away we can rest assured that it will be re-attempted. |
Issue seems to be resolved. Feel free to reopen if you still have problems. |
What type of bug is this?
Locking issue, Unexpected error
What subsystems and features are affected?
Background worker, Continuous aggregate
What happened?
We use TimescaleDB and wanna have some of our tests ran against a live database. We use the Docker image for that. Initially we had tests run sequentially but would like to get them to run in isolation so they can run concurrently.
In other databases such as MySQL or vanilla PostgreSQL we would create a new database per test, run migrations on it to get it up to speed, run our test and then finally drop the database.
However, this approach does not seem to play well with TimescaleDB because of its background workers. When running migrations in constrained environments (like the GitHub CI runner) we would hit deadlocks. Reproducing this locally could be done by limiting the memory and CPU of the Docker container.
Reading more about TimescaleDB seems to indicate that it's essentially meant to work with a single database, though I'd love some clarifications on whether this is true. Resources I found about this:
In any case, to switch approaches we implemented schema based isolation between tests, but to no avail. Upon looking at the deadlocks occurring (which would now happen on the same database objects every time) we identified the culprit to be
_timescaledb_config.bgw_job
.Issuing
LOCK TABLE _timescaledb_config.bgw_job IN SHARE ROW EXCLUSIVE MODE;
in a transaction before running migrations initially seems to have solved our issues, as there wouldn't be contention between them and background workers anymore. However, we started seeing deadlocks every now and then when issuingDROP SCHEMA IF EXISTS {schema_name} CASCADE;
.Solving this does not seem as simple, though. The issue only happens in CI and I could not reproduce it locally. Logs seem to indicate that there's contention on advisory locks between running background jobs and job deletions as a result of dropping the schema. The relation that surfaced in logs now is
_timescaledb_internal.bgw_job_stat
.Issuing locks on
_timescaledb_config.bgw_job
and_timescaledb_internal.bgw_job_stat
before dropping the schema does not fix the intermittent issue as deadlocks would still happen. So at this point we're back to using a database per test, locking_timescaledb_config.bgw_job
before running migrations, running the test code and then dropping the database entirely.I'm reporting this as a bug because I don't think that dropping a schema should ever deadlock with running jobs. But apart from that I'm really curious if there's a better/suggested/recommended way of doing this (running concurrent tests on a single instance).
Any help would be appreciated!
TimescaleDB version affected
2.10.2
PostgreSQL version used
15.2
What operating system did you use?
Ubuntu 22.04 x64
What installation method did you use?
Docker
What platform did you run on?
On prem/Self-hosted
Relevant log output and stack trace
How can we reproduce the bug?
The text was updated successfully, but these errors were encountered: