Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Running concurrent tests on a TimescaleDB instance #6848

Closed
bobozaur opened this issue Apr 22, 2024 · 5 comments
Closed

[Bug]: Running concurrent tests on a TimescaleDB instance #6848

bobozaur opened this issue Apr 22, 2024 · 5 comments
Assignees
Labels

Comments

@bobozaur
Copy link

bobozaur commented Apr 22, 2024

What type of bug is this?

Locking issue, Unexpected error

What subsystems and features are affected?

Background worker, Continuous aggregate

What happened?

We use TimescaleDB and wanna have some of our tests ran against a live database. We use the Docker image for that. Initially we had tests run sequentially but would like to get them to run in isolation so they can run concurrently.

In other databases such as MySQL or vanilla PostgreSQL we would create a new database per test, run migrations on it to get it up to speed, run our test and then finally drop the database.

However, this approach does not seem to play well with TimescaleDB because of its background workers. When running migrations in constrained environments (like the GitHub CI runner) we would hit deadlocks. Reproducing this locally could be done by limiting the memory and CPU of the Docker container.

Reading more about TimescaleDB seems to indicate that it's essentially meant to work with a single database, though I'd love some clarifications on whether this is true. Resources I found about this:

In any case, to switch approaches we implemented schema based isolation between tests, but to no avail. Upon looking at the deadlocks occurring (which would now happen on the same database objects every time) we identified the culprit to be _timescaledb_config.bgw_job.

Issuing LOCK TABLE _timescaledb_config.bgw_job IN SHARE ROW EXCLUSIVE MODE; in a transaction before running migrations initially seems to have solved our issues, as there wouldn't be contention between them and background workers anymore. However, we started seeing deadlocks every now and then when issuing DROP SCHEMA IF EXISTS {schema_name} CASCADE;.

Solving this does not seem as simple, though. The issue only happens in CI and I could not reproduce it locally. Logs seem to indicate that there's contention on advisory locks between running background jobs and job deletions as a result of dropping the schema. The relation that surfaced in logs now is _timescaledb_internal.bgw_job_stat.

Issuing locks on _timescaledb_config.bgw_job and _timescaledb_internal.bgw_job_stat before dropping the schema does not fix the intermittent issue as deadlocks would still happen. So at this point we're back to using a database per test, locking _timescaledb_config.bgw_job before running migrations, running the test code and then dropping the database entirely.

I'm reporting this as a bug because I don't think that dropping a schema should ever deadlock with running jobs. But apart from that I'm really curious if there's a better/suggested/recommended way of doing this (running concurrent tests on a single instance).

Any help would be appreciated!

TimescaleDB version affected

2.10.2

PostgreSQL version used

15.2

What operating system did you use?

Ubuntu 22.04 x64

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

Issues running migrations without locks:
Process 240 waits for AccessExclusiveLock on advisory lock [5,1104,0,29749]; blocked by process 244.
Process 244 waits for ShareLock on relation 16567 of database 5; blocked by process 240.

Issues dropping schema without locks:
2024-04-22 15:13:04.651 UTC [157] LOG:  refreshing continuous aggregate "counters_continuous" in window [ 4714-11-24 00:00:00+00 BC, 294247-01-02 00:00:00+00 ]
2024-04-22 15:13:05.588 UTC [89] LOG:  process 89 still waiting for RowShareLock on advisory lock [5,1013,0,29749] after 1000.077 ms
2024-04-22 15:13:05.588 UTC [89] DETAIL:  Process holding the lock: 158. Wait queue: 89.
2024-04-22 15:13:05.649 UTC [158] LOG:  process 158 detected deadlock while waiting for AccessExclusiveLock on advisory lock [5,1014,0,29749] after 1000.085 ms
2024-04-22 15:13:05.649 UTC [158] DETAIL:  Process holding the lock: 157. Wait queue: .
2024-04-22 15:13:05.649 UTC [158] STATEMENT:  DROP SCHEMA IF EXISTS "harness_2" CASCADE;
2024-04-22 15:13:05.649 UTC [158] ERROR:  deadlock detected
2024-04-22 15:13:05.649 UTC [158] DETAIL:  Process 158 waits for AccessExclusiveLock on advisory lock [5,1014,0,29749]; blocked by process 157.
    Process 157 waits for ShareRowExclusiveLock on relation 16584 of database 5; blocked by process 158.
 	Process 158: DROP SCHEMA IF EXISTS "harness_2" CASCADE;
 	Process 157: CALL _timescaledb_internal.policy_refresh_continuous_aggregate()

Other run logs for the above:
Process 227 waits for AccessExclusiveLock on advisory lock [5,1104,0,29749]; blocked by process 231.
Process 231 waits for ShareRowExclusiveLock on relation 16584 of database 5; blocked by process 227.

Issues dropping schema with locks:
Process 401 waits for AccessExclusiveLock on advisory lock [5,1038,0,29749]; blocked by process 396.
Process 396 waits for ShareLock on relation 16567 of database 5; blocked by process 401.

How can we reproduce the bug?

Running concurrent tests on a constrained Docker container is probably the most successful way of observing this behavior, although it takes numerous runs for the issue to occur:

docker run --rm -p 5432:5432 -e POSTGRES_PASSWORD="passowrd" --memory="256M" --cpus="0.25" timescale/timescaledb-ha:pg15-latest
@bobozaur bobozaur added the bug label Apr 22, 2024
@antekresic
Copy link
Contributor

We had a few reports of similar behavior before. I'm looking into the locking to see if we can pinpoint the issue.

In the meanwhile, maybe this can help. We do some assurances in our tests to make sure we can drop databases without issues like this:
https://github.com/timescale/timescaledb/blob/main/test/sql/pg_dump.sql#L235C1-L243C0

Will let you know if I have something.

@antekresic antekresic self-assigned this Apr 23, 2024
@bobozaur
Copy link
Author

@antekresic thanks for sharing the approach!

May I also ask about the migrations situation? It looks like we solved it by locking the _timescaledb_config.bgw_job table, and initially this seemed fine, but I can't help but wonder now whether there is a better way.

I see there's some test utility being used in the code you shared, ts_bgw_wait. It would be ideal to have a supported mechanism that would block background workers until migrations SQL runs and only then enable them. Or is the background jobs table lock the recommended way to go about this?

@antekresic
Copy link
Contributor

I could also suggest trying:

SELECT _timescaledb_functions.stop_background_workers();
SELECT _timescaledb_functions.start_background_workers();

They should stop all the workers during migration and you can re-enable them after.

@bobozaur
Copy link
Author

@antekresic These functions seem to work great! Gave it a couple of runs in the CI without any issues. Just a note that it seems (at least in version 2.10.2 of TimescaleDB) that the functions reside in the _timescaledb_internal schema.

Our flow now on test setup is:

  • create the test database
  • call _timescaledb_internal.stop_background_workers()
  • run migrations
  • call _timescaledb_internal.start_background_workers()

We then run the custom test code and finally on test teardown drop the database. Attempting to replicate your approach with terminating workers and removing connections from the database or even just calling _timescaledb_internal.stop_background_workers() before dropping the database seems to cause some timeouts. So we just go ahead and drop it, especially since we know that only a single test works on a given database.

Maybe worth mentioning is that the test framework we use will drop orphan databases that could not be deleted for some reason on previous tests' teardown, so that's another reason why we just do a blind attempt to drop them and if it doesn't work right away we can rest assured that it will be re-attempted.

@svenklemm
Copy link
Member

Issue seems to be resolved. Feel free to reopen if you still have problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants