Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visibility data can become inconsistent #5643

Open
aromanovich opened this issue Apr 2, 2024 · 0 comments
Open

Visibility data can become inconsistent #5643

aromanovich opened this issue Apr 2, 2024 · 0 comments

Comments

@aromanovich
Copy link
Contributor

aromanovich commented Apr 2, 2024

Expected Behavior

Visibility data is eventually consistent.

Actual Behavior

In some of our self-hosted clusters we observed occasional inconsistency between the primary and the advanced visibilty storages, e.g. completed workflows being listed as running in ListWorkflowExecutions response until they're deleted by the retention policy.
I guess the same issue was reported by other Temporal users in community and Slack.

Steps to Reproduce the Problem

As far as I understand, the immediate queue implementation and its FIFO scheduler do not guarantee any particular order of tasks execution. A queue reader selects tasks in batch, and the scheduler dispatches them among multiple workers (512 by default for visibility processor). And if a single task processing gets delayed (for any reason, could be the database or the workflow context being busy), tasks ahead could be executed and completed meanwhile.

For a single execution, transfer tasks are serialized by mutable state conditions and database concurrency control mechanisms; but visibility tasks seem to allow arbitrary order of execution, and the visibility storage doesn't implement any concurrency control.


main...aromanovich:temporal:visibility-race shows how this becomes an issue:
go test . -persistenceType=sql -persistenceDriver=postgres12 -run 'AdvancedVisibilitySuite/TestVisibilityRace' -v will fail when SLOWDOWN_VISIBILITY_UPSERT=1 env var is set and succeed otherwise.

The test is very simple: it starts an execution, changes its search attributes and terminates it without any delays; then waits for some time and checks that the visibility data is consistent with the workflow state.
SLOWDOWN_VISIBILITY_UPSERT=1 slows down UpsertExecutionVisibilityTask by one second, imitating an occasional persistence delay.
This delay causes UpsertExecutionVisibilityTask to succeed after CloseExecutionVisibilityTask, which leaves the visibility data in an inconsistent state:

2024-04-02T00:19:01.581+0200	info	Sleeping for 1 second	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "logging-call-at": "visibility_queue_task_executor.go:237"}
2024-04-02T00:19:01.602+0200	info	processCloseExecution	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "startedAt": "2024-04-01T22:19:01.600Z", "finishedAt": "2024-04-01T22:19:01.602Z", "logging-call-at": "visibility_queue_task_executor.go:256"}
2024-04-02T00:19:02.581+0200	info	Finished sleeping for 1 second	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "logging-call-at": "visibility_queue_task_executor.go:239"}
2024-04-02T00:19:02.585+0200	info	processUpsertExecution	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "startedAt": "2024-04-01T22:19:01.580Z", "finishedAt": "2024-04-01T22:19:02.585Z", "logging-call-at": "visibility_queue_task_executor.go:200"}

Specifications

  • Version: 1.22.5
  • Platform: All
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant