Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill mutex not unlocked when context is canceled in CleanupBackfills #1576

Open
glindstedt opened this issue Jun 5, 2023 · 0 comments · Fixed by #1587
Open

Backfill mutex not unlocked when context is canceled in CleanupBackfills #1576

glindstedt opened this issue Jun 5, 2023 · 0 comments · Fixed by #1587
Labels
kind/bug Something isn't working

Comments

@glindstedt
Copy link

What happened:

A backfill stopped being acknowledged since it was no longer needed. After a while it expired and was scheduled for cleanup. The CleanupBackfills routine was interrupted through the context being cancelled. This led to a mutex lock on the backfill not being unlocked, until the backfillLockTimeout was reached.

Within this period the game server needed backfilling again started trying to re-acknowledge the backfill, but ran into continuous errors of rpc error: code = Unknown desc = redsync: failed to acquire lock until the lock timeout was reached and we finally got a response that it had expired. Only then could it react to that state and trigger creation of a new backfill. This is in part an issue with our current implementation, which is quite early stage, and we don't implement deleting of backfills right now and just let them expire when they're not needed.

However I do believe there's a bug with the mutex handling in DeleteBackfillsCompletely where the defer func to release the lock is using the same context that is provided, which in the case of a cancellation will be canceled, so I think it will never properly release the lock in that case: https://github.com/googleforgames/open-match/blob/main/internal/statestore/backfill.go#L245

It might be better to give it context.Background() or a new context with a timeout.

Logs:
image
image

What you expected to happen:

Cancelling the context should not leave mutexes in a locked state.

How to reproduce it (as minimally and precisely as possible):

Not quite sure what caused the context cancellation in the first place, looks like it happens when all the callers (director FetchMatches calls) are done.

Anything else we need to know?:

Output of kubectl version:

Client Version: v1.27.1
Kustomize Version: v5.0.1
Server Version: v1.24.11-gke.1000

Cloud Provider/Platform (AKS, GKE, Minikube etc.):

GKE

Open Match Release Version:

1.5.0

Install Method(yaml/helm):

Helm to generate yaml applied with kubectl.

@glindstedt glindstedt added the kind/bug Something isn't working label Jun 5, 2023
@joeholley joeholley linked a pull request Aug 17, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
1 participant