feat(manager) block synchronizer on cache readiness #5483
+13
−7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Removes the time-based startup delay in favor of obtaining the manager cache and calling its wait function during synchronizer start. In situations where the cache has not fully synced, the dataplane synchronizer will block until it has. If the cache never syncs within its grace period, a cache sync timeout will terminate the controller, preventing it from ever syncing config.
This better addresses situations where the controller may incorrectly delete configuration upon gaining leadership because it cannot yet see all resources on the cluster. Configuration previously added by another controller replica should remain intact.
The previous naive time-based delay addressed this for some cases, but relied on an arbitrary "good enough" wait time. AFAIK this time did work in practice during normal startup. However, it did not work if the cache sync got stuck, such as when the controller role was broken or when API resources were not defined.
Which issue this PR fixes:
Fix #2315
Special notes for your reviewer:
I could have sworn that we had no means of getting the manager cache, that it was reserved for the manager's internal use. I then saw a
GetCache()
call in one of the controllers and re-checked that, and apparently it's (maybe always?) been accessible despite being in aninternal.go
.Manual testing supports the notion that this works in problem scenarios. I deployed a test image to a normal install and confirmed successful start and config population. While watching the admin API for configuration changes, I deleted the Ingress permissions from the controller role, scaled to 0, and scaled back to 1. The new replica started and blocked without syncing configuration before eventually hitting a cache sync timeout and dying.
PR Readiness Checklist:
Complete these before marking the PR as
ready to review
:CHANGELOG.md
release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR