Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod stuck in ContainerCreating after too many range is full errors #4218

Open
freedge opened this issue Mar 12, 2024 · 2 comments
Open

Pod stuck in ContainerCreating after too many range is full errors #4218

freedge opened this issue Mar 12, 2024 · 2 comments
Labels
kind/support Ask a question or get support for anything in ovn-kubernetes pods All issues related to the PodAPI

Comments

@freedge
Copy link

freedge commented Mar 12, 2024

if a pod fails multiple times to get an IP due to a "err: range is full" error, the pod will stay stuck in ContainerCreating forever

as implemented through
906a598

the retry is done for around 15 minutes up until the final attempt, then the pod hangs there until ovnkube-controller is restarted or the pod deleted.
As the pod status says "Creating" I believe ovnk should keep trying.

@tssurya
Copy link
Member

tssurya commented Mar 12, 2024

@freedge : thanks for the issue!
So I agree in ideal k8s world we probably should keep retrying but in ovnkube today we keep retries pinned at max 15 because the retry is triggered every 30seconds and there is a backoff algorithm as well so totally it amounts to many minutes of retry which can flood large environments which is the reason why we added a cap and did that fix to supress those logs instead of infinite retry cc @ricky-rav PTAL

So in this case if the range is indeed really full and we can't do anything about it, I think admin should react on the triggered "subnet full alert" and do the needful which would retrigger events.

However we can revisit this cap and explore real level drivenness if this needs fine tuning.

@tssurya tssurya added services/endpoints All issues related to the Servces/Endpoints API pods All issues related to the PodAPI kind/support Ask a question or get support for anything in ovn-kubernetes and removed services/endpoints All issues related to the Servces/Endpoints API labels Mar 12, 2024
@ricky-rav
Copy link
Contributor

Even in level-driven controllers there's usually a cap on the number of retries (re-queues). We can revisit our max of 15 if there's a specific need, but I think it's a lot more complicated to handle if we /never/ give up an add/update/delete operation than if we give up after n failed attempts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Ask a question or get support for anything in ovn-kubernetes pods All issues related to the PodAPI
Projects
None yet
Development

No branches or pull requests

3 participants