The spec.template.spec.terminationGracePeriodSeconds: 3600 setting has no effect #28

eugen-nw · 2019-12-11T21:51:30Z

My container runs a Windows Console application in an Azure Kubernetes instance. I'm doing the SetConsoleCtrlHandler subscription, I catch the CTRL_SHUTDOWN_EVENT (6) and Thread.Sleep(TimeSpan.FromSeconds(3600)); so the SIGKILL won't get sent to the container. The container receives indeed the CTRL_SHUTDOWN_EVENT and logs on a separate thread one message/second to show for how long it kept waiting.

I'm adding the required registry settings,

USER ContainerAdministrator
RUN reg add hklm\system\currentcontrolset\services\cexecsvc /v ProcessShutdownTimeoutSeconds /t REG_DWORD /d 3600 && \
    reg add hklm\system\currentcontrolset\control /v WaitToKillServiceTimeout /t REG_SZ /d 3600000 /f
ADD publish/ /

I verified this running the container on my computer and 'docker stop -t <seconds>' achieves the delayed shutdown.

The relevant .yaml deployment file fragment.

spec:
  replicas: 1
  selector:
    matchLabels:
      app: aks-aci-boldiq-external-solver-runner
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: aks-aci-boldiq-external-solver-runner
    spec:
      terminationGracePeriodSeconds: 3600
      containers:
      - image: ...
        imagePullPolicy: Always
        name: boldiq-external-solver-runner
        resources:
          requests:
            memory: 8G
            cpu: 1
      imagePullSecrets:
        - name: docker-registry-secret-official
      nodeName: virtual-kubelet-aci-connector-windows-windows-westus

After deployment I ran the 'kubectl get pod aks-aci-boldiq-external-solver-runner-69bf9cd949-njzz2 -o yaml' command and verified that the setting below is present in the output:

  terminationGracePeriodSeconds: 3600

If I do 'kubectl delete pod', the containers stays alive only for the default 30 seconds instead of the 1 hour that I want to get. Could the problem be in the VK or could this behavior be caused by AKS please?

The text was updated successfully, but these errors were encountered:

ibabou · 2019-12-20T07:18:53Z

@eugen-nw, so this actually not supported at all today on 2 levels:
1- VK package itself up till v1.1 (the current used by azure virtual kubelet) didn’t honor that. It simply calls a delete pod on provider and sets always 30 seconds. But this got updated in v1.2, so we should be able to utilize it in future. The other thing is that our provider don’t send any updates about pod after delete call, so actually it immediately gets deleted although the 30 secs showing from K8s side. This later point is going to be fixed shortly, i’m currently working on an update for that.
2- This is main problem, ACI doesn’t support a way to configure how the termination should be handled, and what grace period to be used if specified. The delete operation is synchronous too, so actual resource is removed regardless of actual pod cleanup that gets triggered on ACI’s backend. We’re aware of the limitations on ACI, but till these are supported, the fixes mentioned in (1) won’t make a difference. @macolso the async deletion is coming with new api, but I remember you/Deep mentioning about termination handling. Can you please elaborate if it is planned for next semester ?

eugen-nw · 2019-12-20T18:23:49Z

Thanks very much for having looked into this! When will this issue be fixed please? Our major customer is not pleased with the fact that some of their long running computations get killed midway through and need to be restarted on a different container.

eugen-nw · 2020-01-11T00:10:46Z

@ibabou your answer 2. above implies that even if we'd use Linux containers running on virtual-node-aci-linux we'd run into the exact same problem. I assume that virtual-node-aci-linux is the equivalent Linux ACI connector. Are both of these 2 statements correct please?

ibabou · 2020-01-11T00:19:09Z

@eugen-nw if you mean the graceful period and the wait on containers termination on ACI's side, yeah that's not currently supported to either Linux or Windows.

eugen-nw · 2020-01-11T00:26:02Z

Thanks very much, that's what I was asking about. That's very bad behavior on ACI's side. Do they plan to fix it?

ibabou · 2020-01-11T00:41:54Z

So our team owns both ACI service and AKS-VK integration. but I don't have an ETA about that feature. I'll let @dkkapur @macolso elaborate more.

dkkapur · 2020-01-13T22:44:39Z

@eugen-nw indeed :( we're looking into fixing this in the coming months on ACI's side. Hope to have an update for you in terms of a concrete timeline shortly.

eugen-nw · 2020-01-13T23:24:14Z

@dkkapur: THANKS VERY MUCH for planning to address this problem soon! This is a major issue for our largest customer.

We scale our processing on demand, based on workload sent to containers through a Service Bus Queue. There are two distinct types of processing: 1). under 2 minutes (the majority) 2). over 40 minutes (occurs now and then). Whenever the AKS HPA scales down, it kills the containers that it spun during scale up. If any of the long processing operations happen to land on one of those scale-up containers, it will get aborted and currently we have no way of avoiding that. We've designed the solution such as the processing will restart on another container, but our customer is definitely not happy with the fact that the 40' processing may happen to run for much longer durations on occasion.

macolso · 2020-01-13T23:37:09Z

Ya - I've been working on enabling graceful termination / lifecycle hooks for ACI. If you want to talk more about your use case, I'd love to set up some time - shoot me a piece of mail macolso@microsoft.com

AlexeyRaga · 2020-05-13T10:27:01Z

Bumping into the same issue with the auto scaler.

4 months passed, are there any known workarounds? Or ETA for the fix?

AlexeyRaga · 2020-07-09T00:03:22Z

@dkkapur @macolso @ibabou Sorry for bumping it again, it hurts us quite a lot here, any news on this front?

eugen-nw · 2020-07-09T05:20:37Z

Probably customer focus is no longer trendy these days? I’ll check out the AWS offerings and will report back.

macolso · 2020-07-09T23:46:08Z

Hi @AlexeyRaga , unfortunately no concrete ETA we can share at this point. We're happy to hop on a call and talk a bit to the product roadmap though - email shared above ^^

asipras · 2021-05-03T17:03:24Z

This is a big drawback where the pods scheduled on virtual node does not support Pod Lifecycle Hooks or terminationGracePeriodSeconds. This functionality is needed to stop the pods from getting terminated during scaling-in.

Is there any timeline to implement this issue? @macolso

rustlingwind · 2021-07-15T00:47:19Z

Does the terminationGracePeriodSeconds work for aws eks pods on fargate ? Fargate nodes also looks like a kind of virtual nodes.

Andycharalambous · 2022-11-10T00:06:23Z

Any progress on this at all yet? It's over 2 years since the last update.

helayoty · 2022-11-10T05:17:04Z

Hey @Andycharalambous , we will start working on it soon, no ETA yet.

macolso assigned ibabou Dec 19, 2019

macolso added the kind/bug Something isn't working label Dec 19, 2019

ibabou assigned dkkapur Jan 11, 2020

ibabou added kind/enhancement New feature or request and removed kind/bug Something isn't working labels Jan 11, 2020

feiskyer added the dependency/aci label Feb 8, 2021

dkkapur assigned macolso and unassigned dkkapur and ibabou Jul 16, 2021

helayoty added this to Needs triage in Bug Triage Apr 18, 2022

helayoty added the triaged label Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The spec.template.spec.terminationGracePeriodSeconds: 3600 setting has no effect #28

The spec.template.spec.terminationGracePeriodSeconds: 3600 setting has no effect #28

eugen-nw commented Dec 11, 2019 •

edited

ibabou commented Dec 20, 2019

eugen-nw commented Dec 20, 2019

eugen-nw commented Jan 11, 2020

ibabou commented Jan 11, 2020

eugen-nw commented Jan 11, 2020

ibabou commented Jan 11, 2020

dkkapur commented Jan 13, 2020

eugen-nw commented Jan 13, 2020

macolso commented Jan 13, 2020

AlexeyRaga commented May 13, 2020

AlexeyRaga commented Jul 9, 2020

eugen-nw commented Jul 9, 2020

macolso commented Jul 9, 2020

asipras commented May 3, 2021 •

edited

rustlingwind commented Jul 15, 2021 •

edited

Andycharalambous commented Nov 10, 2022

helayoty commented Nov 10, 2022

The spec.template.spec.terminationGracePeriodSeconds: 3600 setting has no effect #28

The spec.template.spec.terminationGracePeriodSeconds: 3600 setting has no effect #28

Comments

eugen-nw commented Dec 11, 2019 • edited

ibabou commented Dec 20, 2019

eugen-nw commented Dec 20, 2019

eugen-nw commented Jan 11, 2020

ibabou commented Jan 11, 2020

eugen-nw commented Jan 11, 2020

ibabou commented Jan 11, 2020

dkkapur commented Jan 13, 2020

eugen-nw commented Jan 13, 2020

macolso commented Jan 13, 2020

AlexeyRaga commented May 13, 2020

AlexeyRaga commented Jul 9, 2020

eugen-nw commented Jul 9, 2020

macolso commented Jul 9, 2020

asipras commented May 3, 2021 • edited

rustlingwind commented Jul 15, 2021 • edited

Andycharalambous commented Nov 10, 2022

helayoty commented Nov 10, 2022

eugen-nw commented Dec 11, 2019 •

edited

asipras commented May 3, 2021 •

edited

rustlingwind commented Jul 15, 2021 •

edited