Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KPA and cluster autoscaler compatibility #14939

Open
hyde404 opened this issue Feb 23, 2024 · 1 comment
Open

KPA and cluster autoscaler compatibility #14939

hyde404 opened this issue Feb 23, 2024 · 1 comment
Labels
kind/question Further information is requested

Comments

@hyde404
Copy link

hyde404 commented Feb 23, 2024

Ask your question here:

Hello,

I'm setting up an infrastructure based on scale-to-zero, and therefore scale-from-zero too.
To do this, we're using the now-familiar "cluster autoscaler", coupled with cluster API (specifically the machineDeployment resource with some annotations).
The node scaling is working fine.

For the moment, I'm just trying to create an "autoscaler-go" knative service, from the cluster where no node is available.
The resource is then "pending", which is expected.

NAME                                             READY   STATUS    RESTARTS   AGE
user-service-00001-deployment-6f6d577c45-rtjvz   0/2     Pending   0          1m32s

Here is the configuration I used to create the service:

apiVersion: v1
kind: Namespace
metadata:
  name: 6d2ef157
---
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: user-service
  namespace: 6d2ef157
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/max-scale: "10"
        autoscaling.knative.dev/min-scale: "1"
        autoscaling.knative.dev/scale-down-delay: "15m"
        autoscaling.knative.dev/window: "240s"
        autoscaling.knative.dev/scale-to-zero-pod-retention-period: "1800s"
      creationTimestamp: null
    spec:
      containerConcurrency: 50
      containers:
      - env:
        - name: TARGET
          value: Sample
        image: ghcr.io/knative/autoscale-go:latest
        name: app
        ports:
        - containerPort: 8080
          protocol: TCP
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources:
          limits:
            cpu: "12"
            memory: 78Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "12"
            memory: 78Gi
            nvidia.com/gpu: "1"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - CAP_SYS_ADMIN
          runAsNonRoot: true
          runAsUser: 1000
          seccompProfile:
            type: RuntimeDefault
      enableServiceLinks: false
      nodeSelector:
        nvidia.com/gpu.count: "1"
        nvidia.com/gpu.product: NVIDIA-GeForce-RTX-2080-Ti
      runtimeClassName: nvidia
      timeoutSeconds: 1800
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
  traffic:
  - latestRevision: true
    percent: 100

After a few minutes, the pod is still pending, but we get an event that says the cluster autoscaler has been triggered.

Normal   TriggeredScaleUp  2m16s  cluster-autoscaler  pod triggered scale-up: [{MachineDeployment/gpu-nodes 0->1 (max: 30)}]

When the node is available, the pod is created and running.

NAME                                             READY   STATUS    RESTARTS   AGE
user-service-00001-deployment-6f6d577c45-rtjvz   2/2     Running   0          6m7s

However, the service is not ready, and the revision is not created.

NAME           URL                               LATESTCREATED        LATESTREADY   READY   REASON
user-service   http://6d2ef157.some.domain.net   user-service-00001                 False   RevisionMissing
NAME                 CONFIG NAME    K8S SERVICE NAME   GENERATION   READY   REASON          ACTUAL REPLICAS   DESIRED REPLICAS
user-service-00001   user-service                      1            False   Unschedulable   1                 0

This is the events I get from the revision

Warning  InternalError  7m29s  revision-controller  failed to update deployment "user-service-00001-deployment": Operation cannot be fulfilled on deployments.apps "user-service-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
Warning  InternalError  7m29s  revision-controller  failed to update PA "user-service-00001": Operation cannot be fulfilled on podautoscalers.autoscaling.internal.knative.dev "user-service-00001": the object has been modified; please apply your changes to the latest version and try again  

The PodAutoscaler resource is not ready, and the DesiredScale is 0.

NAME                 DESIREDSCALE   ACTUALSCALE   READY   REASON
user-service-00001   0              1             False   NoTraffic

the events from the PodAutoscaler resource

Status:
  Actual Scale:  1
  Conditions:
    Last Transition Time:  2024-02-23T16:32:02Z
    Message:               The target is not receiving traffic.
    Reason:                NoTraffic
    Status:                False
    Type:                  Active
    Last Transition Time:  2024-02-23T16:32:02Z
    Message:               The target is not receiving traffic.
    Reason:                NoTraffic
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-02-23T16:38:03Z
    Status:                True
    Type:                  SKSReady
    Last Transition Time:  2024-02-23T16:32:02Z
    Status:                True
    Type:                  ScaleTargetInitialized
  Desired Scale:           0
  Metrics Service Name:    user-service-00001-private
  Observed Generation:     2
  Service Name:            user-service-00001

I got error logs from the autoscaler pod

{"severity":"ERROR","timestamp":"2024-02-23T15:55:24.847361414Z","logger":"autoscaler","caller":"podautoscaler/reconciler.go:314","message":"Returned an error","commit":"239b73e","knative.dev/controller":"knative.dev.serving.pkg.reconciler.autoscaling.kpa.Reconciler","knative.dev/kind":"autoscaling.internal.knative.dev.PodAutoscaler","knative.dev/traceid":"2c39855d-329c-43a0-99a9-204f4944e4af","knative.dev/key":"3010eb09/user-service-00001","targetMethod":"ReconcileKind","error":"error scaling target: failed to get scale target {Deployment  user-service-00001-deployment  apps/v1  }: error fetching Pod Scalable 3010eb09/user-service-00001-deployment: deployments.apps \"user-service-00001-deployment\" not found","stacktrace":"knative.dev/serving/pkg/client/injection/reconciler/autoscaling/v1alpha1/podautoscaler.(*reconcilerImpl).Reconcile\n\tknative.dev/serving/pkg/client/injection/reconciler/autoscaling/v1alpha1/podautoscaler/reconciler.go:314\nmain.(*leaderAware).Reconcile\n\tknative.dev/serving/cmd/autoscaler/leaderelection.go:44\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:491"}
{"severity":"ERROR","timestamp":"2024-02-23T15:55:24.847442144Z","logger":"autoscaler","caller":"controller/controller.go:566","message":"Reconcile error","commit":"239b73e","knative.dev/controller":"knative.dev.serving.pkg.reconciler.autoscaling.kpa.Reconciler","knative.dev/kind":"autoscaling.internal.knative.dev.PodAutoscaler","knative.dev/traceid":"2c39855d-329c-43a0-99a9-204f4944e4af","knative.dev/key":"3010eb09/user-service-00001","duration":"787.035µs","error":"error scaling target: failed to get scale target {Deployment  user-service-00001-deployment  apps/v1  }: error fetching Pod Scalable 3010eb09/user-service-00001-deployment: deployments.apps \"user-service-00001-deployment\" not found","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:491"}

PodAutoscaler resource:

spec:
  containerConcurrency: 50
  protocolType: http1
  reachability: Unreachable
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service-00001-deployment

Changed reachability manually from "Unreachable" to "" and Changed desiredScale manually from "0" to "1"

NAME                 CONFIG NAME    K8S SERVICE NAME   GENERATION   READY   REASON   ACTUAL REPLICAS   DESIRED REPLICAS
user-service-00001   user-service                      1            True             1                 1
NAME           URL                               LATESTCREATED        LATESTREADY          READY   REASON
user-service   http://6d2ef157.some.domain.net   user-service-00001   user-service-00001   True    

The configuration I tried

I started to play with the configuration in an attempt to find the parameter that would unlock everything, but this was not successful. Please note that the values are intentionally exaggerated in an attempt to highlight a pattern.

config-autoscaler:

apiVersion: v1
data:
  allow-zero-initial-scale: "true"
  enable-scale-to-zero: "true"
  initial-scale: "0"
  scale-down-delay: 15m
  scale-to-zero-grace-period: 1800s
  scale-to-zero-pod-retention-period: 1800s
  stable-window: 360s
  target-burst-capacity: "211"
  window: 240s
kind: ConfigMap
metadata:
  name: config-autoscaler
  namespace: knative-serving

config-deployment:

apiVersion: v1
data:
  progress-deadline: 3600s
  queue-sidecar-image: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:d569f30abd31cbe105ba32b512a321dd82431b0a8e205bebf14538fddb4dfa54
  queueSidecarImage: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:9b8dad0630029dfcab124e6b4fa7c8e39b453249f0b31282c48e008bfc16faa3
kind: ConfigMap
metadata:
  name: config-deployment
  namespace: knative-serving

config-defaults:

apiVersion: v1
data:
  max-revision-timeout-seconds: "3600"
  revision-response-start-timeout-seconds: "1800"
  revision-timeout-seconds: "1800"
kind: ConfigMap
metadata:
  name: config-defaults
  namespace: knative-serving

The problem I'm facing

I'm not sure what I'm doing wrong. It looks like the revision has no reconciler, but I'm not sure.
The pod is running and the service is created, but the revision is not, which is why the service is not ready, and it's a bit of a mystery.

Could you please help me understand what is wrong with my configuration?

@hyde404 hyde404 added the kind/question Further information is requested label Feb 23, 2024
@JunfeiZhang
Copy link

Hi @hyde404 we are facing the same issue. have you resolved this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants