Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase backoff limit in K8s jobs #5032

Closed
dcahillm1 opened this issue May 6, 2024 · 4 comments · Fixed by #5044
Closed

Increase backoff limit in K8s jobs #5032

dcahillm1 opened this issue May 6, 2024 · 4 comments · Fixed by #5044
Labels
enhancement Polish or UX improvements

Comments

@dcahillm1
Copy link

dcahillm1 commented May 6, 2024

Is your feature request related to a problem? Please describe.
Some of our pipelines have been failing due to: " Job has reached the specified backoff limit"

Hi, we are trying to add a tag like ttlSecondsAfterFinished: 100 to the jobs that are generated in the pipelines, since currently when they fail they are not automatically deleted and last week we had the issue that we reached the maximum number of jobs in the namespace and could not be launched new pipelines. Is there any way you plan to handle this? Or is there already something and we're not doing it right?

Describe the solution you'd like
Is it possible to increase the backoff limit?

@wangxiaoyou1993 wangxiaoyou1993 added the enhancement Polish or UX improvements label May 6, 2024
@edulodgify
Copy link

edulodgify commented May 7, 2024

I've been trying to configure the backoff in the k8s executor template, but every time I add the label it is ignored. I have tried it as you can see in the screenshot, but also inside pod and container fields. I also have tried backoff restart policies, but with all of them I have the same result

image

reference: https://kubernetes.io/docs/concepts/workloads/controllers/job/

@artche
Copy link
Contributor

artche commented May 8, 2024

Hi,@wangxiaoyou1993
Prepared a simple solution in PR #5044
Fill free to use)

@artche
Copy link
Contributor

artche commented May 10, 2024

@dcahillm1 @edulodgify
Check out this configuration in next release:

k8s_executor_config:
  ...
  job_config:
    active_deadline_seconds: 120
    backoff_limit: 3
    ttl_seconds_after_finished: 86400

https://docs.mage.ai/production/configuring-production-settings/compute-resource#kubernetes-executor
Note that supports only active_deadline_seconds, backoff_limit and ttl_seconds_after_finished

@edulodgify
Copy link

Hi @artche we have been trying to implement the fix but it seems that is not working, at least for us. I don't know if could be due to we are not using this kind of configuration

k8s_executor_config:
  ...
  job_config:
    active_deadline_seconds: 120
    backoff_limit: 3
    ttl_seconds_after_finished: 86400

we are using k8s configuration template like

# Kubernetes Configuration Template
metadata:
  annotations:
    application: "mage"
    composant: "executor"
  labels:
    application: "mage"
    type: "spark"
  namespace: "default"
pod:
  service_account_name: ""
  image_pull_secrets: "secret"
  volumes:
  - name: data-pvc
    persistent_volume_claim:
      claim_name: pvc-name
container:
  name: "mage-data"
  env:
    - name: "KUBE_NAMESPACE"
      value: "default"
    - name: "secret_key"
      value: "somesecret"
  image: "mageai/mageai:latest"
  image_pull_policy: "IfNotPresent"
  resources:
    limits:
      cpu: "1"
      memory: "1Gi"
    requests:
      cpu: "0.1"
      memory: "0.5Gi"
  volume_mounts:
    - mount_path: "/tmp/data"
      name: "data-pvc"

We have tried to use the job_config inside the container field, inside the pod field and outside of all the fields as it seems to be in what you shared.
If we use it inside container or pod, these lines are completely ignored and if we leave it out, as you shared, the jobs are not even created in k8s.
Any clue about what could be happening?or how should we implement it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Polish or UX improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants