Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dkron can't be safely used in k8s at the moment #1442

Open
ivan-kripakov-m10 opened this issue Dec 15, 2023 · 11 comments
Open

Dkron can't be safely used in k8s at the moment #1442

ivan-kripakov-m10 opened this issue Dec 15, 2023 · 11 comments

Comments

@ivan-kripakov-m10
Copy link

ivan-kripakov-m10 commented Dec 15, 2023

hi!

Is your feature request related to a problem? Please describe.
At the moment dkron cannot be safely used in k8s because dkron servers cannot handle IP changes.
To reproduce you can just deploy dkron using actual helm, shutdown the cluster and redeploy it.
Nodes will try to reconnect to each other using old IPs, but this process won't succeed.

Describe the solution you'd like
I think the consul-like approach can be used: hashicorp/consul#3403

Additional context
I'm not sure if this is the only problem with dkron in k8s (there is a hypothesis that you need to resolve todo - one and two, but I'm not sure - will share updates if any appears)
If you know of any other problems, I would suggest making a series of improvements aimed at supporting the work of dkron in k8s.
I think many people would like to have such an opportunity (I have seen many issues that are related to this in one way or another).

@ivan-kripakov-m10 ivan-kripakov-m10 changed the title Dkron can't be used safely used in k8s at the moment Dkron can't be safely used in k8s at the moment Dec 15, 2023
@vcastellm
Copy link
Member

Possibly fixed in #1446

@fopina
Copy link
Contributor

fopina commented Feb 5, 2024

this looks similar to #1253
is it also fixed by #1446 ?
Looking forward to update to v4 and test it!

@vcastellm
Copy link
Member

Hey can you try with v4.0.0-beta? this should be fixed by #1446

@ivan-kripakov-m10
Copy link
Author

ivan-kripakov-m10 commented Feb 9, 2024

Hey, I have already tested #1446 (as I have written in my PR).
If anybody else is able to set up Dkron in some k8s cluster, I think it will be more sufficient as we will have at least two evidence that #1446 is a correct change.

@ivan-kripakov-m10
Copy link
Author

ivan-kripakov-m10 commented Feb 9, 2024

Also there is a significant change in dkron k8s helm.
distribworks/dkron-helm#7
I tested the dkron build 3.2.6 with commits from #1446 using it.

@vcastellm are you going to merge it too?

@ivan-kripakov-m10
Copy link
Author

And also we are for sure waiting for Dkron v4, but isn't it a good idea to release a patch version of Dkron 3.2.x (with #1446) to provide possibility to use Dkron in k8s now?

@vcastellm
Copy link
Member

@ivan-kripakov-m10 it would be possible to release a patch version for v3 but I don't see any advantage of it. Can you elaborate on possible use cases of v3 vs v4?

@fopina
Copy link
Contributor

fopina commented Feb 11, 2024

Hey can you try with v4.0.0-beta? this should be fixed by #1446

@vcastellm not sure if I'm supposed to use any extra flags but 4.0.0-beta3 does not fix my issue #1253 (which I believe to be similar to this one)

After killing the server (to make it restart), agents report a log like this one

## inital join, all good
time="2024-02-11T13:36:59Z" level=info msg="Adding LAN adding server" node=sfpi4 server=dkron1
time="2024-02-11T13:36:59Z" level=info msg="agent: Received event" event=member-update node=pi4

## server (dkron1) killed, and removed from list, never retried
time="2024-02-11T13:49:49Z" level=info msg="agent: Received event" event=member-update node=pi4
time="2024-02-11T13:49:49Z" level=info msg="agent: Received event" event=member-failed node=pi4
time="2024-02-11T13:49:49Z" level=info msg="removing server dkron1 (Addr: 10.0.2.35:6868) (DC: dc1)" node=pi4

Docker swarm compose (to illustrate configuration)

services:
  server:
    image: dkron/dkron:4.0.0-beta3
    command: agent 
    environment:
      #DKRON_NODE_NAME: "{{.Node.Hostname}}"
      DKRON_NODE_NAME: dkron1
      DKRON_DATA_DIR: /ext/data
      DKRON_SERVER: 1
      DKRON_BIND_ADDR: tasks.server:8946
      DKRON_BOOTSTRAP_EXPECT: 1
    deploy:
      mode: replicated
      replicas: 1
  agents:
    image: dkron/dkron:4.0.0-beta3
    command: agent
    environment:
      DKRON_NODE_NAME: "{{.Node.Hostname}}"
      DKRON_RETRY_JOIN: tasks.server
      DKRON_BIND_ADDR: '{{`{{ GetInterfaceIP "eth0" }}:8946`}}'
      DKRON_TAG: 'arch={{.Node.Platform.Architecture}} server=false'
    deploy:
      mode: global

@ivan-kripakov-m10
Copy link
Author

@vcastellm It appears that speed is the primary focus for me. From what I gather, version 4 will bring numerous modifications to both the user interface and backend. Implementing change #1446 and rolling out a release to enable users to utilize dkron in k8s seems like a more straightforward and quicker task comparing to the extensive v4 update.

@raebbar
Copy link

raebbar commented Feb 12, 2024

If anybody else is able to set up Dkron in some k8s cluster, I think it will be more sufficient as we will have at least two evidence that #1446 is a correct change.

I converted a Dkron test instance with 3 servers and 2 agents to version 4.0.0-beta4. After that I deleted various pods several times, restarted the server's StatefulSet and so on. In all cases, the new pods reconnected correctly with the Dkron cluster, IP changes were handled, and leader selection worked.

@jaccky
Copy link

jaccky commented Feb 15, 2024

Hi,
we tried dkron/dkron:4.0.0-beta4 on an aks cluster, with 3 server nodes.
Various restarts of the nodes, always resulted in a working cluster with an elected leader.
So the issue seems to be finally solved !
Thanks to @ivan-kripakov-m10 for his work, I hope we can see this soon released in a stable version, I also hope a patch will be available for version 3 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants