Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I am experiencing difficulties connecting to Kafka in a distributed environment. #109

Open
5win opened this issue Jan 3, 2024 · 3 comments
Assignees

Comments

@5win
Copy link

5win commented Jan 3, 2024

Hello,
I am trying to run kafka-ml on a k8s cluster and execute the mnist example code. However, I am facing the following error and having difficulty connecting to Kafka.

image

This is the result after cloning the repository, creating pods using 'apply kustomize/local', and modifying the bootstrap_servers in mnist_dataset_training_example.py to the IP and NodePort of the node where the Kafka pod is running.

I appreciate your assistance, even though it might be inconvenient. Thank you.

@Altair-Bueno
Copy link
Member

Hi @5win, thanks for reaching out. I've just tried Kafka ML with an empty cluster (K3s v1.28.3-rc2+k3s2). Unfortunately we cannot reproduce your issue. It looks like a name resolution error, which might be one of two things:

  • Your cluster lacks support for name resolution
  • Your CoreDNS* service is down

On either case, verify your kube-system namespace. We are looking for a pod with a name similar to CoreDNS and it should be up and running. For more information, check out this great support article from Kubernetes: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

* Maybe your cluster is not using CoreDNS, but is the most used out there.


PS: The steps I used to deploy Kafka-ML local are:

# Make sure no Kafka-ML is running
kubectl delete namespace kafkaml
# Clone the repo
git clone https://github.com/ertis-research/kafka-ml
# Deploy Kafka-ML
kubectl apply -k kafka-ml/kustomize/local 

@Altair-Bueno
Copy link
Member

Altair-Bueno commented Jan 19, 2024

We've just noticed this error comes as a result of the bootstrap process in Kafka (see https://www.confluent.io/blog/kafka-client-cannot-connect-to-broker-on-aws-on-docker-etc/). This is likely caused because you are using our local deployment on a remote cluster and thus localhost do not match up. There are three ways of resolving your issue:

Port forward Kafka to your local machine

Something among these lines should work. Keep the following command running on a background shell:

kubectl port-forward '--namespace=kafkaml' service/kafka 9094:9094

And set the kafka address to localhost:9094

Modify KAFKA_CFG_ADVERTISED_LISTENERS

You can add PLAINTEXT://{{ your_cluster_ip }}:9094 to the list on the kafka deployment

Deploy the normal Kafka-ML version

This would unlock the usage of GPU if your cluster supports it and you deploy the -gpu versions of Kafka-ML. The only downside you need to deploy your own kafka cluster, which might be a bit tricky. Internally we use Bitnami's Helm chart, but there are other options out there that might be easier to get started.

@GURPREET-WEB
Copy link

Dear authors,
Thanks for the great work.
I am also experiencing errors while running the provided code after enabling docker desktop with kubernetes. Kindly resolve the issues.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants