You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I have an EKS cluster with public/private access on a VPC with public and private subnets. I've setup my ALB in the public subnets on port 80, internet-facing and ip and installed the AWS controller following example through AWS docs and 2048 deployment example. I am using GPU nodes and also set up Kubernetes GPU operator. I have a deployment and service for a flask rest api.
After getting everything setup, I expected the EKS cluster node instances I have running to register into my target group but its empty and the pods have no instances to join.
Here is a screenshot of the ALB and the empty target group from the AWS console
I'm struggling to find an answer as to why this is happening. I've been messing with my ingress and deployment yaml files and thought it was maybe a selector/label issue but that doesn't seem to be the case. My deployment is running a flask api on port 5000 and I am setting a /health path to hit the flask api server /health endpoint and return response.
This is the dockerfile that I built for the deployment:
# start by pulling the python image
FROM python:3.9
# copy the requirements file into the image
COPY ./requirements.txt /app/requirements.txt
# switch working directory
WORKDIR /app
# install the dependencies and packages in the requirements file
RUN pip install -r requirements.txt
# copy every content from the local file to the image
COPY . /app
# Expose port 5000 for Gunicorn
EXPOSE 5000
# Configure the container to run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "main:app"]
I also ran the command kubectl describe targetgroupbindings -n flask-api-app and this was the result:
Name: k8s-flaskapi-flaskapi-c99c751836
Namespace: flask-api-app
Labels: ingress.k8s.aws/stack-name=flask-ingress-3
ingress.k8s.aws/stack-namespace=flask-api-app
Annotations: <none>
API Version: elbv2.k8s.aws/v1beta1
Kind: TargetGroupBinding
Metadata:
Creation Timestamp: xxxxxxxxxxxxxxxx
Finalizers:
elbv2.k8s.aws/resources
Generation: 1
Resource Version: 1802318
UID: xxxxxxxxxxxxxxxxxxxxxxxxx
Spec:
Ip Address Type: ipv4
Networking:
Ingress:
From:
Security Group:
Group ID: xxxxxxxxxxxxxxxxxxxx
Ports:
Port: 5000
Protocol: TCP
Service Ref:
Name: flask-api-app-service
Port: 80
Target Group ARN: xxxxxxxxxxxxxxxxxxxxxxxx
Target Type: ip
Status:
Observed Generation: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfullyReconciled 10m (x2 over 10m) targetGroupBinding Successfully reconciled
namespaces:
kubectl get namespaces
NAME STATUS AGE
default Active 8d
flask-api-app Active 40m
gpu-operator Active 7d7h
kube-node-lease Active 8d
kube-public Active 8d
kube-system Active 8d
kubectl get all -n flask-api-app
NAME READY STATUS RESTARTS AGE
pod/flask-api-deployment-59c668dcf8-wzl6p 0/1 Pending 0 44m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/flask-api-app-service NodePort 172.20.201.77 <none> 80:32235/TCP 44m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/flask-api-deployment 0/1 1 0 44m
NAME DESIRED CURRENT READY AGE
replicaset.apps/flask-api-deployment-59c668dcf8 1 1 0 44m
Environment
Amazon Linux 2 Ubuntu
AWS Load Balancer controller version
2.7.2(???)
this is the output of
kubectl describe deployment -n kube-system aws-load-balancer-controller
Name: aws-load-balancer-controller
Namespace: kube-system
CreationTimestamp: Thu, 09 May 2024 17:00:58 -0400
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=aws-load-balancer-controller
app.kubernetes.io/version=v2.7.2
helm.sh/chart=aws-load-balancer-controller-1.7.2
Annotations: deployment.kubernetes.io/revision: 1
meta.helm.sh/release-name: aws-load-balancer-controller
meta.helm.sh/release-namespace: kube-system
Selector: app.kubernetes.io/instance=aws-load-balancer-controller,app.kubernetes.io/name=aws-load-balancer-controller
Replicas: 2 desired | 2 updated | 2 total | 0 available | 2 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/name=aws-load-balancer-controller
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
Service Account: aws-load-balancer-controller
Containers:
aws-load-balancer-controller:
Image: public.ecr.aws/eks/aws-load-balancer-controller:v2.7.2
Ports: 9443/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--cluster-name=EKS-Test-Cluster
--ingress-class=alb
Liveness: http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
Readiness: http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
Environment: <none>
Mounts:
/tmp/k8s-webhook-server/serving-certs from cert (ro)
Volumes:
cert:
Type: Secret (a volume populated by a Secret)
SecretName: aws-load-balancer-tls
Optional: false
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available False MinimumReplicasUnavailable
OldReplicaSets: <none>
NewReplicaSet: aws-load-balancer-controller-6bf4b948d6 (2/2 replicas created)
Events: <none>
Kubernetes version
1.29
Using EKS (yes/no), if so version?
yes, EKS.6
The text was updated successfully, but these errors were encountered:
Hello, issue was unrelated to ALB. I was using GPU nodes with Amazon Linux 2, tried to install a driver tagged with AL2 that does not exist. I have since moved the OS to bottlerocket NVIDIA and everything is working.
Describe the bug
I have an EKS cluster with public/private access on a VPC with public and private subnets. I've setup my ALB in the public subnets on port 80, internet-facing and ip and installed the AWS controller following example through AWS docs and 2048 deployment example. I am using GPU nodes and also set up Kubernetes GPU operator. I have a deployment and service for a flask rest api.
After getting everything setup, I expected the EKS cluster node instances I have running to register into my target group but its empty and the pods have no instances to join.
Here is a screenshot of the ALB and the empty target group from the AWS console
I'm struggling to find an answer as to why this is happening. I've been messing with my ingress and deployment yaml files and thought it was maybe a selector/label issue but that doesn't seem to be the case. My deployment is running a flask api on port 5000 and I am setting a /health path to hit the flask api server /health endpoint and return response.
Deployment.yaml:
ingress.yaml:
service-account.yaml:
This is the dockerfile that I built for the deployment:
I also ran the command
kubectl describe targetgroupbindings -n flask-api-app
and this was the result:namespaces:
Environment
Amazon Linux 2 Ubuntu
2.7.2(???)
this is the output of
1.29
yes, EKS.6
The text was updated successfully, but these errors were encountered: