add ELB tags for custom ingresses #1391

xh3b4sd · 2019-02-04T15:09:02Z

Fixes https://github.com/giantswarm/giantswarm/issues/5220. I checked b8z5x on seal. See https://console.aws.amazon.com/vpc/home?region=us-east-1#subnets:search=b8z5x;sort=SubnetId. Only the private subnet is tagged there. I simply added the tags to the CF template. Not sure what the value should be but I guess due to the weird Kubernetes label format that might not even matter at all. 1 is set already in AWS for a couple of clusters.

calvix

is empty string valid tag? that would make more sense to me

xh3b4sd · 2019-02-04T15:45:59Z

You can set the value of a tag to an empty string

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html

pipo02mix

LGTM beyond the value used for the tag

xh3b4sd · 2019-02-04T15:46:55Z

Set the empty string. Going to test that.

xh3b4sd · 2019-02-04T17:06:19Z

When a cluster is updated, do we know how CF treats the changed subnets? Will it simply add the labels and keep the network functionality up? How could I verify that at best?

…into elb-tags

pipo02mix · 2019-02-04T17:21:38Z

can you try to update from a previous version to WIP with this change?

xh3b4sd · 2019-02-04T17:22:26Z

Sure, but what behaviour to look for? How to verify the subnets behave properly?

pipo02mix · 2019-02-04T17:26:04Z

yeah I have no idea. I remember for security groups when you add a tag it delete and create the SG again which obviously has an impact in the customer traffic. Maybe you can ping an application there and see if there is downtime during the upgrade. I can help if you need

mazzy89 · 2019-02-04T18:24:30Z

In case of a tag it should not recreate the entire SG because the changes are not at the level of the single rule but at the level of the SG. But apology me in advance if I say it wrong. Maybe i remember wrongly.

Different is the case for changes at the level of the rule. In that case there is a drop in the connection because the SG is destroyed and recreated. 😭😬😱

Reference:
https://jlordiales.me/2018/01/30/cf-gotchas-sg/

calvix

lgtm

calvix · 2019-02-05T04:44:20Z

for testing, I would do the upgrade and look at the CF, check if CF state is green, check if the subnets after update have a proper tag and then check cluster functionality, that means access k8s API, and try ssh into one of the nodes, ping each node from master/worker.

If all of this works than subnets are fine, testing this on multi-AZ cluster make sense as you will have more subnets and inter-subnet communication.

xh3b4sd · 2019-02-05T10:02:17Z

The easy things first. New clusters get created with the tags. Cool.

teemow · 2019-02-05T11:52:58Z

@xh3b4sd you could also create an old cluster and then modify the cloudformation and see what it does if you execute this. If that helps.

xh3b4sd · 2019-02-05T19:45:00Z

The new cluster which has the tags apparently has connection issues. I deleted the tags manually and restarted the kubelets but fooling around did not lead to anything. I would like to pair with somebody to figure out which problems we have there. The cluster 4ntau is on ginger. My hunch would be that the tags screws the routing but I don't follow how this works and where to look at now.

$ kubectl get node -o wide
Unable to connect to the server: EOF

xh3b4sd · 2019-02-05T20:00:51Z

I updated sp0eq from 6.2.0 to 6.3.0. During the process the subnets get updated quickly and it seems only the tags are added. Pining workers from workers and workers from master works. After the master node got rolled the API is functional again. Will check tomorrow how the updated cluster behaves after a couple of hours and if the update goes through at all.

mazzy89 · 2019-02-05T20:38:57Z

@xh3b4sd Did you try to create those tags with the value 1?

For instance kops creates utility (public) and private subnets with the tags respectively kubernetes.io/role/elb: 1 and kubernetes.io/role/internal-elb: 1

See also here:

But it seems that you already solved the issue and it works so ignore as said 🤓

mazzy89 · 2019-02-06T22:54:45Z

I've run a test cluster and reproduced perfectly the error. I made also some further progress to figure out the root cause.

It is crucial here to check out the timeline.

23.20: the tenant cluster is created.

23.22: it is not possible to reach the cluster at the first hit because it returns the following error:

➜ kgno
Unable to connect to the server: EOF

and by logging the apiserver, we get this:

I0206 22:24:58.788349       1 log.go:172] http: TLS handshake error from 10.1.0.215:63042: EOF
I0206 22:25:03.388177       1 log.go:172] http: TLS handshake error from 10.1.0.214:42568: EOF
I0206 22:25:03.788940       1 log.go:172] http: TLS handshake error from 10.1.0.215:63056: EOF
I0206 22:25:08.176294       1 log.go:172] http: TLS handshake error from 10.1.0.104:51114: EOF
I0206 22:25:08.387920       1 log.go:172] http: TLS handshake error from 10.1.0.214:42574: EOF
I0206 22:25:08.788566       1 log.go:172] http: TLS handshake error from 10.1.0.215:63062: EOF
I0206 22:25:13.388652       1 log.go:172] http: TLS handshake error from 10.1.0.214:42576: EOF
I0206 22:25:13.788126       1 log.go:172] http: TLS handshake error from 10.1.0.215:63064: EOF

The EC2 instance (master) linked to the api ELB is marked InService.

The EC2 instances (worker) linked to the ingress ELB are marked OutOfService

23.38: the cluster gets healthy and then consequently the ingress ELB instances are marked InService.

Using the following tags, introduce a delay somewhere in the creation of the cluster. In general the cluster takes more time as it is visible from two things:

apiserver component is not ready → kubectl commands fails
ingress ELB does not get healthy immediately

Where to go next?

I'd recommend to debug the creation of the cluster in particular each component because for some reason these tags introduce some overhead computation somewhere perhaps right in this part https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L3150

calvix · 2019-02-07T05:51:21Z

The cluster takes time to create, that is not something new and we are fine, the problem AFAIK was that it did not resolve, the connection was broken for day after creation. if You can access the server 20min after creation as you said then you did not hit the bug.

mazzy89 · 2019-02-07T07:04:07Z

20m after the creation I can run kubectl get nodes (or any other kube commands) and successfully connect to the API.

It did not resolve, the connection was broken for day after creation.

Could you please point out how to reproduce this? Maybe I haven't tested something.

xh3b4sd · 2019-02-07T09:03:50Z

The issue works as follows. You create a cluster. You wait until it is created. The API works. You wait some more time, maybe an hour or two. Then the API is not working anymore and only returns EOF from that point on. The apiserver seems to be fine though. So it is something between the ELB and the EC2 instance.

mazzy89 · 2019-02-07T09:10:19Z

I see. Thank you @xh3b4sd for the clarification. So we need to observe this in the long run. When you say the API is not working anymore how does it look the api ELB? Is it InService?

xh3b4sd · 2019-02-07T09:12:00Z

ELB appears to be healthy and the master node is also in service.

xh3b4sd · 2019-02-07T09:13:06Z

Your cluster mzv35 on ginger shows the symptoms already.

$ kubectl get node -o wide --show-labels
Unable to connect to the server: EOF

mazzy89 · 2019-02-07T09:14:34Z

I see then I think we should improve one thing. If the API returns EOF then the LoadBalancer should be marked unhealthy. The health-check at that level should inspect the fact that api is not working. Second thing I propose here is to enable VPC flow logs. If it could be a problem of ELB and EC2 then it is easily visible in the VPC flow logs.

calvix · 2019-02-07T09:19:20Z

I see then I think we should improve one thing. If the API returns EOF then the LoadBalancer should be marked unhealthy. The health-check at that level should inspect the fact that api is not working. Second thing I propose here is to enable VPC flow logs. If it could be a problem of ELB and EC2 then it is easily visible in the VPC flow logs.

API does not return EOF, API is working, the ELB returns EOF, the health check for API worked fine so far.

Maybe just try debug yourself :p

mazzy89 · 2019-02-07T09:28:21Z

Ok @calvix but do you agree on the fact that if the ELB returns EOF is a symptom that something is now working properly and we should be advertised about that? So we need to introduce/scrape any kind of metrics at the level of ELB to know what is going on. Do we agree on VPC flow logs could be the way to dig more on this?

xh3b4sd · 2019-02-07T09:29:36Z

Vaclav created an issue in SIG Monitoring about this yesterday: https://github.com/giantswarm/giantswarm/issues/5294.

mazzy89 · 2019-02-08T08:53:02Z

I'm still on this and try to understand more. Enabled the VPC flow logs but it seems that it didn't help that much. I haven't seen any suspicious logs. I'll continue to check that.

calvix · 2019-02-08T08:57:49Z

Ok @calvix but do you agree on the fact that if the ELB returns EOF is a symptom that something is now working properly and we should be advertised about that? So we need to introduce/scrape any kind of metrics at the level of ELB to know what is going on. Do we agree on VPC flow logs could be the way to dig more on this?

Yes, but we should first find what is wrong before we add scraping random metrics. We already scrape some for them in aws-operator and alert on them.

What I thought is that you could debug the problem found what is wrong and then figure out a way how we can detect such a situation.

calvix · 2019-02-14T06:13:01Z

https://github.com/giantswarm/giantswarm/issues/5220#issuecomment-463503801

xh3b4sd · 2019-02-14T10:19:26Z

@calvix is there anything left to test? Is it good enough for us that you fixed and verified the issue and CI is green? I am moving the tags to the new WIP bundle as well. Just checking when we could merge this PR here and get the PM sorted.

calvix · 2019-02-14T13:02:14Z

@xh3b4sd should be fine, i tested this few times in ginger and gauss and was not able to reproduce again

xh3b4sd · 2019-02-14T14:24:37Z

Should the changes in v22 stay now? It would mean we back port it to eventually already created clusters. Is that fine as well or should we only introduce with the new version v23?

pipo02mix · 2019-02-14T14:41:50Z

it will be great to have available for current versions

add ELB tags for custom ingresses

dd9c4ba

xh3b4sd self-assigned this Feb 4, 2019

xh3b4sd requested review from pipo02mix and a team February 4, 2019 15:09

calvix reviewed Feb 4, 2019

View reviewed changes

set tags to empty string

a65434c

pipo02mix approved these changes Feb 4, 2019

View reviewed changes

xh3b4sd added 3 commits February 4, 2019 18:06

Merge branch 'master' into elb-tags

d72a399

Merge branch 'master' into elb-tags

b81d6dd

Merge branch 'elb-tags' of https://github.com/giantswarm/aws-operator …

9a69b48

…into elb-tags

xh3b4sd temporarily deployed to ginger February 4, 2019 17:29 Inactive

calvix approved these changes Feb 5, 2019

View reviewed changes

Merge branch 'master' into elb-tags

c8840f1

xh3b4sd temporarily deployed to ginger February 5, 2019 09:54 Inactive

Merge branch 'master' into elb-tags

87638d6

xh3b4sd temporarily deployed to ginger February 5, 2019 19:32 Inactive

set tag value to 1

a6a2355

mazzy89 temporarily deployed to ginger February 6, 2019 20:45 Inactive

mazzy89 temporarily deployed to ginger February 6, 2019 22:09 Inactive

tuommaki approved these changes Feb 8, 2019

View reviewed changes

calvix temporarily deployed to ginger February 12, 2019 09:01 Inactive

Merge branch 'master' into elb-tags

9d4ed44

calvix temporarily deployed to ginger February 13, 2019 12:11 Inactive

calvix temporarily deployed to ginger February 14, 2019 06:13 Inactive

Merge branch 'master' into elb-tags

1f43c6d

move ELB tags to v23

db131ef

calvix deployed to gauss February 14, 2019 11:28 Active

xh3b4sd merged commit e43bc75 into master Feb 14, 2019

xh3b4sd deleted the elb-tags branch February 14, 2019 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ELB tags for custom ingresses #1391

add ELB tags for custom ingresses #1391

xh3b4sd commented Feb 4, 2019

calvix left a comment

xh3b4sd commented Feb 4, 2019

pipo02mix left a comment

xh3b4sd commented Feb 4, 2019

xh3b4sd commented Feb 4, 2019

pipo02mix commented Feb 4, 2019

xh3b4sd commented Feb 4, 2019

pipo02mix commented Feb 4, 2019

mazzy89 commented Feb 4, 2019 •

edited

calvix left a comment

calvix commented Feb 5, 2019

xh3b4sd commented Feb 5, 2019

teemow commented Feb 5, 2019

xh3b4sd commented Feb 5, 2019

xh3b4sd commented Feb 5, 2019

mazzy89 commented Feb 5, 2019 •

edited

mazzy89 commented Feb 6, 2019

calvix commented Feb 7, 2019

mazzy89 commented Feb 7, 2019

xh3b4sd commented Feb 7, 2019

mazzy89 commented Feb 7, 2019

xh3b4sd commented Feb 7, 2019

xh3b4sd commented Feb 7, 2019

mazzy89 commented Feb 7, 2019

calvix commented Feb 7, 2019 •

edited

mazzy89 commented Feb 7, 2019

xh3b4sd commented Feb 7, 2019

mazzy89 commented Feb 8, 2019

calvix commented Feb 8, 2019

calvix commented Feb 14, 2019

xh3b4sd commented Feb 14, 2019

calvix commented Feb 14, 2019

xh3b4sd commented Feb 14, 2019

pipo02mix commented Feb 14, 2019

add ELB tags for custom ingresses #1391

add ELB tags for custom ingresses #1391

Conversation

xh3b4sd commented Feb 4, 2019

calvix left a comment

Choose a reason for hiding this comment

xh3b4sd commented Feb 4, 2019

pipo02mix left a comment

Choose a reason for hiding this comment

xh3b4sd commented Feb 4, 2019

xh3b4sd commented Feb 4, 2019

pipo02mix commented Feb 4, 2019

xh3b4sd commented Feb 4, 2019

pipo02mix commented Feb 4, 2019

mazzy89 commented Feb 4, 2019 • edited

calvix left a comment

Choose a reason for hiding this comment

calvix commented Feb 5, 2019

xh3b4sd commented Feb 5, 2019

teemow commented Feb 5, 2019

xh3b4sd commented Feb 5, 2019

xh3b4sd commented Feb 5, 2019

mazzy89 commented Feb 5, 2019 • edited

mazzy89 commented Feb 6, 2019

calvix commented Feb 7, 2019

mazzy89 commented Feb 7, 2019

xh3b4sd commented Feb 7, 2019

mazzy89 commented Feb 7, 2019

xh3b4sd commented Feb 7, 2019

xh3b4sd commented Feb 7, 2019

mazzy89 commented Feb 7, 2019

calvix commented Feb 7, 2019 • edited

mazzy89 commented Feb 7, 2019

xh3b4sd commented Feb 7, 2019

mazzy89 commented Feb 8, 2019

calvix commented Feb 8, 2019

calvix commented Feb 14, 2019

xh3b4sd commented Feb 14, 2019

calvix commented Feb 14, 2019

xh3b4sd commented Feb 14, 2019

pipo02mix commented Feb 14, 2019

mazzy89 commented Feb 4, 2019 •

edited

mazzy89 commented Feb 5, 2019 •

edited

calvix commented Feb 7, 2019 •

edited