Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ELB tags for custom ingresses #1391

Merged
merged 11 commits into from Feb 14, 2019
Merged

add ELB tags for custom ingresses #1391

merged 11 commits into from Feb 14, 2019

Conversation

xh3b4sd
Copy link
Contributor

@xh3b4sd xh3b4sd commented Feb 4, 2019

Fixes https://github.com/giantswarm/giantswarm/issues/5220. I checked b8z5x on seal. See https://console.aws.amazon.com/vpc/home?region=us-east-1#subnets:search=b8z5x;sort=SubnetId. Only the private subnet is tagged there. I simply added the tags to the CF template. Not sure what the value should be but I guess due to the weird Kubernetes label format that might not even matter at all. 1 is set already in AWS for a couple of clusters.

@xh3b4sd xh3b4sd self-assigned this Feb 4, 2019
@xh3b4sd xh3b4sd requested review from pipo02mix and a team February 4, 2019 15:09
Copy link
Member

@calvix calvix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is empty string valid tag? that would make more sense to me

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 4, 2019

You can set the value of a tag to an empty string

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html

Copy link
Contributor

@pipo02mix pipo02mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM beyond the value used for the tag

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 4, 2019

Set the empty string. Going to test that.

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 4, 2019

When a cluster is updated, do we know how CF treats the changed subnets? Will it simply add the labels and keep the network functionality up? How could I verify that at best?

@pipo02mix
Copy link
Contributor

can you try to update from a previous version to WIP with this change?

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 4, 2019

Sure, but what behaviour to look for? How to verify the subnets behave properly?

@pipo02mix
Copy link
Contributor

yeah I have no idea. I remember for security groups when you add a tag it delete and create the SG again which obviously has an impact in the customer traffic. Maybe you can ping an application there and see if there is downtime during the upgrade. I can help if you need

@xh3b4sd xh3b4sd temporarily deployed to ginger February 4, 2019 17:29 Inactive
@mazzy89
Copy link
Contributor

mazzy89 commented Feb 4, 2019

In case of a tag it should not recreate the entire SG because the changes are not at the level of the single rule but at the level of the SG. But apology me in advance if I say it wrong. Maybe i remember wrongly.

Different is the case for changes at the level of the rule. In that case there is a drop in the connection because the SG is destroyed and recreated. 😭😬😱

Reference:
https://jlordiales.me/2018/01/30/cf-gotchas-sg/

Copy link
Member

@calvix calvix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@calvix
Copy link
Member

calvix commented Feb 5, 2019

for testing, I would do the upgrade and look at the CF, check if CF state is green, check if the subnets after update have a proper tag and then check cluster functionality, that means access k8s API, and try ssh into one of the nodes, ping each node from master/worker.

If all of this works than subnets are fine, testing this on multi-AZ cluster make sense as you will have more subnets and inter-subnet communication.

@xh3b4sd xh3b4sd temporarily deployed to ginger February 5, 2019 09:54 Inactive
@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 5, 2019

The easy things first. New clusters get created with the tags. Cool.

screenshot 2019-02-05 at 11 01 00

screenshot 2019-02-05 at 11 01 12

@teemow
Copy link
Member

teemow commented Feb 5, 2019

@xh3b4sd you could also create an old cluster and then modify the cloudformation and see what it does if you execute this. If that helps.

@xh3b4sd xh3b4sd temporarily deployed to ginger February 5, 2019 19:32 Inactive
@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 5, 2019

The new cluster which has the tags apparently has connection issues. I deleted the tags manually and restarted the kubelets but fooling around did not lead to anything. I would like to pair with somebody to figure out which problems we have there. The cluster 4ntau is on ginger. My hunch would be that the tags screws the routing but I don't follow how this works and where to look at now.

$ kubectl get node -o wide
Unable to connect to the server: EOF

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 5, 2019

I updated sp0eq from 6.2.0 to 6.3.0. During the process the subnets get updated quickly and it seems only the tags are added. Pining workers from workers and workers from master works. After the master node got rolled the API is functional again. Will check tomorrow how the updated cluster behaves after a couple of hours and if the update goes through at all.

@mazzy89
Copy link
Contributor

mazzy89 commented Feb 5, 2019

@xh3b4sd Did you try to create those tags with the value 1?

For instance kops creates utility (public) and private subnets with the tags respectively kubernetes.io/role/elb: 1 and kubernetes.io/role/internal-elb: 1

See also here:

But it seems that you already solved the issue and it works so ignore as said 🤓

@mazzy89 mazzy89 temporarily deployed to ginger February 6, 2019 20:45 Inactive
@mazzy89 mazzy89 temporarily deployed to ginger February 6, 2019 22:09 Inactive
@mazzy89
Copy link
Contributor

mazzy89 commented Feb 6, 2019

I've run a test cluster and reproduced perfectly the error. I made also some further progress to figure out the root cause.

It is crucial here to check out the timeline.

23.20: the tenant cluster is created.

23.22: it is not possible to reach the cluster at the first hit because it returns the following error:

➜ kgno
Unable to connect to the server: EOF

and by logging the apiserver, we get this:

I0206 22:24:58.788349       1 log.go:172] http: TLS handshake error from 10.1.0.215:63042: EOF
I0206 22:25:03.388177       1 log.go:172] http: TLS handshake error from 10.1.0.214:42568: EOF
I0206 22:25:03.788940       1 log.go:172] http: TLS handshake error from 10.1.0.215:63056: EOF
I0206 22:25:08.176294       1 log.go:172] http: TLS handshake error from 10.1.0.104:51114: EOF
I0206 22:25:08.387920       1 log.go:172] http: TLS handshake error from 10.1.0.214:42574: EOF
I0206 22:25:08.788566       1 log.go:172] http: TLS handshake error from 10.1.0.215:63062: EOF
I0206 22:25:13.388652       1 log.go:172] http: TLS handshake error from 10.1.0.214:42576: EOF
I0206 22:25:13.788126       1 log.go:172] http: TLS handshake error from 10.1.0.215:63064: EOF

The EC2 instance (master) linked to the api ELB is marked InService.

The EC2 instances (worker) linked to the ingress ELB are marked OutOfService

23.38: the cluster gets healthy and then consequently the ingress ELB instances are marked InService.

Using the following tags, introduce a delay somewhere in the creation of the cluster. In general the cluster takes more time as it is visible from two things:

  • apiserver component is not ready → kubectl commands fails
  • ingress ELB does not get healthy immediately

Where to go next?

I'd recommend to debug the creation of the cluster in particular each component because for some reason these tags introduce some overhead computation somewhere perhaps right in this part https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L3150

@calvix
Copy link
Member

calvix commented Feb 7, 2019

The cluster takes time to create, that is not something new and we are fine, the problem AFAIK was that it did not resolve, the connection was broken for day after creation. if You can access the server 20min after creation as you said then you did not hit the bug.

@mazzy89
Copy link
Contributor

mazzy89 commented Feb 7, 2019

20m after the creation I can run kubectl get nodes (or any other kube commands) and successfully connect to the API.

It did not resolve, the connection was broken for day after creation.

Could you please point out how to reproduce this? Maybe I haven't tested something.

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 7, 2019

The issue works as follows. You create a cluster. You wait until it is created. The API works. You wait some more time, maybe an hour or two. Then the API is not working anymore and only returns EOF from that point on. The apiserver seems to be fine though. So it is something between the ELB and the EC2 instance.

@mazzy89
Copy link
Contributor

mazzy89 commented Feb 7, 2019

I see. Thank you @xh3b4sd for the clarification. So we need to observe this in the long run. When you say the API is not working anymore how does it look the api ELB? Is it InService?

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 7, 2019

ELB appears to be healthy and the master node is also in service.

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 7, 2019

Your cluster mzv35 on ginger shows the symptoms already.

$ kubectl get node -o wide --show-labels
Unable to connect to the server: EOF

@mazzy89
Copy link
Contributor

mazzy89 commented Feb 7, 2019

I see then I think we should improve one thing. If the API returns EOF then the LoadBalancer should be marked unhealthy. The health-check at that level should inspect the fact that api is not working. Second thing I propose here is to enable VPC flow logs. If it could be a problem of ELB and EC2 then it is easily visible in the VPC flow logs.

@calvix
Copy link
Member

calvix commented Feb 7, 2019

I see then I think we should improve one thing. If the API returns EOF then the LoadBalancer should be marked unhealthy. The health-check at that level should inspect the fact that api is not working. Second thing I propose here is to enable VPC flow logs. If it could be a problem of ELB and EC2 then it is easily visible in the VPC flow logs.

API does not return EOF, API is working, the ELB returns EOF, the health check for API worked fine so far.

Maybe just try debug yourself :p

@mazzy89
Copy link
Contributor

mazzy89 commented Feb 7, 2019

Ok @calvix but do you agree on the fact that if the ELB returns EOF is a symptom that something is now working properly and we should be advertised about that? So we need to introduce/scrape any kind of metrics at the level of ELB to know what is going on. Do we agree on VPC flow logs could be the way to dig more on this?

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 7, 2019

Vaclav created an issue in SIG Monitoring about this yesterday: https://github.com/giantswarm/giantswarm/issues/5294.

@mazzy89
Copy link
Contributor

mazzy89 commented Feb 8, 2019

I'm still on this and try to understand more. Enabled the VPC flow logs but it seems that it didn't help that much. I haven't seen any suspicious logs. I'll continue to check that.

@calvix
Copy link
Member

calvix commented Feb 8, 2019

Ok @calvix but do you agree on the fact that if the ELB returns EOF is a symptom that something is now working properly and we should be advertised about that? So we need to introduce/scrape any kind of metrics at the level of ELB to know what is going on. Do we agree on VPC flow logs could be the way to dig more on this?

Yes, but we should first find what is wrong before we add scraping random metrics. We already scrape some for them in aws-operator and alert on them.

What I thought is that you could debug the problem found what is wrong and then figure out a way how we can detect such a situation.

@calvix calvix temporarily deployed to ginger February 12, 2019 09:01 Inactive
@calvix calvix temporarily deployed to ginger February 13, 2019 12:11 Inactive
@calvix
Copy link
Member

calvix commented Feb 14, 2019

@calvix calvix temporarily deployed to ginger February 14, 2019 06:13 Inactive
@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 14, 2019

@calvix is there anything left to test? Is it good enough for us that you fixed and verified the issue and CI is green? I am moving the tags to the new WIP bundle as well. Just checking when we could merge this PR here and get the PM sorted.

@calvix calvix deployed to gauss February 14, 2019 11:28 Active
@calvix
Copy link
Member

calvix commented Feb 14, 2019

@xh3b4sd should be fine, i tested this few times in ginger and gauss and was not able to reproduce again

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Feb 14, 2019

Should the changes in v22 stay now? It would mean we back port it to eventually already created clusters. Is that fine as well or should we only introduce with the new version v23?

@pipo02mix
Copy link
Contributor

it will be great to have available for current versions

@xh3b4sd xh3b4sd merged commit e43bc75 into master Feb 14, 2019
@xh3b4sd xh3b4sd deleted the elb-tags branch February 14, 2019 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants