-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Conformance E2E: client-egress-l7-named-port/pod-to-pod: command terminated with exit code 28 (timeout) #27762
CI: Conformance E2E: client-egress-l7-named-port/pod-to-pod: command terminated with exit code 28 (timeout) #27762
Comments
Hit another one here Matrix entry: (5, 5.15-20230810.091425, iptables, true, disabled, dsr, true, true, true)
Sysdump: cilium-sysdump-5-final.zip
|
Another hit here. Matrix entry: (7, bpf-next-20230810.091425, none, true, {eth0,eth1}, true, disabled, snat, true, testing-only)
Sysdump with the hubble flows is way to big, but the final one is
|
I've been hitting this three times this week. Also on the v1.14 branch (see references in PRs above) |
Looking through the connectivity test source, it should be expected to see drops for any HTTP path other than "/". The l7 policy from the sysdump shows that only "/" is allowed, and the test checks that if the path is not "/" then a drop should be observed: https://github.com/cilium/cilium-cli/blob/50abf22438009b121ba819086d53954f01aa0bd6/connectivity/suite.go#L891-L916. Given this, I'm confused as to why we even see flows for "/public" being forwarded. My current hypothesis is that there is a race condition in updating the envoy config and that it is being updated while the test is executing. |
Looks like we use The callback in https://github.com/cilium/cilium/blob/main/pkg/envoy/xds_server.go#L1684-L1692 |
@learnitall Good find! Shall we add a proper wait? |
I think that would make sense. I'm going to do some more digging through the sysdumps to double check if this would help and then get started on an implementation. |
I looked back through the sysdumps and I think it makes to give the proxy policy revision waiting a shot. I've been brainstorming a solution and the trick part is that the proxy policy revision is specific to each endpoint, not per cilium-agent pod. This means that we'd have to:
This might take me a while to implement, but I'll continue cracking away at this next week. |
@learnitall We already have an idea of a policy revision, so it might make sense to roll up the proxy policy revision into the same number. I'm not sure we need a completely separate value. |
If we want to roll these up together, we'd have to be a bit careful to make sure we don't break any existing assumptions (ie adding a policy bumps the revision by 1). I think that might be a larger discussion outside of the scope for this issue? Anyways, I dug deeper into how we handle proxies in the datapath code and I think we actually wait for proxy updates to be completed before updating the policy revision. In Lines 572 to 577 in 5e993cb
We then call Line 502 in 5e993cb
|
I don't know where to go from here. We could try bumping the timeout value in the curl command, but I don't know if that would help. I've been looking through the sysdump from @bimmlerd and I can't find much in the envoy logs. All status codes seem normal:
And connections to |
Looks like this might be related, but the test doesn't deploy envoy as a Daemonset: #28057 |
Looking through how the |
This commit adds retries to the curl commands in the PodToPod scenario. The goal of adding this change is to try and address CI flakes which may be related to weird state update issues, such as cilium/cilium#27762. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit adds retries to the curl commands in the PodToPod scenario. The goal of adding this change is to try and address CI flakes which may be related to weird state update issues, such as cilium/cilium#27762. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit adds retries to the curl commands in the PodToPod scenario. The goal of adding this change is to try and address the CI flake cilium/cilium#27762. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit adds retries to the curl commands in the PodToPod scenario. The goal of adding this change is to try and address the Cilium CI flake cilium/cilium#27762. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
Removing myself as assignee for ci-health tophat handoff |
@learnitall So we no longer believe that this is an issue with Envoy policy version increments not taking into account Envoy policy changes? |
@learnitall Took another look at the endpoint code, I believe youre initial thought might be correct, the regenerate endpoint code only handles updating the endpoint policy version. To test this I ran an experiment, I added a 30 second sleep in front of the Envoy proxy waitgroup.Completion() ACK and made l7 policy changes on a Pod.
As far as I know (need to double check this?) cilium-cli only uses the |
[ upstream commit e2f41ce ] Compute socket option hashKey over the original source address and port when available. This will cause different downstream connections to no longer share upstream connections. This may fix connectivity issues where proxy responses do not get back to the proxy due to the connection with the original source port is cleared from bpf conntrack. Related: cilium/cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit e2f41ce ] Compute socket option hashKey over the original source address and port when available. This will cause different downstream connections to no longer share upstream connections. This may fix connectivity issues where proxy responses do not get back to the proxy due to the connection with the original source port is cleared from bpf conntrack. Related: cilium/cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
Update Envoy image to a version that includes the source port in upstream connection pool hash, so that each unique downstream connection gets a dedicated upstream connection. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Confirming the hypothesis that CT GC causes the response to be lost here:
#32270 updates |
Update Envoy image to a version that includes the source port in upstream connection pool hash, so that each unique downstream connection gets a dedicated upstream connection. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Update Envoy image to a version that includes the source port in upstream connection pool hash, so that each unique downstream connection gets a dedicated upstream connection. Fixes: #27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit e2f41ce ] Compute socket option hashKey over the original source address and port when available. This will cause different downstream connections to no longer share upstream connections. This may fix connectivity issues where proxy responses do not get back to the proxy due to the connection with the original source port is cleared from bpf conntrack. Related: cilium/cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit e2f41ce ] Compute socket option hashKey over the original source address and port when available. This will cause different downstream connections to no longer share upstream connections. This may fix connectivity issues where proxy responses do not get back to the proxy due to the connection with the original source port is cleared from bpf conntrack. Related: cilium/cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit d5efd28 ] Update Envoy image to a version that includes the source port in upstream connection pool hash, so that each unique downstream connection gets a dedicated upstream connection. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit d5efd28 ] Update Envoy image to a version that includes the source port in upstream connection pool hash, so that each unique downstream connection gets a dedicated upstream connection. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit d5efd28 ] Update Envoy image to a version that includes the source port in upstream connection pool hash, so that each unique downstream connection gets a dedicated upstream connection. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit d5efd28 ] Update Envoy image to a version that includes the source port in upstream connection pool hash, so that each unique downstream connection gets a dedicated upstream connection. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Unfortunately this has come back in https://github.com/cilium/cilium/actions/runs/9112279753/job/25051211214:
Relevant Envoy logs:
This time there is a CT entry, but it is missing the
Without the Here are all the hubble flows with the port
Then, 1.5 minutes later there is another flow with the same 5-tuple. Note the
Then the connections
And finally a
So it looks like that when a CT entry is re-opened, the First the successful flow that has proxy redirect on the destination node:
Note how the connection 1.5 minutes later also
Indeed, the CT entry in the destination node has the
|
Reset the CT entry when an old entry has been found when looking up an entry for TCP SYN packet (CT_REOPENED). This way the old entry has no effect to the entry for the new connection and security identity, as well as proxy and nodeport related flags get set correctly for the new connection. This prevents failing proxy redirect when the CT entry of a non-redirected connection would have been reused for a redirected connection, which has happened in the CI where an egress L7 policy was applied and an old CT entry from time without the L7 policy existed. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Reset the CT entry when an old entry has been found when looking up an entry for TCP SYN packet (CT_REOPENED). This way the old entry has no effect to the entry for the new connection and security identity, as well as proxy and nodeport related flags get set correctly for the new connection. This prevents failing proxy redirect when the CT entry of a non-redirected connection would have been reused for a redirected connection, which has happened in the CI where an egress L7 policy was applied and an old CT entry from time without the L7 policy existed. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Reset the CT entry when an old entry has been found when looking up an entry for TCP SYN packet (CT_REOPENED). This way the old entry has no effect to the entry for the new connection and security identity, as well as proxy and nodeport related flags get set correctly for the new connection. This prevents failing proxy redirect when the CT entry of a non-redirected connection would have been reused for a redirected connection, which has happened in the CI where an egress L7 policy was applied and an old CT entry from time without the L7 policy existed. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Reset the CT entry when an old entry has been found when looking up an entry for TCP SYN packet (CT_REOPENED). This way the old entry has no effect to the entry for the new connection and security identity, as well as proxy and nodeport related flags get set correctly for the new connection. This prevents failing proxy redirect when the CT entry of a non-redirected connection would have been reused for a redirected connection, which has happened in the CI where an egress L7 policy was applied and an old CT entry from time without the L7 policy existed. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Reset the CT entry when an old entry has been found when looking up an entry for TCP SYN packet (CT_REOPENED). This way the old entry has no effect to the entry for the new connection and security identity, as well as proxy and nodeport related flags get set correctly for the new connection. This prevents failing proxy redirect when the CT entry of a non-redirected connection would have been reused for a redirected connection, which has happened in the CI where an egress L7 policy was applied and an old CT entry from time without the L7 policy existed. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
CT_REOPENED was originally added in cilium#13340 to emit policy verdicts for apparently re-opened TCP connections, which are in fact more likely to be newly opened TCP connections rather than re-opened ones, as the CT entries may live minutes after the TCP state from the endpoints has already timed out. This added the complexity of the call sites having to differentiate between CT_NEW and CT_REOPENED, and failing to update various CT entry field values in the process, 'proxy_redirect' being one of them. Instead of adjusting each call site to behave properly for CT_REOPENED, return CT_NEW instead, and make the observable CT lookup behavior the same as for CT_NEW in that case, most notably by not updating the passed in `*ct_state'. This change fixes proxy redirection bug where return packets are not redirected to an L7 proxy when (a stale) CT entry is missing the 'proxy_redirect' flag. Fixes: cilium#27762 Fixes: cilium#13340 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
CT_REOPENED was originally added in cilium#13340 to emit policy verdicts for apparently re-opened TCP connections, which are in fact more likely to be newly opened TCP connections rather than re-opened ones, as the CT entries may live minutes after the TCP state from the endpoints has already timed out. This added complexity to call sites, forcing differentiation between CT_NEW and CT_REOPENED. In all cases some CT entry field values were left stale, e.g., 'proxy_redirect' after a policy change. Instead of adjusting each call site to behave properly for CT_REOPENED, return CT_NEW instead, and make the observable CT lookup behavior the same as for CT_NEW in that case, most notably by not updating the passed in `*ct_state'. This change fixes proxy redirection bug where return packets are not redirected to an L7 proxy when (a stale) CT entry is missing the 'proxy_redirect' flag. Fixes: cilium#27762 Fixes: cilium#13340 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
CT_REOPENED was originally added in cilium#13340 to emit policy verdicts for apparently re-opened TCP connections, which are in fact more likely to be newly opened TCP connections rather than re-opened ones, as the CT entries may live minutes after the TCP state from the endpoints has already timed out. This added complexity to call sites, forcing differentiation between CT_NEW and CT_REOPENED. In all cases some CT entry field values were left stale, e.g., 'proxy_redirect' after a policy change. Instead of adjusting each call site to behave properly for CT_REOPENED, return CT_NEW instead, and make the observable CT lookup behavior the same as for CT_NEW in that case, most notably by not updating the passed in `*ct_state'. This change fixes proxy redirection bug where return packets are not redirected to an L7 proxy when (a stale) CT entry is missing the 'proxy_redirect' flag. Fixes: cilium#27762 Fixes: cilium#13340 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
CI failure
Matrix entry: (5, 5.15-20230420.212204, iptables, true, disabled, dsr, true, true, true)
Looks similar to #27672 but not related to multicluster.
From a quick sysdump check, there seems to be something wrong with Envoy and the http conneciton:
cilium-sysdump-5-final.zip
The text was updated successfully, but these errors were encountered: