Multi datacenter communication failure due to missing ACL policy #3539

Lord-Y · 2024-02-02T09:55:53Z

Hello guys,

With @rrondeau , we've been having many issues between our secondary datacenter and our primary one in production but NOT in our test environment.
Our setup:

5 VMs consul servers (primary datacenter)
services deployed in nomad clients connected to consul primary datacenter
5 VMs consul servers (secondary datacenter)
GKE cluster with consul-k8s connected to consul secondary datacenter

Nothing special in our setup but here were the errors:

Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.902Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=health-services error="rpc error making call: Permission denied" index=0
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.902Z [ERROR] agent.proxycfg: Failed to handle update from watch: kind=ingress-gateway proxy=default/ingress/consul-prd-front-ingress-gateway-prd-867c495c46-swdnd service_id=default/ingress/consul-prd-front-ingress-gateway-prd-867c495c46-swdnd id=upstream-target:core-front-api.production.default.primary-dc:default/ingress/REDACTED error="error filling agent cache: rpc error making call: Permission denied"
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.903Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.58:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.905Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.59:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.907Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.60:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Jan 29 18:44:48 REDACTED consul[2704047]: 2024-01-29T18:44:48.648Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.61:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"

Communication between services:

# non production
consul catalog services -namespace=staging -datacenter secondary-dc -token e9722fb8-*** => OK
consul catalog services -namespace=staging -datacenter primary-dc -token e9722fb8-*** => OK 
# production
consul catalog services -namespace=production -datacenter secondary-dc -token 99b8194d-*** OK 
consul catalog services -namespace=production -datacenter primary-dc -token 99b8194d-*** => Error listing services: Unexpected response code: 403 (rpc error making call: rpc error making call: ACL not found)

After almost 1 week of config check/diff/debug, it turns out everything was good except that there was NO policy attached to the anonymous token used to communicate between datacenters.
So in the primary datacenter, we recreated the anonymous-token-policy and attached it to the anonymous-token 00000002

The fix that need to be done is making sure that server-acl-init job deployed with consul-k8s enforce the creation of this policy in the primary datacenter with the anonymous-token.
In previous version of consul, some improvement has been made regarding ACLs logs except for this one with just Permission denied. We are now use to get Permission denied with accessorID xxxxx since version 1.15.x maybe, so let's hope for some new improvements.

As we are using the consul enterprise version, we notified the supported but fixed the setup before their feedback.

Stack:

consul: 1.17.2+ent
consul-k8s: 1.3.1

The text was updated successfully, but these errors were encountered:

david-yu · 2024-02-02T20:14:51Z

@Lord-Y It might be best to work through this with support to see how we can resolve this issue. If you already filed a support ticket but not seeing a response yet, apologies. We can ask to look into this issue.

Lord-Y · 2024-02-03T17:20:39Z

@david-yu thx As said:
As we are using the consul enterprise version, we notified the supported but fixed the setup before their feedback.
We had a zoom call to explain what was wrong and how we fixed it.
I created this issue so others can know what to do and hopefully you guys fix it in the code :).

Lord-Y added the type/bug Something isn't working label Feb 2, 2024

Lord-Y changed the title ~~Multi datacenter communication failure du to missing ACL policy~~ Multi datacenter communication failure due to missing ACL policy Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi datacenter communication failure due to missing ACL policy #3539

Multi datacenter communication failure due to missing ACL policy #3539

Lord-Y commented Feb 2, 2024

david-yu commented Feb 2, 2024

Lord-Y commented Feb 3, 2024

Multi datacenter communication failure due to missing ACL policy #3539

Multi datacenter communication failure due to missing ACL policy #3539

Comments

Lord-Y commented Feb 2, 2024

david-yu commented Feb 2, 2024

Lord-Y commented Feb 3, 2024