Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi datacenter communication failure due to missing ACL policy #3539

Open
Lord-Y opened this issue Feb 2, 2024 · 2 comments
Open

Multi datacenter communication failure due to missing ACL policy #3539

Lord-Y opened this issue Feb 2, 2024 · 2 comments
Labels
type/bug Something isn't working

Comments

@Lord-Y
Copy link
Contributor

Lord-Y commented Feb 2, 2024

Hello guys,

With @rrondeau , we've been having many issues between our secondary datacenter and our primary one in production but NOT in our test environment.
Our setup:

  • 5 VMs consul servers (primary datacenter)
  • services deployed in nomad clients connected to consul primary datacenter
  • 5 VMs consul servers (secondary datacenter)
  • GKE cluster with consul-k8s connected to consul secondary datacenter

Nothing special in our setup but here were the errors:

Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.902Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=health-services error="rpc error making call: Permission denied" index=0
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.902Z [ERROR] agent.proxycfg: Failed to handle update from watch: kind=ingress-gateway proxy=default/ingress/consul-prd-front-ingress-gateway-prd-867c495c46-swdnd service_id=default/ingress/consul-prd-front-ingress-gateway-prd-867c495c46-swdnd id=upstream-target:core-front-api.production.default.primary-dc:default/ingress/REDACTED error="error filling agent cache: rpc error making call: Permission denied"
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.903Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.58:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.905Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.59:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.907Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.60:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Jan 29 18:44:48 REDACTED consul[2704047]: 2024-01-29T18:44:48.648Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.61:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"

Communication between services:

# non production
consul catalog services -namespace=staging -datacenter secondary-dc -token e9722fb8-*** => OK
consul catalog services -namespace=staging -datacenter primary-dc -token e9722fb8-*** => OK 
# production
consul catalog services -namespace=production -datacenter secondary-dc -token 99b8194d-*** OK 
consul catalog services -namespace=production -datacenter primary-dc -token 99b8194d-*** => Error listing services: Unexpected response code: 403 (rpc error making call: rpc error making call: ACL not found)

After almost 1 week of config check/diff/debug, it turns out everything was good except that there was NO policy attached to the anonymous token used to communicate between datacenters.
So in the primary datacenter, we recreated the anonymous-token-policy and attached it to the anonymous-token 00000002

The fix that need to be done is making sure that server-acl-init job deployed with consul-k8s enforce the creation of this policy in the primary datacenter with the anonymous-token.
In previous version of consul, some improvement has been made regarding ACLs logs except for this one with just Permission denied. We are now use to get Permission denied with accessorID xxxxx since version 1.15.x maybe, so let's hope for some new improvements.

As we are using the consul enterprise version, we notified the supported but fixed the setup before their feedback.

Stack:

consul: 1.17.2+ent
consul-k8s: 1.3.1
@Lord-Y Lord-Y added the type/bug Something isn't working label Feb 2, 2024
@Lord-Y Lord-Y changed the title Multi datacenter communication failure du to missing ACL policy Multi datacenter communication failure due to missing ACL policy Feb 2, 2024
@david-yu
Copy link
Contributor

david-yu commented Feb 2, 2024

@Lord-Y It might be best to work through this with support to see how we can resolve this issue. If you already filed a support ticket but not seeing a response yet, apologies. We can ask to look into this issue.

@Lord-Y
Copy link
Contributor Author

Lord-Y commented Feb 3, 2024

@david-yu thx As said:
As we are using the consul enterprise version, we notified the supported but fixed the setup before their feedback.
We had a zoom call to explain what was wrong and how we fixed it.
I created this issue so others can know what to do and hopefully you guys fix it in the code :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants