You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With @rrondeau , we've been having many issues between our secondary datacenter and our primary one in production but NOT in our test environment.
Our setup:
5 VMs consul servers (primary datacenter)
services deployed in nomad clients connected to consul primary datacenter
5 VMs consul servers (secondary datacenter)
GKE cluster with consul-k8s connected to consul secondary datacenter
Nothing special in our setup but here were the errors:
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.902Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=health-services error="rpc error making call: Permission denied" index=0
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.902Z [ERROR] agent.proxycfg: Failed to handle update from watch: kind=ingress-gateway proxy=default/ingress/consul-prd-front-ingress-gateway-prd-867c495c46-swdnd service_id=default/ingress/consul-prd-front-ingress-gateway-prd-867c495c46-swdnd id=upstream-target:core-front-api.production.default.primary-dc:default/ingress/REDACTED error="error filling agent cache: rpc error making call: Permission denied"
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.903Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.58:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.905Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.59:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Jan 29 18:44:44 REDACTED consul[2704047]: 2024-01-29T18:44:44.907Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.60:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Jan 29 18:44:48 REDACTED consul[2704047]: 2024-01-29T18:44:48.648Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=xx.xx.xx.61:8300 datacenter=primary-dc method=Health.ServiceNodes error="rpc error making call: Permission denied"
Communication between services:
# non production
consul catalog services -namespace=staging -datacenter secondary-dc -token e9722fb8-*** => OK
consul catalog services -namespace=staging -datacenter primary-dc -token e9722fb8-*** => OK
# production
consul catalog services -namespace=production -datacenter secondary-dc -token 99b8194d-*** OK
consul catalog services -namespace=production -datacenter primary-dc -token 99b8194d-*** => Error listing services: Unexpected response code: 403 (rpc error making call: rpc error making call: ACL not found)
After almost 1 week of config check/diff/debug, it turns out everything was good except that there was NO policy attached to the anonymous token used to communicate between datacenters.
So in the primary datacenter, we recreated the anonymous-token-policy and attached it to the anonymous-token 00000002
The fix that need to be done is making sure that server-acl-init job deployed with consul-k8s enforce the creation of this policy in the primary datacenter with the anonymous-token.
In previous version of consul, some improvement has been made regarding ACLs logs except for this one with just Permission denied. We are now use to get Permission denied with accessorID xxxxx since version 1.15.x maybe, so let's hope for some new improvements.
As we are using the consul enterprise version, we notified the supported but fixed the setup before their feedback.
Stack:
consul: 1.17.2+ent
consul-k8s: 1.3.1
The text was updated successfully, but these errors were encountered:
Lord-Y
changed the title
Multi datacenter communication failure du to missing ACL policy
Multi datacenter communication failure due to missing ACL policy
Feb 2, 2024
@Lord-Y It might be best to work through this with support to see how we can resolve this issue. If you already filed a support ticket but not seeing a response yet, apologies. We can ask to look into this issue.
@david-yu thx As said: As we are using the consul enterprise version, we notified the supported but fixed the setup before their feedback.
We had a zoom call to explain what was wrong and how we fixed it.
I created this issue so others can know what to do and hopefully you guys fix it in the code :).
Hello guys,
With @rrondeau , we've been having many issues between our secondary datacenter and our primary one in production but NOT in our test environment.
Our setup:
Nothing special in our setup but here were the errors:
Communication between services:
After almost 1 week of config check/diff/debug, it turns out everything was good except that there was NO policy attached to the
anonymous token
used to communicate between datacenters.So in the primary datacenter, we recreated the
anonymous-token-policy
and attached it to theanonymous-token 00000002
The fix that need to be done is making sure that
server-acl-init
job deployed withconsul-k8s
enforce the creation of this policy in the primary datacenter with the anonymous-token.In previous version of consul, some improvement has been made regarding
ACLs
logs except for this one with justPermission denied
. We are now use to getPermission denied with accessorID xxxxx
since version 1.15.x maybe, so let's hope for some new improvements.As we are using the consul enterprise version, we notified the supported but fixed the setup before their feedback.
Stack:
The text was updated successfully, but these errors were encountered: