Provider produced inconsistent result after apply when interacting with remote secondary datacenters #249

danieleva · 2021-03-26T16:57:42Z

Terraform Version

Terraform v0.14.7
registry.terraform.io/hashicorp/consul v2.11.0
consul 1.9.4

Affected Resource(s)

Please list the resources as a list, for example:

consul_acl_policy

Reproducing the issue requires some setup.
I have 2 consul datacenters, WAN federated with ACL replication enabled. The primary is in US, secondary in Asia/Pacific.
There is a ~200ms latency on the WAN connection used for federation.
If terraform is configured to connect to consul API on the remote datacenter, acl_policy creation fails with

consul_acl_policy.test: Creating...

Error: Provider produced inconsistent result after apply

When applying changes to consul_acl_policy.test, provider
"registry.terraform.io/hashicorp/consul" produced an unexpected new value:
Root resource was present, but now absent.

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

This fails:

provider "consul" {
    address = "secondary-dc:8500"
    datacenter = "secondary"
    token = "...."
}

resource "consul_acl_policy" "test" {
  name        = "my_policy"
  datacenter = ["secondary"]
  rules       = <<-RULE
    node_prefix "" {
      policy = "read"
    }
    RULE
}

If I force the provider to use the primary datacenter, the resource is created correctly:

provider "consul" {
    address = "secondary-dc:8500"
    datacenter = "primary"
    token = "...."
}

resource "consul_acl_policy" "test" {
  name        = "my_policy"
  datacenter = ["secondary"]
  rules       = <<-RULE
    node_prefix "" {
      policy = "read"
    }
    RULE
}

consul_acl_policy.test: Creating...
consul_acl_policy.test: Creation complete after 1s [id=d9e929b5-28e3-95ac-615f-abf1453b52a2]
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Debug logs on consul show the issue. In both cases the provider is connected to a server in the secondary datacenter
When provider is configured with datacenter=secondary:

consul[17468]: 2021-03-26T14:39:15.852Z [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=4
consul[17468]: 2021-03-26T14:39:15.852Z [DEBUG] agent.server.replication.acl.policy: acl replication: local=3 remote=4
consul[17468]: 2021-03-26T14:39:15.852Z [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1
consul[17468]: 2021-03-26T14:39:15.852Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/acl/policy?dc=secondary from=127.0.0.1:41096 latency=211.389717ms  <--- Terraform request to create policy, forwarded to primary
consul[17468]: 2021-03-26T14:39:16.037Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/policy/62cf9c85-4d1b-e1b8-87b0-b37e0559a6bf?dc=secondary from=127.0.0.1:41096 error="ACL not found"
consul[17468]: 2021-03-26T14:39:16.038Z [DEBUG] agent.http: Request finished: method=GET url=/v1/acl/policy/62cf9c85-4d1b-e1b8-87b0-b37e0559a6bf?dc=secondary from=127.0.0.1:41096 latency=264.415µs  <--- Terraform request to read the policy back, response from local agent
consul[17468]: 2021-03-26T14:39:16.059Z [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1
consul[17468]: 2021-03-26T14:39:16.059Z [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates
consul[17468]: 2021-03-26T14:39:16.063Z [DEBUG] agent.server.replication.acl.policy: acl replication - upserted batch: number_upserted=1 batch_size=111
consul[17468]: 2021-03-26T14:39:16.063Z [DEBUG] agent.server.replication.acl.policy: acl replication - finished updates
consul[17468]: 2021-03-26T14:39:16.063Z [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=198

When provider is configured with datacenter=primary

consul[17468]: 2021-03-26T14:40:07.664Z [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=5
consul[17468]: 2021-03-26T14:40:07.664Z [DEBUG] agent.server.replication.acl.policy: acl replication: local=4 remote=5
consul[17468]: 2021-03-26T14:40:07.664Z [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1
consul[17468]: 2021-03-26T14:40:07.664Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/acl/policy?dc=primary from=127.0.0.1:41114 latency=208.652671ms  <--- Terraform request to create policy, forwarded to primary
consul[17468]: 2021-03-26T14:40:07.870Z [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1
consul[17468]: 2021-03-26T14:40:07.870Z [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates
consul[17468]: 2021-03-26T14:40:07.872Z [DEBUG] agent.server.replication.acl.policy: acl replication - upserted batch: number_upserted=1 batch_size=111
consul[17468]: 2021-03-26T14:40:07.872Z [DEBUG] agent.server.replication.acl.policy: acl replication - finished updates
consul[17468]: 2021-03-26T14:40:07.872Z [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=202
consul[17468]: 2021-03-26T14:40:08.059Z [DEBUG] agent.http: Request finished: method=GET url=/v1/acl/policy/76661435-a972-ffa4-eeed-3c8658b89f09?dc=primary from=127.0.0.1:41114 latency=206.995291ms  <--- Terraform request to read the policy back, forwarded to primary

In both cases the first part of the flow is identical, the behaviour changes when reading the policy back from consul

terraform provider sends a PUT to /v1/acl/policy
local consul server forwards the PUT to primary datacenter
primary datacenter creates the policy and triggers a sync to the secondary
terraform provider sends a GET to /v1/acl/policy/<policy_id>
1. if datacenter=secondary, the local agent replies, and since the replication is not completed yet, the provider gets an ACL not found error and breaks
2. if datacenter=primary, the request is forwarded to the primary and the provider completes correctly

A naive workaround, adding time.Sleep(10 * time.Second) before the return in resourceConsulACLPolicyCreate to allow for acl replication to complete fixes the problem, but I don't think that's the proper way to address this.

The provider documentation is not clear on what should be the configuration when dealing with federated datacenters.
If the datacenter parameter in the provider must be configured to point at the primary, that should be explicit in the documentation, in addition of ensuring all the resources specify the datacenter they refer to if it's not the primary.
IMHO a better option would be to add some retry logic in the resources, to account for delay and eventually consistent nature of ACL federation. In my tests the replication is still very fast, usually under 1s, so a configurable retry with exponential backoff would handle it nicely.
If you agree on the retry solution, I'm happy to provide a PR for it.

[GH-167] partially addressed this, but didn't add any retry logic.

Thanks :)

The text was updated successfully, but these errors were encountered:

remilapeyre · 2021-05-09T15:40:11Z

Hi @danieleva, thanks for reporting this issue. I did not test much with federated datacenters, the provider certainly behaves weirdly in this cases and is probably not coherent for each resource. I will have a look in the coming days to find what is the best way to proceed, the retry solution looks appropriate for ACLs but I would like to make sure it is.

rrijkse · 2021-10-13T19:14:44Z

@remilapeyre Any updates on this? It is still an issue with the latest version of Terraform/Consul provider.

remilapeyre · 2021-10-13T20:23:57Z

Hi @rrijkse, I made some tests and found the way I wanted to implement this. I will come back to this issue and try to fix all the resources that currently show this behaviour

erisnar · 2022-02-16T10:14:10Z

We experienced the same issue and solved it by configuring the provider to the primary datacenter.

next-jesusmanuelnavarro · 2023-06-28T14:27:27Z

I also found what seems a related behaviour when creating intentions on a federated secondary datacenter.
Terraform successfully creates intentions when pointing to the primary but fails when pointing to secondary.

Note the intention is in fact created.

2023-06-28T15:42:37.308+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 17
2023-06-28T15:42:37.308+0200 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate.d/[redacted]
2023-06-28T15:42:37.315+0200 [ERROR] vertex "consul_config_entry.intentions["[redacted]"]" error: failed to read config entry after setting it.

terraform apply to the secondary fails the first time (while the intention is in fact created) and apply successes when applied for a second time.

Given the error provided, "error: failed to read config entry after setting it." it seems a workaround may be catch that error and reattempt some few cycles with increasing waiting time (i.e., 1 sec, then 2, then 4, then 8) before finally giving up.

jmnavarrol · 2023-08-15T10:14:47Z

Hi @rrijkse, I made some tests and found the way I wanted to implement this. I will come back to this issue and try to fix all the resources that currently show this behaviour

Hi @remilapeyre: did you manage to advance on this issue?

At least you might apply @danieleva 's suggested workaround "A naive workaround, adding time.Sleep(10 * time.Second) before the return in resourceConsulACLPolicyCreate to allow for acl replication to complete fixes the problem" (quite possibly a lower wait time would do the trick as I also saw time in the 1~3 seconds range for replication) till you find the time/inspiration for a better solution.

TIA

fixes hashicorp#249

7fELF · 2023-12-04T10:43:55Z

I opened a PR to fix this.
@remilapeyre can you take a look? #385

7fELF · 2023-12-06T15:42:51Z

I opened a PR to fix this. @remilapeyre can you take a look? #385

I published the patched version to the registry to make it easier to validate/test: https://registry.terraform.io/providers/7fELF/consul/latest

next-jesusmanuelnavarro · 2023-12-11T11:56:57Z

I opened a PR to fix this. @remilapeyre can you take a look? #385

I published the patched version to the registry to make it easier to validate/test: https://registry.terraform.io/providers/7fELF/consul/latest

I could test it today with following versions' definition:

terraform {
  required_version = "= 1.4.6"

  required_providers {
    consul = {
      source  = "7fELF/consul"
      version = "= 2.20.1"
    }
    null   = "= 3.2.1"
  }
}

I can still reproduce the bug: upon first terraform apply I get the following error:

consul_config_entry.intentions["REDACTED"]: Creating...
╷
│ Error: failed to read config entry after setting it.
│ This may happen when some attributes have an unexpected value.
│ Read the documentation at https://www.consul.io/docs/agent/config-entries/service-intentions.html
│ to see what values are expected
│ 
│   with consul_config_entry.intentions["REDACTED"],
│   on main.tf line 30, in resource "consul_config_entry" "intentions":
│   30: resource "consul_config_entry" "intentions" {
│ 
╵

The intention is nevertheless properly created and I can see it on Consul webui. A second terraform apply finishes successfully and terraform destroy works as expected on first run.

This is exactly the same behaviour I got with consul = "= 2.17.0".

Also relevant code on main.tf (Consul access variables from shell environment pointing to a remote secondary datacenter):

# Loops through intentions
resource "consul_config_entry" "intentions" {
  for_each = {
    for intention in local.intentions:
      intention.name => intention
  }

  name = each.value.name
  kind = "service-intentions"

  config_json = jsonencode({
    Sources = [
      for source in each.value.sources: {
        Name       = source
        Type       = "consul"
        Action     = "allow"
        Namespace  = "default"
        Partition  = "default"
        Precedence = 9
      }
    ]
  })
}

7fELF · 2023-12-11T12:54:23Z

Thanks for testing my patch @next-jesusmanuelnavarro
My patch currently fixes the following resources:

auth_method
binding_rule
policy
role
role_policy_attachment
token
token_policy_attachment
token_role_attachment

I'm not a service mesh user, but according to the docs, setting a replication token also enables service mesh data replication.

So to also fix it, I need to figure out:

Which resources (referred to as "service mesh data") are replicated
Which replication index each of those resources increases:

(redacted)@(redacted):~$ curl http://localhost:8500/v1/acl/replication | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   353  100   353    0     0   199k      0 --:--:-- --:--:-- --:--:--  344k
{
  "Enabled": true,
  "Running": true,
  "SourceDatacenter": "(redacted)-preprod",
  "ReplicationType": "tokens",
  "ReplicatedIndex": 133549002,
  "ReplicatedRoleIndex": 133549003,
  "ReplicatedTokenIndex": 133547011,
  "LastSuccess": "2023-12-11T12:51:53Z",
  "LastError": "2023-12-07T12:52:24Z",
  "LastErrorMessage": "failed to retrieve remote ACL tokens: rpc error making call: ACL not found"
}

next-jesusmanuelnavarro · 2023-12-11T13:42:00Z

Thanks for testing my patch @next-jesusmanuelnavarro My patch currently fixes the following resources:
So to also fix it, I need to figure out:
* Which resources (referred to as "service mesh data") are replicated
* Which replication index each of those resources increases:

On this, I can be of little help as I don't admin my Consul cluster, I'm just a user of it (in fact, I can't even list policies with my credentials).

All I can say, if that's what you mean, is my use case is for service-intentions, service-defaults, service-resolver and service-splitters. https://developer.hashicorp.com/consul/docs/connect/config-entries/service-intentions

remilapeyre · 2023-12-14T13:21:30Z

Hi, this is a long standing issue and the patch from @7fELF looks like the right way forward to fix this. I wish this could be handled automatically by the Consul Go client but we should move forward with the current approach first, and improve the situation for all users of the Go client later. Regarding the inconsistency with the config entry, I'm not sure the same fix is applicable but will look into that as wel.

danihuerta · 2024-02-21T16:15:28Z

Any update in this? I'm experiencing the same issue in my Federated Clusters when pointing to the Secondary DC.
The Fix will be applied into the new release?

7fELF added a commit to 7fELF/terraform-provider-consul that referenced this issue Nov 28, 2023

Wait until ACL resources are replicated to the local DC

b3451a8

fixes hashicorp#249

7fELF linked a pull request Nov 28, 2023 that will close this issue

Wait until ACL resources are replicated to the local DC #385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provider produced inconsistent result after apply when interacting with remote secondary datacenters #249

Provider produced inconsistent result after apply when interacting with remote secondary datacenters #249

danieleva commented Mar 26, 2021

remilapeyre commented May 9, 2021

rrijkse commented Oct 13, 2021

remilapeyre commented Oct 13, 2021

erisnar commented Feb 16, 2022

next-jesusmanuelnavarro commented Jun 28, 2023 •

edited

jmnavarrol commented Aug 15, 2023 •

edited

7fELF commented Dec 4, 2023

7fELF commented Dec 6, 2023

next-jesusmanuelnavarro commented Dec 11, 2023

7fELF commented Dec 11, 2023 •

edited

next-jesusmanuelnavarro commented Dec 11, 2023 •

edited

remilapeyre commented Dec 14, 2023

danihuerta commented Feb 21, 2024

Provider produced inconsistent result after apply when interacting with remote secondary datacenters #249

Provider produced inconsistent result after apply when interacting with remote secondary datacenters #249

Comments

danieleva commented Mar 26, 2021

Terraform Version

Affected Resource(s)

remilapeyre commented May 9, 2021

rrijkse commented Oct 13, 2021

remilapeyre commented Oct 13, 2021

erisnar commented Feb 16, 2022

next-jesusmanuelnavarro commented Jun 28, 2023 • edited

jmnavarrol commented Aug 15, 2023 • edited

7fELF commented Dec 4, 2023

7fELF commented Dec 6, 2023

next-jesusmanuelnavarro commented Dec 11, 2023

7fELF commented Dec 11, 2023 • edited

next-jesusmanuelnavarro commented Dec 11, 2023 • edited

remilapeyre commented Dec 14, 2023

danihuerta commented Feb 21, 2024

next-jesusmanuelnavarro commented Jun 28, 2023 •

edited

jmnavarrol commented Aug 15, 2023 •

edited

7fELF commented Dec 11, 2023 •

edited

next-jesusmanuelnavarro commented Dec 11, 2023 •

edited