Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource.consul_service "async" changes after creation #347

Open
marceloboeira opened this issue Jun 23, 2023 · 1 comment
Open

resource.consul_service "async" changes after creation #347

marceloboeira opened this issue Jun 23, 2023 · 1 comment

Comments

@marceloboeira
Copy link
Contributor

marceloboeira commented Jun 23, 2023

I'm not 100% sure if that's to TF providers fault or simply "the way consul works" but, almost every time I create a consul service (with checks) after the terraform apply, the next terraform plan includes a change with the service check information. Even thought it was already "published" to consul in the first plan/apply setup.

Terraform Version

Terraform v1.5.1 (but it's also problematic on 1.3.x, 1.4.x)
on darwin_amd64

Affected Resource(s)

  • consul_service

Terraform Configuration Files

resource "consul_service" "service" {
  service_id = "example"
  name       = "cache-foo"
  node       = "cache-foo-node"
  address    = "10.9.9.99"
  port       = 6379

  check {
    check_id = "service:${var.name}"
    name     = "cache-foo"
    notes    = "Service check for service:cache-foo"
    ...
  }
}

Expected Behavior

Nothing should show up after plan/apply since the service check and everything service itself should've been created with the above code.

Actual Behavior

After the first plan/apply (possibly due to some async process on consul's side?) the next terraform plan shows:

 resource "consul_service" "service" {
       ....
+        check {
+           check_id                          = "service:cache-foo"
+           deregister_critical_service_after = "30s"
+           interval                          = "30s"
...
        }
    }

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform plan
  2. terraform apply
  3. wait a few minutes to be sure
  4. terraform plan (without any .TF code change)
  5. See weird "already applied" changes

Important Factoids

What I'm unsure of is if this:

  • consul persist this async, when the provider performs the create it then performs a ready and by that time (due to some eventual consistency) the check is not yet there thus the apply doesn't persist THAT part to TF state — are there parameters to make those options async persisted? is that the default?
  • TF provider code doesn't properly read the changes, forgetting to update this block and then in the next read somehow it shows as a
    *TF provider doesn't "wait" for the change to be fully persisted in the apply time, thus the read happens without the service-check is not returned/ignored

Checking the code for the create part, I don't see any major issues:

registration, ident, err := getCatalogRegistration(d, meta)
if err != nil {
return err
}
if _, err := catalog.Register(registration, wOpts); err != nil {
return fmt.Errorf("failed to register service (dc: '%s'): %v", wOpts.Datacenter, err)
}
// Retrieve the service again to get the canonical service ID. We can't
// get this back from the register call or through
service, err := retrieveService(client, name, ident, node, qOpts)
if err != nil {
return fmt.Errorf("failed to retrieve service '%s' after registration. This may mean that the service should be manually deregistered. %v", ident, err)
}
d.SetId(service.ServiceID)
return resourceConsulServiceRead(d, meta)
}

Then checking how it is read also, nothing big other than it relies on those values being there in the first place:

https://github.com/hashicorp/terraform-provider-consul/blob/9c5772f607ad26325c6bab96917fb41f875dd621/consul/resource_consul_service.go#L271C1-L344

My money would be on service.Checks being empty in the first "read" during the apply but populated later on further reads:

for _, check := range service.Checks {

Finally, what leads me to believe it is a consul "problem" is that the tests do not have this issue. Possibly, a slight delay on replicating and different nodes being the ones to receive the "write" vs "read" requests could. The weird part is why would the service itself be replicated but not the service check...

If that is the case, is there anything specific that can be done to perhaps reduce the likelihood of that happening?

@remilapeyre
Copy link
Collaborator

Hello @marceloboeira, thanks for the detailed write up. As you mentioned this situation does not happens in the tests or in a single node cluster. The tests are also a bit peculiar here as none of them test an actual running service.

It is possible that the diff occurs after the Consul agent on the node running the service updates the check in Consul for the first time, which would be an async operation happening after the service is registered in the Consul catalog.

The diff is probably benign but we may be able to use a diff suppress function to hide the changes when this happens, if we can detect it reliably (we wouldn't want to hide actual changes by mistake).

I will make additional tests on my end, can you please post the complete diff if it happens again to you? It would help to understand what attributes are changing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants