Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthchecks created via Terraform do not work #187

Open
vector623 opened this issue Mar 25, 2020 · 5 comments
Open

Healthchecks created via Terraform do not work #187

vector623 opened this issue Mar 25, 2020 · 5 comments
Labels

Comments

@vector623
Copy link

vector623 commented Mar 25, 2020

Terraform Version

> terraform -v
Terraform v0.12.24
+ provider.consul v2.6.1

> consul -v
Consul v1.7.2                                                                                                                           Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Affected Resource(s)

  • consul_service

Terraform Configuration Files

provider "consul" {
  address    = "https://srv-pro-svrg-01"
  datacenter = "dc1"
  version = "~> 2.6"
}

data "consul_nodes" "read-dc1-nodes" {
  query_options {
    # Optional parameter: implicitly uses the current datacenter of the agent
    datacenter = "dc1"
  }
}

resource "consul_service" "redis" {
  name = "redis"
  node    = "srv-pro-svrg-01"
  port = 6379

  check {
    check_id                          = "service:redis1"
    name                              = "Redis health check"
    status                            = "passing"
    http                              = "https://www.hashicorptest.com"
    tls_skip_verify                   = false
    method                            = "PUT"
    interval                          = "5s"
    timeout                           = "1s"
    deregister_critical_service_after = "90m"

    header {
      name  = "foo"
      value = ["test"]
    }

    header {
      name  = "bar"
      value = ["test"]
    }
  }
}

Debug Output

https://gist.github.com/vector623/d193f3292790bf7f1119c57bafd4e561

Expected Behavior

Health check should execute successfully. If it fails, it should not deregister for 90 minutes.

Actual Behavior

Health check fails and deregisters within a minute.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform init
  2. terraform apply -auto-approve

Important Factoids

  • Running on-prem on Ubuntu 18.04 VMWare guest.
  • Experienced same issue with Ubuntu 18.04 in GCP.
  • Health checks created via HTTP/curl work fine.

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:

@remilapeyre
Copy link
Collaborator

Hi @vector623, thanks for opening this issue, it seems that the health-check may not be the problem as the issue still appears when I try without it.

I will investigate and let you know what I found.

@remilapeyre
Copy link
Collaborator

Hi @vector623, we've looked and I think you are trying to register a service on a node where a Consul agent is running (an internal service). The consul_service resource was created to register external services and adds the service to the Consul catalog but not to the local catalog of the agent. When the agent perform the anti-entropy syncs, it finds a service in the catalog it knows nothing about and removes it:

Mar 25 18:57:23 srv-pro-schd-05 consul[32576]:     2020-03-25T18:57:23.181Z [DEBUG] agent: Node info in sync
Mar 25 18:57:23 srv-pro-schd-05 consul[32576]:     2020-03-25T18:57:23.183Z [INFO]  agent: Deregistered service: service=redis
Mar 25 18:57:23 srv-pro-schd-05 consul[32576]:     2020-03-25T18:57:23.184Z [INFO]  agent: Deregistered check: check=service:redis1

The documentation of the provider (https://www.terraform.io/docs/providers/consul/r/service.html) mentions this briefly:

If the Consul agent is running on the node where this service is registered, it is not recommended to use this resource.

This is not related to the health-check and you should see the same behaviour when registering the service without the health-checks.

You mentioned that the same service created using cURL works, I think you are creating it using /v1/agent/service/register and not the /v1/catalog/register endpoint consul_service is using. Could you confirm that?

The consul_agent_service resource can be used to create an internal service but it was marked as deprecated and does not support health-checks at the moment. I'm wondering if we should rollback this deprecation.

@chris-aeviator
Copy link

@remilapeyre wouldnt it be possible to combine the two, and abstract away that complexity to users? Healthchecks ftw!!

@mbrav
Copy link

mbrav commented Jul 11, 2023

Still cannot get TCP Health checks working, let alone HTTP health checks. Lets take two services as an example: Prometheus, which has to be configured using TCP checks on port 9090 and Grafana, which can be checked with a GET /api/health request on port 3000.

Tested on Consul v1.15.3

Curl Checks

I have Prometheus running on IP 192.168.55.120:

$ curl -i 192.168.55.120:9090
HTTP/1.1 302 Found
Content-Type: text/html; charset=utf-8
Location: /graph
Date: Tue, 11 Jul 2023 18:15:55 GMT
Content-Length: 29

<a href="/graph">Found</a>.

If Prometheus does HTTP responses, then it is surely giving out a collection of TCP packets.

I have Grafana running on IP 192.168.55.121:

$ curl -i 192.168.55.121:3000/api/health
HTTP/1.1 200 OK
Cache-Control: no-store
Content-Type: application/json; charset=UTF-8
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-Xss-Protection: 1; mode=block
Date: Tue, 11 Jul 2023 18:17:30 GMT
Content-Length: 71

{
  "commit": "5a30620b85",
  "database": "ok",
  "version": "10.0.1"
}

Grafana working as well.

Configuring healthchecks with terraform-provider-consul

No lets create the necessary service healthcheck resources.

Prometheus configuration

Configuring Health checks for Prometheus

resource "consul_node" "node" {
  count      = 1
  datacenter = "dc1"
  address    = "192.168.55.120"
  name       = "prometheus01"
}

resource "consul_service" "svc" {
  count = 1

  name       = "prometheus01"
  node       = "prometheus01"
  address    = "192.168.55.120"
  datacenter = "dc1"
  port       = 9090

  check {
    check_id                          = "service:prometheus01"
    name                              = "Prometheus Health Check"
    notes                             = "Checks for a TCP connection on port 9090"
    tcp                               = "192.168.55.120:9090"
    interval                          = "10s"
    timeout                           = "2s"
    deregister_critical_service_after = "60s"
  }
}

Prometheus results

Screenshot_0303

Grafana configuration

Configuring Health checks for Grafana

resource "consul_node" "node" {
  datacenter = "dc1"
  address    = "192.168.55.121"
  name       = "grafana01"
}

resource "consul_service" "svc" {
  name       = "grafana01"
  node       = "grafana01"
  address    = "192.168.55.121"
  datacenter = "dc1"
  port       = 3000

  check {
    check_id                          = "service:grafana01"
    name                              = "Grafana Health Check"
    http                              = "/api/health"
    notes                             = "Checks for a GET /api/health request on port 3000"
    tls_skip_verify                   = true
    method                            = "GET"
    interval                          = "10s"
    timeout                           = "2s"
    deregister_critical_service_after = "30s"

    header {
      name  = "Accept"
      value = ["application/json"]
    }
  }
}

Grafana results

Screenshot_0304

Conclusion

With what has been demonstrated above, I have three questions:

  1. How is this still an issue after 3 years?
  2. As our company is in the process of Enterprise versions of HashiCorp products, what will support look like while taking the first question into account?
  3. Is question number two rhetorical?

Relevant issues: #124

@brucellino
Copy link

Hi @mbrav . Not sure if I'm doing archaeology here, but I just struggled through this myself. This looks like a non-issue to me, although it didn't at first. It's non-issue because although the service and the service health check are declared, there is no external service monitor to actually perform the health checks.

I run consul_esm on my Nomad cluster to perform the health checks.

So, registered services start off critical, but are updated to healthy as they are discovered by consul-esm and their health checks are performed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants