Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler completely idle after nomad-client restart #642

Open
bernardoVale opened this issue May 18, 2023 · 3 comments
Open

Autoscaler completely idle after nomad-client restart #642

bernardoVale opened this issue May 18, 2023 · 3 comments

Comments

@bernardoVale
Copy link

It looks like #514 but we're using the latest version v0.3.7:

# nomad-autoscaler version
Nomad Autoscaler v0.3.7 (90ad44d)

Steps to Reproduce

Configure autoscaler with network_mode = "host"

use 127.0.0.1:5656 as the nomad address:

nomad {
  address = "http://127.0.0.1:5656"
}

telemetry {
  prometheus_metrics = true
  disable_hostname   = true
}

strategy "target-value" {
  driver = "target-value"
}

strategy "pass-through" {
  driver = "pass-through"
}

Wait for a few evaluations and restart nomad-client. You should see logs like this:

2023-05-18T12:19:17.564Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=18c3ce32-3247-4673-c49e-dd12719200b0 error="failed to get policy: Unexpected response code: 404 (policy not found)"
2023-05-18T12:19:17.568Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=fa573b85-9ab6-952c-8deb-611a59f64d32 error="failed to get policy: Unexpected response code: 404 (policy not found)"
2023-05-18T12:19:17.689Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Get "http://127.0.0.1:5656/v1/scaling/policies?index=2558031&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:17.795Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=fa573b85-9ab6-952c-8deb-611a59f64d32 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/fa573b85-9ab6-952c-8deb-611a59f64d32?index=2555795&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:17.886Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=3c10001b-0402-743b-5bfd-73577527d9ac error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/3c10001b-0402-743b-5bfd-73577527d9ac?index=2426098&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:17.993Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=e88ba06c-9631-2a48-b28d-050f2b7d15bb error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/e88ba06c-9631-2a48-b28d-050f2b7d15bb?index=2343116&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:17.996Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=39524417-0f87-cd42-3e1a-01374abc50fa error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/39524417-0f87-cd42-3e1a-01374abc50fa?index=2378430&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:18.095Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=06610fc9-966d-8e19-3055-b0ef4d585440 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/06610fc9-966d-8e19-3055-b0ef4d585440?index=2176899&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:18.194Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=8bf225c1-5a57-7ce1-d13e-ce63f7de45de error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/8bf225c1-5a57-7ce1-d13e-ce63f7de45de?index=2408453&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:18.294Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=854225b7-34cd-e54c-7f0f-c06fd8be8903 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/854225b7-34cd-e54c-7f0f-c06fd8be8903?index=2495493&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:18.495Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=18e2c333-3585-9ec5-aa23-bd149f04d0eb error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/18e2c333-3585-9ec5-aa23-bd149f04d0eb?index=2343094&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:18.497Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=56ab1c1d-81c7-f52a-7902-0792c8e3d657 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/56ab1c1d-81c7-f52a-7902-0792c8e3d657?index=2453803&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:18.591Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=d3d5e9c4-cb88-a87b-5fb3-a3b4dfeaa882 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/d3d5e9c4-cb88-a87b-5fb3-a3b4dfeaa882?index=2453232&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:18.694Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=3f8bafc8-7d7d-ad99-a73f-2a76042e8922 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/3f8bafc8-7d7d-ad99-a73f-2a76042e8922?index=2081893&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:18.890Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=7bc11475-e607-a45a-d0f1-a335362f76ce error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/7bc11475-e607-a45a-d0f1-a335362f76ce?index=2548812&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:18.996Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=42c70066-c068-9adb-3b25-ab8dda559a5d error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/42c70066-c068-9adb-3b25-ab8dda559a5d?index=2160200&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:19.192Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=b2ac7fa7-c2ae-9b71-8498-eea022b84a2a error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/b2ac7fa7-c2ae-9b71-8498-eea022b84a2a?index=2036167&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:19.296Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=9a25ef5d-5264-08f4-2df2-df8a2a50e7a7 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/9a25ef5d-5264-08f4-2df2-df8a2a50e7a7?index=2325794&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:19.592Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=14634d85-f6fc-8b26-1748-cf4d36ef2a60 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/14634d85-f6fc-8b26-1748-cf4d36ef2a60?index=2426100&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:19.594Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=3e2681aa-68d2-41ae-f4e0-cd1dcd0b8ba3 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/3e2681aa-68d2-41ae-f4e0-cd1dcd0b8ba3?index=2555707&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:19.896Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=093e54aa-d0ef-1bf0-b1cd-456147e1e2a6 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/093e54aa-d0ef-1bf0-b1cd-456147e1e2a6?index=2272762&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:19.995Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=4e5789ba-f1de-df74-5161-7fbea431a50d error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/4e5789ba-f1de-df74-5161-7fbea431a50d?index=2548836&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:20.190Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=18c3ce32-3247-4673-c49e-dd12719200b0 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/18c3ce32-3247-4673-c49e-dd12719200b0?index=2558031&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:20.192Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=41a62628-3374-f7dc-e904-8dbec6dffaa7 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/41a62628-3374-f7dc-e904-8dbec6dffaa7?index=2007902&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:20.195Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=c99b6168-66a0-6aaa-ec0c-da482ba586ed error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/c99b6168-66a0-6aaa-ec0c-da482ba586ed?index=2426096&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:20.395Z [ERROR] policy_manager.policy_handler: encountered an error monitoring policy: policy_id=af2adcdd-ffc6-8c54-e71d-dbbf55543a77 error="failed to get policy: Get "http://127.0.0.1:5656/v1/scaling/policy/af2adcdd-ffc6-8c54-e71d-dbbf55543a77?index=2523637&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"

Then finally, these lines:

2023-05-18T12:19:27.690Z [INFO]  policy_manager: starting policy source: source=nomad
2023-05-18T12:19:27.690Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Get "http://127.0.0.1:5656/v1/scaling/policies?index=1&namespace=default&region=global&wait=300000ms": dial tcp 127.0.0.1:5656: connect: connection refused"
2023-05-18T12:19:37.691Z [INFO]  policy_manager: starting policy source: source=nomad

After the lines above, it reports zero active policies:
Screen Shot 2023-05-18 at 09 35 11

SIGABRT dump
nomad-autoscaler-sigabrt-dev.log

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 15, 2023

Hi @bernardoVale 👋

Could you verify if you can access that endpoint manually with something like curl from withing the Autoscaler container?

If this works that I wonder if the pool of connections in the Nomad client SDK needs to be somehow refreshed 🤔

@bernardoVale
Copy link
Author

Could you verify if you can access that endpoint manually with something like curl from withing the Autoscaler container?

Yes, it works fine. The problem is that autoscaler is not handling these temporary network interruptions gracefully

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 19, 2023

I suspected as much, so thank you for the confirmation.

We'll need some time to investigate this further and we will let you know if we need more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants