Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler Nomad Plugin Doesn't Take Into Consideration CPU or Memory requested by blocked evals #584

Open
rorylshanks opened this issue May 24, 2022 · 2 comments

Comments

@rorylshanks
Copy link

Hey everyone, this is an awesome project! However in using this we found a small issue with the npmad apm plugin

Nomad now exposes the below metrics

nomad.nomad.blocked_evals.cpu
nomad.nomad.blocked_evals.memory

Which represent the amount of memory and CPU that is requested by blocked and unplaced evals. Correctly the nomad apm plugin just reads all the nodes and the allocs that are currently placed, but should also take into consideration whether it needs to scale up the cluster due to unplaced evals.

Currently we can use prometheus to get around this, but we found that using the Nomad API directly was significantly more robust for cluster autoscaling. So ideally the nomad apm would take this into consideration.

Thanks!

@lgfa29
Copy link
Contributor

lgfa29 commented May 26, 2022

Hii @megalan247 👋

Those would definitely be useful metrics to have, but unfortunately there isn't a good way to read them from the Nomad API.

The Nomad APM plugin uses Nomad's REST API to retrieve values, not client metrics. We have to do it this way because metrics are available per agent, meaning that, when you query /v1/metrics you only receive the values for the Nomad agent that you sent the request to, so the Autoscaler would need access and be able to scrape metrics from all agents in your cluster, which may not be feasible.

To make things worse, these metrics are only emitted by the cluster leader, so the Autoscaler would have to have access to the Nomad server APIs but some environments don't allow that, specially if the Autoscaler is running in Nomad itself, as an allocation.

From Nomad's perspective, the problem is that these metrics are not persisted in the state store, they are only available as in-memory metrics, so it would not be possible to query them like the other information.

So the only solution for now is to use an APM that is able to scrape/receive and aggregate metrics from all agents in your cluster, otherwise you will only have, at best, partial data.

I will keep this open if things change in the future, but unfortunately I think it will take a while for us to be able to it.

@Cbeck527
Copy link

Cbeck527 commented Jul 8, 2022

For anyone stumbling up this looking for more info, my team managed to come up with something that we think works for us inspired by the config posted in a seemingly unrelated issue.

Prerequisite: you have all of your agents configured to send metrics. We use DataDog, so our telemetry {} block looks something like this:

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  datadog_address = "localhost:8125"
  disable_hostname = true
  collection_interval = "10s"
}

For a given AWS ASG that we want to scale we have two checks of the metrics @megalan247 mentioned:
(note: the {{ }} are variables populated by our config management)

    check "scale_up_on_exhausted_cpu" {
      source = "datadog"
      query_window = "5m"
      query = "default_zero(default_zero(sum:nomad.nomad.blocked_evals.cpu{environment:{{ environment }},node_class:{{ node_class }}})/default_zero(sum:nomad.nomad.blocked_evals.cpu{environment:{{ environment }},node_class:{{ node_class }}}))"

      strategy "target-value" {
        target = 0.9
      }
    }

    check "scale_up_on_exhausted_memory" {
      source = "datadog"
      query_window = "5m"
      query = "default_zero(default_zero(sum:nomad.nomad.blocked_evals.memory{environment:{{ environment }},node_class:{{ node_class }}})/default_zero(sum:nomad.nomad.blocked_evals.memory{environment:{{ environment }},node_class:{{ node_class }}}))"

      strategy "target-value" {
        target = 0.9
      }
    }

Initial testing looks good— if a deployment is blocked because of resource exhaustion, our metric jumps and the autoscaler reacts appropriately:

image

2022-07-07T19:23:52.099Z [INFO]  policy_eval.worker: scaling target: id=e12f7c10-4292-ace8-6872-833b95344800 policy_id=cf12159c-94b3-6156-6b93-19e1f2e7d87f queue=cluster target=aws-asg from=9 to=10 reason="scaling up because factor is 1.111111" meta=map[nomad_policy_id:cf12159c-94b3-6156-6b93-19e1f2e7d87f]
2022-07-07T19:24:12.933Z [INFO]  internal_plugin.aws-asg: successfully performed and verified scaling out: action=scale_out asg_name=workers desired_count=10

Totally open to any and all feedback on this approach from the maintainers or other folks who have successfully solved for this! And last, credit where credit is due— thank you @baxor! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants