Skip to content

Latest commit

 

History

History
223 lines (173 loc) · 10.3 KB

faq.md

File metadata and controls

223 lines (173 loc) · 10.3 KB

Frequently Asked Questions

How does this differ from the old collectd agent?

Upon its initial release, the new agent, called the SignalFx Smart Agent, is essentially a wrapper application around collectd that adds service discovery and automatic configuration of collectd based on those discovered services. Most of the system metrics are generated by collectd, as well as most application metrics. Configuration of collectd monitors is largely a passthrough to collectd config options, but in a YAML format instead of the collectd custom syntax.

The first main foray outside of collectd was the Kubernetes integration, which uses monitors and observers written purely in Go and run completely independent of collectd. Our new monitors can now be independent of collectd to overcome some of the limitations we have with that tool.

What if I am currently using the old collectd agent?

The Smart Agent comes with all of its dependencies bundled, so you will not need a prior collectd installation. If you are currently using the old collectd agent, you should uninstall it first before installing the Smart Agent.
To minimize the load on your host, make sure the old collectd instance does not run alongside the new agent. Running both will use unnecessary resources.

If you have your own homegrown collectd plugins, you can still use these with the Smart Agent by using the collectd/custom monitor. You can reuse your original collectd managed_config directory's configuration files by adding the following monitor:

monitors:
  - type: collectd/custom
    templates:
    - {"#from": "/etc/collectd/managed_config/*.conf", flatten: true, raw: true}

We run collectd-python linked against Python 2.7 so any Python plugins will have to be Python 2.7 compatible.

How can I see the datapoints emitted by the agent to troubleshoot issues?

There are two ways, you can either set a config option in the agent to dump datapoints to the agent logs, or you can use the signalfx-agent tap-dps subcommand to stream them to a separate console.

Log Dump

To dump datapoints to the logs, set the following config in the agent.yaml config file:

logging:
  level: debug
writer:
  logDatapoints: true

Datapoint tap

You can also dump a stream of datapoint to a separate console by running the signalfx-agent tap-dps command on the same host as the running agent. Run signalfx-agent tap-dps -h for more information.

How can I see what services the agent has discovered?

Run the following command on the host with the agent. (If you are using the containerized agent, you don't need to use sudo.)

$ sudo signalfx-agent status endpoints

This command dumps out some text listing the discovered service endpoints that the agent knows about.

Why do other pods in my Kubernetes cluster get stuck terminating?

When running the agent in K8s, we have seen issues where the prescribed host filesystem mount to /hostfs inside the agent pod will prevent termination of other pods on the same node. It appears to be the same issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1437952 with fluentd containers. The best thing to do in this case is to unmount docker/k8s related mounts inside the agent container by using this container command in the DaemonSet for the agent instead of the default /bin/signalfx-agent, as well by adding the SYS_ADMIN capability to the agent container:

...
      containers:
      - command:
        - /bin/bash
        - -c
        - /bin/umount-hostfs-non-persistent; exec /bin/signalfx-agent
        name: signalfx-agent
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
        ...
    ...
...

The source for the script /bin/umount-hostfs-non-persistent can be found here, but basically it just does a umount on all of the potentially problematic mounts that we know of. You can add arguments to the script invocation for additional directories to unmount if necessary.

Note that in order to unmount filesystems, you must have the SYS_ADMIN capability. Because it requires such a broad capability, we don't do the unmounting by default in order to keep the agent's permissions limited.

We need to mount the host filesystem into the agent pod in order to get disk usage metrics for each disk individually on the node, but unfortunately there is no way to be more selective with what gets mounted by K8s.

How do I monitor CPU usage for Kubernetes pods that have CPU limits?

CPU usage in Kubernetes (or really any environment where process/container CPU throttling is active) can be a bit tricky since the usual metrics giving a container's CPU utilization are absolute values of CPU consumed (e.g. the docker cpu.percent metric or the container_cpu_utilization metric from cAdvisor), without regard for cgroup limits set by K8s and Docker.

See Resource Quality of Service in Kubernetes for an explanation of requests and limits and how they work in K8s. See CFS Bandwidth Control for more low-level information on how K8s limits are imposed via the Linux kernel.

The primary metrics for container CPU limits are:

  • container_cpu_cfs_throttled_time: The amount of time (in nanoseconds) that a container's processes have spent throttled
  • container_cpu_usage_seconds_total: The total amount of time (in nanoseconds) that a container's processes have spent executing -- this metric is equivalent to container_cpu_utilization * 10,000,000.
  • container_spec_cpu_period: The CFS period length (in microseconds) -- the length of time for which the CFS scheduler considers process usage. This is typically 100,000 microseconds or 0.1 seconds. This value cannot exceed 1 second.
  • container_spec_cpu_quota: The CFS quota (in microseconds) -- a process can run for this amount of time within a given CFS period. The value of this for a given container is derived by dividing the millicore limit value by 1000 and multiplying by the CFS period (e.g. a K8s limit of 500m would translate to a quota of 50,000 microseconds assuming the period were 100,000 microseconds.

The first two metrics are cumulative counters that keep growing, so the easiest way to use them is to look at how much they change per second using rate rollup (default is delta when you look at the metrics in SignalFx). The second two are gauges and generally don't change for the lifetime of the container.

The maximum percentage of time a process can execute in a given second is equal to container_spec_cpu_quota/container_spec_cpu_period. For example, a process with quota of 50,000µs and a period of 100,000µs could execute for no more than half a second, every second. More specifically, within each discrete 100ms window within that second, the process can execute no more than 50ms. In other words, the rate/sec rollup of container_cpu_usage_seconds_total should never exceed 500,000,000 nanoseconds with such a limit. Note that the quota can be larger than the period, which means that a process could consume more than an entire core's worth of execution per period.

There are two ways that a container process might be exceeding its limit:

  1. The process is being throttled continually and it does not have enough CPU to accomplish everything it needs to. The value of container_cpu_usage_seconds_total is maxed out for long periods of time based on the formula above. This is a starving process.

  2. The process is bursty and needs a lot of CPU for short periods, so it might get throttled within a short time window but is always able to complete execution without backing up indefinitely. The process could do things faster if it had a higher limit, but is not starving for CPU.

Case #1 is almost always a bad situation that should be remedied by some combination of 1) optimizing the application, 2) if workload can be distributed, launching more instances of it (horizontal scaling), or 3) increasing the CPU limit (and potentially the CPU request) on a container (vertical scaling). Case #2 may or may not be bad depending on how time-sensitive its workload is.

To monitor case #1, you can use the formula

(container_cpu_usage_seconds_total/10000000)/(container_spec_cpu_quota/container_spec_cpu_period)

to get the percentage of CPU used compared to the limit (0 - 100+). This value can actually exceed 100 because the sampling by the agent is not on a perfectly exact interval.

For case #2 you need to factor in the container_cpu_cfs_throttled_time metric. The above metric showing usage relative to the limit will be under 100 in this case but that doesn't mean throttling isn't happening. You can simply look at container_cpu_cfs_throttled_time using the rollup of rate which will tell you the raw amount of time a container is spending throttled. If you have many processes/threads in a container, this number could be very high. You could compare throttle time to usage time with the formula

container_cpu_cfs_throttled_time/container_cpu_usage_seconds_total

or the equivalent

container_cpu_cfs_throttled_time/(container_cpu_utilization*10000000)

which will tell you the ratio of time the container's processes spent waiting to execute vs the time spent actually executed. Anything over 1 means that the process is spending more time waiting than actually executing.