Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could Not Expand Globs - Context Cancelled #479

Open
gkramer opened this issue Jul 18, 2022 · 30 comments
Open

Could Not Expand Globs - Context Cancelled #479

gkramer opened this issue Jul 18, 2022 · 30 comments
Labels

Comments

@gkramer
Copy link

gkramer commented Jul 18, 2022

Error in Log:
[2022-07-18T20:59:06.441Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=protobuf&from=1658156341&target=MyTarget%24JmxTimer.update_session_exception_none.999thPercentile&until=1658177941", "peer": "10.128.27.189:50684", "carbonapi_uuid": "e5737414-2d9a-4d85-b89c-a253b13380dc", "format": "carbonapi_v2_pb", "targets": ["MyTarget$JmxTimer.update_session_exception_none.999thPercentile"], "runtime_seconds": 0.000098906, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"}

I'm also seeing 'find failed' with the following reasons:
"reason": "Internal error while processing request", "error": "could not expand globs - context canceled", "http_code": 500
"reason": "Internal error while processing request", "http_code": 500

It should be noted that some of these queries are trying to merge 80+ graphs at runtime, which may be contributing to the issue.

Please also note that we've turned off indexing, as we have >30TB of whisper data which results in enormous amounts of RAM utilisation.

Any assistance in resolving these issues would be really appreciated!

@gkramer
Copy link
Author

gkramer commented Jul 20, 2022

@bom-d-van Any chance you should shed some light on this? :)

@bom-d-van
Copy link
Member

it's most likely hitting a timeout. how many whisper files do you have for the server? are you able to ls the whisper file?

@gkramer
Copy link
Author

gkramer commented Jul 25, 2022

@bom-d-van Its not exhausting IOps, CPU or RAM. There are 30+ TB of whisper files, so a lot! Any suggestions for improving performance?

@Civil
Copy link
Member

Civil commented Jul 25, 2022

@gkramer problem with no index at all approach is that you rely on OS to be able to give you list of files fast.

However, depending on your filesystem and other factors stat syscall on folder that have a lot of entries might take time.

That is also the reason why @bom-d-van asked to try to measure how fast ls works in same whisper tree. If it's also slow, then there is not much you can do about performance (you can try to split your whisper tree into multiple sub-trees for example, but that way you'll add more problems to solve, like load balancing for ingestion on top of multiple go-carbon's)

@gkramer
Copy link
Author

gkramer commented Jul 26, 2022

@Civil Thank you for the response. So the most efficient approach is to add more RAM [in order to enable indexing] to the system? (I find myself wondering about the impact of using swap, if only to validate the indexing approach temporarily - this is the primary service running on the box)

side points:

  • Are there any timeouts that can be bumped up to reduce the likelihood of errors caused in this regard? [I've pushed up all the relevant timeouts I could find in the config, all with little/no benefit]
  • surely the logic could be improved to only run syscall on the relevant child directories/namespaces, thereby reducing the importance of the index?
  • We're currently using (AWS) EBS storage with gp3, and I've considered using io2, but I doubt it will improve performance sufficiently to resolve the problem above: Seems like increased RAM is the best approach, albeit an expensive one... [me: ponders using swap a little more]
  • How efficient is the in memory indexing? If I were to swap to disk, would I still see a net benefit in response times?

@gkramer
Copy link
Author

gkramer commented Aug 1, 2022

@Civil @bom-d-van The problem is back, and indexing is turned on - both! Any ideas? Should I look to upgrade the disks to host-based SSD? It wouldnt appear as if this is the bottleneck, as IOps never spike above 4k, and the allocation is 8k. I'm also seeing upstream issues with carbonapi, but I'd prefer to resolve the go-carbon issues first.

Do you know if the developers have stress-tested the daemon for parallel queries - as in aggregating 50+ wsp files? I'd really like to get this issue resolved and behind me :)

Thanks for all your advice and insight to date!

@bom-d-van
Copy link
Member

The problem is back, and indexing is turned on - both! Any ideas?

hi, can you share the config for go-carbon for us to have more context?

Do you know if the developers have stress-tested the daemon for parallel queries - as in aggregating 50+ wsp files? I'd really like to get this issue resolved and behind me :)

50+ metrics should not be a problem in my experience. but usually the servers that I have seen running go-carbon has 96-128GB of memories and 16+ cpu cores. so it feels like that you might have to either scale out and scale up.

What are the timeouts below that you have for carbonserver?

[carbonserver]
# Read and Write timeouts for HTTP server
read-timeout = "60s"
write-timeout = "60s"
# Request timeout for each API call
request-timeout = "60s"

# request-timeout would override the global request-timeout.
#
# [[carbonserver.api-per-path-rate-limiters]]
# path = "/metrics/render/"
# max-inflight-requests = 1
# request-timeout = "60s"

@gkramer
Copy link
Author

gkramer commented Aug 1, 2022

@bom-d-van

go-carbon.conf

[common]
user = "graphite"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
metric-interval = "1m0s"
max-cpu = 8

[whisper]
data-dir = "/data/graphite/storage/whisper"
schemas-file = "/data/graphite/conf/storage-schemas.conf"
aggregation-file = "/data/graphite/conf/storage-aggregation.conf"
workers = 8
max-updates-per-second = 0
max-creates-per-second = 0
hard-max-creates-per-second = false
sparse-create = false
flock = true
enabled = true
hash-filenames = true
compressed = false
remove-empty-file = false

[cache]
max-size = 1000000    # 1000000
write-strategy = "max"

[udp]
listen = ":2503"
enabled = true
buffer-size = 0

[tcp]
listen = ":2503"
enabled = true
buffer-size = 0

[pickle]
listen = ":2504"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7502"
enabled = true
read-timeout = "30s"

[grpc]
listen = "127.0.0.1:7503"
enabled = true

[tags]
enabled = false
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
local-dir = "/var/lib/graphite/tagging/"
tagdb-timeout = "1s"

[carbonserver]
listen = "0.0.0.0:8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "120s"
write-timeout = "60s"
query-cache-enabled = true
query-cache-size-mb = 0  # FIXME: Was: 0 / 1000
find-cache-enabled = true
trigram-index = true # FIXME: true
scan-frequency = "5m0s"
trie-index = true # FIXME: false
file-list-cache = ""
concurrent-index = false
realtime-index = 0
max-inflight-requests = 0
no-service-when-index-is-not-ready = false
cache-scan = false
max-globs = 100
fail-on-max-globs = false
max-metrics-globbed  = 1000000 # 20000
max-metrics-rendered = 500000
graphite-web-10-strict-mode = true
empty-result-ok = true
internal-stats-dir = ""
stats-percentiles = [99, 98, 95, 75, 50]

[dump]
enabled = false
path = "/var/lib/graphite/dump/"
restore-per-second = 0

[pprof]
listen = "localhost:7007"
enabled = false

[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "debug"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

carbonapi.conf

listeners:
        - address: "0.0.0.0:8081"
listen: ":8081"

prefix: ""
useCachingDNSResolver: false
cachingDNSRefreshTime: "1m"
expvar:
  enabled: true
  pprofEnabled: false
  listen: ""
headersToPass:
  - "X-Dashboard-Id"
  - "X-Grafana-Org-Id"
  - "X-Panel-Id"
headersToLog:
  - "X-Dashboard-Id"
  - "X-Grafana-Org-Id"
  - "X-Panel-Id"
define:
  -
    name: "perMinute"
    template: "perSecond({{.argString}})|scale(60)"
notFoundStatusCode: 200
concurency: 1000
cache:
   # Type of caching. Valid: "mem", "memcache", "null"
   type: "mem"
   # Cache limit in megabytes
   size_mb: 0
   # Default cache timeout value. Identical to DEFAULT_CACHE_DURATION in graphite-web.
   defaultTimeoutSec: 60
   # Only used by memcache type of cache. List of memcache servers.
   memcachedServers:
       - "127.0.0.1:1234"
       - "127.0.0.2:1235"
cpus: 0
tz: ""

functionsConfig:
    graphiteWeb: /etc/go-carbon/graphTemplates.yaml
    timeShift: /etc/go-carbon/timeShift.yaml

maxBatchSize: 100
graphite:
    # Host:port where to send internal metrics
    # Empty = disabled
    host: "127.0.0.1:2123"
    interval: "60s"
    prefix: "carbon.api"
    # rules on how to construct metric name. For now only {prefix} and {fqdn} is supported.
    # {prefix} will be replaced with the content of {prefix}
    # {fqdn} will be repalced with fqdn
    pattern: "{prefix}.{fqdn}"
idleConnections: 10
pidFile: ""
upstreams:
    # Use TLD Cache. Useful when you have multiple backends that could contain
    # different TLDs.
    #
    # For example whenever you have multiple top level metric namespaces, like:
    #   one_min.some.metric
    #   ten_min.some_metric
    #   one_hour.some_metric
    #
    # `one_min`, `ten_min` and `one_hour` are considered to be TLDs
    # carbonapi by default will probe all backends and cache the responses
    # and will know which backends would contain the prefix of the request
    #
    # This option allows to disable that, which could be helpful for backends like
    # `clickhouse` or other backends where all metrics are part of the same cluster
    tldCacheDisabled: false

    # Number of 100ms buckets to track request distribution in. Used to build
    # 'carbon.zipper.hostname.requests_in_0ms_to_100ms' metric and friends.
    # Requests beyond the last bucket are logged as slow (default of 10 implies
    # "slow" is >1 second).
    # The last bucket is _not_ called 'requests_in_Xms_to_inf' on purpose, so
    # we can change our minds about how many buckets we want to have and have
    # their names remain consistent.
    buckets: 10

    # If request took more than specified amount of time, it will be logged as a slow request as well
    slowLogThreshold: "10s"

    timeouts:
        # Maximum backend request time for find requests.
        find: "120s"
        # Maximum backend request time for render requests. This is total one and doesn't take into account in-flight requests
        render: "120s"
        # Timeout to connect to the server
        connect: "500ms"

    # Number of concurrent requests to any given backend - default is no limit.
    # If set, you likely want >= MaxIdleConnsPerHost
    concurrencyLimitPerServer: 0

    # Configures how often keep alive packets will be sent out
    keepAliveInterval: "30s"

    # Control http.MaxIdleConnsPerHost. Large values can lead to more idle
    # connections on the backend servers which may bump into limits; tune with care.
    maxIdleConnsPerHost: 100

    # Only affects cases with maxBatchSize > 0. If set to `false` requests after split will be sent out one by one, otherwise in parallel
    doMultipleRequestsIfSplit: false

    # "http://host:port" array of instances of carbonserver stores
    # It MUST be specified.
    backends:

    #backends section will override this one!
    backendsv2:
        backends:
          -
            groupName: "group1"
            # supported:
            #    carbonapi_v2_pb - carbonapi 0.11 or earlier version of protocol.
            #    carbonapi_v3_pb - new protocol, http interface (native)
            #    carbonapi_v3_grpc - new protocol, gRPC interface (native)
            #    protobuf, pb, pb3 - same as carbonapi_v2_pb
            #    msgpack - protocol used by graphite-web 1.1 and metrictank
            #    auto - carbonapi will do it's best to guess if it's carbonapi_v3_pb or carbonapi_v2_pb
            #
            #  non-native protocols will be internally converted to new protocol, which will increase memory consumption
            protocol: "carbonapi_v3_pb"
            # supported:
            #    "broadcast" - send request to all backends in group and merge responses. This was default behavior for carbonapi 0.11 or earlier
            #    "roundrobin" - send request to one backend.
            #    "all - same as "broadcast"
            #    "rr" - same as "roundrobin"
            lbMethod: "broadcast"
            # amount of retries in case of unsuccessful request
            maxTries: 3
            # amount of metrics per fetch request. Default: 0 - unlimited. If not specified, global will be used
            maxBatchSize: 0
            # interval for keep-alive http packets. If not specified, global will be used
            keepAliveInterval: "10s"
            # override for global concurrencyLimit.
            concurrencyLimit: 0
            # override for global maxIdleConnsPerHost
            maxIdleConnsPerHost: 1000
            # force attempt to establish HTTP2 connection, instead of http1.1. Default: false
            # Backends must use https for this to take any effect
            forceAttemptHTTP2: false
            # Only affects cases with maxBatchSize > 0. If set to `false` requests after split will be sent out one by one, otherwise in parallel
            doMultipleRequestsIfSplit: false
            # per-group timeout override. If not specified, global will be used.
            # Please note that ONLY min(global, local) will be used.
            timeouts:
                # Maximum backend request time for find requests.
                find: "300s"
                # Maximum backend request time for render requests. This is total one and doesn't take into account in-flight requests.
                render: "120s"
                # Timeout to connect to the server
                connect: "500ms"
            servers:
                - "http://127.0.0.1:8080"



    # carbonsearch is not used if empty
    carbonsearch:
        # Instance of carbonsearch backend
        # carbonsearch prefix to reserve/register
        # carbonsearch is not used if empty
    # carbonsearch section will override this one!
    carbonsearchv2:
        # Carbonsearch instances. Follows the same syntax as backendsv2
        backends:
            -
              groupName: "group1"
              protocol: "carbonapi_v3_pb"
              lbMethod: "broadcast"
              servers:
                  - "http://127.0.0.1:8080"
        # carbonsearch prefix to reserve/register
        prefix: "virt.v1.*"

    # Enable compatibility with graphite-web 0.9
    # This will affect graphite-web 1.0+ with multiple cluster_servers
    # Default: disabled
    graphite09compat: false


graphTemplates: /etc/go-carbon/graphTemplates.yaml
expireDelaySec: 10
logger:
    - logger: ""
      file: "stderr"
      level: "debug"
      encoding: "console"
      encodingTime: "iso8601"
      encodingDuration: "seconds"
    - logger: ""
      file: "/var/log/go-carbon/carbonapi.log"
      level: "info"
      encoding: "json"

@bom-d-van
Copy link
Member

bom-d-van commented Aug 1, 2022

try the configuration bellow to see if it helps, maybe you can also generate some go profiling results like cpu, heap, and goroutines if you are still seeing problems:

trigram-index = false
scan-frequency = "60m"
trie-index = true 
file-list-cache = "/data/graphite/storage/flc.bin"
file-list-cache-version = 2
concurrent-index = true
realtime-index = 0

that said, the chance is high that you might need to scale the servers for your load.

@Civil
Copy link
Member

Civil commented Aug 1, 2022

I think key to understanding of what is the bottleneck in your case is gathering metrics and some debug information from your server.

For example if you suspect that it's I/O (indeed, whisper doesn't like slow I/O and thrive on SSDs, that's unfortunately by design there).

I would collect some basic ones, like I/O performance, io wait time, etc., check the cpu usage with breakdown by User, I/O and Sys of course, that would also give you some clues.

And in case Sys usage is somewhat high, you can try to record a perf dump and maybe look at the flamegraphs based on it (https://www.brendangregg.com/flamegraphs.html).

Doing so should show you where the problem is. For example if you'll see high I/O and in perf the top time taking syscalls would be related to disk (stat*, read*, write*, ...) - then that's likely would be your bottleneck.

@gkramer
Copy link
Author

gkramer commented Aug 2, 2022

@bom-d-van I've run queries against the box whilst running iostats, and iowait barely ticks above 7%. carbonapi also seems to flatline. I see intermittent success in graph creation, but dont fully understand why it works at some times and not at others. I've also verified FD and filemax limits, and all seem fine. I'd be happy to scale the box, but I'm having trouble identifying whats under sufficient load to justify the work...

@bom-d-van
Copy link
Member

bom-d-van commented Aug 2, 2022

Yeah, that sounds strange. Can you try figuring out the following questions and share it if it's possible? As Vladimir shared above, it's more of generic debugging or root-cause-finding process.

  • How about cpu iowait? (never mind, you already answered it.)
  • Are there any errors in /var/log/go-carbon/go-carbon.log? (never mind, you have mentioned it in the description already.)
  • Is it happening to all queries or just to some queries?
  • How many metrics are matched by the query if you use the find api? Can you share the result?
  • What's the schema looks like for the matching metrics? And what's the time range that you are trying to read? How many data point per metrics are you trying to fetch?
  • Do you have any Grafana dashboards that contains the metrics reported by Go-Carbon? Can you share a screenshot? Things like cache.metrics, carbonserver.* and persister.* could be interesting to look around.
  • Maybe also create a dashboard that has the system metrics from the servers that's running Go-Carbon and share a screenshot of it would be helpful too. (In my experience, having a single Grafana dashboard that contains all the critical system metrics is extremely valuable, Diamond collectors worked well for my production journey.)

Off-topic: We should create some sharable/open-source-able Grafana (and friends) dashboard formats for common metrics exported by different systems and tools. This way we don't have to re-create dashboards for every systems and we can all speak the same language, see and talk the same thing.

@bom-d-van
Copy link
Member

[2022-07-18T20:59:06.441Z] ERROR [access] fetch failed 
{
    "handler": "render",
    "url": "/render/?format=protobuf&from=1658156341&target=MyTarget%24JmxTimer.update_session_exception_none.999thPercentile&until=1658177941",
    "peer": "10.128.27.189:50684",
    "carbonapi_uuid": "e5737414-2d9a-4d85-b89c-a253b13380dc",
    "format": "carbonapi_v2_pb",
    "targets":
    [
        "MyTarget$JmxTimer.update_session_exception_none.999thPercentile"
    ],
    "runtime_seconds": 0.000098906,
    "reason": "failed to read data",
    "http_code": 400,
    "error": "could not expand globs - context canceled"
}

It seems that you are only trying to read one metric in this API call, and it fails immediately "runtime_seconds": 0.000098906. Are you able to read the data directly with python whisper-dump.py or go-whisper/dump.go like this?

whisper-dump.py '/data/graphite/storage/whisper/MyTarget$JmxTimer/update_session_exception_none/999thPercentile.wsp'

@Civil
Copy link
Member

Civil commented Aug 2, 2022

@bom-d-van I think important part here is that it's context canceled - which means either timeout set for too small value or that upstream gave up immediately (AFAIR go-carbon in case of closed connection would pass context and cancel the request). So that's not a data problem at all.

But that potentially could be that request was stuck in OS TCP Socket backlog for too long (for example) and by the time it can be accepted by go-carbon it was already too late.

Based on that I would suggest to try increasing the backlog. That would be net.ipv4.tcp_max_syn_backlog sysctl and I would put it to at least 30000 or if it's already higher - maybe double it.

And you can check the backlog for the socket by using ss -lt and finding what is the value in Send-Q column and comparing that with what is set in the sysctl.

If that won't help - it would be important to understand where timeout actually came from and why. Key might be to find in carbonapi the reuqest with same UUID and check what happened there from carbonapi's point of view.

@bom-d-van
Copy link
Member

bom-d-van commented Aug 2, 2022

But that potentially could be that request was stuck in OS TCP Socket backlog for too long (for example) and by the time it can be accepted by go-carbon it was already too late.

[carbonserver]
read-timeout = "120s"
write-timeout = "60s"

@Civil these timeouts are for initiating http.Server here. Based on my understanding, it should be a pure go std library/runtime thing. Are there any tcp magic that the kernel can tell user space that a request should be timed out?

which means either timeout set for too small value or that upstream gave up immediately (AFAIR go-carbon in case of closed connection would pass context and cancel the request). So that's not a data problem at all.

@Civil yep, I think the chance is low that it's a data/whisper file issue. but it's good to confirm it.
And It's less likely to be a small value issue as only some requests are failing randomly (I hope).

@gkramer you can also try using bpftrace or just hacking go-carbon from this place where the error is reported, just to see why the request is failing.

Just to double check, is the always the same request failing or just different requests failing at different times?

@Civil
Copy link
Member

Civil commented Aug 2, 2022

@bom-d-van depends on what you do, you can pass deadline from upstream and reuse it for specific request, for example. So it can be somewhat implied. As well as in the code no one forbid you to redefine timeout.

As I've said, there is some chance that if you run out of backlog kernel buffer (see sysctl above), you might have too many requests enqueued for too long that might mean that by the time Golang have a chance to process connection, deadline already has passed, connection was closed by upstream and that would immediately result in context being canceled.

Increasing backlog won't fix the underlying slowness but might reduce amount of timeouts. That is runtime sysctl so increasing the value should be relatively safe and easy test. Oh, and because of that you might have some errors in the logs that are just red herrings and steer you away from actual problematic queries.

@bom-d-van
Copy link
Member

Just to be exploring the idea of stuck tcp connections, @gkramer can you also try to find and share the runtime/latency/timeout for the failed requests in carbonapi log? If it's indeed caused by tcp being stuck, we should be able to see the runtime to be above 120s in your example. In go net/http, a read timeout would start counting when it tries to read the header.

What's more, in your carbonapi config, you have cpus: 0, it seems carbonapi should be able to make it default to the number of cores for the server, but maybe you can try setting it to 8 like go-carbon or more just to be sure.

@Civil
Copy link
Member

Civil commented Aug 2, 2022

@bom-d-van if they are stuck - it can be that runtime will be 0.0 from go-carbon's point of view, but it will be timeout setting from the carbonapi's standpoint.

However problem is that if in carbonapi some of the concurrency settings will be misconfigured, effect will be the same, but they'll be stuck waiting for a slot available inside carbonapi's code. So it would be relatively hard to distinguish, unless you check Send-Q column in output of ss -lt or netstat -lt for the listener port.

@gkramer
Copy link
Author

gkramer commented Aug 4, 2022

OK, so apologies for merging information from go-graphite and carbonapi, but I'm going to put everything here:

Go-Graphite:

[2022-08-04T14:22:32.260Z] ERROR [tcp] read error {"error": "read tcp 127.0.0.1:2503->127.0.0.1:56536: i/o timeout"}
[2022-08-04T14:23:01.526Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=carbonapi_v3_pb", "peer": "127.0.0.1:42012", "carbonapi_uuid": "6c2a3831-2aca-40f0-9570-39f0de24c17a", "carbonzipper_uuid": "6c2a3831-2aca-40f0-9570-39f0de24c17a", "format": "carbonapi_v3_pb", "targets": ["servers.X.Y.current.*X-[0-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_codahale_metrics_jmx_JmxReporter$JmxTimer.*_none.999thPercentile"], "runtime_seconds": 9.925694688, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"}
[2022-08-04T14:23:05.905Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=carbonapi_v3_pb", "peer": "127.0.0.1:33140", "carbonapi_uuid": "2a84cc1f-7ac1-4c40-aa27-3c49a55e1729", "carbonzipper_uuid": "2a84cc1f-7ac1-4c40-aa27-3c49a55e1729", "format": "carbonapi_v3_pb", "targets": ["servers.X.Y.current.*X-[0-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_codahale_metrics_jmx_JmxReporter$JmxTimer.*Spin*.98thPercentile"], "runtime_seconds": 9.950427383, "reason": "failed to read data", "http_code": 400, "error": "invalid cache record for the request"}

CarbonApi:

{"level":"WARN","timestamp":"2022-08-04T14:23:18.745Z","logger":"slow","message":"Slow Request","time":6.752854819,"slowLogThreshold":1,"url":"/render","referer":""}
{"level":"WARN","timestamp":"2022-08-04T14:23:21.221Z","logger":"zipper","message":"timeout waiting for more responses","type":"broadcastGroup","groupName":"root","function":"prober","no_answers_from":["group1"]}
{"level":"ERROR","timestamp":"2022-08-04T14:23:21.222Z","logger":"zipper","message":"failed to probe tlds","type":"probe","errors":"timeout while fetching Response"}
{"level":"ERROR","timestamp":"2022-08-04T14:23:35.989Z","logger":"render","message":"panic during eval:","carbonapi_uuid":"b654fbdb-ce12-4e38-af46-90fd3e08f984","username":"","request_headers":{"X-Dashboard-Id":"2","X-Grafana-Org-Id":"1","X-Panel-Id":"19"},"cache_key":"format=json&from=-7d&maxDataPoints=1330&target=sortByMaxima%28derivative%28groupByNode%28servers.X.Y.current.%2AX-%5B1-9%5D%2A%7B79f6967d57%2C7ddb99d49%2C8585cd9d8b%2C5d5dd59984%2C75ccb5875c%2C5c88bbc88b%2C8bd4f7bd7%2C74446b56b4%2C7b8945bdd8%2C9f868c77c%2Cc4b8c4b78%2C5db45cb6b5%2C6c9cd4b98f%2C6f47dc4fc%2C75f5bf5c45%2C8595df68dd%2C564489c58d%2C669b9fbf85%2C75f8859849%2C7d499bd94%2C86f864cb55%2C664bbd8d5d%2C6d74dcf8dc%2C6cffcf9d79%2C7fcd567c66%2C798d9568cd%2Cc57665cd%2C5ccbf999f4%2C6fd6bb8bd%2C7d666f98b9%2C5c97bb55cf%2Cc4c54b8db%2C6b44bf7878%2C798855fffd%2C5469467466%2C5ccdc977fb%2C57757cf96b%2C7ddb5985bb%2C6bb59d45f4%2Cd7f5949f6%2C5b6dd7dd84%2C69cdbf48cd%2C5948c5bdc6%2C6b5fcb6898%2Cf9cc559c8%2C74c8c459d5%2C8d7885854%2C5dd95c6f7b%2C645bc9cd44%2C5bb9b575f6%2C6f5c8bd7d8%2C84f488c49f%2C6d7bb87bff%2C8667c597fb%2C579c89665b%2C6cfb8b7494%2C64444c85c8%2C769df9d9db%2C5865d44c8d%2Cf8565559c%2C6499658d6b%2C64f5dbcb9b%2C6c5645d74%2C7c9c9cc74d%2C7fd48c79d8%2C977898486%2Cdb4f57d69%2Cd87c5788b%2Cd8858b94%2C64c85cf95d%2C7c7f776b96%2C7d5dc75ff8%2C85b47844f4%2C5659c8864d%2C6c8566cfc8%2C56f869dd46%2C649b758644%2C5955b5f44b%2C97666577f%2C6c847bdd6b%2Cccbf6cbb6%2C5c56c8c56%2C7d968975d%2C77dc6b6494%2C848d89d8dd%2C55f4b77588%2Cc7978cb79%2C7bccbc98b4%2C8b77fbdbf%2C6d77c78c5f%2Cd57bfc7cb%2C55cf45f88d%2C6c9f6b88df%7D%2A.com_codahale_metrics_jmx_JmxReporter%24JmxTimer.%2AERROR%2A.Count%2C+6%2C+%27sum%27%29%29%29&until=now","reason":"runtime error: index out of range [6] with length 2","stack":"github.com/go-graphite/carbonapi/cmd/carbonapi/http.renderHandler.func2\n\t/root/go/src/github.com/go-graphite/carbonapi/cmd/carbonapi/http/render_handler.go:252\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:844\nruntime.goPanicIndex\n\t/usr/local/go/src/runtime/panic.go:89\ngithub.com/go-graphite/carbonapi/expr/functions/groupByNode.(*groupByNode).Do\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/functions/groupByNode/function.go:81\ngithub.com/go-graphite/carbonapi/expr.EvalExpr\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:160\ngithub.com/go-graphite/carbonapi/expr.evaluator.Eval\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:119\ngithub.com/go-graphite/carbonapi/expr/helper.GetSeriesArg\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/helper/helper.go:41\ngithub.com/go-graphite/carbonapi/expr/helper.ForEachSeriesDo\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/helper/helper.go:124\ngithub.com/go-graphite/carbonapi/expr/functions/derivative.(*derivative).Do\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/functions/derivative/function.go:33\ngithub.com/go-graphite/carbonapi/expr.EvalExpr\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:160\ngithub.com/go-graphite/carbonapi/expr.evaluator.Eval\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:119\ngithub.com/go-graphite/carbonapi/expr/helper.GetSeriesArg\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/helper/helper.go:41\ngithub.com/go-graphite/carbonapi/expr/functions/sortBy.(*sortBy).Do\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/functions/sortBy/function.go:35\ngithub.com/go-graphite/carbonapi/expr.EvalExpr\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:160\ngithub.com/go-graphite/carbonapi/expr.evaluator.Eval\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:119\ngithub.com/go-graphite/carbonapi/expr.evaluator.FetchAndEvalExp\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:96\ngithub.com/go-graphite/carbonapi/expr.FetchAndEvalExp\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:131\ngithub.com/go-graphite/carbonapi/cmd/carbonapi/http.renderHandler\n\t/root/go/src/github.com/go-graphite/carbonapi/cmd/carbonapi/http/render_handler.go:293\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2084\ngithub.com/go-graphite/carbonapi/util/ctx.ParseCtx.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/util/ctx/ctx.go:90\ngithub.com/go-graphite/carbonapi/cmd/carbonapi/http.enrichContextWithHeaders.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/cmd/carbonapi/http/enrichcontext.go:32\ngithub.com/dgryski/httputil.TimeHandler.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/dgryski/httputil/times.go:26\ngithub.com/dgryski/httputil.TrackConnections.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/dgryski/httputil/track.go:40\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2084\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2462\ngithub.com/gorilla/handlers.CompressHandlerLevel.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/gorilla/handlers/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2084\ngithub.com/gorilla/handlers.(*cors).ServeHTTP\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/gorilla/handlers/cors.go:54\ngithub.com/gorilla/handlers.ProxyHeaders.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/gorilla/handlers/proxy_headers.go:59\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2084\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2916\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1966"}
{"level":"ERROR","timestamp":"2022-08-04T14:23:35.993Z","logger":"access","message":"request failed","data":{"handler":"render","carbonapi_uuid":"b654fbdb-ce12-4e38-af46-90fd3e08f984","url":"/render","peer_ip":"192.168.50.178","host":"10.128.27.7:8081","format":"json","use_cache":true,"targets":["sortByMaxima(derivative(groupByNode(servers.X.Y.current.*X-[1-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_codahale_metrics_jmx_JmxReporter$JmxTimer.*ERROR*.Count, 6, 'sum')))"],"cache_timeout":60,"runtime":9.592144806,"http_code":500,"reason":"runtime error: index out of range [6] with length 2\nStack trace: github.com/go-graphite/carbonapi/cmd/carbonapi/http.renderHandler.func2\n\t/root/go/src/github.com/go-graphite/carbonapi/cmd/carbonapi/http/render_handler.go:257\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:844\nruntime.goPanicIndex\n\t/usr/local/go/src/runtime/panic.go:89\ngithub.com/go-graphite/carbonapi/expr/functions/groupByNode.(*groupByNode).Do\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/functions/groupByNode/function.go:81\ngithub.com/go-graphite/carbonapi/expr.EvalExpr\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:160\ngithub.com/go-graphite/carbonapi/expr.evaluator.Eval\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:119\ngithub.com/go-graphite/carbonapi/expr/helper.GetSeriesArg\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/helper/helper.go:41\ngithub.com/go-graphite/carbonapi/expr/helper.ForEachSeriesDo\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/helper/helper.go:124\ngithub.com/go-graphite/carbonapi/expr/functions/derivative.(*derivative).Do\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/functions/derivative/function.go:33\ngithub.com/go-graphite/carbonapi/expr.EvalExpr\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:160\ngithub.com/go-graphite/carbonapi/expr.evaluator.Eval\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:119\ngithub.com/go-graphite/carbonapi/expr/helper.GetSeriesArg\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/helper/helper.go:41\ngithub.com/go-graphite/carbonapi/expr/functions/sortBy.(*sortBy).Do\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/functions/sortBy/function.go:35\ngithub.com/go-graphite/carbonapi/expr.EvalExpr\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:160\ngithub.com/go-graphite/carbonapi/expr.evaluator.Eval\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:119\ngithub.com/go-graphite/carbonapi/expr.evaluator.FetchAndEvalExp\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:96\ngithub.com/go-graphite/carbonapi/expr.FetchAndEvalExp\n\t/root/go/src/github.com/go-graphite/carbonapi/expr/expr.go:131\ngithub.com/go-graphite/carbonapi/cmd/carbonapi/http.renderHandler\n\t/root/go/src/github.com/go-graphite/carbonapi/cmd/carbonapi/http/render_handler.go:293\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2084\ngithub.com/go-graphite/carbonapi/util/ctx.ParseCtx.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/util/ctx/ctx.go:90\ngithub.com/go-graphite/carbonapi/cmd/carbonapi/http.enrichContextWithHeaders.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/cmd/carbonapi/http/enrichcontext.go:32\ngithub.com/dgryski/httputil.TimeHandler.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/dgryski/httputil/times.go:26\ngithub.com/dgryski/httputil.TrackConnections.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/dgryski/httputil/track.go:40\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2084\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2462\ngithub.com/gorilla/handlers.CompressHandlerLevel.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/gorilla/handlers/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2084\ngithub.com/gorilla/handlers.(*cors).ServeHTTP\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/gorilla/handlers/cors.go:54\ngithub.com/gorilla/handlers.ProxyHeaders.func1\n\t/root/go/src/github.com/go-graphite/carbonapi/vendor/github.com/gorilla/handlers/proxy_headers.go:59\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2084\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2916\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1966","from":1659018206,"until":1659623006,"from_raw":"-7d","until_raw":"now","uri":"/render","from_cache":false,"used_backend_cache":false,"request_headers":{"X-Dashboard-Id":"2","X-Grafana-Org-Id":"1","X-Panel-Id":"19"}}}
{"level":"ERROR","timestamp":"2022-08-04T14:23:45.238Z","logger":"access","message":"request failed","data":{"handler":"render","carbonapi_uuid":"f5f0219b-b146-4945-968b-d8e6af80b2bd","url":"/render","peer_ip":"192.168.50.178","host":"10.128.27.7:8081","format":"json","use_cache":true,"targets":["scale(sortByMaxima(groupByNode(servers.X.Y.current.*X-[0-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_codahale_metrics_jmx_JmxReporter$JmxTimer.*react*.OneMinuteRate, 6, 'sum')), 60)","scale(sortByMaxima(groupByNode(servers.X.Y.current.*X-[0-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_codahale_metrics_jmx_JmxReporter$JmxTimer.*update*.OneMinuteRate, 6, 'sum')), 60)","scale(sortByMaxima(groupByNode(servers.X.Y.current.*X-[0-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_codahale_metrics_jmx_JmxReporter$JmxTimer.*grant*.OneMinuteRate, 6, 'sum')), 60)","scale(sortByMaxima(groupByNode(servers.X.Y.current.*X-[0-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_codahale_metrics_jmx_JmxReporter$JmxTimer.*purchase*.OneMinuteRate, 6, 'sum')), 60)"],"cache_timeout":60,"runtime":31.135724624,"http_code":503,"reason":"failed to fetch data from server/group,failed to fetch data from server/group,failed to fetch data from server/group","from":1659018194,"until":1659622994,"from_raw":"-7d","until_raw":"now","uri":"/render","from_cache":false,"used_backend_cache":false,"request_headers":{"X-Dashboard-Id":"2","X-Grafana-Org-Id":"1","X-Panel-Id":"30"}}}
{"level":"WARN","timestamp":"2022-08-04T14:23:46.153Z","logger":"zipper","message":"errors occurred while getting results","type":"protoV3Group","name":"http://127.0.0.1:8080","type":"fetch","request":"&MultiFetchRequest{Metrics:[]FetchRequest{FetchRequest{Name:servers.X.Y.*.*X-[1-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_mongodb_management_ConnectionPoolStatistics.Size,StartTime:1659018216,StopTime:1659623016,HighPrecisionTimestamps:false,PathExpression:servers.X.Y.*.*X-[1-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_mongodb_management_ConnectionPoolStatistics.Size,FilterFunctions:[]*FilteringFunction{},MaxDataPoints:1330,},},}","errors":"max tries exceeded","errorsVerbose":"max tries exceeded\nHTTP Code: 504\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:25\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6222\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6199\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571\n\nCaused By: failed to fetch data from server/group\nHTTP Code: 504\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:27\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6222\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6199\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571\n\nCaused By: timeout while fetching Response\nHTTP Code: 504\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:20\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6222\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6199\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571\n\nCaused By: Get \"http://127.0.0.1:8080/render/?format=carbonapi_v3_pb\": context deadline exceeded"}

GrafanaFE:

t=2022-08-04T14:36:19+0000 lvl=eror msg="Data proxy error" logger=data-proxy-log userId=1 orgId=1 uname=admin path=/api/datasources/proxy/2/render remote_addr=192.168.50.178 referer="http://10.128.27.189:3000/d/q8xEIsz4k/spring-boot-apps-carbonapi?orgId=1&from=now-7d&to=now" error="http: proxy error: context canceled"

Carbon-relay-ng (CRNG):

2022-08-04 14:58:01.770 [WARNING] plain handler for 10.128.219.48:53092 returned: read tcp 10.128.27.7:2123->10.128.219.48:53092: i/o timeout. closing conn <-- seeing quite a few of these; Just to clarify, we've got an AWS NLB passing traffic to CRNG, and this may very well be due to upstream connections not properly closing the socket... will spend some time investigating this at another time. I'm assuming CRNG wouldnt benefit from longer timeouts from upstream sources?

Re:
ss -ltn: Nothing above 128 in Send-Q.
netstat -ltn: Queues are empty

General:

  • I've just recently seen go-graphite OOM killed by the kernel, which hasnt happened in some time... are you guys able to provide some estimates re what kind of RAM I ought to throw at 30-40TB of wsp files if indexing is enabled?
  • So in summary:
    • lots of timeouts across the board (any possibility of getting GG to more gracefully handle this by enabling keepalives; or some kind of polling for long-running renders?)
    • runtime error: index out of range [6] <-- thats an exciting one :)
    • net.ipv4.tcp_max_syn_backlog <-- bumped up to 30k with little to no impact.
    • how many requests can be safely enqueued before the system starts falling over?

@gkramer
Copy link
Author

gkramer commented Aug 4, 2022

So I cranked up the timeline to 30 days, and that totally annihilated the machine - Iowait shoots up to 70%, RES/VIRT mem shoots past 120GB+, and the kernel then terminates go-graphite. So... brings me back to my original question:

  • What sort of resource should we be throwing at 30TB+ of whisper logs, with indexing enabled?
  • What kind of performance improvement could I expect from host-based SSD vs EBS gp3 with 8k IOps?
  • Splitting wsp across multiple systems has some benefits, obviously, but what kind of improvement should I reasonably expect, and at what point might one need to scale across more hosts?

@gkramer
Copy link
Author

gkramer commented Aug 4, 2022

@Civil @bom-d-van It does seem to me that go-graphite could benefit from optimisations re RAM and indexing...

  • does the daemon load only those metrics relevant to the query at the time
  • Could it not load and crunch smaller/more manageable batches, to keep RAM utilisation under control? Would this require a change in the behaviour of the comms between go-graphite and carbonapi?
  • Clearly as the number of metrics grow, the system will have more difficulty keeping up, which seems to motivate for a more thoughtful approach to batched rendering so as to not overload the machine -- although I also recognise that some responsibility needs to be taken by the dashboard administrator to ensure a query doesnt crash the daemon, but this risk could be minimised by setting some sane limits [which may, or may not, break the protocol]

Interested to hear your thinking, and greatly value all your assistance to date!!!

@bom-d-van
Copy link
Member

bom-d-van commented Aug 5, 2022

What sort of resource should we be throwing at 30TB+ of whisper logs, with indexing enabled?

Have you tried my recommendation config here: #479 (comment)

It does seem to me that go-graphite could benefit from optimisations re RAM and indexing
What sort of resource should we be throwing at 30TB+ of whisper logs, with indexing enabled?

It's not really about how big your whisper files are in total. More like how many uniq metrics you have for the server or the system. And it seems you haven't figured it out yet? There are metrics reported by go-carbon (carbonserver.metrics_known) and you can also use find to count them (find /data/graphite/storage/whisper -type f).

For general scaling question:

  • Can you also describe how are you deploying your services?
  • What versions of cabronapi, carbonzipper, go-carbon that you are using and how many servers are there, and how are they deployed.
  • How many servers do you have and how many memories, cpu, disk space does it have?

What kind of performance improvement could I expect from host-based SSD vs EBS gp3 with 8k IOps?

This I'm not certain. You might have to benchmark it for your production load because whisper schemas and write loads varies.

Splitting wsp across multiple systems has some benefits, obviously, but what kind of improvement should I reasonably expect, and at what point might one need to scale across more hosts?

I usually look at the cache.size metrics reported by go-carbon. It tends to be an indicator that the go-carbon instance is overloaded. I think you could expand better no time out for your issue after expanding your clusters. Maybe less memory usage and cpu usage.

runtime error: index out of range [6] <-- thats an exciting one :)

For this, we would appreciate if you could file a bug report in the carbonapi repo.


For the logs, it's not very helpful because you didn't retrieve the ones that are connected using the carbonapi_uuid field. For example, if you notice a failed request in go-carbon like this one, you should get the carbonapi_uuid (in this case 6c2a3831-2aca-40f0-9570-39f0de24c17a) from it and use it to grep against the carbonapi and carbonzipper logs.

[2022-08-04T14:23:01.526Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=carbonapi_v3_pb", "peer": "127.0.0.1:42012", "carbonapi_uuid": "6c2a3831-2aca-40f0-9570-39f0de24c17a", "carbonzipper_uuid": "6c2a3831-2aca-40f0-9570-39f0de24c17a", "format": "carbonapi_v3_pb", "targets": ["servers.X.Y.current.*X-[0-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_codahale_metrics_jmx_JmxReporter$JmxTimer.*_none.999thPercentile"], "runtime_seconds": 9.925694688, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"}

@bom-d-van
Copy link
Member

one more small tip: use the quoting code markdown syntax to format your log, config file, and code would make the comment easier to read. (I have tweaked your comment above).

@gkramer
Copy link
Author

gkramer commented Aug 7, 2022

Have you tried my recommendation config here: #479 (comment)

Yes, this was enabled immediately after your suggestion. It doesnt seem to have made a difference.

It's not really about how big your whisper files are in total. More like how many uniq metrics you have for the server or the system. And it seems you haven't figured it out yet? There are metrics reported by go-carbon (carbonserver.metrics_known) and you can also use find to count them (find /data/graphite/storage/whisper -type f).

[2022-08-07T09:47:48.319Z] INFO [stat] collect {"endpoint": "local", "metric": "carbon.agents.stats01.carbonserver.metrics_known", "value": 29247434}

Find: 29257864; yes, number of metrics are growing daily. This is the first time find has completed successfully in a while.

For general scaling question:

  • Can you also describe how are you deploying your services?
    WRT GoGraphite:
  • Single r4.4xlarge (16x CPU, 128GB RAM)
  • RAID0 across 4x 10TB drives (EBS, gp3), each with 8k IOps
  • 1x AWS NLB sitting in front of two carbon-relay-ng daemons [1.2-1-g375f430] on the same box
  • single GoCarbon daemon [v0.16.2]
  • single CarbonApi daemon [v1.1.0]

Cache size reported as:
[2022-08-07T10:07:48.318Z] INFO [stat] collect {"endpoint": "local", "metric": "carbon.agents.stats01.cache.size", "value": 1000001}

... which is above my threshold of 1m. What is the appropriate thing to do in this regard... increase cache (seems unreasonable considering how much RAM is consumed), reduce cache size (I assume this is to the detriment of performance, but will it stabilise memory consumption?), split out instance into multiple systems to better utilise zipper - and if this is the case, how should I calculate the number of instances -- or put differently, how many metrics per instance should I be aiming for?

@gkramer
Copy link
Author

gkramer commented Aug 7, 2022

Apologies, I'll aim to properly brace logs in future.

@bom-d-van
Copy link
Member

Yep, your server is certainly under heavy load as it's already dropping data based on the cache.size value. How about cache.metrics values?

Find: 29,257,864; yes, number of metrics are growing daily. This is the first time find has completed successfully in a while.

Have you consider removing obsolete metrics if they no longer receive updates? Booking.com production would remove metrics that aren't updated for 3-30days. It's fairly easy to achieve with a find command.

Single r4.4xlarge (16x CPU, 128GB RAM)
RAID0 across 4x 10TB drives (EBS, gp3), each with 8k IOps
1x AWS NLB sitting in front of two carbon-relay-ng daemons [1.2-1-g375f430] on the same box
single GoCarbon daemon [v0.16.2]
single CarbonApi daemon [v1.1.0]

128GB of RAM with almost 30m metrics per instance and given the timeout issue that you are having, in my experience, you would have to scale out, or considering reduce the load, by either removing old/stale metrics, produce less new metrics, or both.

and if this is the case, how should I calculate the number of instances -- or put differently, how many metrics per instance should I be aiming for?

It's hard for us to give you a definitive number, you would have to go with your experiment based on your production load. For a simple starter, if you are running just one server, maybe consider make it 2 or 3 servers. I would recommend you seek inspirations in the Google SRE books.


You should create some relatively sophisticated grafana dashboards using the go-carbon metrics and system metrics, so that you know what your server is like now what it looks like after expansion.


also from this query example, it seems you are producing metrics with uuid or friends, this would certainly generates lots of metrics and if it's like k8s pod id, then after the pod is removed, the metrics would remain. if that's the case, you would certainly need to manually remove the obsolete metrics after some time.

servers.X.Y.current.*X-[0-9]*{79f6967d57,7ddb99d49,8585cd9d8b,5d5dd59984,75ccb5875c,5c88bbc88b,8bd4f7bd7,74446b56b4,7b8945bdd8,9f868c77c,c4b8c4b78,5db45cb6b5,6c9cd4b98f,6f47dc4fc,75f5bf5c45,8595df68dd,564489c58d,669b9fbf85,75f8859849,7d499bd94,86f864cb55,664bbd8d5d,6d74dcf8dc,6cffcf9d79,7fcd567c66,798d9568cd,c57665cd,5ccbf999f4,6fd6bb8bd,7d666f98b9,5c97bb55cf,c4c54b8db,6b44bf7878,798855fffd,5469467466,5ccdc977fb,57757cf96b,7ddb5985bb,6bb59d45f4,d7f5949f6,5b6dd7dd84,69cdbf48cd,5948c5bdc6,6b5fcb6898,f9cc559c8,74c8c459d5,8d7885854,5dd95c6f7b,645bc9cd44,5bb9b575f6,6f5c8bd7d8,84f488c49f,6d7bb87bff,8667c597fb,579c89665b,6cfb8b7494,64444c85c8,769df9d9db,5865d44c8d,f8565559c,6499658d6b,64f5dbcb9b,6c5645d74,7c9c9cc74d,7fd48c79d8,977898486,db4f57d69,d87c5788b,d8858b94,64c85cf95d,7c7f776b96,7d5dc75ff8,85b47844f4,5659c8864d,6c8566cfc8,56f869dd46,649b758644,5955b5f44b,97666577f,6c847bdd6b,ccbf6cbb6,5c56c8c56,7d968975d,77dc6b6494,848d89d8dd,55f4b77588,c7978cb79,7bccbc98b4,8b77fbdbf,6d77c78c5f,d57bfc7cb,55cf45f88d,6c9f6b88df}*.com_codahale_metrics_jmx_JmxReporter$JmxTimer.*_none.999thPercentile

@gkramer
Copy link
Author

gkramer commented Aug 8, 2022

@bom-d-van We are absolutely removing old metrics, but not quite as aggressively as you mention... i.e. after months, as opposed to <1M. We're currently in discussion with the team regarding retaining 30 days max of metrics, which will help in a material way, but I suspect that we'll still see issues due to the number of metrics - albeit not the depth. I'm also trying to motivate for splitting out the GG daemon on a per service/k8s basis, but I don't want to arrive at the same point we're at now in N months - I suspect that taking this route without having a better feel for what to expect out of the daemon (per IOps/GB RAM/GHz) up front may cause problems down the line.

@bom-d-van
Copy link
Member

bom-d-van commented Aug 8, 2022

I'm also trying to motivate for splitting out the GG daemon on a per service/k8s basis

You can consider enabling the quota sub-system in go-carbon to produce per-namespace usage metrics: #420

This was the way that we proposed for my ex-employer to achieve multi-tenancy.

With the usage and quota metrics, you can know and have control on how many resources a prefix/namespace/pattern consumes. And when a namespace grew too big, you can relocate the namespace to its own dedicated go-carbon instances or cluster.

However, the quota sub-system itself also produces something like 16 metrics per namespace, so it's not itself free. It's a good idea to have a dedicated instance/cluster to save go-carbon metrics.

That said, it's probably better to try it out after you resolved the scaling challenges for your instances.

but I don't want to arrive at the same point we're at now in N months

It's a never-ending struggle if your company continues growing. That's why we got paid. ;)

Also it's a common SRE/devops practice to have capacity predictions from time to time and expand or shrink the cluster, or throttle and reduce the usage.

I suspect that taking this route without having a better feel for what to expect out of the daemon (per IOps/GB RAM/GHz) up front may cause problems down the line.

For whisper-based Graphite storage systems, the capacity limits varies with schemas and loads, for example, minutely metric is much less expensive than secondly one. But you can use the cache.metrics to get some understanding on how many active metrics you are receiving on the instance.

@bom-d-van
Copy link
Member

[cache]
max-size = 1000000    # 1000000
write-strategy = "max"

also this value is relatively low and might be too easy to saturate the cache (ingestion queue) which leads to data lost on ingestion. You might want to consider go all the way to 200m or more for a server with 128GB of ram.

@gkramer
Copy link
Author

gkramer commented Aug 15, 2022

@bom-d-van @Civil I've since done a major cleanup, and we've seen significant improvements in performance. Some of the steps taken:

  1. Change retention period to 60:30d - this cut our storage by 50%
  2. Delete all wsp files in the abovementioned directories that are >30d old.

We've since seen 'carbonserver.metrics_known' fall from > 29.9M to 4M. I've also bumped max-size to 50m for now, and will keep an eye on cache size over the next week.

[An architecture built around business needs, rather than nice-to-have seems to have significantly simplified the project!]

Will keep you guys updated, but we're now in a far better position thanks to all your help. Thank you both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants