Metric queries return null data until whisper file exists #629

bitprophet · 2014-02-20T00:34:35Z

Scenario

You have a large Graphite install and/or are just stuck with slow disks, such that data enters carbon-cache and then doesn't reach disk for a number of minutes.
You add new metric paths to your system that didn't exist before - new hosts or services came online, you renamed some metrics in your collectors, whatever.
You are querying Graphite-web for these new metrics directly (aka bypassing the 'finder') via direct HTTP calls, or a dashboard, etc.
Those queries result in no data until the above to-disk lag period passes.
Users get sad because they made all the changes on their end (re: data collection) but then they have to wait a long time for validation/usefulness.

Cause

Querying the webapp for a specific metric path returns no data until it hits disk even if data is in the cache, due to the short-circuit linked here.

Solution

Update readers.py such that it performs a best-effort check of the cache for the requested metric path before giving up and returning None.

Downsides/challenges

This may incur a (hopefully minor!) performance hit compared to existing behavior, in the case of truly-bogus metric path queries, but feels quite worth it, at least in my use case where lag is 20-30 minutes at times.
Unless Carbon/CarbonLink is capable of servicing glob-expression requests (guessing not, given the abovelinked code) this will only solve things for fully explicit queries. Still useful some of the time, but not a full solution.
- EDIT: Yup, the query is straight-up keying into the carbon-cache's cache dict. Could probably shoehorn in a way to handle globs, however, if core team thinks that's a reasonable thing to do.

The text was updated successfully, but these errors were encountered:

obfuscurity · 2014-08-10T00:44:30Z

I'm really surprised to see that we don't attempt to hit the cache before disk. I'd like to get @mleinart's feedback on this proposed change and why we started returning None in 764bdf7 without checking the cache too.

g76r · 2014-12-17T10:40:14Z

@obfuscurity: I didn't talk to @mleinart but I had a look in the code to change that behaviour by myself.

IMHO it should be interesting to search the cache for series not yet present on disk, however it needs some changes since carbon cache store series in a hashtable, therefore resolving glob-expression will need another memory structure in addition to the hashtable (or, less likely, instead of the hashtable). This is probably why Graphite-web does not request carbon cache first.

ocervell · 2016-07-11T18:34:49Z

Any updates on this ? I am running into the same problem.

deniszh · 2016-07-11T19:13:37Z

@ocervell - unfortunately not yet.
We still need to attack this issue.

deniszh · 2016-09-08T05:42:31Z

It completely slips off our radars. Developers on my work hitting that constantly, I even using dirty hack for 0.9.x
Will try to make a fix for a master, but adding 1.1.0 milestone just in case.

ocervell · 2017-10-12T21:24:01Z

Issue is 3 years old. Any update ?
My company moving out of Graphite because of this.
I was still hoping that this issue would get resolved since I'm using Graphite in my own projects.

DanCech · 2017-10-12T22:02:32Z

The big thing that would be needed for this to work would be the ability for carbon to respond to a find request, since right now if the whisper file doesn't exist then the standard finder is never going to find the series and the code won't even get as far as calling the whisper reader.

If carbon were to provide a find method then it seems like it should be possible to move all the cache functionality out into a finder and reader for the cache, and handle merging the cached data into the final results via new the MultiReader mechanism. The biggest issue there seems to be aligning the data from the cache since it's stored with the raw timestamps and not aligned to a step.

deniszh · 2017-10-14T23:33:14Z

@ocervell : There're 2 main reasons why this issue is not in active development yet. First - that logic buried quite deep in whisper itself, and fix for this is quite big and massive, as @DanCech said.
On the other hand - if you will use SSDs and increase the number of your caches delay can be quite low and acceptable. I'm not saying that we should fix it, just explaining why even 3rd party graphite implementations still have this issue. :(

Exocomp · 2017-10-14T23:55:46Z

You have a large Graphite install and/or are just stuck with slow disks, such that data enters carbon-cache and then doesn't reach disk for a number of minutes.

I'm hitting the above scenario. I have to lower MAX_UPDATES_PER_SECOND because it kills the disk's throughput but that causes the cache to increase. Then queries start returning null for a few mintues because the cache is not being checked. ¯\_(ツ)_/¯

AsenZahariev · 2017-12-15T09:24:25Z

@deniszh , i saw that you included this issue into milestones 1.1.0 and 1.2.0, can you confirm in which version it will be definitely included ? Seems that we are struggling with the same problem.
Thank you!

deniszh · 2017-12-15T10:47:34Z

Milestone 1.2.0 means only 'not now'. Currently, we have no solution for that problem, and not actively working on it.
Sorry.

mwtzzz-zz · 2018-06-10T20:27:44Z

I just wanted to add my two cents. We also are running a large graphite installation. This problem has been apparent for a while, but was tolerable because we were using nvme-backed storage. But I recently converted to iops-provisioned EBS volumes (because we were losing historical data every time an nvme drive bombed out) and the problem has been exasperated.

It's problematic for us because a lot our monitoring and auto-scaling stuff relies on being able to retrieve timely metrics from the graphite front-end.

I should note that for us, some metrics current values are retrievable from the cache, while others aren't. I don't know if this is because of a hashing problem or because of the sheer number of metrics in the cache at any given time. But in either case, the fact that it retrieves some of them means it shouldn't be a big deal to update the code to retrieve any of them, right? In our case, we're not using wildcards, if that makes a difference.

piotr1212 · 2018-06-11T08:12:23Z

I should note that for us, some metrics current values are retrievable from the cache, while others aren't. I don't know if this is because of a hashing problem or because of the sheer number of metrics in the cache at any given time. But in either case, the fact that it retrieves some of them means it shouldn't be a big deal to update the code to retrieve any of them, right? In our case, we're not using wildcards, if that makes a difference.

This issue is about metrics for which a .wsp file does not exist yet. They never return results. I don't get why it would sometimes return data for you. In graphite-project/carbon#782 (comment) you indicated that the metrics you are having trouble with already have a .wsp file. Is this another issue you are experiencing?

mwtzzz-zz · 2018-06-11T15:01:26Z

@piotr1212 Ok, if this issue #629 is about metrics for which a .wsp file does not exist yet, then my problem is different. Sorry to bother.

stale · 2020-04-13T21:34:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

twm · 2020-04-14T01:48:37Z

This is still a thing AFAIK.

piotr1212 · 2020-04-14T08:27:42Z

There is no easy way to fix this without changing the whole architecture.

g76r · 2020-04-14T22:21:57Z

@piotr1212 the whole architecture I don't think so, but it would need a kind of index in addition to the hashtable in the write cache, in order to be able to find new series names in the cache with a globing pattern and not only knowing their exact name after globing on the filesystem.

piotr1212 · 2020-04-15T07:10:27Z

@g76r Graphite-web needs to figure out which cache to query, as long as globbing is not done it just doesn't know which cache has that metric. Querying all caches on every metric with a glob is not scalable at all.

So you will need a index daemon which keeps a list (or rather tree or some faster datastructure) of all metrics. When a cache receives a metric it will first have to check if it already exists on disk instead of just pushing it to cache and worrying about creating it later at write time. It can check it on disk which is slow or keep an internal index. With an internal index you have to reindex at startup or save it on disk.
If the metric doesn't exists it has to inform the index daemon. Graphite-web will have to query the index daemon on every request which has a glob, the index daemon will return a list of caches which have this metric, or just metricnames so that web can do the carbon hash lookup itself.
Then this index daemon needs to be distributed in some way because otherwise it will become a huge bottleneck in performance.

g76r · 2020-04-15T08:19:49Z

@piotr1212 graphite-web already query every carbon instance, to complete on-disk data with its still-in-memory data. because graphite-web has already no way to known which cache contains still-in-memory data.
at less it's the way it worked when I wrote my comment above, on 17 Dec 2014.

g76r · 2020-04-15T08:23:36Z

PS:
I don't say : "it's easy it's should be done" I'm only saying "it's probably more easy and with less impact than you seems to think"
and to be frank maybe in 2014 I would have coded it because I needed graphite in those days, but I'm no longer using it so I won't contribute, therefore I wouldn't be offended if this issue was closed...

g76r · 2020-04-15T08:24:55Z

@piotr1212 imho what you call a distributed index is actually carbon cache daemon

piotr1212 · 2020-04-15T09:00:41Z

@piotr1212 graphite-web already query every carbon instance, to complete on-disk data with its still-in-memory data. because graphite-web has already no way to known which cache contains still-in-memory data.
at less it's the way it worked when I wrote my comment above, on 17 Dec 2014.

No, it uses carbon consistent hashing to determine which cache to query. I don't know if it ever worked differently.

basically:

web receives a metric with globs
web checks disk to expand the globs
web runs carbon consistent hash function on expanded metrics to determine which cache to query
web queries the cache's which hasve the in memory data.

piotr1212 · 2020-04-15T09:15:06Z

Another option which would be easy to implement would be to create the new files earlier.

In large Graphite installations the queue's can get really long. It can take an hour for Graphite to write all metrics in queue. New db files are created when the metric is written, which can take too long. This separates the creation of metrics from writing data to them and moves the creation to an earlier moment. Whenever a new metric is received it's name is pushed to a new_metric list. The first step in the writer loop is to check if there are new metrics received and creates them if they don't exist on disk yet. After the creation the writer continues as usual with writing metrics from the queue but it does not check if the file already exists, to prevent that the check occurs twice and has impact on IO. If the file does not exists at thie point it is logged. Fixes: graphite-project/graphite-web#629

stale · 2020-06-14T09:41:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bitprophet mentioned this issue Feb 20, 2014

incorrect read-from-writeback-cache behavior #521

Closed

deniszh added this to the 1.1.0 milestone Sep 8, 2016

deniszh mentioned this issue Oct 8, 2016

Greate time difference between arrival of metric and retrievability graphite-project/carbon#602

Closed

deniszh modified the milestones: 1.1.0, 1.2.0 Nov 19, 2017

cout mentioned this issue Dec 22, 2017

graphite returns no datapoints if time range includes only datapoints in carbon cache #2167

Closed

deniszh mentioned this issue Jun 10, 2018

frontend not finding some metrics in carbon cache. How to determine whether this is a problem with carbon-cache or if the metrics are stuck in a relay? graphite-project/carbon#782

Closed

stale bot added the stale label Apr 13, 2020

stale bot removed the stale label Apr 14, 2020

piotr1212 mentioned this issue Apr 16, 2020

Create new metrics earlier graphite-project/carbon#888

Open

stale bot added the stale label Jun 14, 2020

stale bot closed this as completed Jun 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric queries return null data until whisper file exists #629

Metric queries return null data until whisper file exists #629

bitprophet commented Feb 20, 2014

obfuscurity commented Aug 10, 2014

g76r commented Dec 17, 2014

ocervell commented Jul 11, 2016

deniszh commented Jul 11, 2016

deniszh commented Sep 8, 2016

ocervell commented Oct 12, 2017

DanCech commented Oct 12, 2017

deniszh commented Oct 14, 2017

Exocomp commented Oct 14, 2017 •

edited

AsenZahariev commented Dec 15, 2017

deniszh commented Dec 15, 2017

mwtzzz-zz commented Jun 10, 2018

piotr1212 commented Jun 11, 2018

mwtzzz-zz commented Jun 11, 2018

stale bot commented Apr 13, 2020

twm commented Apr 14, 2020

piotr1212 commented Apr 14, 2020

g76r commented Apr 14, 2020

piotr1212 commented Apr 15, 2020

g76r commented Apr 15, 2020

g76r commented Apr 15, 2020

g76r commented Apr 15, 2020

piotr1212 commented Apr 15, 2020

piotr1212 commented Apr 15, 2020

stale bot commented Jun 14, 2020

Metric queries return null data until whisper file exists #629

Metric queries return null data until whisper file exists #629

Comments

bitprophet commented Feb 20, 2014

Scenario

Cause

Solution

Downsides/challenges

obfuscurity commented Aug 10, 2014

g76r commented Dec 17, 2014

ocervell commented Jul 11, 2016

deniszh commented Jul 11, 2016

deniszh commented Sep 8, 2016

ocervell commented Oct 12, 2017

DanCech commented Oct 12, 2017

deniszh commented Oct 14, 2017

Exocomp commented Oct 14, 2017 • edited

AsenZahariev commented Dec 15, 2017

deniszh commented Dec 15, 2017

mwtzzz-zz commented Jun 10, 2018

piotr1212 commented Jun 11, 2018

mwtzzz-zz commented Jun 11, 2018

stale bot commented Apr 13, 2020

twm commented Apr 14, 2020

piotr1212 commented Apr 14, 2020

g76r commented Apr 14, 2020

piotr1212 commented Apr 15, 2020

g76r commented Apr 15, 2020

g76r commented Apr 15, 2020

g76r commented Apr 15, 2020

piotr1212 commented Apr 15, 2020

piotr1212 commented Apr 15, 2020

stale bot commented Jun 14, 2020

Exocomp commented Oct 14, 2017 •

edited