Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics with NULL values are not processed correctly #178

Open
PhantomPhreak opened this issue Jan 14, 2022 · 4 comments
Open

Metrics with NULL values are not processed correctly #178

PhantomPhreak opened this issue Jan 14, 2022 · 4 comments

Comments

@PhantomPhreak
Copy link

PhantomPhreak commented Jan 14, 2022

Recently we migrated from graphite-web + whisper to graphite-web + graphite-clickhouse + clickhouse schema, and there are some issues with processing NULL values.

Whisper's *.wsp files has pre-defined resolution:

aggregationMethod: average
maxRetention: 157680000
xFilesFactor: 0.5
fileSize: 12614440

Archive 0
offset: 40
secondsPerPoint: 60
points: 525600
retention: 31536000
size: 6307200

which means that value for each timestamp may be NULL or some actual value, like following:

# whisper-fetch metric.wsp
1642054380      0.000977
1642054440      0.000977
1642054500      0.000977
1642054560      None
1642054620      0.000977
1642054680      0.000977
1642054740      None

In this case, graphite + whisper always return some value for each datapoint, even if all of them are NULLs. This may be used by transformNull() graphite functions to convert them into specific values and fill the gaps if necessary.

graphite-clickhouse + clickhouse has different behavior:
in case if at least 1 datapoint has non-NULL value, this value + NULL for other datapoints will be returned. This is query result from ClickHouse and how data is presented in Grafana (exported CSV)

SELECT
    Path,
    groupArray(Time),
    groupArray(Value),
    groupArray(Timestamp)
FROM graphite.data
PREWHERE (Date >= toDate(1642020420)) AND (Date <= toDate(1642023719))
WHERE (Path IN ('metric')) AND ((Time >= 1642020420) AND (Time <= 1642023719))
GROUP BY Path
FORMAT Vertical

Query id: 1eccd72b-9a8d-4825-b7f1-46bc21e5fde7

Row 1:
──────
Path:                  metric
groupArray(Time):      [1642022160]
groupArray(Value):     [0.0009765625]
groupArray(Timestamp): [1642022217]

1 rows in set. Elapsed: 0.020 sec. Processed 40.96 thousand rows, 3.68 MB (2.02 million rows/s., 181.21 MB/s.)
Grafana (exported CSV)

"Time","metric"
2022-01-13 03:47:00,
2022-01-13 03:48:00,
2022-01-13 03:49:00,
2022-01-13 03:50:00,
2022-01-13 03:51:00,
2022-01-13 03:52:00,
2022-01-13 03:53:00,
2022-01-13 03:54:00,
2022-01-13 03:55:00,
2022-01-13 03:56:00,
2022-01-13 03:57:00,
2022-01-13 03:58:00,
2022-01-13 03:59:00,
2022-01-13 04:00:00,
2022-01-13 04:01:00,
2022-01-13 04:02:00,
2022-01-13 04:03:00,
2022-01-13 04:04:00,
2022-01-13 04:05:00,
2022-01-13 04:06:00,
2022-01-13 04:07:00,
2022-01-13 04:08:00,
2022-01-13 04:09:00,
2022-01-13 04:10:00,
2022-01-13 04:11:00,
2022-01-13 04:12:00,
2022-01-13 04:13:00,
2022-01-13 04:14:00,
2022-01-13 04:15:00,
2022-01-13 04:16:00,0.00098
2022-01-13 04:17:00,
2022-01-13 04:18:00,
2022-01-13 04:19:00,
2022-01-13 04:20:00,
2022-01-13 04:21:00,
2022-01-13 04:22:00,
2022-01-13 04:23:00,
2022-01-13 04:24:00,
2022-01-13 04:25:00,
2022-01-13 04:26:00,
2022-01-13 04:27:00,
2022-01-13 04:28:00,
2022-01-13 04:29:00,
2022-01-13 04:30:00,
2022-01-13 04:31:00,
2022-01-13 04:32:00,
2022-01-13 04:33:00,
2022-01-13 04:34:00,
2022-01-13 04:35:00,
2022-01-13 04:36:00,
2022-01-13 04:37:00,
2022-01-13 04:38:00,
2022-01-13 04:39:00,
2022-01-13 04:40:00,
2022-01-13 04:41:00,

In case if selected interval has only NULL values, nothing is returned from ClickHouse to graphite-clickhouse, and from graphite-web to Grafana:

SELECT
    'metric' AS Path,
    groupArray(Time),
    groupArray(Value),
    groupArray(Timestamp)
FROM graphite.data
PREWHERE (Date >= toDate(1642020420)) AND (Date <= toDate(1642023719))
WHERE (Path IN ('metric)) AND ((Time >= 1642020420) AND (Time <= 1642023719))
GROUP BY Path
FORMAT Vertical

Query id: c1d5a623-0d5e-4826-9303-f5af306a7e2c

Ok.

0 rows in set. Elapsed: 0.012 sec. 

Grafana displaying "No data" for this time period, but "No data" and "NULL" values are completely different, and expected behavior in this case - return NULL values for all datapoints.

Real-world problem: we have a metric representing request processing quantiles, and if there are no requests (night time), there are no values for quantiles is calculated. In the monitoring system, tracking this quantiles, we can replace NULL values with 0 using transformNull() , and differentiate the situation, when metric is broken and when there are just no requests served. This logic is broken now, because we may have a sequence of NULL values for the metric, exceeding selected time-window (5 minutes in our case).

I suppose it can be possible to send NULLs for all datapoints from graphite-clickhouse even if there are empty result returned from the ClickHouse, if we're using rollup config and already know the metric's precision for selected time interval.

We're using "dummy" rollup configuration for now:

SELECT *
FROM system.graphite_retentions

Query id: ce6cd594-5498-47fd-a3a4-d54adade8d1f

┌─config_name─────┬─regexp─┬─function─┬─age─┬─precision─┬─priority─┬─is_default─┬─Tables.database─┬─Tables.table───┐
│ graphite_rollup │        │ any      │   0 │        60 │    65535 │          1 │ ['graphite']    │ ['data_local'] │
└─────────────────┴────────┴──────────┴─────┴───────────┴──────────┴────────────┴─────────────────┴────────────────┘

1 rows in set. Elapsed: 1.548 sec.

Or, if there are any other way for workaround - i'd be happy to know.

And thank you for the amazing go-graphite stack :)

@Felixoid
Copy link
Collaborator

Hi. To have proper null-points you need to have a valid retention config for the GraphiteMergeTree table. Then graphite-clickhouse returns valid None-points for any format.

@PhantomPhreak
Copy link
Author

Humm. Possibly i don't understand correctly what is "valid" retention config in this case.
Here is my configuration

SELECT *
FROM system.graphite_retentions

Query id: b3b2f340-71fe-4b5d-851c-f4ebd664beb7

┌─config_name─────┬─regexp─┬─function─┬─age─┬─precision─┬─priority─┬─is_default─┬─Tables.database─┬─Tables.table───┐                                                                                       
│ graphite_rollup │        │ any      │   0 │        60 │    65535 │          1 │ ['graphite']    │ ['data_local'] │                                                                                       
└─────────────────┴────────┴──────────┴─────┴───────────┴──────────┴────────────┴─────────────────┴────────────────┘                                                                                       

1 rows in set. Elapsed: 0.002 sec.

rollup config in /etc/clickhouse-server/config.d

<yandex>
        <graphite_rollup>
        <!-- default -->
                <default>
                        <function>any</function>
                        <retention>
                                <age>0</age>
                                <precision>60</precision>
                        </retention>
                </default>
        </graphite_rollup>
</yandex>

graphite-clickhouse configured to use table with Distributed engine (since it allow to avoid seding query request to the stale replica)

[clickhouse]
...
 index-table = "graphite.index"
...
 internal-aggregation = false

[[data-table]]
 table = "graphite.data"
...
 # custom rollup.xml file for table, 'auto' and 'none' are allowed as well
 rollup-conf = "auto"
 # custom table for 'rollup-conf=auto', useful for Distributed or MatView
 rollup-auto-table = "graphite.data_local"
 # is used when none of rules match
 rollup-default-precision = 0
 # is used when none of rules match
 rollup-default-function = ""
 # should be set to true if you don't have reverted regexps in rollup-conf for reversed tables
 rollup-use-reverted = false

With this configuration, graphite-clickhouse issuing following queries:

SELECT
    regexp,
    function,
    age,
    precision,
    is_default
FROM system.graphite_retentions
ARRAY JOIN Tables AS table
WHERE (table.database = 'graphite') AND (table.table = 'data_local')
ORDER BY
    is_default ASC,
    priority ASC,
    regexp ASC,
    age ASC
FORMAT JSON

Query id: 82e5d1c5-2178-473b-bfab-ffe6c5eae3e6

{
        "meta":
        [
                {
                        "name": "regexp",
                        "type": "String"
                },
                {
                        "name": "function",
                        "type": "String"
                },
                {
                        "name": "age",
                        "type": "UInt64"
                },
                {
                        "name": "precision",
                        "type": "UInt64"
                },
                {
                        "name": "is_default",
                        "type": "UInt8"
                }
        ],

        "data":
        [
                {
                        "regexp": "",
                        "function": "any",
                        "age": "0",
                        "precision": "60",
                        "is_default": 1
                }

        ],

        "rows": 1,

        "statistics":
        {
                "elapsed": 0.000710489,
                "rows_read": 1,
                "bytes_read": 116
        }
}

1 rows in set. Elapsed: 0.023 sec.

Now, related to https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/graphitemergetree/#rollup-configuration

age – The minimum age of the data in seconds.
precision– How precisely to define the age of the data in seconds. Should be a divisor for 86400 (seconds in a day).
function – The name of the aggregating function to apply to data whose age falls within the range [age, age + precision]. Accepted functions: min / max / any / avg. The average is calculated imprecisely, like the average of the averages.

age defining minumum age of data, in my case it's 0 which means "all data"
precision - i'd like to keep one value for each 60s
function - i use any

purpose of this config - keep only 1 value for each minute, if more that 1 value received within a 60s interval, and keep this data with 1-minute resolution without re-sampling (rollup).

Just in case,
CH version: 21.8.12.29
graphite-clickhouse: 0.13.2

table's schema

CREATE TABLE graphite.data_local
(
    `Path` String CODEC(ZSTD(16)),
    `Value` Float64 CODEC(Gorilla, ZSTD(16)),
    `Time` UInt32 CODEC(DoubleDelta),
    `Date` Date CODEC(DoubleDelta, LZ4HC(9)),
    `Timestamp` UInt32 CODEC(DoubleDelta)
)
ENGINE = ReplicatedGraphiteMergeTree('/clickhouse/tables/401246e0-d1af-4835-8012-46e0d1afa835/{shard}', '{replica}', 'graphite_rollup')
PARTITION BY toYYYYMM(Date)
PRIMARY KEY (Path, round(Time, -5))
ORDER BY (Path, round(Time, -5), Time)
SETTINGS index_granularity = 8192

Maybe i missing something or misunderstanding the documentation?
I'd be grateful for any hint :)

@Felixoid
Copy link
Collaborator

Ok, sorry for my previous comment, I haven't slept enough today and got the issue differently, because you have a distributed table graphite.data, but the table in graphite_rollup is graphite.data_local.

So, the case is legal, although it would very significantly change the default behavior, and would be unexpected for the most of users. Currently even is old metrics are in index, they aren't listed in a metrics list #83

It's possible to add a setting and fill-up missing metrics, though.

Since the project doesn't have an active maintainer currently, it makes sense to consider a PR. We can discuss it in https://t.me/ru_go_graphite 😉

@PhantomPhreak
Copy link
Author

No problems, i haven't sleep enough as well and thought i was missing something 😅
Yeah, for backward compatibility this logic can be added by additional setting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants