Occasional "duplicate sample for timestamp" errors #11

Vlad-Stryapko · 2021-12-23T11:59:10Z

I use the latest version of the extension with 0.35.0 k6. Occasionally during the test I get the following error:

ERRO[0370] Failed to store timeseries. error="server returned HTTP status 400 Bad Request: duplicate sample for timestamp"

The errors are not terribly frequent, but not rare either - I've got about 25 errors after 10 minutes of the test run with 255 VUs generating the load.

Is it possible to somehow mitigate the issue so that the samples are not dropped? Also, is it possible to asses the number of duplicate samples, i.e. how many of them were not written because of the error?

yorugac · 2021-12-23T17:02:05Z

Hi @Vlad-Stryapko
Thanks for opening the issue!

Currently I'd say it isn't about "mitigating this issue" but figuring out when it happens in the first place. I've briefly looked at this issue with duplicate timestamps a few months ago but it didn't seem to behave as an easily repeatable bug (at least back then). Did you happen to notice any specific pattern when these errors happens? Maybe it's a specific k6 script or metric / tag or something in Prometheus setup, etc.

We might end up adding hacks in extension as a workaround for this error like you suggest but first, I think it makes sense to try to gather more info about it and additionally, to try out this recent improvement on Prometheus side prometheus/prometheus#9894 and see if it helps to debug the origin of the issue.

As for the amount of such errors, I don't think Prometheus has separate metric for duplicate timestamps in remote write 😞 It does have metric prometheus_target_scrapes_sample_duplicate_timestamp_total but by its name, it's probably specific for scraping.

Vlad-Stryapko · 2021-12-23T19:14:23Z

I don't think there is anything specific I've noticed. My test algorithm is pretty straightforward: I send a message to a websocket channel and when I receive a response, I add the elapsed time between a request and response to the trend.

What I think I can do is as follows: try to repro the bug on my local machine with the latest Prometheus installed; try to reproduce it with a very basic k6 test which would simply write to a trend as quickly as possible.

As to finding out the dropped data, in scope of the repro I can also introduce a counter of how many times the trend.Add was called and then compare the counter to the number of samples in Prometheus. I would assume that generally the % lost should be the same for the same configuration, e.g. 10% of the samples dropped.

Vlad-Stryapko · 2021-12-24T12:09:41Z

I was able to repro the issue with the latest Prom and k6 locally.

docker run -p 9090:9090 -v <path_to_local_prom_file>:/etc/prometheus/prometheus.yml prom/prometheus --enable-feature=remote-write-receiver --config.file=/etc/prometheus/prometheus.yml

The local file looks like this:

global:
scrape_interval: 10s
evaluation_interval: 10s

scrape_configs:

job_name: prometheus
static_configs:

targets: ['localhost:9090']

The k6 test is as follows:

import { sleep } from 'k6';

export let options = {
    stages: [
        { duration: '10s', target: 100 },
        { duration: '180s', target: 100 },
        { duration: '0s', target: 0 },
    ]
}

export default function () {
    sleep(1);
}

The first error occurs ~15 seconds after the test start, as the VUs ramp up.

yorugac · 2022-01-05T09:59:22Z

@Vlad-Stryapko Sorry for the delay and thanks for confirming the bug!

I've had some time to look into this issue and wanted to share the results.

With the new logging in Prometheus, it is indeed possible to get more info: I've tested it with different setups and it is at least clear now how the issue originates. It seems to happen only with the first value in a batch when it is compared against the last value in previous batch. Prometheus expects duplicate timestamps to have the same value:
https://github.com/prometheus/prometheus/blob/dfa5cb7462b7b93b53eebec6197da594c6f78aa2/tsdb/head_append.go#L308-L312

With k6 metrics, they usually won't have the same values. What remains strange is that this check can fail only once per a batch: there can be duplicate timestamps with different values in a middle of the batch and Prometheus would allow them without an error. Currently, I'm thinking it is connected to the fact that Commit is called once per write request:
https://github.com/prometheus/prometheus/blob/dfa5cb7462b7b93b53eebec6197da594c6f78aa2/storage/remote/write_handler.go#L82-L92

There might be another reason but either way, the rationale for this logic is not clear to me at the moment. It might be beneficial to clarify it before implementing any fixes here.

As for potential solutions. The straight-forward way is to omit such "duplicate" values in the beginning of the batch: it'll mean a loss of some data points but it'd allow to preserve the rest of the batch. Anything else would probably be too complicated and potentially contradicting to whatever changes k6 will have after metrics refactoring. E.g. so far I haven't seen this error happening with raw mapping, only with aggregates in prometheus mapping and aggregations will be modified during metrics refactoring. This error is the result of the way k6 handles time granularity and aggregation of its metrics and that is likely to change in near future.

Vlad-Stryapko · 2022-01-05T12:54:41Z

Thanks for looking into it.

I think the best course of action would be to simply wait until the k6 metrics changes are implemented and see how it works after this. I'm not well versed in k6 internals so I don't think I'll be able to implement the changes you mentioned and the impact seems tolerable since it's a loss of only a small portion of metrics.

yorugac · 2022-01-05T15:54:59Z

Yes, I definitely wasn't suggesting trying fixing those metric refactoring issues in k6 😄 They are indeed heavy on understanding internals and will be worked on by k6 team. Some of them are already in current priorities so hopefully, changes will start appearing soon.

the impact seems tolerable since it's a loss of only a small portion of metrics.

Good to know!

I think this issue should stay open since this error might be encountered by other people too and it's easier to keep all the related info tidbits in one place. Once those metric changes start getting merged in to k6, this issue can be re-visited and perhaps even closed or reviewed again as needs be.

rverma-dev · 2022-05-23T11:25:42Z

@Vlad-Stryapko is the dependent issue is grafana/k6#1831 ?

The specific hack is not required with the new local model. Regarding the original issue we need to get the real problem and apply a complete solution. We expect to fix with a PR for github.com//issues/11

robholland · 2022-07-13T11:02:57Z

fwiw I have various tests that easily replicate this duplicate sample warning as well. Please let me know if I can help debug/test any fixes when you get to a point that would be useful.

codebien · 2022-07-13T11:19:39Z

Hi @robholland, if they are deterministic tests then they may help in creating a quick fix directly with the current main status. So, if you can share them now it would be helpful.

robholland · 2022-07-13T11:36:24Z

https://github.com/temporalio/xk6-temporal/blob/main/examples/signal-waiter.js is an example test that always produces multiple of these warnings for me, generally within seconds of starting. It requires our (WIP) extension and a running Temporal instance though (suitable docker compose files for postgres or sqlite backed Temporal are included in the repo).

perrinj3 · 2022-08-18T01:27:51Z

We get frequent errors like ts=2022-08-18T00:50:19.293Z caller=write_handler.go:102 level=error component=web msg="Out of order sample from remote write" err="duplicate sample for timestamp" series="
when using the K6 remote write extension to remotewrite to Prometheus. It seems to be worse at high sample rates ie 700-800tps
is anyone actively looking at this issue? Happy to provide any logs and K6 scripts. We can reproduce it with a simple script

codebien · 2022-09-06T09:24:18Z

This issue should be now fixed in the main branch, we will release a new tag in the next few days. In the meanwhile, it's possible to get a build with the following command:

xk6 build --with github.com/grafana/xk6-output-prometheus-remote@main

yorugac changed the title ~~Occasional "ERRO[0263] Failed to store timeseries." errors~~ Occasional "duplicate sample for timestamp" errors Dec 23, 2021

yorugac added bug Something isn't working evaluation needed labels Dec 23, 2021

yorugac mentioned this issue Jun 1, 2022

Not sink the global metrics #23

Merged

codebien mentioned this issue Sep 1, 2022

Sink by time series #38

Merged

codebien closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional "duplicate sample for timestamp" errors #11

Occasional "duplicate sample for timestamp" errors #11

Vlad-Stryapko commented Dec 23, 2021 •

edited

yorugac commented Dec 23, 2021

Vlad-Stryapko commented Dec 23, 2021

Vlad-Stryapko commented Dec 24, 2021

yorugac commented Jan 5, 2022 •

edited

Vlad-Stryapko commented Jan 5, 2022

yorugac commented Jan 5, 2022

rverma-dev commented May 23, 2022

robholland commented Jul 13, 2022

codebien commented Jul 13, 2022 •

edited

robholland commented Jul 13, 2022

perrinj3 commented Aug 18, 2022

codebien commented Sep 6, 2022

Occasional "duplicate sample for timestamp" errors #11

Occasional "duplicate sample for timestamp" errors #11

Comments

Vlad-Stryapko commented Dec 23, 2021 • edited

yorugac commented Dec 23, 2021

Vlad-Stryapko commented Dec 23, 2021

Vlad-Stryapko commented Dec 24, 2021

yorugac commented Jan 5, 2022 • edited

Vlad-Stryapko commented Jan 5, 2022

yorugac commented Jan 5, 2022

rverma-dev commented May 23, 2022

robholland commented Jul 13, 2022

codebien commented Jul 13, 2022 • edited

robholland commented Jul 13, 2022

perrinj3 commented Aug 18, 2022

codebien commented Sep 6, 2022

Vlad-Stryapko commented Dec 23, 2021 •

edited

yorugac commented Jan 5, 2022 •

edited

codebien commented Jul 13, 2022 •

edited