Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional "duplicate sample for timestamp" errors #11

Closed
Vlad-Stryapko opened this issue Dec 23, 2021 · 12 comments
Closed

Occasional "duplicate sample for timestamp" errors #11

Vlad-Stryapko opened this issue Dec 23, 2021 · 12 comments
Labels
bug Something isn't working evaluation needed

Comments

@Vlad-Stryapko
Copy link

Vlad-Stryapko commented Dec 23, 2021

I use the latest version of the extension with 0.35.0 k6. Occasionally during the test I get the following error:

ERRO[0370] Failed to store timeseries. error="server returned HTTP status 400 Bad Request: duplicate sample for timestamp"

The errors are not terribly frequent, but not rare either - I've got about 25 errors after 10 minutes of the test run with 255 VUs generating the load.

Is it possible to somehow mitigate the issue so that the samples are not dropped? Also, is it possible to asses the number of duplicate samples, i.e. how many of them were not written because of the error?

@yorugac
Copy link
Collaborator

yorugac commented Dec 23, 2021

Hi @Vlad-Stryapko
Thanks for opening the issue!

Currently I'd say it isn't about "mitigating this issue" but figuring out when it happens in the first place. I've briefly looked at this issue with duplicate timestamps a few months ago but it didn't seem to behave as an easily repeatable bug (at least back then). Did you happen to notice any specific pattern when these errors happens? Maybe it's a specific k6 script or metric / tag or something in Prometheus setup, etc.

We might end up adding hacks in extension as a workaround for this error like you suggest but first, I think it makes sense to try to gather more info about it and additionally, to try out this recent improvement on Prometheus side prometheus/prometheus#9894 and see if it helps to debug the origin of the issue.

As for the amount of such errors, I don't think Prometheus has separate metric for duplicate timestamps in remote write 😞 It does have metric prometheus_target_scrapes_sample_duplicate_timestamp_total but by its name, it's probably specific for scraping.

@yorugac yorugac changed the title Occasional "ERRO[0263] Failed to store timeseries." errors Occasional "duplicate sample for timestamp" errors Dec 23, 2021
@yorugac yorugac added bug Something isn't working evaluation needed labels Dec 23, 2021
@Vlad-Stryapko
Copy link
Author

I don't think there is anything specific I've noticed. My test algorithm is pretty straightforward: I send a message to a websocket channel and when I receive a response, I add the elapsed time between a request and response to the trend.

What I think I can do is as follows: try to repro the bug on my local machine with the latest Prometheus installed; try to reproduce it with a very basic k6 test which would simply write to a trend as quickly as possible.

As to finding out the dropped data, in scope of the repro I can also introduce a counter of how many times the trend.Add was called and then compare the counter to the number of samples in Prometheus. I would assume that generally the % lost should be the same for the same configuration, e.g. 10% of the samples dropped.

@Vlad-Stryapko
Copy link
Author

I was able to repro the issue with the latest Prom and k6 locally.

docker run -p 9090:9090 -v <path_to_local_prom_file>:/etc/prometheus/prometheus.yml prom/prometheus --enable-feature=remote-write-receiver --config.file=/etc/prometheus/prometheus.yml

The local file looks like this:

global:
scrape_interval: 10s
evaluation_interval: 10s

scrape_configs:

  • job_name: prometheus
    static_configs:
    • targets: ['localhost:9090']

The k6 test is as follows:

import { sleep } from 'k6';

export let options = {
    stages: [
        { duration: '10s', target: 100 },
        { duration: '180s', target: 100 },
        { duration: '0s', target: 0 },
    ]
}

export default function () {
    sleep(1);
}

The first error occurs ~15 seconds after the test start, as the VUs ramp up.

@yorugac
Copy link
Collaborator

yorugac commented Jan 5, 2022

@Vlad-Stryapko Sorry for the delay and thanks for confirming the bug!

I've had some time to look into this issue and wanted to share the results.

With the new logging in Prometheus, it is indeed possible to get more info: I've tested it with different setups and it is at least clear now how the issue originates. It seems to happen only with the first value in a batch when it is compared against the last value in previous batch. Prometheus expects duplicate timestamps to have the same value:
https://github.com/prometheus/prometheus/blob/dfa5cb7462b7b93b53eebec6197da594c6f78aa2/tsdb/head_append.go#L308-L312

With k6 metrics, they usually won't have the same values. What remains strange is that this check can fail only once per a batch: there can be duplicate timestamps with different values in a middle of the batch and Prometheus would allow them without an error. Currently, I'm thinking it is connected to the fact that Commit is called once per write request:
https://github.com/prometheus/prometheus/blob/dfa5cb7462b7b93b53eebec6197da594c6f78aa2/storage/remote/write_handler.go#L82-L92

There might be another reason but either way, the rationale for this logic is not clear to me at the moment. It might be beneficial to clarify it before implementing any fixes here.

As for potential solutions. The straight-forward way is to omit such "duplicate" values in the beginning of the batch: it'll mean a loss of some data points but it'd allow to preserve the rest of the batch. Anything else would probably be too complicated and potentially contradicting to whatever changes k6 will have after metrics refactoring. E.g. so far I haven't seen this error happening with raw mapping, only with aggregates in prometheus mapping and aggregations will be modified during metrics refactoring. This error is the result of the way k6 handles time granularity and aggregation of its metrics and that is likely to change in near future.

@Vlad-Stryapko
Copy link
Author

Thanks for looking into it.

I think the best course of action would be to simply wait until the k6 metrics changes are implemented and see how it works after this. I'm not well versed in k6 internals so I don't think I'll be able to implement the changes you mentioned and the impact seems tolerable since it's a loss of only a small portion of metrics.

@yorugac
Copy link
Collaborator

yorugac commented Jan 5, 2022

Yes, I definitely wasn't suggesting trying fixing those metric refactoring issues in k6 😄 They are indeed heavy on understanding internals and will be worked on by k6 team. Some of them are already in current priorities so hopefully, changes will start appearing soon.

the impact seems tolerable since it's a loss of only a small portion of metrics.

Good to know!

I think this issue should stay open since this error might be encountered by other people too and it's easier to keep all the related info tidbits in one place. Once those metric changes start getting merged in to k6, this issue can be re-visited and perhaps even closed or reviewed again as needs be.

@rverma-dev
Copy link

@Vlad-Stryapko is the dependent issue is grafana/k6#1831 ?

codebien added a commit that referenced this issue Jun 6, 2022
The specific hack is not required with the new local model.
Regarding the original issue we need to get the real problem and apply a complete solution.
We expect to fix with a PR for github.com//issues/11
@robholland
Copy link
Contributor

fwiw I have various tests that easily replicate this duplicate sample warning as well. Please let me know if I can help debug/test any fixes when you get to a point that would be useful.

@codebien
Copy link
Contributor

codebien commented Jul 13, 2022

Hi @robholland, if they are deterministic tests then they may help in creating a quick fix directly with the current main status. So, if you can share them now it would be helpful.

@robholland
Copy link
Contributor

https://github.com/temporalio/xk6-temporal/blob/main/examples/signal-waiter.js is an example test that always produces multiple of these warnings for me, generally within seconds of starting. It requires our (WIP) extension and a running Temporal instance though (suitable docker compose files for postgres or sqlite backed Temporal are included in the repo).

@perrinj3
Copy link

We get frequent errors like ts=2022-08-18T00:50:19.293Z caller=write_handler.go:102 level=error component=web msg="Out of order sample from remote write" err="duplicate sample for timestamp" series="
when using the K6 remote write extension to remotewrite to Prometheus. It seems to be worse at high sample rates ie 700-800tps
is anyone actively looking at this issue? Happy to provide any logs and K6 scripts. We can reproduce it with a simple script

@codebien
Copy link
Contributor

codebien commented Sep 6, 2022

This issue should be now fixed in the main branch, we will release a new tag in the next few days. In the meanwhile, it's possible to get a build with the following command:

xk6 build --with github.com/grafana/xk6-output-prometheus-remote@main

@codebien codebien closed this as completed Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working evaluation needed
Projects
None yet
Development

No branches or pull requests

6 participants