Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large jumps in MAIN.s_resp_bodybytes output #4055

Open
delthas opened this issue Feb 13, 2024 · 5 comments
Open

Large jumps in MAIN.s_resp_bodybytes output #4055

delthas opened this issue Feb 13, 2024 · 5 comments

Comments

@delthas
Copy link
Contributor

delthas commented Feb 13, 2024

Expected Behavior

MAIN.s_resp_bodybytes is somewhat stable, corresponding to the output seen from the NIC.

Current Behavior

MAIN.s_resp_bodybytes shows large jumps in its output, up to +10GB in a single second.

Context

I have three servers running Varnish 7.3.0 with a similar configuration.

I have plotted the output of varnishstat -1 run every second on each server, grepping for MAIN.s_resp_bodybytes. First graph: total value (subtracted from the value at t0). Second graph: Derive by taking the difference between each datapoint and the previous one (in other words, variation in that second).

On server 1, the output looks stable, and corresponds to what I see on the NIC.

Graph 1: MAIN.s_resp_bodybytes - t0. Graph 2 : rate of MAIN.s_resp_bodybytes.
cd10cb5a4

On server 2 & 3 (I've just shown server 2 here, but the behavior is similar on server 3), the output has huge jumps from one line to the next (as seen from the spikes in the second graph, up to 1e10 ie +10GB from one second to the next, way more than the average throughput of ~0.3GB/s). This does not reflect what the NIC sees. The NIC sees a somewhat stable throughput with no jumps. As an example, from one second to the next I got 52766514824 -> 61136417222, ie a 8GB difference.

Graph 1: MAIN.s_resp_bodybytes - t0. Graph 2 : rate of MAIN.s_resp_bodybytes.
86eaffdf8

I'm serving small files (~<1MB), no pipes. So I see no particular reason for suddenly accounting for a huge change in the throughput.

Is this a known issue?

Varnish Cache version

Varnish 7.3.0

@nigoroll
Copy link
Member

We have an optimization in varnish-cache to push statistics of worker threads only when they are idle. Thus, the behavior you describe is expected if they are (mostly) fully loaded. The statistics should even out if the rate is calculated over longer time frames.

There is the busy_stats_rate feature flag to also push statistics of busy threads. Could you please try if the observed behavior changes with this flag enabled?

@delthas
Copy link
Contributor Author

delthas commented Feb 13, 2024

OK, will try that. Thanks for the quick reply.

@delthas
Copy link
Contributor Author

delthas commented Feb 14, 2024

Issue not fixed after varnishadm param.set feature +busy_stats_rate; do I need to restart or is this applied on the fly?

@nigoroll
Copy link
Member

bugwash: @walid-git noticed that busy_stats_rate is not implemented for H2 and wants to work on this feature.

@delthas confirmed that the "second server" showing the big jumps might actually be serving mostly H2.

@delthas also asked about the performance impact of busy_stats_rate and I promised an answer: The main implication is that one of the central locks, the worker pool mutex, needs to be acquired after every request. A measure of the impact is LCK.wq.dbg_busy with the debug=+lck bit set: Compare the rate of this statistic with and without busy_stats_rate. Using more pools should lower contention on this lock.

walid-git added a commit to walid-git/varnish-cache that referenced this issue Feb 21, 2024
@walid-git
Copy link
Member

@delthas can you give #4059 a try ?
Note that this has only been tested locally, and I did not measure the performance impact.

When working on this, I found an interesting case for h1, where even having +busy_stats_rate enabled won't change anything, it can very easily be reproduced with wrk to generate traffic and varnish running with default parameter values :

./wrk -t1 -c1 -d30s http://127.0.0.1:1234/

This will generate traffic with a single connection for 30s, and if you observe varnishstat during that time, no counter will update until the 30s are done, regardless of +busy_stats_rate being enabled or not.
I did some investigation, and found that in this case, the stats were frequently pushed to the pool stats as expected, however, since there is only one active worker in the pool, and that it stays busy for the whole duration, the pool stats never get to be pushed to the global stats.
While I think that this is not very likely to happen on a real server traffic, as there would be other workers on the same pool pushing the pool stats when they finish their work, I find the behavior a bit unexpected, that is to say: waiting for a very busy worker to finish its job to push pool statistics while there are a lot of idle workers doing nothing.

walid-git added a commit that referenced this issue Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants