Large jumps in MAIN.s_resp_bodybytes output #4055

delthas · 2024-02-13T11:26:44Z

Expected Behavior

MAIN.s_resp_bodybytes is somewhat stable, corresponding to the output seen from the NIC.

Current Behavior

MAIN.s_resp_bodybytes shows large jumps in its output, up to +10GB in a single second.

Context

I have three servers running Varnish 7.3.0 with a similar configuration.

I have plotted the output of varnishstat -1 run every second on each server, grepping for MAIN.s_resp_bodybytes. First graph: total value (subtracted from the value at t0). Second graph: Derive by taking the difference between each datapoint and the previous one (in other words, variation in that second).

On server 1, the output looks stable, and corresponds to what I see on the NIC.

Graph 1: MAIN.s_resp_bodybytes - t0. Graph 2 : rate of MAIN.s_resp_bodybytes.

On server 2 & 3 (I've just shown server 2 here, but the behavior is similar on server 3), the output has huge jumps from one line to the next (as seen from the spikes in the second graph, up to 1e10 ie +10GB from one second to the next, way more than the average throughput of ~0.3GB/s). This does not reflect what the NIC sees. The NIC sees a somewhat stable throughput with no jumps. As an example, from one second to the next I got 52766514824 -> 61136417222, ie a 8GB difference.

Graph 1: MAIN.s_resp_bodybytes - t0. Graph 2 : rate of MAIN.s_resp_bodybytes.

I'm serving small files (~<1MB), no pipes. So I see no particular reason for suddenly accounting for a huge change in the throughput.

Is this a known issue?

Varnish Cache version

Varnish 7.3.0

The text was updated successfully, but these errors were encountered:

nigoroll · 2024-02-13T11:33:54Z

We have an optimization in varnish-cache to push statistics of worker threads only when they are idle. Thus, the behavior you describe is expected if they are (mostly) fully loaded. The statistics should even out if the rate is calculated over longer time frames.

There is the busy_stats_rate feature flag to also push statistics of busy threads. Could you please try if the observed behavior changes with this flag enabled?

delthas · 2024-02-13T12:08:57Z

OK, will try that. Thanks for the quick reply.

delthas · 2024-02-14T11:13:55Z

Issue not fixed after varnishadm param.set feature +busy_stats_rate; do I need to restart or is this applied on the fly?

nigoroll · 2024-02-19T17:18:21Z

bugwash: @walid-git noticed that busy_stats_rate is not implemented for H2 and wants to work on this feature.

@delthas confirmed that the "second server" showing the big jumps might actually be serving mostly H2.

@delthas also asked about the performance impact of busy_stats_rate and I promised an answer: The main implication is that one of the central locks, the worker pool mutex, needs to be acquired after every request. A measure of the impact is LCK.wq.dbg_busy with the debug=+lck bit set: Compare the rate of this statistic with and without busy_stats_rate. Using more pools should lower contention on this lock.

Refs varnishcache#4055

walid-git · 2024-02-21T18:11:17Z

@delthas can you give #4059 a try ?
Note that this has only been tested locally, and I did not measure the performance impact.

When working on this, I found an interesting case for h1, where even having +busy_stats_rate enabled won't change anything, it can very easily be reproduced with wrk to generate traffic and varnish running with default parameter values :

./wrk -t1 -c1 -d30s http://127.0.0.1:1234/

This will generate traffic with a single connection for 30s, and if you observe varnishstat during that time, no counter will update until the 30s are done, regardless of +busy_stats_rate being enabled or not.
I did some investigation, and found that in this case, the stats were frequently pushed to the pool stats as expected, however, since there is only one active worker in the pool, and that it stays busy for the whole duration, the pool stats never get to be pushed to the global stats.
While I think that this is not very likely to happen on a real server traffic, as there would be other workers on the same pool pushing the pool stats when they finish their work, I find the behavior a bit unexpected, that is to say: waiting for a very busy worker to finish its job to push pool statistics while there are a lot of idle workers doing nothing.

Refs #4055

bsdphk added a=Bugwash Today and removed a=Bugwash Today labels Feb 19, 2024

walid-git added a commit to walid-git/varnish-cache that referenced this issue Feb 21, 2024

h2: Add support for busy_stats_rate

98567fd

Refs varnishcache#4055

walid-git mentioned this issue Feb 21, 2024

h2: Add support for busy_stats_rate #4059

Merged

walid-git added a commit that referenced this issue Feb 26, 2024

h2: Add support for busy_stats_rate

4741f40

Refs #4055

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large jumps in MAIN.s_resp_bodybytes output #4055

Large jumps in MAIN.s_resp_bodybytes output #4055

delthas commented Feb 13, 2024 •

edited

nigoroll commented Feb 13, 2024

delthas commented Feb 13, 2024

delthas commented Feb 14, 2024

nigoroll commented Feb 19, 2024

walid-git commented Feb 21, 2024

Large jumps in MAIN.s_resp_bodybytes output #4055

Large jumps in MAIN.s_resp_bodybytes output #4055

Comments

delthas commented Feb 13, 2024 • edited

Expected Behavior

Current Behavior

Context

Varnish Cache version

nigoroll commented Feb 13, 2024

delthas commented Feb 13, 2024

delthas commented Feb 14, 2024

nigoroll commented Feb 19, 2024

walid-git commented Feb 21, 2024

delthas commented Feb 13, 2024 •

edited