Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for waiting clients count #913

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

franciscoreinolds
Copy link

@franciscoreinolds franciscoreinolds commented Aug 5, 2023

Hello 馃憢馃徑

In this PR, I'm trying to add support to a new logging variable that would track how many clients have had to wait in order to connect to the server.

I feel like this makes sense when taking into account that there already exist {query,xact}_{count,time} / wait_time variables.

I tried looking into how these were calculated and follow the same approach for this one, but I'm of course open to feedback.

@franciscoreinolds franciscoreinolds force-pushed the fr/add-support-for-waiting-clients-count branch from bf14f6a to 8b20b7f Compare August 5, 2023 18:03
@franciscoreinolds franciscoreinolds force-pushed the fr/add-support-for-waiting-clients-count branch from 6c640f7 to 75f6e77 Compare August 6, 2023 07:16
@franciscoreinolds
Copy link
Author

franciscoreinolds commented Aug 6, 2023

Hiya @JelteF

I managed to fix most checks but there are 2 windows ones failing due to connection_count assertions failing. Since I didn't touch any code regarding that I find it weird, have you seen such behaviour before?

Edit: Apparently #923 fixed this 馃帀

@franciscoreinolds franciscoreinolds marked this pull request as ready for review August 16, 2023 17:11
@eulerto
Copy link
Member

eulerto commented Aug 22, 2023

IMO this counter doesn't help. If I obtain 1,000 wait_count that waits for 80ms and 50,000 wait_count that waits for 1ms, it seems the later indicates a bad scenario but indeed the former is worse. This counter without the time information is useless. That's why there is only wait_time available. Let's not add another counter just because other metrics already provide a counter.

@franciscoreinolds
Copy link
Author

Well, feel free to close then but I don't really understand the argument to be honest.

If I obtain 1,000 wait_count that waits for 80ms and 50,000 wait_count that waits for 1ms, it seems the later indicates a bad scenario but indeed the former is worse. This counter without the time information is useless.

But we do have the time, so what's the problem?

That's why there is only wait_time available.

What's the problem with giving people information about what's going on in their system? Just document what it does and let people decide what (if anything) they want to do with said information. Not giving people the information because one's afraid that they might misinterpret it seems like a strange concept to me.

Regardless, this is just my 2cents, if you feel like this adds no value to the project then feel free to close.

Comment on lines +78 to +79
if (wait_count > 0)
avg->wait_time = (cur->wait_time - old->wait_time) / wait_count;
Copy link
Member

@JelteF JelteF Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this counter doesn't help. If I obtain 1,000 wait_count that waits for 80ms and 50,000 wait_count that waits for 1ms, it seems the later indicates a bad scenario but indeed the former is worse. This counter without the time information is useless. That's why there is only wait_time available. Let's not add another counter just because other metrics already provide a counter.

But we do have the time, so what's the problem?

I think adding the wait_count isn't a huge problem, and to me the metric seems useful. But what I think could be considered questionable is that this changed line changes the meaning of the avg_wait_time column significantly. The docs state that the following is its meaning:

The docs currently state that avg_wait_time means this:

avg_wait_time
Average time spent by clients waiting for a server that were assigned a backend connection within the current stats_period, in microseconds (averaged per second within that period).

Honestly I think, while that seems like quite a useful metric, but I think its name is really confusing (and I'm not alone: #316). I think a better name would be per_second_wait_time. And I also think the new calculation for avg_wait_time makes much more sense given it's in line with averages calculations for the others. However, breaking backwards compatibility by changing its current meaning is not an easy thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I spoke to @jcoleman at PGConf NYC and he was also annoyed by the problem that avg_wait_time is a really strange metric, and not very useful when debugging performance issues (because you only know how much clients are waiting in total but not on average). So he had made a similar patch a while back: #727.

I prefer #727 over this PR, because it averages the wait time across all backend assignments, not just the assignments where a wait took place. Including the 0 second wait times in the average makes sure we don't over inflate the avg_wait_time when many clients are not waiting at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants