Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load_balance_hosts parameter to db config #736

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

dpirotte
Copy link

@dpirotte dpirotte commented Jul 5, 2022

PGBouncer's default behavior is to round-robin between comma-separated
hosts on each new connection attempt. This is desirable when all hosts
are similar, such as load balancing over multiple replicas, but
undesirable when only one host out of the list is expected to
successfully connect and login. In this scenario, every other connection
attempt would hit an invalid host and fail.

The host_strategy parameter controls this host selection behavior. The
default behavior is labeled round_robin and a new strategy called
last_successful is introduced. This new strategy instructs PGBouncer
to prefer the host with most recent successful connections. When that
host fails, new connections will round-robin through the list until a
successful connection occurs.

src/loader.c Outdated
int pool_size = -1;
int min_pool_size = -1;
int res_pool_size = -1;
int max_db_connections = -1;
int dbname_ofs;
int pool_mode = POOL_INHERIT;
int host_strategy = ROUND_ROBIN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int host_strategy = ROUND_ROBIN;
enum HostStrategy host_strategy = ROUND_ROBIN;

Copy link
Member

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, it looks useful, postgres connection strings work in a similar way so it makes sense to support that as well.

I left some comments. And can you please rebase/merge this PR because there's a merge conflict.

src/loader.c Outdated
@@ -236,6 +241,11 @@ bool parse_database(void *base, const char *name, const char *connstr)
res_pool_size = atoi(val);
} else if (strcmp("max_db_connections", key) == 0) {
max_db_connections = atoi(val);
} else if (strcmp("host_strategy", key) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an entry in config.md to describe its usage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is in. Feedback welcome.

src/objects.c Outdated
Comment on lines 1072 to 1077
if (!server->pool->last_connect_failed && server->pool->db->host_strategy == LAST_SUCCESSFUL)
server->pool->rrcounter = server->pool->last_successful_rrcounter;
Copy link
Member

@JelteF JelteF Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current PR seems to have the desired effect based on the tests you wrote, but I have the feeling the same could probably be achieved without the two new fields you added: server->rrcounter and pool->last_successful_rrcounter.

How about we simply increase the pool->rr_counter whenever a connection attempt fails.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I simplified the logic as you suggest and it looks better to me.

@dpirotte
Copy link
Author

@JelteF Touching base. Happy to incorporate any additional feedback on the new implementation.

src/loader.c Outdated
@@ -299,6 +309,7 @@ bool parse_database(void *base, const char *name, const char *connstr)
db->res_pool_size = res_pool_size;
db->pool_mode = pool_mode;
db->max_db_connections = max_db_connections;
db->host_strategy = host_strategy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the host_strategy of a database can change, should we tag it as changed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call. Yes, I think we should tag it so behavior changes after a reload. I will make that change. Thanks for the feedback!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host_strategy is now reloadable.

show databases/show pools also now include host_strategy in their output, which makes this testable and is useful on its own.

@dpirotte
Copy link
Author

dpirotte commented Nov 7, 2022

@petere @JelteF I rebased this PR and made @goksgie's suggested change above.

Please let me know if anything else needs to happen to get this into a mergeable state.

@JelteF
Copy link
Member

JelteF commented Nov 17, 2022

I think the code in the PR is fine now (apart from the merge conflict that occured due to another merged PR). But after playing with these changes a bit I found one other issue, and I'm wondering how useful this feature is without also addressing this issue:
server_login_retry creates a significant pause anytime a single node cannot be connected to.

@dpirotte How are you using this change, that it is useful in its current form?

@JelteF
Copy link
Member

JelteF commented Jun 5, 2023

Hi @dpirotte, github notified me that you rebased this on a branch of your fork. I think a rebased version of this can be merged pretty easily. Could you still explain how you use this in practice? Because when I played with it, another bug in pgbouncer was causing serious issues for me.

@dpirotte
Copy link
Author

dpirotte commented Jun 5, 2023

Hi @JelteF. Yeah, I’m working on rebasing this and another patch on top of 1.19. The code works but I haven’t updated to your new (much improved!) test framework.

The purpose here is to support faster failovers in environments where 1/N hosts in the list will be the primary and (N-1)/N hosts will be replicas and where DNS updates are potentially too slow. RDS/Aurora is one such environment, and I believe some Patroni configurations are as well.

Given the following configurations:

; pg-primary is a CNAME to whichever of pg-0 or pg-1 is the primary
current = host=pg-primary.cluster.local connect_query=‘select disconnect_self() if pg_is_in_recovery()’…
proposed = host=pg-0.cluster.local,pg-1.cluster.local host_strategy=last_successful connect_query=‘select disconnect_self() if pg_is_in_recovery()’…

After a failover, current may take some time to point to the correct new primary, i.e. due to DNS caching. To speed up failovers, proposed takes the full list of potential primaries so pgbouncer can iterate through until it finds the primary. (And, to your earlier comment, this does encourage a low value for server_login_retry.)

A key part here is the connect_query, which asserts that pgbouncer is only ever talking to a writable database. Without host_strategy, using the full list of potential primaries means that (N-1)/N new connections will fail and wait for server_login_retry before trying another host. With host_strategy=last_successful, we only fail connections until a new primary is found, and direct all new connections there until the next failure. This minimizes failed connection noise and time waiting for server_login_retry.

I have another patch that adds lightweight target_session_attrs support albeit only on PG 14+ hosts where the server informs the client of its status as a primary or standby. This replaces the connect_query above, but is not a replacement for host_strategy. One could imagine having three hosts (primary and two replicas) where you want host_strategy=last_successful for the primary db connection string but still want to load balance on the replica connection string with host_strategy=round_robin.

Another option would be to keep the round-robin behavior and make target_session_attrs loop through hosts and gracefully disconnect until it finds one matching the requested state. I prefer having both configuration options so that target_session_attrs=primary host_strategy=last_successful is opening/closing as few connections on replicas as possible.

I’ll bump once the tests are ported to pytest and we’ll go from there.

dpirotte and others added 5 commits June 7, 2023 14:57
PGBouncer's default behavior is to round-robin between comma-separated
hosts on each new connection attempt. This is desirable when all hosts
are similar, such as load balancing over multiple replicas, but
undesirable when only one host out of the list is expected to
successfully connect and login. In this scenario, every other connection
attempt would hit an invalid host and fail.

The `host_strategy` parameter controls this host selection behavior. The
default behavior is labeled `round_robin` and a new strategy called
`last_successful` is introduced. This new strategy instructs PGBouncer
to prefer the host with most recent successful connections. When that
host fails, new connections will round-robin through the list until a
successful connection occurs.
In order to maintain behavior that connections always start from the
left-most host in the host list, increment rrcounter either before
choosing a host (in the case of last_successful) or after choosing a
host (in the case of round_robin).

(This incorporates PR feedback from @JelteF.)
Previously, host_strategy could only be set at pgbouncer startup, which
is unintuitive and inconsistent with other per-database configuration.

Also, add host_strategy to both `show databases` and `show pools` output
for visibility into the current setting, and to facilitate testing that
the configuration properly reloads.
@dpirotte
Copy link
Author

dpirotte commented Jun 7, 2023

@JelteF Rebased host_strategy with pytests if you want to take a look while I sort out the macOS build.

Previously, the host_strategy tests used 127.0.0.2,127.0.0.1 had a "bad
IP" (127.0.0.2) as the first entry to force the first backend server
connection to fail and verify that `host_strategy=last_successful`
directs new connections to the second "good" entry.

MacOS doesn't treat 127.0.0.2 in the same manner as Linux by default, so
we need a different way to fail the first host in the list. Switching to
an unresolvable DNS hostname works fine on both Linux and MacOS.
doc/config.md Outdated
@@ -1060,6 +1060,20 @@ they are logged but ignored otherwise.
Set the pool mode specific to this database. If not set,
the default `pool_mode` is used.

### host_strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to change the name of the option to load_balance_hosts. This option got added to libpq for PG16 (by me). It seems nice to use consistently named options across projects. The last_successful option name should then be changed to disable. And lets change round_robin to round-robin, to be consistent with other arguments like verify-full and read-write.

PS. I especially think consistency is nice here, because you mentioned you also want to add target_session_attrs to PgBouncer. Since target_session_attrs and load_balance_hosts can be used nicely together, I think it's especially useful to have the same names for both Postgres and PgBouncer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, adding a random as an option would be a nice future improvement to make sure PgBouncer supports the same as libpq. But that should definitely be another follow up PR. We can merge this PR easily without the random functionality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to change the name of the option to load_balance_hosts. This option got added to libpq for PG16 (by me).

Ha, good call. I noticed that you contributed this feature in the release notes. +100 to consistent naming with libpq. I'll take care of that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the initial naming. It seems clear. load_balance_hosts seems to be a misnomer, since no "load" and no "balancing" is involved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the initial naming. It seems clear.

I think the initial name was a fine name too. But I think being in line with naming of libpq, jdbc and npgsql is a big advantage of the load_balance_hosts naming.

load_balance_hosts seems to be a misnomer, since no "load" and no "balancing" is involved?

The "load" is the number of connection that are made by pgbouncer to the database, and those connections are "balanced" across the different servers in a round-robin fashion (or random fashion when implemented in the future). And you can enable this balancing by using disable which will select a working host and send all traffic there.

IMHO, the old naming describes how pgbouncer chooses hosts. The new naming describes what the user will see happening to their setup by changing the setting. I think I personally like names that describe the effects instead of the method.

I incorrectly ported these tests from bash to pytest and so they were
not actually verifying that subsequent connections were reusing the last
successful host in the list. The tests now verify behavior correctly.
This feature does almost the same thing as the upcoming libpq param
called `load_balance_hosts`, so use that name for consistency and rename
the configuration options accordingly (`last_successful` => `disable`)
@dpirotte dpirotte changed the title Add host_strategy parameter to db config Add load_balance_hosts parameter to db config Jun 8, 2023
@dpirotte
Copy link
Author

dpirotte commented Jun 8, 2023

@JelteF Updated naming per discussion above.

The macOS build passes now. The mingw{32,64} builds failed a couple times but seem to pass now, so those might have been random failures. The most recent build failures look like Cirrus issues: Failed to start an instance: INTERNAL: API response null(0): An error has occurred. Failed to connect to /34.122.100.68:443.

@@ -30,6 +30,9 @@ authdb = port=6666 host=127.0.0.1 dbname=p1 auth_user=pswcheck

hostlist1 = port=6666 host=127.0.0.1,::1 dbname=p0 user=bouncer
hostlist2 = port=6666 host=127.0.0.1,127.0.0.1 dbname=p0 user=bouncer
hostlist_good_first = port=6666 host=127.0.0.1,127.0.0.3 dbname=p0 user=bouncer load_balance_hosts=disable
hostlist_bad_first = port=6666 host=unresolvable-hostname,127.0.0.1 dbname=p0 user=bouncer load_balance_hosts=disable
load_balance_hosts_update = port=6666 host=127.0.0.1,127.0.0.3 dbname=p0 user=bouncer load_balance_hosts=disable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_balance_hosts_update database isn't used (afaict). Did you have a test in mind for it?

Lost in translation from bash tests to pytest
Comment on lines +1065 to +1066
When a comma-separated list is specified in `host`, `load_balance_hosts` controls
which entry is chosen for a new connection.
Copy link
Member

@JelteF JelteF Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this same logic would be used when receiving multiple IP adresses from DNS. The easiest way to test this is to create lines in /etc/host. Libpq its load_balance_hosts version does this too. So if we don't implement this, we should at least make it explicit in the docs that this setting currently doesn't impact DNS based load balancing (but that this might change in the future).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm wondering if this feature also includes #448 or at least to have it in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I didn't see that PR before (I guess I didn't go more than 2 years into PR history when joining the project). Right now #448 is completely distinct from this PR:

  1. This PR does not implement the disable mode for DNS load balancing. It still uses round robin even if disable is used.
  2. This PR does not implement the random mode. PR add server_shuffle_hosts #448 implements random mode for DNS load balancing only, not for host lists (certainly because host lists were not available in PgBouncer in the time that PR was written).

Th design of the config option would allow for implementing both 1 and 2 though. I think 2 isn't required to implement for this PR. A completely missing option isn't very confusing. But I think 1 is quite confusing because the existing option does not apply to all the places that you would expect, so preferably this PR would implement 1 too. If that's hard for some reason, then I think a note in the docs is also sufficient.


disable
: A new connection continues using the same host entry until a connection
fails, after which the next host entry is chosen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to mention that it is recommended to set server_login_retry lower than the default to ensure fast retries, because of issue #872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants